You want to run something on pyspark. You cannot use conda, pip, or Docker.


  • a machine
  • a terminal
  • Java 8/11 installed & JAVA_HOME set
  • Python 3.8+ installed


Choose a Spark version binary which includes Hadoop here. Using Spark 3.4.1 as an example, on a mac zsh shell.

export BINARY_TGZ_URL=""
# download
# checksum
curl "${BINARY_TGZ_URL}.sha512" | shasum -a 512 --check
# unzip
tar -xvzf "$(echo $BINARY_TGZ_URL | rev | cut -d/ -f1 | rev)"

Then pyspark will be available under /bin in the unzipped dir which can be added to $PATH.


Is pip not included in python now? Why manually download?

Traffic to pypi may be restricted and, well, you gotta do what you gotta do.

I’m on Windows. It’s not working :(

You likely need the Hadoop .dll and some environment variables. To be continued in a future article…

