PySpark by download

Published on Mar 4, 2024

1 min read

Prelude

You want to run something on pyspark. You cannot use conda, pip, or Docker.

Prerequisites

a machine
a terminal
Java 8/11 installed & JAVA_HOME set
Python 3.8+ installed

Terminal

Choose a Spark version binary which includes Hadoop here. Using Spark 3.4.1 as an example, on a mac zsh shell.

export BINARY_TGZ_URL="https://archive.apache.org/dist/spark/spark-3.4.1/spark-3.4.1-bin-hadoop3.tgz"
# download
curl -O $BINARY_TGZ_URL
# checksum
curl "${BINARY_TGZ_URL}.sha512" | shasum -a 512 --check
# unzip
tar -xvzf "$(echo $BINARY_TGZ_URL | rev | cut -d/ -f1 | rev)"

Then pyspark will be available under /bin in the unzipped dir which can be added to $PATH.

Addendum

Is pip not included in python now? Why manually download?

Traffic to pypi may be restricted and, well, you gotta do what you gotta do.

I’m on Windows. It’s not working :(

You likely need the Hadoop .dll and some environment variables. To be continued in a future article…