PySpark by download
Published on Mar 4, 2024
·
1 min read
Prelude
You want to run something on pyspark. You cannot use conda, pip, or Docker.
Prerequisites
- a machine
- a terminal
- Java 8/11 installed &
JAVA_HOME
set - Python 3.8+ installed
Terminal
Choose a Spark version binary which includes Hadoop here. Using Spark 3.4.1 as an example, on a mac zsh shell.
export BINARY_TGZ_URL="https://archive.apache.org/dist/spark/spark-3.4.1/spark-3.4.1-bin-hadoop3.tgz"
# download
curl -O $BINARY_TGZ_URL
# checksum
curl "${BINARY_TGZ_URL}.sha512" | shasum -a 512 --check
# unzip
tar -xvzf "$(echo $BINARY_TGZ_URL | rev | cut -d/ -f1 | rev)"
Then pyspark will be available under /bin
in the unzipped dir which can be added to $PATH
.
Addendum
Is pip not included in python now? Why manually download?
Traffic to pypi may be restricted and, well, you gotta do what you gotta do.
I’m on Windows. It’s not working :(
You likely need the Hadoop .dll
and some environment variables. To be continued in a future article…