PySpark with pip

Published on Jan 8, 2024

1 min read

Prelude

You want to run something on pyspark. You cannot use conda.

Prerequisites

a machine
a terminal
Java 8/11 installed & JAVA_HOME set
Python 3.8+ including pip installed

Terminal

pip install pyspark==3.4
pyspark

Addendum

I’m on Windows. It’s not working :(

You likely need the Hadoop .dll and some environment variables. To be continued in a future article…

This looks even easier than with conda. How so?

Is it, though? Both Python and Java need to be installed here. Doing this isolated would require additional tools, such as pyenv, venv, or jenv. Further pain when doing it on Windows (see above).

What about other Spark / Python / Java versions?

See requirements here.