PySpark with Docker
Published on Feb 19, 2024
1 min read
Prelude
You want to run something on pyspark. You want to use Docker.
Prerequisites
- a machine
- a terminal
- Docker installed
Terminal
We use the official Apache Spark image here.
docker -it spark:3.4.1-scala2.12-java11-python3-ubuntu /opt/spark/bin/pyspark
To run existing code, consider binding a mount which would map the code on your laptop/host to a directory in the container.
Alternatively, one could also use Docker IDE plugins which handle the entire Docker workflow for you.
Addendum
Why bother with pip or conda? This seems so much easier!
Well, Docker is not free to use in large businesses. A containerized environment also typically uses more resources.
How can I configure Spark in there?
The easiest would be to use Docker Compose, see the configuration options here. One could also pass environment variables.
Does this replicate a real cluster setup?
One could indeed use Docker to setup Spark in cluster mode. The bitnami images are easier to use for this purpose.
However, at this point, consider instead using one of the many Spark offerings available on the cloud.