PySpark with Docker

Published on Feb 19, 2024

1 min read

Prelude

You want to run something on pyspark. You want to use Docker.

We use the official Apache Spark image here.

docker -it spark:3.4.1-scala2.12-java11-python3-ubuntu /opt/spark/bin/pyspark

To run existing code, consider binding a mount which would map the code on your laptop/host to a directory in the container.

Alternatively, one could also use Docker IDE plugins which handle the entire Docker workflow for you.

Well, Docker is not free to use in large businesses. A containerized environment also typically uses more resources.

The easiest would be to use Docker Compose, see the configuration options here. One could also pass environment variables.

One could indeed use Docker to setup Spark in cluster mode. The bitnami images are easier to use for this purpose.

However, at this point, consider instead using one of the many Spark offerings available on the cloud.