PySpark with Docker

Published on Feb 19, 2024

·

1 min read

Blogpost Main Image

Prelude

You want to run something on pyspark. You want to use Docker.

Prerequisites

  • a machine
  • a terminal
  • Docker installed

Terminal

We use the official Apache Spark image here.

docker -it spark:3.4.1-scala2.12-java11-python3-ubuntu /opt/spark/bin/pyspark

To run existing code, consider binding a mount which would map the code on your laptop/host to a directory in the container.

Alternatively, one could also use Docker IDE plugins which handle the entire Docker workflow for you.

Addendum

Why bother with pip or conda? This seems so much easier!

Well, Docker is not free to use in large businesses. A containerized environment also typically uses more resources.

How can I configure Spark in there?

The easiest would be to use Docker Compose, see the configuration options here. One could also pass environment variables.

Does this replicate a real cluster setup?

One could indeed use Docker to setup Spark in cluster mode. The bitnami images are easier to use for this purpose.

However, at this point, consider instead using one of the many Spark offerings available on the cloud.

Notice something wrong? Have an additional tip?

Contribute to the discussion here