PySpark jar dependencies

Published on Mar 18, 2024

2 min read

Prelude

You are using pyspark and require a Java/Scala dependency.

Prerequisites

pyspark set up

JVM Dependencies

Practically, in pyspark, one can easily add dependencies dynamically before getOrCreate(). They are then automatically distributed to the driver and the worker nodes.

Published on Maven

Here the dependencies will be downloaded for you. They need to be comma-separated, in the Maven group:artifact:version format.

from pyspark.sql import SparkSession

spark = (
    SparkSession
    .builder
    .config("spark.jars.packages", "io.delta:delta-core_2.12:2.2.0")
    .getOrCreate()
)

Local / Downloaded

from pyspark.sql import SparkSession

spark = (
    SparkSession
    .builder
    .config("spark.jars", "path-to-file.jar")
    .getOrCreate()
)

Alternatives

Recall that configuration can also be set via CLI or config files. Note, however, the precedence order which might differ depending on how pyspark is ran.

CLI (spark-submit or pyspark) via --packages or --jars for maven coordinates or local jars, respectively;
Spark config files, such as conf/spark-defaults.conf;
the PYSPARK_SUBMIT_ARGS environment variable.

Addendum

Hey, it does not download the dependencies! Why?

You are likely behind a proxy. Since the download is handled by the JVM process, the proxy needs to be configured via JVM options. Easiest is by setting the JAVA_TOOL_OPTIONS environment variable. This needs to happen before running pyspark (which starts the JVM process).

These are the most relevant:

-Dhttp.proxyPort -Dhttp.proxyHost
-Dhttps.proxyPort -Dhttps.proxyHost

Does it download the dependencies on each run?

Depends on the cluster setup. A cache dir is used by default, however if using ephemeral clusters, such as Databricks Job Clusters, then yes, they will by default be downloaded every time.

Note that for long running clusters, clean up might also be necessary.