PySpark jar dependencies
Published on Mar 18, 2024
2 min read
Prelude
You are using pyspark and require a Java/Scala dependency.
Prerequisites
- pyspark set up
JVM Dependencies
Practically, in pyspark
, one can easily add dependencies dynamically before getOrCreate()
.
They are then automatically distributed to the driver and the worker nodes.
Published on Maven
Here the dependencies will be downloaded for you.
They need to be comma-separated, in the Maven group:artifact:version
format.
from pyspark.sql import SparkSession
spark = (
SparkSession
.builder
.config("spark.jars.packages", "io.delta:delta-core_2.12:2.2.0")
.getOrCreate()
)
Local / Downloaded
from pyspark.sql import SparkSession
spark = (
SparkSession
.builder
.config("spark.jars", "path-to-file.jar")
.getOrCreate()
)
Alternatives
Recall that configuration can also be set via CLI or config files. Note, however, the precedence order which might differ depending on how pyspark is ran.
- CLI (
spark-submit
orpyspark
) via--packages
or--jars
for maven coordinates or local jars, respectively; - Spark config files, such as
conf/spark-defaults.conf
; - the
PYSPARK_SUBMIT_ARGS
environment variable.
Addendum
Hey, it does not download the dependencies! Why?
You are likely behind a proxy.
Since the download is handled by the JVM process, the proxy needs to be configured via JVM options.
Easiest is by setting the JAVA_TOOL_OPTIONS
environment variable.
This needs to happen before running pyspark (which starts the JVM process).
These are the most relevant:
-Dhttp.proxyPort -Dhttp.proxyHost
-Dhttps.proxyPort -Dhttps.proxyHost
Does it download the dependencies on each run?
Depends on the cluster setup. A cache dir is used by default, however if using ephemeral clusters, such as Databricks Job Clusters, then yes, they will by default be downloaded every time.
Note that for long running clusters, clean up might also be necessary.