Spark and all the versions

Published on Jan 15, 2024

·

2 min read

Blogpost Main Image

Prelude

You have a (Python) Spark project. You want to migrate to a newer Spark version and need to first test it out. You want to ensure your code will work in production.

Perhaps you work in corporate and requesting the necessary tools is a long manual process where any mistake costs you weeks.

Is there a quick checklist I could go through?

Read on :)

Compatibility

Compiled from Github tags. For the Python libs, they are the minimum versions.

SparkScalaJavaPythonpandasnumpypyarrow
3.52.12 / 2.1383.71.0.51.151.0.0
3.4.12.12 / 2.1383.71.0.51.151.0.0
3.3.32.12 / 2.1383.71.0.51.151.0.0
3.2.42.12 / 2.1383.60.23.21.141.0.0
3.1.32.1283.60.23.21.71.0.0

Java

Technically, Spark also runs on Java 11/17. However, Java UDFs compiled on 11/17 do not.

Databricks

Databricks uses Databricks Runtime environments (DBR) built using specific library versions. This means the versions they use are expected to work well together and can serve as great guidelines. A summary of the Long Term Support (LTS) versions:

DBRSparkScalaJavaPythonpandasnumpypyarrow
13.3 LTS3.4.12.12.15Zulu 8.70.0.233.10.121.4.41.21.58.0.0
12.2 LTS3.3.22.12.15Zulu 8.68.0.213.9.51.4.21.21.57.0.0
11.3 LTS3.3.02.12.14Zulu 8.56.0.213.9.51.3.41.20.37.0.0
10.4 LTS3.2.12.12.14Zulu 8.56.0.213.8.101.2.41.20.14.0.0

Addendum

Why even make this blog post?

There have been countless moments and projects where something crashed because of a version mismatch. Data types (looking at you, datetime) and upgrading to the latest versions have always been tricky to deal with.

Ever ran pandas in production?

In those moments, I wish I had these lists easily accessible.

Can’t I just let pip handle the installation of pyspark?

Yes but that will only contain details about Python libs versions. See, e.g., the pyspark==3.4.1 requirements here.

Also, you should be installing pyspark[sql] instead of barebones pyspark else pandas and pyarrow will not be managed by pip to be compatible.

Why care about pandas or pyarrow versions?

There are precedents of unexpected behavior between versions, even patch. The handling of datetime between Spark and Python/pandas is such an example [1] [2].

The code running on your laptop environment might not run on your cluster as expected and then days of scratching your head ensue trying to understand the source of the weird exceptions.

Do I really need these versions specifically?

Using higher patch versions should be fine, but unless there is some specific newer feature you need, just pin these versions; don’t make it difficult.

Wait, do I need Scala and Java on my machine to run pyspark?

Scala, no. Java, yes.

Where can I just download Java?

Examples:

  • Zulu manually here
  • apt-get install openjdk-8-jdk
  • brew install openjdk@8

References

Notice something wrong? Have an additional tip?

Contribute to the discussion here