Spark and all the versions
Published on Jan 15, 2024
2 min read
Prelude
You have a (Python) Spark project. You want to migrate to a newer Spark version and need to first test it out. You want to ensure your code will work in production.
Perhaps you work in corporate and requesting the necessary tools is a long manual process where any mistake costs you weeks.
Is there a quick checklist I could go through?
Read on :)
Compatibility
Compiled from Github tags. For the Python libs, they are the minimum versions.
Spark | Scala | Java | Python | pandas | numpy | pyarrow |
---|---|---|---|---|---|---|
3.5 | 2.12 / 2.13 | 8 | 3.7 | 1.0.5 | 1.15 | 1.0.0 |
3.4.1 | 2.12 / 2.13 | 8 | 3.7 | 1.0.5 | 1.15 | 1.0.0 |
3.3.3 | 2.12 / 2.13 | 8 | 3.7 | 1.0.5 | 1.15 | 1.0.0 |
3.2.4 | 2.12 / 2.13 | 8 | 3.6 | 0.23.2 | 1.14 | 1.0.0 |
3.1.3 | 2.12 | 8 | 3.6 | 0.23.2 | 1.7 | 1.0.0 |
Java
Technically, Spark also runs on Java 11/17. However, Java UDFs compiled on 11/17 do not.
Databricks
Databricks uses Databricks Runtime environments (DBR) built using specific library versions. This means the versions they use are expected to work well together and can serve as great guidelines. A summary of the Long Term Support (LTS) versions:
DBR | Spark | Scala | Java | Python | pandas | numpy | pyarrow |
---|---|---|---|---|---|---|---|
13.3 LTS | 3.4.1 | 2.12.15 | Zulu 8.70.0.23 | 3.10.12 | 1.4.4 | 1.21.5 | 8.0.0 |
12.2 LTS | 3.3.2 | 2.12.15 | Zulu 8.68.0.21 | 3.9.5 | 1.4.2 | 1.21.5 | 7.0.0 |
11.3 LTS | 3.3.0 | 2.12.14 | Zulu 8.56.0.21 | 3.9.5 | 1.3.4 | 1.20.3 | 7.0.0 |
10.4 LTS | 3.2.1 | 2.12.14 | Zulu 8.56.0.21 | 3.8.10 | 1.2.4 | 1.20.1 | 4.0.0 |
Addendum
Why even make this blog post?
There have been countless moments and projects where something crashed because of
a version mismatch. Data types (looking at you, datetime
) and upgrading to the
latest versions have always been tricky to deal with.
Ever ran pandas in production?
In those moments, I wish I had these lists easily accessible.
Can’t I just let pip handle the installation of pyspark?
Yes but that will only contain details about Python libs versions.
See, e.g., the pyspark==3.4.1
requirements here.
Also, you should be installing pyspark[sql]
instead of barebones pyspark
else pandas and pyarrow will not be managed by pip to be compatible.
Why care about pandas or pyarrow versions?
There are precedents of unexpected behavior between versions, even patch. The handling of datetime between Spark and Python/pandas is such an example [1] [2].
The code running on your laptop environment might not run on your cluster as expected and then days of scratching your head ensue trying to understand the source of the weird exceptions.
Do I really need these versions specifically?
Using higher patch versions should be fine, but unless there is some specific newer feature you need, just pin these versions; don’t make it difficult.
Wait, do I need Scala and Java on my machine to run pyspark?
Scala, no. Java, yes.
Where can I just download Java?
Examples:
- Zulu manually here
apt-get install openjdk-8-jdk
brew install openjdk@8