Highlights from PyCon DE & PyData Berlin 2024

Published on May 6, 2024

·

9 min read

Blogpost Main Image

Prelude

All Pythonistas must have heard of the annual Python Conferences aka PyCon about everything Python. While originating in the US, they are now held in various locations around the globe. Look at Python data projects, supported by NumFOCUS, and you get a PyData conference. This year in Europe, PyCon & PyData were held together in Berlin on April 22-24. And yes, there will also be another PyData in London in the middle of June :)

Every year there are thousands of participants and so many interesting talks; it is hard to choose! In this post, I want to highlight some of the trends I noticed, along with some interesting highlights.

LLMs

Everything is LLM

LLMs are taking over; obviously, right?

There have been many talks on integrating LLMs into various workflows. We saw RAG helping doctors find information quicker from thousands of pages of medication instructions and the challenges in making it good enough to provide value, while being usable, feasible, and viable. These 4 metrics were indeed first presented by Marty Cagan in his book INSPIRED, which was highly recommended by the speaker. I definitely bumped it high on my reading list.

Another use case is extracting knowledge graphs from text with the help of LLMs. They prove highly capable at detecting entities, however do require careful prompt engineering. As highlighted in other talks too, first providing a well formulated context is critical to obtaining good outputs. Nevertheless, LLMs are still far from being an actual AI with reasoning capabilities; let us not forget that.

And there were many other talks in this area. Hermione’s time turner would have been great.

The same old software practices

The Engineer is however no expert in ML/AI. He does see things from the software and data perspective and has drifted towards talks in this domain.

One observation is that building an LLM-backed project still comes with much of the same challenges as any other software product; and this is not a bad thing!

We have been building software for far longer than AI was a thing, and we can use some of the same techniques. Applications still need a UI for users to interact with and the backend still requires well-configured CI/CD for development to be effective.

Where it does drift apart, is that ML models need be tested perhaps even more than a simple code change, then further fine-tuned and often deployed via techniques such as A/B to compare their effectiveness. It is no wonder that MLOps has become its own field with its own dedicated tools.

Every cloud provider now has an MLOps offering, plus there is a multitude of great open source alternatives; Neptune AI provides a great overview here, which looks fairly similar to the image below in terms of complexity.

Everything is LLM

At the conference, we saw a great example on how building a MVP can take as little as a day using both the cloud and open source, here Azure ML along with MLFlow and Streamlit. Recall how having a UI and being able to deploy quickly is critical?

Non-technical Challenges

Let us not ignore the non-technical challenges too. Projects need financial backing to exist and it is not easy to justify the value they bring, particularly when things are moving so fast.

We have seen a great talk on this by Hannes - which can already be found online - who is behind our new favorite DuckDB and should serve as inspiration for those trying to bring something new in tech. “Money = Evil” stands on one of the slides and the key takeaway here is that something must be sacrificed in order for a project to gain money and thus survive, be it time, ownership, or control; a topic that not many want to talk about! How many tools have visibly changed when VC’s got involved?

Money = Evil

Another talk pointed out the irony that despite how much value open source projects bring, people - or say companies - are not willing to pay for them, nor contribute to maintaining them. It is one of the reasons NumFOCUS exists, namely to sponsor our favorite tools: pandas, conda, or jupyter, to name a few. Yet who’s to say this is entirely out of good will? Just take a look at how sponsorships come to be and you’ll remember Hannes’ talk.

Testing

Now here is The Engineer’s turf. We have already seen the need to follow good software practices when it comes to AI, but that does not mean dealing with data and being able to deliver value fast is any less important; it is perhaps even more important!

Knowing when and what kind of tests bring the most value is critical. It is easy to fall into the trap of doing unit tests for every single function, integration & regression tests, end-to-end tests, and so on; but are all of them really needed in all projects? No matter the size, both in code and teams, domain, exposure, or tech? Where is the boundary between “this helps prevent downtime” and “this is slowing development significantly”?

Test Speed vs Safety

The speaker provided some helpful rules of thumb:

  • consider doing a risk analysis over the ISO25010 quality product model, particularly on products directly impacting users; pick the top areas that provide value and test those first
  • better more coverage than specific in-depth testing
  • do not forget to unit test the bad scenarios too, i.e. when things are expected to fail
  • bugs should not reach users but when they do, have detailed-enough crash reports and ways for users to provide feedback

pydantic

In the same general direction, pydantic popped up several times as the go-to tool for validating data interfaces, not only in connection to APIs, but also when configuring an app. While fully-fletched config managers exist, such as dynaconf or hydra, sometimes one needs to tackle things slightly differently. In this talk, the speaker talks about in-depth validation of the configuration of several apps while still allowing defaults and overrides.

Testing SQL

Another talk went over testing SQL queries, a topic that seems easy but I am sure veterans here acknowledge that it is not always straight forward to set up. There are many databases and SQL flavors out there. Typically one needs a live instance to run these queries without affecting PROD performance. The queries should also run in isolation to allow developers to, well, do their jobs uninterrupted.

How many reading this are thinking of a past - or current! - project with a single DB instance that everyone is competing for? The speaker worked on a simple SQL testing framework based on two key aspects:

  1. Single rows of base test data. These rows are then derived for other use-cases, but always by modifying certain column values from these base data.
  2. Override the tables used in SQL queries. Essentially create virtual tables via CTEs and inject them into the existing queries. These virtual tables are namespaced to ensure isolation.

Nevertheless, an instance of the database to run those queries is still required, which hopefully can be spun up locally, therefore not affecting PROD performance.

Rust

Rust is used in more and more Python projects. While no longer new, Polars is a prime example showing great performance on smaller datasets; even better than DuckDB in some queries! The speaker here detailed the recent speed-ups in Dask via techniques such as column projection and predicate pushdown. Sound familiar? Some will immediately think of Spark, though these techniques have been studied since far longer.

“All tools suck at something”, mentions the speaker, who further evaluates the performance of our favorite data processing tools over the TPC-H queries. DuckDB comes out as a great all-rounder, on all dataset sizes up to a couple of TB. It however struggles on particular queries over larger datasets, where Dask and Spark still shine, with Spark requiring more memory. Polars finally holds the trophy on datasets of up to a couple hundred GB.

All tools suck at something

Polars is also continuously extended with further functionality, such as time series and custom plugins. The speaker cleverly provides examples which would be much more complicated to implement in pandas but are a breeze in Polars.

Other project examples include pixi, another attempt on a Python package manager, meant to replace pip and conda while providing incredible speed due to its Rust backend. The speaker is also behind the mamba project, a similar implementation of a package manager but over C++. At this point though, I must admit: how many package managers until we have the one and only?

One and Only package manager

Side-track: Spark on Rust, anyone?

All these talks on data processing, Rust, and Arrow made me wonder why don’t we have alternatives to Spark written in Rust? Which we do! Or do we? There was an initiative called Ballista which, however, as of 2023, seems to be dead. Despite it apparently being moved to Apache as datafusion, we don’t seem to hear anything about it.

This is a shame, because even though we see machines growing larger and larger, at some point, most tools struggle and only distributed computing such as via Spark may still perform well.

We should however not forget that the vast majority of businesses do not have data big enough for such processing. As highlighted in this boldly-named article by no other than a founding engineer of BigQuery, the average is more around the 100GB size, perfect for a typical data warehouse.

Most businesses do not operate on Google-scale, as exemplified in the graph below, and even when data does seem to accumulate, most queries do not touch all the historical data, instead just the most recent day or year. On top of that, we have already mentioned optimization techniques such as predicate pushdown, which decrease the size even further.

Data Size

But who knows, Spark on Rust might one day become a thing.

Other topics

Going into full details on all the talks would definitely go beyond the point of this article. For those curious, here are some others I thought were interesting:

All (most?) talks shall eventually become available on Youtube :)

Addendum

The conference was a great opportunity to learn a lot of new things and meet many like-minded individuals. The organizers did an amazing job in catering to all needs and let us not forget the many volunteers that assisted in keeping all sessions on track. We have seen a mix of topics from AI to deep Python internals along with tangents from the industry, be it mathematics or the practicalities of maintaining an open source project. A highly recommended event.

Notice something wrong? Have an additional tip?

Contribute to the discussion here