Parallel table ingestion with a Spark Notebook (PySpark + Threading)

Опубликовано: 25 Февраль 2025
на канале: Dustin Vannoy

14,228

299

If we want to kick off a single Apache Spark notebook to process a list of tables we can write the code easily. The simple code to loop through the list of tables ends up running one table after another (sequentially). If none of these tables are very big, it is quicker to have Spark load tables concurrently (in parallel) using multithreading. There are some different options of how to do this, but I am sharing the easiest way I have found when working with a PySpark notebook in Databricks, Azure Synapse Spark, Jupyter, or Zeppelin.

Written tutorial and links to code:
https://dustinvannoy.com/2022/05/06/p...

More from Dustin:
Website: https://dustinvannoy.com
LinkedIn: / dustinvannoy
Twitter: / dustinvannoy
Github: https://github.com/datakickstart

CHAPTERS:
0:00 Intro and Use Case
1:05 Code example single thread
4:36 Code example multithreaded
7:15 Demo run - Databricks
8:46 Demo run - Azure Synapse
11:48 Outro