ericxiao251/spark-syntax
This is a repo documenting the best practices in PySpark.
repo name | ericxiao251/spark-syntax |
repo link | https://github.com/ericxiao251/spark-syntax |
homepage | https://ericxiao251.github.io/spark-syntax/ |
language | Jupyter Notebook |
size (curr.) | 4858 kB |
stars (curr.) | 389 |
created | 2017-08-27 |
license | |
Spark-Syntax
This is a public repo documenting all of the “best practices” of writing PySpark code from what I have learnt from working with PySpark
for 3 years. This will mainly focus on the Spark DataFrames and SQL
library.
you can also visit ericxiao251.github.io/spark-syntax/ for a online book version.
Contributing/Topic Requests
If you notice an improvements in terms of typos, spellings, grammar, etc. feel free to create a PR and I’ll review it 😁, you’ll most likely be right.
If you have any topics that I could potentially go over, please create an issue and describe the topic. I’ll try my best to address it 😁.
Acknowledgement
Huge thanks to Levon for turning everything into a gitbook. You can follow his github at https://github.com/tumregels.
Table of Contexts:
Chapter 1 - Getting Started with Spark:
-
1.1 - Useful Material
-
1.2 - Creating your First DataFrame
-
1.3 - Reading your First Dataset
-
1.4 - More Comfortable with SQL?
Chapter 2 - Exploring the Spark APIs:
-
2.1 - Non-Trivial Data Structures in Spark
-
2.1.1 - Struct Types (
StructType
) -
2.1.2 - Arrays and Lists (
ArrayType
) -
2.1.3 - Maps and Dictionaries (
MapType
) -
2.1.4 - Decimals and Why did my Decimals overflow :( (
DecimalType
)
-
-
2.2 - Performing your First Transformations
-
2.2.1 - Looking at Your Data (
collect
/head
/take
/first
/toPandas
/show
) -
2.2.2 - Selecting a Subset of Columns (
drop
/select
) -
2.2.3 - Creating New Columns and Transforming Data (
withColumn
/withColumnRenamed
) -
2.2.4 - Constant Values and Column Expressions (
lit
/col
) -
2.2.5 - Casting Columns to a Different Type (
cast
) -
2.2.6 - Filtering Data (
where
/filter
/isin
) -
2.2.7 - Equality Statements in Spark and Comparisons with Nulls (
isNotNull()
/isNull()
) -
2.2.8 - Case Statements (
when
/otherwise
) -
2.2.9 - Filling in Null Values (
fillna
/coalesce
) -
2.2.10 - Spark Functions aren’t Enough, I Need my Own! (
udf
/pandas_udf
) -
2.2.11 - Unionizing Multiple Dataframes (
union
) -
2.2.12 - Performing Joins (clean one) (
join
)
-
-
2.3 More Complex Transformations
-
2.3.1 - One to Many Rows (
explode
) -
2.3.2 - Range Join Conditions (WIP) (
join
)
-
-
2.4 Potential Performance Boosting Functions
-
2.4.1 - (
repartition
) -
2.4.2 - (
coalesce
) -
2.4.2 - (
cache
) -
2.4.2 - (
broadcast
)
-
Chapter 3 - Aggregates:
-
3.1 - Clean Aggregations
-
3.2 - Non Deterministic Behaviours
Chapter 4 - Window Objects:
Chapter 5 - Error Logs:
Chapter 6 - Understanding Spark Performance:
-
6.1 - Primer to Understanding Your Spark Application
-
6.1.1 - Understanding how Spark Works
-
6.1.2 - Understanding the SparkUI
-
6.1.3 - Understanding how the DAG is Created
-
6.1.4 - Understanding how Memory is Allocated
-
-
6.2 - Analyzing your Spark Application
-
6.1 - Looking for Skew in a Stage
-
6.2 - Looking for Skew in the DAG
-
6.3 - How to Determine the Number of Partitions to Use
-
-
6.3 - How to Analyze the Skew of Your Data
Chapter 7 - High Performance Code:
-
7.0 - The Types of Join Strategies in Spark
-
7.0.1 - You got a Small Table? (
Broadcast Join
) -
7.0.2 - The Ideal Strategy (
BroadcastHashJoin
) -
7.0.3 - The Default Strategy (
SortMergeJoin
)
-
-
7.1 - Improving Joins
-
7.1.1 - Filter Pushdown
-
7.1.2 - Joining on Skewed Data (Null Keys)
-
7.1.3 - Joining on Skewed Data (High Frequency Keys I)
-
7.1.4 - Joining on Skewed Data (High Frequency Keys II)
-
7.1.5 - Join Ordering
-
-
7.2 - Repeated Work on a Single Dataset (
caching
)-
7.2.1 - caching layers
-
-
7.3 - Spark Parameters
-
7.3.1 - Running Multiple Spark Applications at Scale (
dynamic allocation
) -
7.3.2 - The magical number
2001
(partitions
) -
7.3.3 - Using a lot of
UDF
s? (python memory
)
-
-
7. - Bloom Filters :o?