The Most Complete Guide to pySpark DataFrames

Jun 24, 2020

∙ Paid

The Most Complete Guide to pySpark DataFrames

Big Data has become synonymous with Data engineering. But the line between Data Engineering and Data scientists is blurring day by day. At this point in time, I think that Big Data must be in the repertoire of all data scientists.

Reason: Too much data is getting generated day by day

And that brings us to Spark which is one of the most used tools when it comes to working with Big Data.

While once upon a time Spark used to be heavily reliant on RDD manipulations , Spark has now provided a DataFrame API for us Data Scientists to work with. Here is the documentation for the adventurous folks. But while the documentation is good, it does not explain it from the perspective of a Data Scientist. Neither does it properly document the most common use cases for Data Science.

In this post, I will talk about installing Spark, standard Spark functionalities you will need to work with DataFrames, and finally some tips to handle the inevitable errors you will face.

This post is going to be quite long. Actually one of my longest posts on medium, so go on and pick up a Coffee.

Also here is the Table of Contents, if you want to skip to a specific section:

Installation
Data
Basic Functions
- Read
- See a few rows in the file
- Change Column Names
- Select Columns
- Sort
- Cast
- Filter
- GroupBy
- Joins
Broadcast/Map Side Joins
Use SQL with DataFrames
Create New Columns
- Using Spark Native Functions
- Using Spark UDFs
- Using RDDs
- Using Pandas UDF
Spark Window Functions
- Ranking
- Lag Variables
- Rolling Aggregations
Pivot Dataframes
Unpivot/Stack Dataframes
Salting
Some More Tips and Tricks
- Caching
- Save and Load from an intermediate step
- Repartitioning
- Reading Parquet File in Local
Conclusion

Keep reading with a 7-day free trial

Subscribe to MLWhiz | AI Unwrapped to keep reading this post and get 7 days of free access to the full post archives.