5 Ways to add a new column in a PySpark Dataframe
Too much data is getting generated day by day.
Although sometimes we can manage our big data using tools like Rapids or Parallelization , Spark is an excellent tool to have in your repertoire if you are working with Terabytes of data.
In my last post on Spark, I explained how to work with PySpark RDDs and Dataframes.
Although this post explains a lot on how to work with RDDs and basic Dataframe operations, I missed quite a lot when it comes to working with PySpark Dataframes.
And it is only when I required more functionality that I read up and came up with multiple solutions to do one single thing.
How to create a new column in spark?
Now, this might sound trivial, but believe me, it isn’t. With so much you might want to do with your data, I am pretty sure you will end up using most of these column creation processes in your workflow. Sometimes to utilize Pandas functionality, or occasionally to use RDDs based partitioning or sometimes to make use of the mature python ecosystem.
This post is…
Keep reading with a 7-day free trial
Subscribe to MLWhiz | AI Unwrapped to keep reading this post and get 7 days of free access to the full post archives.