MLWhiz | AI Unwrapped

MLWhiz | AI Unwrapped

Share this post

MLWhiz | AI Unwrapped
MLWhiz | AI Unwrapped
Learning Spark using Python: Basics and Applications
Copy link
Facebook
Email
Notes
More

Learning Spark using Python: Basics and Applications

Rahul Agarwal's avatar
Rahul Agarwal
Sep 07, 2015
∙ Paid

Share this post

MLWhiz | AI Unwrapped
MLWhiz | AI Unwrapped
Learning Spark using Python: Basics and Applications
Copy link
Facebook
Email
Notes
More
Share

I generally have a use case for Hadoop in my daily job. It has made my life easier in a sense that I am able to get results which I was not able to see with SQL queries. But still I find it painfully slow. I have to write procedural programs while I work. As in merge these two datasets and then filter and then merge another dataset and then filter using some condition and yada-yada. You get the gist. And in hadoop its painstakingly boring to do this. You have to write more than maybe 3 Mapreduce Jobs. One job will read the data line by line and write to the disk.

There is a lot of data movement that happens in between that further affects the speed. Another thing I hate is that there is no straight way to pass files to mappers and reducers and that generally adds up another mapreduce job to the whole sequence.

And that is just procedural tasks. To implement an iterative algorithm even after geting the whole logic of parallelization is again a challenge. There would be a lot of mapreduce…

Keep reading with a 7-day free trial

Subscribe to MLWhiz | AI Unwrapped to keep reading this post and get 7 days of free access to the full post archives.

Already a paid subscriber? Sign in
© 2025 Rahul Agarwal
Privacy ∙ Terms ∙ Collection notice
Start writingGet the app
Substack is the home for great culture

Share

Copy link
Facebook
Email
Notes
More