Too much data is getting generated day by day. Although sometimes we can manage our big data using tools like Rapids or Parallelization, Spark is an excellent tool to have in your repertoire if you are working with Terabytes of data. In my last post on Spark, I explained how to work with PySpark RDDs and Dataframes. Although this post explains a lot on how to work with RDDs and basic Dataframe operations, I missed quite a lot when it comes to working with PySpark Dataframes.
Many of my followers ask me — How difficult is it to get a job in the Data Science field? Or what should they study? Or what path they should take? Now the answer is not one everyone would like — Getting into Data Science is pretty difficult, and you have to toil hard. I mean you have to devote time to learn data science, understand algorithms, upgrade your skills as the market progresses, keep track of old conventional skills, and, of course, search for a job in the meantime and prepare for interviews.
Have you ever been frustrated by doing data exploration and manipulation with Pandas? With so many ways to do the same thing, I get spoiled by choice and end up doing absolutely nothing. And then for a beginner, the problem is just the opposite as in how to do even a simple thing is not appropriately documented. Understanding Pandas syntax can be a hard thing for the uninitiated. So what should one do?
XGBoost is one of the most used libraries fora data science. At the time XGBoost came into existence, it was lightning fast compared to its nearest rival Python’s Scikit-learn GBM. But as the times have progressed, it has been rivaled by some awesome libraries like LightGBM and Catboost, both on speed as well as accuracy. I, for one, use LightGBM for most of the use cases where I have just got CPU for training.
A Machine Learning project is never really complete if we don’t have a good way to showcase it. While in the past, a well-made visualization or a small PPT used to be enough for showcasing a data science project, with the advent of dashboarding tools like RShiny and Dash, a good data scientist needs to have a fair bit of knowledge of web frameworks to get along. As Sten Sootla says in his satire piece which I thoroughly enjoyed:
Recently I was working on tuning hyperparameters for a huge Machine Learning model. Manual tuning was not an option since I had to tweak a lot of parameters. Hyperopt was also not an option as it works serially i.e. at a time, only a single model is being built. So it was taking up a lot of time to train each model and I was pretty short on time.
A Machine Learning project is never really complete if we don’t have a good way to showcase it. While in the past, a well-made visualization or a small PPT used to be enough for showcasing a data science project, with the advent of dashboarding tools like RShiny and Dash, a good data scientist needs to have a fair bit of knowledge of web frameworks to get along. And Web frameworks are hard to learn.