Feeling Helpless? I know I am. With the whole shutdown situation, what I thought was once a paradise for my introvert self doesn’t look so good when it is actually happening. I really cannot fathom being at home much longer. And this feeling of helplessness at not being able to do anything doesn’t help. Honestly, I would like to help with so much more in this dire situation, but here are some small ideas around which we as AI practitioners and Data Scientists can be of use.
I know — Spark is sometimes frustrating to work with. Although sometimes we can manage our big data using tools like Rapids or Parallelization, there is no way around using Spark if you are working with Terabytes of data. In my last few posts on Spark, I explained how to work with PySpark RDDs and Dataframes. Although these posts explain a lot on how to work with RDDs and Dataframe operations, they still are not quite enough.
Too much data is getting generated day by day. Although sometimes we can manage our big data using tools like Rapids or Parallelization, Spark is an excellent tool to have in your repertoire if you are working with Terabytes of data. In my last post on Spark, I explained how to work with PySpark RDDs and Dataframes. Although this post explains a lot on how to work with RDDs and basic Dataframe operations, I missed quite a lot when it comes to working with PySpark Dataframes.
Many of my followers ask me — How difficult is it to get a job in the Data Science field? Or what should they study? Or what path they should take? Now the answer is not one everyone would like — Getting into Data Science is pretty difficult, and you have to toil hard. I mean you have to devote time to learn data science, understand algorithms, upgrade your skills as the market progresses, keep track of old conventional skills, and, of course, search for a job in the meantime and prepare for interviews.
Have you ever been frustrated by doing data exploration and manipulation with Pandas? With so many ways to do the same thing, I get spoiled by choice and end up doing absolutely nothing. And then for a beginner, the problem is just the opposite as in how to do even a simple thing is not appropriately documented. Understanding Pandas syntax can be a hard thing for the uninitiated. So what should one do?
XGBoost is one of the most used libraries fora data science. At the time XGBoost came into existence, it was lightning fast compared to its nearest rival Python’s Scikit-learn GBM. But as the times have progressed, it has been rivaled by some awesome libraries like LightGBM and Catboost, both on speed as well as accuracy. I, for one, use LightGBM for most of the use cases where I have just got CPU for training.
A Machine Learning project is never really complete if we don’t have a good way to showcase it. While in the past, a well-made visualization or a small PPT used to be enough for showcasing a data science project, with the advent of dashboarding tools like RShiny and Dash, a good data scientist needs to have a fair bit of knowledge of web frameworks to get along. As Sten Sootla says in his satire piece which I thoroughly enjoyed: