How did I learn Data Science?

I am a Mechanical engineer by education. And I started my career with a core job in the steel industry.

But I didn’t like it and so I left that.

I made it my goal to move into the analytics and data science space somewhere around in 2013. From then on, it has taken me a lot of failures and a lot of efforts to shift.

Now, people on social networks ask me how I got started in the data science field. So I thought of giving a definitive answer.

It is not really impossible to do this but it will take a lot of time and effort. Fortunately, I had an ample supply of both.

Given below is the way that I took, and any aspiring person could choose to become a self-trained data scientist.

Some of the courses are not the same I did since some of them don’t exist and some have been merged into bigger specializations. But I have tried to keep it as similar to my experience as possible.

Also, I hope that you don’t lose hope after seeing the long list. You have to start with one or two courses. The rest will follow with time. Remember we have ample time.

Follow in order. I have tried to include everything that comes to my mind, including some post links which I think could be beneficial.


Introduction to Probability and Statistics

Stat 110: The quintessential Probability and Statistics course you gotta take. All the lectures and notes are available on Youtube and his site for free.

If not for the content then for Prof. Joseph Blitzstein sense of humor. The above picture is a testament to that.

I took this course to enhance my understanding of probability distributions and statistics, but this course taught me a lot more than that.

Apart from Learning to think conditionally, this also taught me how to explain difficult concepts with a story.

This is a challenging class for a beginner but most definitely fun. The focus was not only on getting Mathematical proofs but also on understanding the intuition behind them and how intuition can help in deriving the proofs quickly. Sometimes the same proof was done in different ways to facilitate the learning of a concept.

One of the things I liked most about this course is the focus on concrete examples while explaining abstract concepts.

The inclusion of Gambler’s Ruin Problem, Matching Problem, Birthday Problem, Monty Hall, Simpsons Paradox, St. Petersberg Paradox, etc. made this course much much more exciting and enjoyable than any ordinary Statistics Course.

It will help you understand Discrete (Bernoulli, Binomial, Hypergeometric, Geometric, Negative Binomial, FS, Poisson) and Continuous (Uniform, Normal, expo, Beta, Gamma) Distributions.

He has also got a textbook based on this course, which is an excellent text and a must for any bookshelf.

Go to Amazon!


Introduction to Python and Data Science:

Do first, understand later

We need to get a taste of machine learning before understanding it fully. This segment is made up of three parts. These are not the exact courses I took to learn Python and getting an intro to data science. But they are quite similar and they serve the purpose.

a) Introduction to Data Science in Python

This course is about learning to use Python and creating things on your own. You will learn about Python Libraries like Numpy, Pandas for data science.

You might also like my posts on Minimal Pandas for Data Scientists and small shorts on advanced python while going through this course.

Course description from Website:

This course will introduce the learner to the basics of the python programming environment, including fundamental python programming techniques such as lambdas, reading and manipulating csv files, and the numpy library. The course will introduce data manipulation and cleaning techniques using the popular python pandas data science library and introduce the abstraction of the Series and DataFrame as the central data structures for data analysis, along with tutorials on how to use functions such as groupby, merge, and pivot tables effectively. By the end of this course, students will be able to take tabular data, clean it, manipulate it, and run basic inferential statistical analyses.

b) Applied Machine Learning in Python

This course gives an intro to many modern machine learning methods that you should know about. Not a thorough grinding but you will get the tools to build your own models. You will learn scikit-learn, which is the python library to create all sorts of models.

The focus here is to start creating things as soon as possible. No one likes to wait too long to get something useful, and you will become useful after this course.

This course will introduce the learner to applied machine learning, focusing more on the techniques and methods than on the statistics behind these methods. The course will start with a discussion of how machine learning is different than descriptive statistics, and introduce the scikit learn toolkit through a tutorial.

c) Visualizations

A well made visualization is worth more than any PPT

One thing you also need to learn about is Visualizations. This is an area which is constantly evolving with a lot of new libraries coming frequently. The libraries I use most are Seaborn and Plotly.

You could take a look at the below posts to get started with both basic and advanced visualizations.

Python’s One-Liner graph creation library with animations Hans Rosling Style

3 Awesome Visualization Techniques for every dataset


Machine Learning Fundamentals

After doing these above courses, you will gain the status of what I would like to call a “Beginner.”

Congrats!!!. You know stuff; you know how to implement things.

Yet you do not fully understand all the math and grind that goes behind all these models.

You need to understand what goes behind the clf.fit

If you don’t understand it you won’t be able to improve it

Here comes the Game Changer Machine Learning course. Contains the maths behind many of the Machine Learning algorithms.

I will put this course as the one course you gotta take as this course motivated me into getting in this field, and Andrew Ng is a great instructor. Also, this was the first course that I took myself when I started.

This course has a little of everything — Regression, Classification, Anomaly Detection, Recommender systems, Neural networks, plus a lot of great advice.

After this one, you are done with the three musketeers of the trade.

You know Python, you understand Statistics, and you have gotten the taste of the math behind ML approaches. Now it is time for the new kid on the block. D’artagnan. This kid has skills. While the three musketeers are masters in their trade, this guy brings qualities that add a new freshness to our data science journey.

Here comes Big Data for you.


Big Data Analytics Using Spark

Big Data is omnipresent. Deal with it.

The whole big data ecosystem has changed a lot since the time I learned Hadoop. And Spark was the new kid on the block at that time. Those days…

The courses I took are pretty redundant as of now so I would try to recommend something suitable for this era. The best course I could find that embodies most of what I learned through scattered sources is Big Data Analytics Using Spark.

From the course website, after doing this course, you will learn:

  • Programming Spark using Pyspark
  • Identifying the computational tradeoffs in a Spark application
  • Performing data loading and cleaning using Spark and Parquet
  • Modeling data through statistical and machine learning methods

You could also take a look at my recent post on Spark.

The Hitchhikers guide to handle Big Data using Spark


Understand Linux Shell

Not a hard requirement but a good to have skill. Shell is a big friend of data scientists. It allows you to do simple data-related tasks in the terminal itself. I couldn’t emphasize how much time shell saves for me every day.

You can read the below post by me to know about this: Impress Onlookers with your newly acquired Shell Skills

If you would like to take a course, you can look at The UNIX workbench course on Coursera.

Congrats you are a “Hacker” now.

You have got all the main tools in your belt to be a data scientist.

On to more advanced topics. From here, it depends on you what you want to learn.

You may want to take a totally different approach than what I took going from here. There is no particular order. “All Roads Lead to Rome” as long as you are moving.


Learn Statistical Inference

Mine Çetinkaya-Rundel teaches this course on Inferential Statistics. And it cannot get simpler than this one.

She is a great instructor and explains the fundamentals of Statistical inference nicely — a must-take course.

You will learn about hypothesis testing, confidence intervals, and statistical inference methods for numerical and categorical data.


Deep Learning

It is all about layers

Intro — Making neural nets uncool again. This is a code-first class for neural nets. An excellent Deep learning class from Kaggle Master Jeremy Howard. Entertaining and enlightening at the same time.

Advanced — You can try out this Deep Learning Specialization by Andrew Ng again. Pure Gold.

Advanced Math Book — A math-intensive book by Yoshua Bengio & Ian Goodfellow

Take a look at below post if you want to learn Pytorch.

Moving from Keras to Pytorch


Learn NLP, Use Deep Learning with Text and create Chatbots

Reading is overrated. Let the machine do it.

Natural Language Processing is something which captured my attention a while back.

I wrote a series of 6 posts on it. If you want, you can take a look.

NLP Learning Series — Towards Data Science


Algorithms, Graph Algorithms, and More

Algorithms. Yes, you need them.

Apart from that if you want to learn about Python and the underlying intricacies of the language you can take the Computer Science Mini Specialization from RICE university too.

This is a series of 6 short but good courses.

I worked on these courses as Data science will require you to do a lot of programming. And the best way to learn to program is by doing it.

The lectures are good, but the problems and assignments are awesome. If you work on this, you will learn Object-Oriented Programming, Graph algorithms, and creating games in Python. Pretty cool stuff.

You could also take a look at:

The 5 Feature Selection Algorithms every Data Scientist should know

The 5 Sampling Algorithms every Data Scientist need to know


Some Advanced Math Topics

Math — The power behind it all

I am writing it last here but don’t underestimate the importance of Math in Data Science. You might want to look a little into these courses if you want to refresh your concepts.

Linear Algebra By Gilbert Strang— A Great Class by a great Teacher. I would definitely recommend this class to anyone who wants to learn Linear Algebra.

Multivariate Calculus — MIT Open Courseware

Convex Optimization — a MOOC on optimization from Stanford, by Steven Boyd, an authority on the subject.


Conclusion

The Machine learning field is evolving, and new advancements are made every day. That’s why I didn’t put the third tier.

The maximum I can call myself is a “Hacker,” and my learning continues.

Everyone has their own path, and here I provided mine to become a data scientist. And this is in no way perfect as obviously, a lot of things can be added to it.

Though I did not complete any professional training, I consider myself more of a Computer science engineer than a mechanical engineer now due to the above courses.

I hope they help you too.

Thanks for the read. I am going to be writing more beginner-friendly posts in the future too. Follow me up at Medium or Subscribe to my blog to be informed about them. As always, I welcome feedback and constructive criticism and can be reached on Twitter @mlwhiz.

Start your future with a Data Science Certificate.