One of the main tasks while working with text data is to create a lot of text-based features. One could like to find out certain patterns in the text, emails if present in a text as well as phone numbers in a large text. While it may sound fairly trivial to achieve such functionalities it is much simpler if we use the power of Python’s regex module. For example, let’s say you are tasked with finding the number of punctuations in a particular piece of text.
I am a Mechanical engineer by education. And I started my career with a core job in the steel industry. But I didn’t like it and so I left that. I made it my goal to move into the analytics and data science space somewhere around in 2013. From then on, it has taken me a lot of failures and a lot of efforts to shift. Now, people on social networks ask me how I got started in the data science field.
Data Science is the study of algorithms. I grapple through with many algorithms on a day to day basis, so I thought of listing some of the most common and most used algorithms one will end up using in this new DS Algorithm series. How many times it has happened when you create a lot of features and then you need to come up with ways to reduce the number of features.
Data Science is the study of algorithms. I grapple through with many algorithms on a day to day basis so I thought of listing some of the most common and most used algorithms one will end up using in this new DS Algorithm series. This post is about some of the most common sampling techniques one can use while working with data. Simple Random Sampling Say you want to select a subset of a population in which each member of the subset has an equal probability of being chosen.
Exploration and Exploitation play a key role in any business. And any good business will try to “explore” various opportunities where it can make a profit. Any good business at the same time also tries to focus on a particular opportunity it has found already and tries to “exploits” it. Let me explain this further with a thought experiment. Thought Experiment: Assume that we have infinite slot machines. Every slot machine has some win probability.
Pandas is a vast library. Data manipulation is a breeze with pandas, and it has become such a standard for it that a lot of parallelization libraries like Rapids and Dask are being created in line with Pandas syntax. Still, I generally have some issues with it. There are multiple ways to doing the same thing in Pandas, and that might make it troublesome for the beginner user.
Big Data has become synonymous with Data engineering. But the line between Data Engineering and Data scientists is blurring day by day. At this point in time, I think that Big Data must be in the repertoire of all data scientists. Reason: Too much data is getting generated day by day And that brings us to Spark. Now most of the Spark documentation, while good, did not explain it from the perspective of a data scientist.