Shell Basics every Data Scientist Should know -Part I

Oct 09, 2015

∙ Paid

Shell Commands are powerful. And life would be like hell without shell is how I like to say it(And that is probably the reason that I dislike windows).

Consider a case when you have a 6 GB pipe-delimited file sitting on your laptop and you want to find out the count of distinct values in one particular column. You can probably do this in more than one way. You could put that file in a database and run SQL Commands, or you could write a python/perl script.

Probably whatever you do it won’t be simpler/less time consuming than this

cat data.txt | cut -d "|" -f 1 | sort | uniq | wc -l

And this will run way faster than whatever you do with perl/python script.

Now this command says

Use the cat command to print/stream the contents of the file to stdout.
Pipe the streaming contents from our cat command to the next command cut.
The cut commands specifies the delimiter by the argument -d and the column by the argument -f and streams the output to stdout.
Pipe the streaming content to the sort comman…

Continue reading this post for free, courtesy of Rahul Agarwal.

Or purchase a paid subscription.

MLWhiz: Recs|ML|GenAI

Shell Basics every Data Scientist Should know -Part I

Continue reading this post for free, courtesy of Rahul Agarwal.