What REALLY is Data Science? 

What REALLY is Data Science? 

Jonathan Ma, Набижонов Камолиддин, Алексей Натёкин

History:

Before data science, we popularized the term data mining. In an article called “from data mining to knowledge discovery in databases” in 1996 in which it referred to the overall process of discovering useful information from data.

In 2001, William S. Cleveland wanted to bring data mining to another level. He did that by combining computer science with data mining.He made statistics a lot more technical which he believed would expand the possibilities of data mining and produce a powerful force for innovation. Now you can take advantage of compute power for statistics and he called this combo data science. 

So then the “journal of data science” described data science as almost everything that has something to do with data: collecting, analyzing, modeling. Yet the most important part is its applications like machine learning.

So in 2010 with the new abundance of data it made it possible to train machines with a data-driven approach, rather than a knowledge driven approach. So machine learning and AI dominated the media overshadowing every other aspect of data science like exploratory analysis, experimentation and skills we traditionally called business intelligence.

Now the general public think of data science as researchers focused on machine learning and AI but the industry is hiring data scientists as analysts. So there's a misalignment there. The reason for the misalignment is that yes, most of these data scientists can probably work on more technical problems but companies have so many low-hanging fruits to improve their products that they don't require any advanced machine learning or statistical knowledge to find these impacts in their analysis.

“Being a good data scientist isn't about how advanced your models are, It's about how much impact you can have with your work.”

Real-life examples of data science jobs:

So this is a very useful chart that tells you the needs of data science.

At the bottom of the pyramid, highlighted by green rectangle, we have COLLECT, MOVE/STORE, EXPLORE/TRANSFORM. You obviously have to collect, move, store, explore, transform and label some sort of data to be able to use that data. So all of these data engineering effort is pretty important and it's actually quite captured pretty well in media because of “big data”.

Now the thing that's less known is the stuff in between that is highlighted by red rectangle. This is actually one of the most important things for companies because you're trying to tell the company what to do with your product. So analytics expert tells you, using the data, some insights like “what are happening to my users ?” and then show metrics that pictures “what's going on with your product?”. These metrics will tell you if you're successful or not. A/B testing is an experimentation that allows you to know, which product versions are the best. So these things are actually really important but they're not so covered in media. 

On top of pyramid, highlighted by blue rectangle - AI, deep learning. We've heard it on and on about it. AI and deep learning are all subfields of machine learning, which in tern subfield of data science. Machine learning is based on all previous steps.That's why AI deep learning is on top of the hierarchy of needs.

distribution map

As you can see, Ml Engineers are more focused on Math and Development while ML Researchers are focused on math, that is behind all ML algorithms. 

Our group’s main focus are two fields, highlighted in red square : ML Engineering and Research.

Job distribution:

So here's how it looks for a large company:

Software Engineers - Instrumental, logging, sensors.

Data Engineers - Cleaning and building data pipelines

Data Science Analytics - A/B testing, simple ml algorithms, analytics, metrics

Research scientists - AI and deep learning

Research scientists are backed by Machine Learning engineers. They train models, evaluate them, implement solutions using programming and test them or real cases.

Summary:

From above distribution map we can see that Data Science is a general term, which includes ML Engineers, ML Researchers, Data Analysts, Data Engineers and Devops. Data science can be all of this and it depends what company you are in.

Links:

[1] What REALLY is Data Science? Told by a Data Scientist

[2] Чем отличаются data analyst, data engineer и data scientist – Алексей Натёкин

[3] IUT Machine Learning telegram group:

https://t.me/iutml



Report Page