Data science from scratch: first principles with Python

Data science from scratch: first principles with Python

Python.Engineering

Who is a Data Scientist?

A data scientist researches data to find hidden patterns and make predictions about how things will work in the future. A Data Scientist deals with mathematical models, programming and statistics for a specific professional area (finance, banking, etc.), as well as with specific tasks, such as recognizing fraudulent transactions, a set of genes corresponding to a specific disease, financial risks for companies, etc.

In order to solve these problems such a specialist should have knowledge and skills in several areas. The most important ones are mathematics, programming, and an understanding of business and strategy - from the Data science from scratch: first principles with Python PDF book by Joel Grus


What kind of specialists work with data

Data Analyst - works with data in a structured form from internal analytics systems, helping the business to summarize and interpret this data. Works with Excel, SQL and internal analytics systems. The Data Analyst Specialization course is open at SkillFactory

BI (Business Intelligence Developer) - Designs internal data warehouses, links data from different systems, and creates dashboards and analytical reports. Uses BI systems (Oracle, IBM, etc.), SQL, ETL tools and programming languages.

Data Engineer - is involved in the creation and support of data infrastructure, particularly Big Data. He or she is responsible for collecting, storing and managing real-time data streams. IT specialist of the highest level, working with Linux server clusters, cloud systems, Big Data systems such as Hadoop, Spark and others. SkillFactory has opened a course called "Data Engineer Specialization"

Data Scientist - deals with intelligent analysis of structured and unstructured data. Uses statistics, machine learning, and advanced predictive analytics techniques to solve key business problems. Compared to a data analyst, a data scientist must not only be able to analyze the information received, but also have excellent programming skills, the ability to develop new algorithms, process large amounts of information, and have a good understanding of the domain in which he or she applies his or her knowledge.


Data science: what it is, where it is used

Data science is a relatively new discipline in the search, storage and processing of information. And don't let the word "science" confuse you, this tool is used everywhere and has a definitive meaning to science, not a related one. That is, it is a kind of science - work with data.

Businesses actively use DS to predict events, collect and segment the target audience, study demand for certain products. You will learn more about what is data science, what the experts in this field do and what you need to start working in this area from our article.

What Data Science is

Data Science is a discipline that promotes the usefulness of data. You can find different definitions of this concept and in each of them there will be the word "data". That is, Data Science is applied very widely.

This leads to the fact that the activity of a specialist in this field is difficult to differentiate: it is not quite clear what exactly he does, working with data, because they are needed to create reports, and to predict demand in one area or another, and to build complex mathematical models of dynamic pricing, and to configure stream processing data for high-load services, working in real time.

The word "science" in the name is used for a reason. Mathematics for Data Science is the basis, data analysis is based on the classical mathematical apparatus: optimization theory, linear algebra, mathematical statistics, and more. However, science is the foundation, not the main area of specialists, most of whom are not engaged in theory, and practice, solving specific problems.

Of course, there are large corporations with a large staff engaged exclusively in scientific work; they create new algorithms and methods of machine learning, as well as improve the existing ones.

Today, businesses want first and foremost to understand what positive effect Data Science can have on them. What matters is not how models are built using machine learning algorithms, but why the need for them arose in the first place, how it was formulated mathematically, and how it was implemented in specific ways to solve problems.

It is also of great importance to conduct honest experiments, which help to properly assess the effectiveness of the models applied to work in a particular business.

Principle of Data Science

Consider the theoretical foundations of data science. Data science in the Russian-speaking world is simply transliterated - "data science. This concept is understood as a set of a number of interrelated disciplines and methods of computer science and mathematics.

First part: data

In data science, data itself is obviously of crucial importance. Of particular importance are methods of collecting, storing, processing, and extracting useful information from the total array of data. The process of obtaining this extraction takes up to 80% of the working time of specialists in this field.

There are data that can not be collected and processed by traditional methods due to their large volume and/or diversity - they are called big data, or big data.

Importantly, big data science is a subset of data science, not synonymous with it. However, in reality, data analysts often work with big data.

Second part: science.

The obtained data should be processed in some way, the constructed graph should lead us to specific conclusions. To do this, the information must be analyzed, useful patterns must be extracted and then used. And this is where the second part of data science comes into play, namely disciplines such as statistics, machine learning, and optimization.

These are what make up data analysis. Machine learning provides the search for patterns in existing data in order to further be able to predict the necessary information for new objects.

The basic idea of machine learning is quite simple: discover a pattern and apply it to new data. But there is another group of key tasks, which does not aim to predict any values, but to break down the data into some groups.

Data science project is an applied research in which such steps as setting a hypothesis, developing an experimental plan and evaluating the result of its suitability for the solution of a particular problem are mandatory. This is of great importance in business, when it is necessary to understand whether a particular decision will benefit.

If we go back to our example with coffee, the results of the study could determine the amount of beverage that office employees need during the month, and make a purchase in accordance with the real needs of people. However, after performing the calculations, it is necessary to compare the resulting model with the already existing one and identify the best one.

Such objects negatively affect the process of model building and its quality, they should be treated differently. And sometimes it happens that they are of paramount importance for the study. This happens, for example, when non-standard banking operations are detected in order to prevent fraud.

Applications of Data Science

According to Kaggle, a professional social network for specialists in the field we describe, data science analytics is used today by businesses of all sizes. IDC and Hitachi report that 78 percent of businesses report significant recent growth in the amount of information they can analyze and use.

Entrepreneurs are aware of the importance of information and the need to structure it in order to positively influence their own activities, regardless of their focus. Let's list the industries in which Data Science is actively used to address current challenges:

  • Online commerce and entertainment services: recommendation systems for users;
  • Healthcare: disease prediction and health recommendations;
  • logistics: planning and optimization of delivery routes;
  • digital advertising: automated content placement and targeting;
  • finance: scoring, fraud detection and prevention;
  • industry: predictive analytics for repair and production planning;
  • real estate: search and offer of the most suitable objects for the buyer;
  • public administration: forecasting of employment and economic situation, fighting crime;
  • Sports: selecting promising players and developing game strategies.

This is not a complete list of areas and ways of application of data science. The number of cases of absolutely different orientation using "data science" is growing every year.

Data Science is encountered not only by specialists working in this field, but also by ordinary users of Internet sites and services. This is due to the fact that the tools of data science are used in them. For example, the well-known audio service Spotify uses them as part of optimizing the selection of music for users according to their preferences.

This also applies to video streaming services like Netflix, which try to offer their viewers content that is relevant to their interests. Uber is actively studying data in order to predict demand, improve the quality of its products, and automate work processes.

Do not rely solely on the results of Data Science, but it has extremely useful tools that allow businesses to better navigate in their field and approximately predict the future.

Who is a Data Science Specialist

A data scientist is a specialist who processes an array of data, extracts useful information from it, and finds relationships and patterns using machine learning algorithms. A model is an algorithm that can be used to solve business problems.

For example, in Taxi models are created for forecasting demand, selecting optimal routes, and monitoring the condition of drivers. Successful use of models makes it possible to reduce the cost of trips and improve the quality of services. In the banking sector models can be used to optimize the process of deciding whether to grant credit to a potential borrower.

In insurance companies, they help assess the likelihood of an insured event. For those who sell their products online, the implementation of the model can help increase advertising conversion rates.

The data scientist works with machine learning. He creates a model according to the specifications given to him. It should provide a certain result.

Tasks of a Data Science Specialist

Data science data analyst in every company has different tasks. In large corporations, he is usually responsible for several areas of activity. If it is a bank, the data scientist may deal with the evaluation of borrowers and speech recognition.

Let's take a look at the standard stages of a Data Scientist's workday. As a rule, there are five of them.

  • Data collection. Information of all kinds (structured and unstructured) is gathered from a variety of sources that are relevant to the task and domain of operation. The methods of work are varied, including manual input, scraping web pages, and collecting data from proprietary systems.
  • Data storage. The specialist is looking for ways to store the collected information, which will allow further processing using the special tools that are already available. At this stage, data is filtered, duplicates are removed, etc.
  • Pre-processing. There is a preliminary analysis of the collected data and the identification of the most prominent correlations between them. Besides, at this stage it is necessary to trace the patterns, check the reality of the information and its correspondence to the tasks to be solved.
  • Processing. The collected data is processed using special tools of the datascientist. It uses artificial intelligence, machine learning models, analytical algorithms, etc.
  • Communication. The specialist visualizes the results of the work by creating tables, graphs, lists, etc. The form of data presentation is chosen depending on the specific situation, the tasks performed, and the category of information consumers.

Despite the fact that the work of datascientists in each field of activity is built according to certain unique rules, there are also common features for all areas. Almost every specialist must:

  • Determine the tasks the customer wants to solve;
  • find out how appropriate it is to solve a working question using machine learning methods;
  • collect, process, markup, and prepare the data for further use;
  • determine the metrics for evaluating the effectiveness of the model;
  • develop and test machine learning models;
  • prove the predicted economic effect of implementing the model;
  • implement the model in business processes;
  • support the model in the process of its use.

Statistics for data science is very important. It is collected constantly, the stages of work are repeated to refine the data and improve the models.

Data Science tools can be used within a business of any size. The difference is the size of the team and the scale of the tasks to be solved. The project manager has the bulk of the work. He maintains contact with customers, receives clear terms of reference, and then sets tasks for his subordinates (analysts of different levels). A data scientist working alone can both communicate with the customers and perform the tasks assigned to him.

Data can also be collected by several specialists or by a single one - it all depends on the level of the company and the scale of the analytics department. At the same time, tools that simplify and automate this process are usually used. They also help to pre-filter and systematize the information received.

When creating a model, metrics are also defined which allow evaluating its effectiveness. As a rule, there are two types of metrics: for business and technical. The first ones allow to track the economic effect from the model implementation, and the second ones define its quality (say, the accuracy of predictions).

It is also necessary to assess the controllability and safety of the resulting model. Let us say, in the medical industry it is of key importance, especially when it comes to diagnostics. The tested model can be built into a manufacturing process (e.g., a credit pipeline) or a product (e.g., a mobile app) and the resulting effect can be monitored in real time.

Basic Data Science Tools

Data Science professionals should have a theoretical and practical training in programming and creating applications, because it expands their professional tools and working opportunities. It is important to know at least one of the two most popular programming languages in Data Science.

R. It is an open-source language and software environment for creating statistical computations. It contains many libraries and handy tools that allow you to filter data and do preprocessing. R provides ample opportunities for data visualization and testing the created machine learning model.

Python. A versatile object-oriented programming language. Python data science can be used in a wide variety of activities for working with data of almost any format.

It is also worth mentioning such datascientist tools as Apache Spark, Tableau, Microsoft PowerBI. And there is a long way to go.

Knowledge and skills needed to get started in Data Science

Today, the basics of data science can be learned in numerous courses and with the help of specialized books. A specialist in this field should have quite extensive knowledge in the field of exact sciences, machine learning, programming languages, data collection.

Statistics, mathematics, linear algebra

Data science training involves learning a basic course in probability theory, mathematical analysis, linear algebra and mathematical statistics. This is necessary to carry out a competent analysis of the results of data processing algorithms.


Machine learning

Machine learning, or data science machine learning, gives you the ability to configure computers to make decisions autonomously and automatically.

To master Data Science from the ground up, you must learn the three main sections of machine learning.

Supervised Learning. Gives you the ability to create predictions from pre-labeled data. In the case where you need to predict multiple values (say, distinguish images of sailboats from cars and boats), it is a classification task, and if one (for example, to assume the cost of a car depending on its characteristics) - a regression task.

Unsupervised learning. In this case, there is no data partitioning, the result and the way of data processing is not known in advance. So you can look for anomalies (non-standard bank card transactions), erroneous sensor readings, etc.

Reinforcement learning. Here also there is no markup, but there is stimulation (positive or negative) of a neural network in response to some actions. For example, this is how machines learn to play computer games such as Dota 2 or Starcraft II.

Literature

  • "Machine Learning. The Science and Art of Building Algorithms that Extract Knowledge from Data" by P. Flach. This is a book about model building techniques and machine learning algorithms.
  • "Probabilistic programming in Python: Bayesian inference and algorithms," by K. Davidson-Pylon. A paper on data processing algorithms and the development of analytical thinking and skills.
  • "Introduction to machine learning with Python," by A. Muller, S. Guido. The book is sharpened to practice the practical skills of MO.
  • Programming with Python.

Data science machine is directly related to programming. It is quite enough for a data scientist (at least at first) to know one language and it is best to start with Python. It is a versatile and feature-rich language with a simple syntax that is often used for data processing.

Literature

  • "Python for Complex Problems. Data Science and Machine Learning," by J. Vander Plas. The book is a guide to statistical and analytical data processing techniques.
  • "Python and Data Analysis," by Wes McKinney. The author discusses applications of the Python programming language in data science.
  • "Automating Routine Tasks with Python," Al Sweigart. A good primer for beginners.
  • "Learning Python," by M. Lutz. A versatile textbook with an emphasis on practice. Suitable for those just starting out on their data journey as well as experienced developers.

Once you have mastered the Python base, you can start learning libraries for Data Science.

Collecting Data

Data Mining is a serious analytical process in which data is examined. It can be used to identify hidden patterns and thereby obtain useful new information for decision-making. It is also about data visualization.

Before looking for a job, you can participate in open projects or competitions. This will allow you to determine your level of knowledge and skills and test your professional skills.

Report Page