What Is Data Science? Definition, Tasks, Process And Examples
Data science, scientifically-based methods for data analysis, is becoming more and more critical. But it is often unclear what the procedure entails, what training is required and what advantages the use of data scientists brings with it.
This article tries to define data science, explain the underlying process, and which roles are involved. To transfer from theory to practice, we show a few examples as an outlook to illustrate the added value of data science.
What is data science?
Simply put, it is an interdisciplinary approach to using data to generate added value. The procedure consists of methods from statistics, computer science, and economics, the combination of which results in the possibility of developing solutions based on (large) amounts of data.
The term “data science” was created to distinguish it from computer science (Peter Naur, 1960) to name processing that focuses on data. However, it was only after the turn of the millennium that there were movements for the discipline to become independent from the field of statistics.
The idea was the multidisciplinary study of data using statistics to establish practical application. Since then, the field has grown steadily, and data science has conquered more and more areas of our daily life.
When people talk about data science today, they primarily mean using big data and machine learning to develop problem-oriented solutions.
This approach also forms the trinity of data science: statistics/mathematics, data/computer science, and economy/business. This process has established itself as a method of finding a solution, presented in detail in the next section.
When the Harvard Business Review in 2012 named the role of the data scientist the “sexiest job in the 21st century”, they ran on the topic, and experienced data scientists never stopped.
However, following the hype of recent years, disillusionment has set in: It is still often unclear how exactly data science “works,” what tasks data scientists have and how one can derive exact added value for companies and organizations from the analysis of data. We want to remove this ambiguity.
The data science process: tasks and methods
Using data science is about understanding a problem and developing a data-based solution for it. This solution can – but does not have to – be based on advanced analytics such as machine learning.
However, in the process, it is vital that an iterative, mutual understanding between business and technical expertise is established so that the solution does not bypass the “customer.” Therefore, we would like to present the data science process again in detail in this section.
USE CASE DEFINITION: UNDERSTAND THE USE CASE
The first and most crucial step is identifying and understanding a specific use case and developing a suitable solution. There are seldom “greenfield” approaches in which one can work in a purely innovative manner.
Therefore, the data scientist is also in a service provider role within the company: his task is to create added value for other business areas such as sales, marketing, or production.
The easiest way to understand the problems and needs of these business areas is to talk to them. Whether by the workshop, use case form, or coffee, the paths are individually effective for each company. In all cases, however, the aim is to identify a use case and, ideally, evaluate it directly for feasibility.
DATA IDENTIFICATION & ENGINEERING
Once the application is straightforward, the next step is to go. Relevant data for the solution is identified, acquired, and prepared for evaluation here.
The data is ideally documented in a data catalog and stored in a data warehouse or data lake, allowing easy access. But often, there is no pertinent data (yet), then it is necessary to generate or acquire data.
Each of these processes – extraction or acquisition – falls either into the area of responsibility of a data scientist as a generalist or, in diversified companies, that of a data engineer.
The data engineer takes care of the consolidation, storage, and management of data to then make it available to consumers such as the data scientist.
The bandwidth for the acquisition, storage, and documentation of data sets is extensive in terms of method. Many tools deal with this fundamental step in the data science process.
It is not for nothing that skilled data engineers are currently in great demand. You can find details on tools & systems in our article about the data engineer and his area of responsibility.
If it becomes apparent in this step that the data does not adequately depict the use case, that it is not available, or the quality is not correct, an action back to the use case definition must be taken.
It is essential to decide whether you still want to proceed with the existing data or retake the data basis.
The last thing to do is to prepare the data for the next steps. This includes merging different data sets, the generation of metrics, and the cleaning of the data sets. The goal of this step is to prepare a reliable data set for further processing or evaluation.
EVALUATION, ANALYTICS / MACHINE LEARNING & EVALUATION
The core of data science is to generate insights from the data. In a somewhat broader sense, the sterile processing of data also applies as part of data science.
As a result, there are several ways to accomplish this process step: From processing and analyzing data to the flagship, using machine learning algorithms. We want to explain each of these three categories here briefly.
As already mentioned, the sterile processing of data can also count as data science. Object recognition can be said as an example.
Capturing imagery and algorithmic recognition of particular objects can be a very challenging task. These and other charges, such as natural language processing in cognitive computing, do great automation and value generation.
Considered by many as an intermediate step to machine learning, classic analytics is considered. Nevertheless, a purely statistical-descriptive analysis of data can also be regarded as the core solution of data science.
The Fast Fourier Transformation and the corresponding analysis of sound data can be cited as an example. If this use case fits into the entire data science process, analytics is also possible as an endpoint of the evaluation.
Much more often, however, data science is associated with machine learning. Artificial intelligence is a topic of great importance, and data scientists combine the necessary skill set to implement this approach.
This step of the data science process thus includes the entire machine learning process of feature engineering, model training, evaluation, and optimization.
Use cases are the prediction of values or categories (supervised learning), for example as sales forecasting or object recognition, the identification of similar behavior (unsupervised learning), or the implementation of recommendation or reinforcement systems (e.g., product recommendations or autonomous pathfinding).
Overall, it can be said that this step in the data science process is the most delicate: Only if the correct data is available in good quality will the results be of high quality (“garbage in, garbage out”).
And only if you, as a data scientist, produce a result that is trusted will your expertise be heard in the future.
As far as tools & systems are concerned, this landscape is again vast. In general, there are three main strands for analyzing and modeling data: Either you use programming/scripting languages such as Python or R, you use data mining tools such as KNIME or RapidMineR, or you use cloud services such as Azure Analytics or Google AutoML. Since each of these aspects has different focuses, we refer to our detailed article on machine learning.
The solution must be re-evaluated when you have achieved a result representing the optimal intersection between forecast quality and avoidance of overfitting. Another collaborative look at the results allows the business to exert further influence and steer its domain expertise.
DEPLOYMENT & MONITORING OF THE SOLUTION
If the decision is made to transfer the solution (the machine learning model) into production and use it operationally, deployment is the next step.
This model deployment means that the information is made available either via a dashboard or the model is made available via a machine learning pipeline. This means that other company systems or channels can access the results and process them further.
This task is usually found in data engineering or IT DevOps, as the technology has to be integrated into the IT landscape.
Once a solution has been put into production, this solution must be monitored and, if necessary, repaired or improved (“Concept drift”).
This post-deployment service maintenance must be considered early on. It must be neatly transferred into the IT processes; otherwise, processes or channels may try to access a service that is unavailable, not up-to-date, or provides incorrect information.
Involved roles in data science
As mentioned several times, there are many roles involved in data science. Here we list all parts, sorted according to how frequently they occur in the process.
The role at the center of data science is, of course, the data scientist himself. There are different interpretations of which tasks the function should take on.
As a generalist, he usually covers the entire process, while more and more organizations tend to specialize in the part. In general, the following tasks fall into the portfolio of the data scientist:
- Use case understanding, define use case, and conceptualize the solution
- Data identification and extraction for the use case
- Exploratory data analysis, feature engineering
- Machine learning modeling, evaluation, and optimization
- Play out the information or the model
BUSINESS STAKEHOLDER / DOMAIN EXPERT
Domain experts are the second most important in any data science project because they are the interface to experience and evaluate the success of use cases.
As a result, close cooperation with technical expertise is of fundamental relevance to developing meaningful and commercially viable use cases.
- Definition of use cases
- Assessment of strategic importance and expected return
- Contact person for domain expertise and experience
- Evaluation of the usability and the success of the result
Where there is no data, there is no analysis. Even if data scientists are often in the limelight when analyzing data, data engineers first lay the foundations for it. You support the data science process employing the following:
- Development and maintenance of data infrastructure, databases, and cloud services
- Establishment and maintenance of data pipelines for the acquisition and consolidation of data
- Providing interfaces for data consumers
- If necessary, play out the solutions and model deployment
If you deal with data infrastructure at the enterprise level, you will quickly find what you are looking for with Data Architects. The data architects overlook the entire IT infrastructure landscape and are responsible for the following processes:
- Classification of data infrastructure in the company’s IT landscape
- Definition and, if necessary, development of solutions for requirements of use cases, including data warehouse and data lake
- possibly responsibility for issues such as security and access control
DATA ANALYSTS / BUSINESS ANALYSTS
“What is the difference between data analysts and data scientists ?” is one of the most frequently asked questions in data science. In short, the difference is: data analysts primarily only work with structured data from data warehouses and ad-hoc process analyzes from the domain. In contrast, data scientists work with more significant variance in each of these aspects. Nevertheless, data analysts support the process in the following elements:
- Definition of data sources that fit the application
- Support with descriptive data analysis and feature engineering
- Help with the visualization of data, for example, using dashboards
DEVOPS / IT
As in the data science process, the point in time will be reached when a machine learning model or other scripts are operationalized. To solve this task with the appropriate software solutions expertise, resources from the IT department are accessed. The DevOps support the process by:
- Integration of data science solutions into the overall IT landscape
- Provision of interfaces between DS solutions and other channels (e.g., website, apps, ERP, CRM ..)
- Maintenance and monitoring of uptime and functionality of the solution
DATA TRANSLATOR / DATA AMBASSADOR
Last but not least, a role that is not yet widespread but is experiencing a certain amount of hype. The data translator or data ambassador mediates between the technical expertise in the data science area and the stakeholders in the domain. In concrete terms, this means:
- Inspiration and definition of use cases
- Consulting and knowledge transfer into the environment and from the atmosphere into the technical expertise
- Translation of technical results from the data science process for a clear understanding of business
Definition of terms
One of the main difficulties in data science is the apparent terms and how to differentiate them. Therefore, we would like to clarify the buzzword jungle by contrasting data science with other words and defining the difference.
DATA SCIENCE VS. DATA MINING
Data mining is the exploratory examination of existing data for new patterns using statistical and machine learning methods. Data science is more comprehensive than data mining in both processes (use case definition, data acquisition, etc.) and methodology.
DATA SCIENCE VS. ARTIFICIAL INTELLIGENCE (AI) / ARTIFICIAL INTELLIGENCE (AI)
Artificial intelligence refers to the simulation of intelligent behavior using algorithms. Data science makes effective use of this idea by using machine learning methods or other algorithms.
However, AI and data science are not the same. Data science describes a process that goes far beyond algorithms (use cases, data acquisition, etc.), while AI as a topic is not fully represented by data science.
DATA SCIENCE VS. MACHINE LEARNING (ML) / MACHINE LEARNING
As shown in the data science process, machine learning is just one of many methods or technology to analyze data. Therefore, machine learning is a tool in the analysis step and certainly one of the figureheads of data science, but not equivalent.
DATA SCIENCE VS. DATA ANALYTICS
Data analysis describes a structured procedure for evaluating recorded and organized data for precise requirements (e.g., KPIs). On the other hand, data science encompasses a more fantastic range of technologies, data types, evaluation approaches, and purposes.
For details on the difference between data science and data analytics, visit our article “Data Scientist vs. Data Analyst: What’s the Difference?”.
DATA SCIENCE VS. ADVANCED ANALYTICS
Advanced Analytics refers to the methodology used in data science, so the correct comparison should be “Analytics vs. Advanced Analytics.”
The difference here is that analytics mainly analyzes descriptively (“What happened?”) Or partly diagnostically (“Why did something happen?”), While advanced analytics using machine learning and cognitive computing also analyzes predictively (“What will happen?”) or prescriptively (“How to react?”).
DATA SCIENCE VS. DATA ENGINEERING
As mentioned in the process, data engineering is the acquisition, connection of data, setting up and maintaining database systems, and setting up cloud services.
All of these things fall into the data science process in the phase preparatory to analysis. If companies only have data science as generalists, data engineering often falls into their area of responsibility. However, it is more optimal if the company has its data engineers who take care of this aspect.