What is data science? The applications and approaches

Data science is the application of scientific techniques and mathematics to making business decisions. More specifically, it has become known for the data mining, machine learning (ML) and artificial intelligence (AI) processes increasingly applied to very large (“big”) and often heterogeneous sets of semi-structured and unstructured datasets.

The term was first suggested in the 1970s as a synonym for "computer science" and then in the 1980s as an alternative phrase for "statistics." Finally, in the 1990s, a consensus began to form as to data science being an interdisciplinary practice that combines data collection, computer processing and analysis. It is seen as "scientific" because it applies systematic analysis to observable, real-world data.

Since then it has connoted that full span, from the aggregation of source data to its application in technical and business decision-making and processes.

But it has also come to be more narrowly associated with the specialized role and function of "data scientists" in burgeoning data departments who are managing ever more data in the modern enterprise.

Data science in the broader sense

In its broader sense, data science can be seen as the application of scientific techniques and mathematics to making business decisions. This work can be broken into three major areas:

Just as data science is sometimes used in this broader sense, “business intelligence” (BI) and “data analytics” may likewise be more generally applied. Depending on the history, scale and focus of an enterprise’s data department, the department itself, its function and/or its key staffers may be more broadly tasked and/or so titled.

However, these terms have different origins and are also most often applied to more narrow functions today.

What is the function of data science in a larger data department?

Teams of developers, or software engineers, combine with data scientists and data analysts to create tools and solutions designed to optimize collecting data from a wide variety of sources, integrate this data, analyze it and then deliver reports or dashboards for everyone to use to make decisions.

Many of these approaches and tools have been given names. Some of the most common are the following:

How are some of the major companies approaching data science?

The major cloud companies devote substantial resources to helping their customers manage and analyze large datasets that often are measured in petabytes or exabytes.

In all of these cases, these major cloud platforms offer more services than can be summarized in a short article. They offer multiple options for both storage and analysis so data scientists can choose the best tools for their jobs.

IBM

IBM integrates its data storage with a collection of statistical analysis packages and artificial intelligence algorithms. These tools, marketed in a number of forms such as the Cloud Pak for Data, manages access and establishes rules for protecting privacy from the beginning.

The tools, which are available both as services and for local installations, can integrate data across multiple clouds and servers.

IBM also offers a collection of AI tools and services under its Watson brand that provide algorithms for classifying datasets and searching for signals.

Oracle

Oracle offers a wide range of databases that can act as the foundation for data lakes and warehouses, either on premises, in Oracle’s cloud data centers or in hybrids of both.

Oracle Cloud Infrastructure supports some of the standard data science tools, using R, Python and Matlab, so that the information from these databases can be turned into notebooks, reports or dashboards filled with tables, charts and graphs.

The company has also invested heavily in providing pathways for training artificial intelligence models and deploying them into production environments.

Oracle is purchasing companies and devoting developers to produce more customized solutions for particular industries with data-intensive needs, like healthcare.

Microsoft

Microsoft’s Azure cloud offers databases and data storage options such as the Cosmos database, which developers can access via a SQL or NoSQL API. Microsoft's data science services range from statistical packages to artificial intelligence routines.

One option, the Data Science Virtual Machine, allows users to boot up a cloud instance with all of the common packages that are optimized for big data analysis and machine learning projects.

Another tool, the Azure Machine Learning Studio, handles most of the details of data storage and analysis so the user can build notebooks that explore the signals in a dataset without worrying about software configuration.

Amazon

Amazon offers a diverse collection of data storage options, ranging from managed versions of open-source databases like PostgreSQL to cold storage for maintaining copies of archives at a low price. Data scientists can also choose between Amazon Web Services’ (AWS) own products and some from other companies that are hosted in the AWS cloud.

Tools like Quicksight, for instance, are designed to simplify creating good, responsive data visualizations that can also adapt as users ask questions. Other products like Kinesis focus on particular data types, like real-time video or website clickstreams. SageMaker supports teams that want to create and deploy artificial intelligence and machine learning to create models with predictive power.

Google

Google Cloud Platform (GCP) can collect and process large amounts of data using a variety of databases, such as BigQuery, which is optimized for extremely large datasets. Google’s data analysis options include raw tools for creating large data fabrics as well as data analysis studios for exploring. Colab, for example, hosts Jupyter notebooks for data science work that have seamless access to a large collection of GPUs for compute-intensive work.

The company has invested heavily in AI and offers a wide range of tools that develop models for extracting insights from data. The VertexAI Workbench, for example, is a Jupyter-based front end that connects to all of the backend AI services available on Google’s cloud.

How are startups and challengers handling data science?

A variety of companies want to help others understand the wisdom that may be hidden inside their data. Some are building platforms for storing and analyzing data. Others are just creating tools that can be installed on local machines. Some offer a service that may be measured by the byte.

At the core of many of these products and services are open-source software packages like R or Python, the common languages used by data scientists. There are also several good open-source packages that offer an integrated data analysis environment. Software like RStudiio, Eric and Eclipse are just a few examples of tools that deliver a comfortable environment for exploring data.

JetBrains sells PyCharm, an integrated development environment for creating Python applications. Many programmers work on Python-based data science there. The company also distributes a free community edition that is popular with many schools.

Snowflake makes a cloud-based data storage platform with a wide range of features including cybersecurity, collaboration and governance control. There are many uses for this data lake or data warehouse service; supporting machine learning and data science is one of the most popular. Snowflake’s cloud supports many common applications and can run many Python applications on the data stored in its cloud.

Kaggle is a data science platform that offers both storage and analysis, both for private datasets and for many of the public ones from sources like governments and universities. The data science is often done with notebook-based code that runs locally in Kaggle’s cloud using either standard hardware or specialized Graphic Processing Units or Tensor Processing Units. The company also sponsors data science contests which some companies use to tap into the creativity and wisdom of the open community.

The Databricks Lakehouse Platform supports storing and analyzing data either in its cloud, in many of the big clouds, or on premises. The tool helps orchestrate complex workflows that collect data from multiple sources, integrating it and then generating charts, graphs, tables and other reports. Many of the most common data science routines are easily applied as steps in these workflows. The goal is to provide a powerful data collection and storage platform that also produces good data science in the process.

Is there anything that data science can’t do?

Questions about the limitations of science have been a deep and often philosophical question for scientists over the years. The same questions about the power and precision of mathematical tools are important for users who want to understand how businesses and other organizations function. The limits of statistical analysis and machine learning apply just as readily to data science work.

In many cases, the problems aren’t with the mathematics or the algorithms. Simply gathering good-quality data is a challenge. Analysis can’t really begin to be trustworthy until data scientists ensure that their data is reliable and consistent.