Join top executives in San Francisco on July 11-12, to hear how leaders are integrating and optimizing AI investments for success. Learn More
Table of contents
Data science is the application of scientific techniques and mathematics to making business decisions. More specifically, it has become known for the data mining, machine learning (ML) and artificial intelligence (AI) processes increasingly applied to very large (“big”) and often heterogeneous sets of semi-structured and unstructured datasets.
The term was first suggested in the 1970s as a synonym for “computer science” and then in the 1980s as an alternative phrase for “statistics.” Finally, in the 1990s, a consensus began to form as to data science being an interdisciplinary practice that combines data collection, computer processing and analysis. It is seen as “scientific” because it applies systematic analysis to observable, real-world data.
Since then it has connoted that full span, from the aggregation of source data to its application in technical and business decision-making and processes.
But it has also come to be more narrowly associated with the specialized role and function of “data scientists” in burgeoning data departments who are managing ever more data in the modern enterprise.
Join us in San Francisco on July 11-12, where top executives will share how they have integrated and optimized AI investments for success and avoided common pitfalls.
Data science in the broader sense
In its broader sense, data science can be seen as the application of scientific techniques and mathematics to making business decisions. This work can be broken into three major areas:
- Gathering: Simply collecting the information from disparate computer systems can be a challenge in itself. The data is often in different formats and it may contain spurious or incomplete records. When the data is cleaned and standardized, it must be stored so the data science algorithms can be used again and again into the future.
- Analyzing: Looking for patterns, and understanding how the demands upon each stage of the enterprise are changing, requires a mixture of statistical analysis and artificial intelligence.
- Reporting: Reports can summarize activity, flag anomalous behavior and predict trends and opportunities. Tables, charts, visualizations and animated summaries can tell a story and guide decision-makers.
Just as data science is sometimes used in this broader sense, “business intelligence” (BI) and “data analytics” may likewise be more generally applied. Depending on the history, scale and focus of an enterprise’s data department, the department itself, its function and/or its key staffers may be more broadly tasked and/or so titled.
However, these terms have different origins and are also most often applied to more narrow functions today.
What is the function of data science in a larger data department?
Teams of developers, or software engineers, combine with data scientists and data analysts to create tools and solutions designed to optimize collecting data from a wide variety of sources, integrate this data, analyze it and then deliver reports or dashboards for everyone to use to make decisions.
Many of these approaches and tools have been given names. Some of the most common are the following:
- Data warehouse: In a data warehouse the information is stored in a collection of well-ordered tables and structures, often in relational databases. The data is usually well-filtered and sometimes already analyzed. In industries with questions about legal compliance, the data is already checked for anomalies and issues for investigation.
- Data lake: In a data lake the idea is to gather the information in a central repository, similar to a data warehouse and, indeed, the differences aren’t always clear. In general, data lakes have more raw data that is less filtered or processed. If questions appear, the data is readily available to be examined, but often this work isn’t done unless there’s a demand for the answers.
- Data store: Data stores tend to be simpler systems that offer more transitory and temporary collections. An example might be all of the data collected by a factory on one day or week. The data is often processed and sent to a lake or warehouse.
- Data mart: Data marts can offer either internal or external users highly processed data collections for immediate consumption. Inside companies, they may hold official reports that have been checked and certified. Some companies also offer external marts that sell data collections or offer them for free.
- Predictive analytics: Some use this term to emphasize how data science can help plan for the future with predictions based upon past data.
- Customer data platform: Some tools are focused on tracking customers to aid in marketing. These often integrate with third-party data sources to build better models of individuals so that marketing efforts can be customized for them.
- Data as a service: Some companies are specializing in packaging collections of data so that they can be integrated into local data science.
- Integrated development environments (IDE): These software packages are also used by developers. They collect many of the common tools for analysis, like a Python or R package, and marry them with an editor and file manager so that data scientists can experiment with writing and running new analyses in one place.
- Notebook: Notebooks are often thought of as dynamic or living documents. They bundle together text, charts, tables and data with the software that produced them. This allows data scientists to share both their results and the analysis that created those results. Readers can not only read the text, they can make changes and explore immediately.
- Notebook host: Many teams of data scientists dedicate servers to hosting notebooks. These systems store the data and text in the results so they can be read and easily experimented with. Some companies offer hosting as a service.
How are some of the major companies approaching data science?
The major cloud companies devote substantial resources to helping their customers manage and analyze large datasets that often are measured in petabytes or exabytes.
In all of these cases, these major cloud platforms offer more services than can be summarized in a short article. They offer multiple options for both storage and analysis so data scientists can choose the best tools for their jobs.
IBM integrates its data storage with a collection of statistical analysis packages and artificial intelligence algorithms. These tools, marketed in a number of forms such as the Cloud Pak for Data, manages access and establishes rules for protecting privacy from the beginning.
The tools, which are available both as services and for local installations, can integrate data across multiple clouds and servers.
IBM also offers a collection of AI tools and services under its Watson brand that provide algorithms for classifying datasets and searching for signals.
Oracle offers a wide range of databases that can act as the foundation for data lakes and warehouses, either on premises, in Oracle’s cloud data centers or in hybrids of both.
Oracle Cloud Infrastructure supports some of the standard data science tools, using R, Python and Matlab, so that the information from these databases can be turned into notebooks, reports or dashboards filled with tables, charts and graphs.
The company has also invested heavily in providing pathways for training artificial intelligence models and deploying them into production environments.
Oracle is purchasing companies and devoting developers to produce more customized solutions for particular industries with data-intensive needs, like healthcare.
Microsoft’s Azure cloud offers databases and data storage options such as the Cosmos database, which developers can access via a SQL or NoSQL API. Microsoft’s data science services range from statistical packages to artificial intelligence routines.
One option, the Data Science Virtual Machine, allows users to boot up a cloud instance with all of the common packages that are optimized for big data analysis and machine learning projects.
Another tool, the Azure Machine Learning Studio, handles most of the details of data storage and analysis so the user can build notebooks that explore the signals in a dataset without worrying about software configuration.
Amazon offers a diverse collection of data storage options, ranging from managed versions of open-source databases like PostgreSQL to cold storage for maintaining copies of archives at a low price. Data scientists can also choose between Amazon Web Services’ (AWS) own products and some from other companies that are hosted in the AWS cloud.
Tools like Quicksight, for instance, are designed to simplify creating good, responsive data visualizations that can also adapt as users ask questions. Other products like Kinesis focus on particular data types, like real-time video or website clickstreams. SageMaker supports teams that want to create and deploy artificial intelligence and machine learning to create models with predictive power.
Google Cloud Platform (GCP) can collect and process large amounts of data using a variety of databases, such as BigQuery, which is optimized for extremely large datasets. Google’s data analysis options include raw tools for creating large data fabrics as well as data analysis studios for exploring. Colab, for example, hosts Jupyter notebooks for data science work that have seamless access to a large collection of GPUs for compute-intensive work.
The company has invested heavily in AI and offers a wide range of tools that develop models for extracting insights from data. The VertexAI Workbench, for example, is a Jupyter-based front end that connects to all of the backend AI services available on Google’s cloud.
How are startups and challengers handling data science?
A variety of companies want to help others understand the wisdom that may be hidden inside their data. Some are building platforms for storing and analyzing data. Others are just creating tools that can be installed on local machines. Some offer a service that may be measured by the byte.
At the core of many of these products and services are open-source software packages like R or Python, the common languages used by data scientists. There are also several good open-source packages that offer an integrated data analysis environment. Software like RStudiio, Eric and Eclipse are just a few examples of tools that deliver a comfortable environment for exploring data.
JetBrains sells PyCharm, an integrated development environment for creating Python applications. Many programmers work on Python-based data science there. The company also distributes a free community edition that is popular with many schools.
Snowflake makes a cloud-based data storage platform with a wide range of features including cybersecurity, collaboration and governance control. There are many uses for this data lake or data warehouse service; supporting machine learning and data science is one of the most popular. Snowflake’s cloud supports many common applications and can run many Python applications on the data stored in its cloud.
Kaggle is a data science platform that offers both storage and analysis, both for private datasets and for many of the public ones from sources like governments and universities. The data science is often done with notebook-based code that runs locally in Kaggle’s cloud using either standard hardware or specialized Graphic Processing Units or Tensor Processing Units. The company also sponsors data science contests which some companies use to tap into the creativity and wisdom of the open community.
The Databricks Lakehouse Platform supports storing and analyzing data either in its cloud, in many of the big clouds, or on premises. The tool helps orchestrate complex workflows that collect data from multiple sources, integrating it and then generating charts, graphs, tables and other reports. Many of the most common data science routines are easily applied as steps in these workflows. The goal is to provide a powerful data collection and storage platform that also produces good data science in the process.
Is there anything that data science can’t do?
Questions about the limitations of science have been a deep and often philosophical question for scientists over the years. The same questions about the power and precision of mathematical tools are important for users who want to understand how businesses and other organizations function. The limits of statistical analysis and machine learning apply just as readily to data science work.
In many cases, the problems aren’t with the mathematics or the algorithms. Simply gathering good-quality data is a challenge. Analysis can’t really begin to be trustworthy until data scientists ensure that their data is reliable and consistent.
VentureBeat's mission is to be a digital town square for technical decision-makers to gain knowledge about transformative enterprise technology and transact. Discover our Briefings.