Oracle's Autonomous Data Warehouse expansion offers potential upside for tech professionals

In March, Oracle announced an expansion to their Autonomous Data Warehouse that can bring the benefits of ADW -- automating previously manual tasks -- to large groups of new potential users. Oracle calls the expansion "the first self-driving database," and its goal with the new features is to "completely transform cloud data warehousing from a complex ecosystem ... that requires extensive expertise into an intuitive, point-and-click experience" that will enable all types of professionals to access, work with, and build business insights with data, from engineers to analysts and data scientists to business users, all without the help of IT.

A serious bottleneck to data work delivering business value across industries is the amount of expertise required at many steps along the data pipeline. The democratization of data tooling is about increasing ROI when it comes to an organization's data capabilities, as well as increasing the total addressable market for Oracle's ADW. Oracle is also reducing the total cost of ownership with elastic scaling and auto-scaling for changing workloads. We spoke with George Lumpkin, Neil Mendelson, and William Endress from Oracle, who shared their time and perspective for this article.

The landscape: democratization of data tooling

There is a growing movement of data tooling democratization, and the space is getting increasingly crowded with tools such as AWS SageMaker Studio (which we have reviewed here, here, and here), DataRobot, Qlik, Tableau, and Looker. It is telling that in recent times, Google has acquired Looker and Salesforce has acquired Tableau. On top of this, the three major cloud providers are all providing drag-and-drop data tooling, to various extents: AWS has an increasing amount of GUI-based data transformation and machine learning tools; Microsoft Azure has a point-and-click visual interface for machine learning, "data preparation, feature engineering, training algorithms, and model evaluation"; and Google Cloud Platform has similar functionality as part of their Cloud AutoML offering.

In their announcement, Oracle frames the ADW enhancements as self-service tools for:

Analysts, including loading and transforming data, building business models, and extracting insights from data (note that ADW also provides some interesting third-party integrations, such as automatically building data models that can be consumed by Tableau or Qlik).

Data scientists (and "citizen data scientists"), along with building and deploying machine learning models (in a video, Andrew Mendelsohn, executive VP of Oracle Database Server Technologies, describes how data scientists can "easily create models with AutoML" and "integrate ML models into apps via REST or SQL").
LoB developers, including Low-Code App Dev and API-Driven Development.

Oracle Autonomous Data Warehouse competes with incumbent products including Amazon Redshift, Azure Synapse, Google BigQuery, and Snowflake. But Oracle does not necessarily see ADW as directly competitive, targeting existing on-premises customers in the short run but with an eye to self-service ones in the longer term. As Lumpkin explained, "Many of Oracle's Autonomous Data Warehouse customers are existing on-prem users of Oracle who are looking to migrate to the cloud. However, we have also designed Autonomous Data Warehouse for the self-service market, with easy interfaces that allow sales and marketing operations teams to move their team's workloads to the cloud."

Oracle's strategy highlights a tension in tech: Traditional CIOs with legions of database administrators (DBAs) are worried about the migration to the cloud. DBAs who have built entire careers around being an expert at patching and tuning databases may find themselves lacking work in a self-service world where cloud providers like Oracle are patching and tuning enterprise databases.

CIOs who measure their success based on headcount and on-premises spend might also be worried. As Mendelson put it: "70% of what the DBA used to do should be automated." Given that Oracle's legacy business is still catering towards DBAs and CIOs, how do they feel about potentially upsetting their traditional advocates? While they acknowledged that automation would reduce some of the tasks traditionally performed by DBAs, they were not worried about complete job redundancy. Lumpkin explained, "By lowering the total cost of ownership for analytics, the business will be demanding 5x the number of databases." In other words, DBAs and CIOs will see the same transformation that accountants saw with the advent of the spreadsheet, and there should be plenty of higher-level strategic work for DBAs in the new era for Oracle cloud.

Of course, this isn't to say there won't be any changes. After all, change is inevitable as certain functions are automated away. DBAs need to refocus on their unique value add. "Some DBAs may have built their skill sets around patching Oracle databases," explains Lumpkin. "That's now automated because it was the same for every customer, and we could do it more consistently and reliably in the cloud. It was never adding value to the customer. What you want is your people doing work that is unique to your datasets and your organization."

We did a deep dive into different parts of ADW tools. Here's what we found.

Autonomous Data Warehouse setup

The automated provisioning and database setup tools were well done. The in-app screens and tutorials mostly adhered to one another and we could get set up in about five minutes. That said, there were still some rather annoying steps. For example, the user needs to create both a "database user" and an "analytics user." This makes a lot of sense on centrally administered databases serving an entire enterprise, but is overkill for a tool for a single analyst trying to get started (much less a tutorial for an analyst tool). The vast majority of data scientists and data analysts do not want to be database administrators, and the tool could benefit from a mode that hides this detail from the end user. This is a shortcoming that Oracle understands. As Lumpkin explains, "We have been looking at how to simplify the create-user flow for new databases. There are competing best practices for security [separation of duties between multiple users] and fastest onboarding experiences [with only one user]." But overall, the documentation is very well done, and onboarding is straightforward but could be a bit smoother.

Data insights

The automated insights tool is also interesting and could prove powerful. The insights run many queries against your dataset, generating predicted values against a target column. They then highlight the unexpected values where the predicted values deviate significantly from actual values. The algorithm appears to be running multiple groupbys and identifying groups with highly unexpected values. While this may lead to some risk of data dredging if used naively, it does provide some quick speedups: Some large fraction of data analysis comes from understanding unexpected results, and this feature can help with that.

Business model

One of the pervasive challenges with data modeling is defining business logic on raw enterprise data. Typically, this logic might reside in the heads of individual business analysts, leading to the inconsistent application of business logic across reports by different analysts. Oracle's Data Tools provide a "Business Model" centralizing business logic into the database, increasing consistency and improving performance via caching. The tool offers some excellent features, like automatically detecting schemas and finding the keys for table joins. However, some of these features may not be very robust. While the tool could identify many valuable potential table joins in the tutorial movie dataset, it could only find a small subset of the relationships in the publicly available MovieLens dataset. Nonetheless, this is a valuable tool for solving a critical enterprise problem.

Data transform

The data transform tool provides a GUI to specify functions to clean data. Cleaning data is the No. 1 job of a data scientist or data analyst, making this a critical feature. Unfortunately, the tool has made certain questionable design choices. They stem from the use of a GUI: Rather than specifying the transformation using a CREATE TABLE query in SQL, they ask you to write code in a GUI, awkwardly connecting functions with lines and clicking through menus to select options. While the end result is a CREATE TABLE query, this abandons the syntax that data scientists and analysts are familiar with, makes code less reproducible and less portable, and ultimately makes analysts and their queries more dependent on Oracle's GUI. Data professionals may wish to avoid this feature if they are eager to develop transferable skills and sidestep tool lock-in.

To be clear, there are useful drag-and-drop features in a SQL integrated development environment (IDE). For example, Count.co, which offers a BI notebook for analysts, supports drag and drop for table and field names into SQL queries. This nicely connects the data catalog to the SQL IDE query and helps prevent misspelled table or field names without abandoning the fundamental text-based query scripts we are used to. Overall, it felt much more natural as an interface.

Oracle Machine Learning

Oracle's Machine Learning offering is growing and now includes ML notebooks, AutoML capabilities, and model deployment tools. One of the big challenges for Oracle and its competitors will be to demonstrate utility to data scientists and, more generally, people working in both ML and AI. While these new capabilities have come a long way, there's still room for improvement. Making data scientists use Apache Zeppelin-based notebooks will likely hamper adoption when so many of us are Jupyter natives; so will preventing users from custom-installing Python packages, such as PyTorch and TensorFlow.

The problem Oracle is attempting to solve here is one of the biggest in the space: How do you get data scientists and machine learners to use enterprise data that sits in databases such as Oracle DBs? The ability to use familiar objects such as pandas data frames and APIs such as matplotlib and scikit-learn is a good step in the right direction, as is the decision to host notebooks. However, we need to see more: Data scientists often prototype code on their laptops in Jupyter Notebooks, VSCode, or PyCharm (among many other choices) with cutting-edge OSS package releases. When they move their code to production, they need enterprise tools that mimic their local workflows and allow them to utilize the full suite of OSS packages.

A representative of Oracle said that the ability to custom install packages on Autonomous Database is a road map item to address in future releases. In the meantime, the inclusion of scikit-learn in OML4Py allows users to work with familiar Python ML algorithms directly in notebooks or through embedded Python execution, where user-defined Python functions run in database-spawned and controlled Python engines. This supplements the scalable, parallelized, and distributed in-database algorithms and provides the ability to manipulate data in database tables and views using Python syntax. Overall, this is a step in the right direction.

Oracle Machine Learning's documentation and example notebook library is extensive and valuable, allowing us to get up and running in a notebook in a matter of minutes with intuitive SQL and Python examples of anomaly detection, classification, and clustering among many others. This is welcome in a tooling landscape that all too often falls short in useful DevRel material. Learning new tooling is a serious bottleneck, and Oracle has removed a lot of friction here with their extensive documentation.

Oracle has also recognized that the MLOps space is heating up and that table stakes include the need to deploy and productionize machine learning models. To this end, OML4Py provides a REST API with Embedded Python Execution, as well as providing a REST API that allows users to store ML models and create scoring endpoints for them. It is welcome that this functionality not only supports classification and regression OML models, but also Open Neural Network Exchange (ONNX) format models, which include TensorFlow. Once again, the documentation here is extensive and very useful.

Graph Analytics

Oracle's Graph Analytics offers the ability to run graph queries on databases. It is unique in that it allows users to directly query their data warehouse data. In contrast, Neptune, AWS' graph solution, requires loading data from their data warehouse (Redshift). Graph Analytics uses PGQL, an Oracle-supported language that queries graph data in the same way that SQL queries structured tabular data. The language's design is closer to SQL, and it is released under the open-source Apache 2.0 License. However, the main contributor is an Oracle employee, and Oracle is the only vendor supporting PGQL. The preferred mode of interacting with PGQL is through the company's proprietary Graph Studio tool, which doesn't promote reproducibility, advanced workflows, or interfacing with the rest of the development ecosystem. Lumpkin promised that REST APIs with Python and Java would be coming soon.

Perhaps unsurprisingly, Oracle's graph query language appears to be less popular than Cypher, the query language supported by neo4j, a rival graph database (i.e., the PGQL language has 114 stars on GitHub, while neo4j has 8K+ stars). A proposal to bring together PGQL, Cypher, and G-Core has over 95% support from users for nearly 4K votes, has its own landing page, and is gaining traction internationally. While the survey methodology may be questionable -- the proposal is authored by the Neo4j team on a Neo4j website -- it's understandable why graph database users would prefer a more commonly used open standard. Hopefully, graph query standards will emerge to streamline competing standards and simplify graph querying for data scientists.

Final thoughts

Oracle is a large incumbent in an increasingly crowded space that's moving rapidly. The company is playing catch-up, with recent developments in open source tooling and the long tail of emerging data tooling businesses as well as with the ever-growing total addressable market of the space. We're not only talking about just well-seasoned data scientist and machine learning engineers, but the increasing number of data analysts and citizen data scientists.

For Oracle, best known for its database software, these recent moves are intended to update its offerings to the data analytics, data science, machine learning, and AI spaces. In many ways, this is the data tooling equivalent of Disney making moves to streaming with Disney+. For the most part, Oracle's recent expansion of its Autonomous Data Warehouse delivers on its promise: to bring the benefits of ADW to large groups of new potential users. There are some lingering questions around whether these tools will meet all the needs of working data professionals, such as being able to work with their open-source packages of choice. We urge Oracle to prioritize such developments on its road map, as access to open source tooling is now table stakes for working data scientists.