Oracle cranks up MySQL HeatWave’s thermostat for in-database machine learning

Since acquiring Sun Microsystems well over a decade ago, Oracle has owned MySQL. Under Oracle’s watch, MySQL has remained distinct. But unless you were MariaDB, until a couple years or so ago, few gave Oracle’s stewardship much thought. And as each of the major cloud providers rolled out its own managed MySQL database services, Oracle provided relatively few reasons to draw customers to MySQL.

Well, that's no more. Fifteen months ago Oracle introduced MySQL HeatWave featuring its own optimized implementation of MySQL running in Oracle Cloud Infrastructure (OCI, aka Oracle’s public cloud platform). Those optimizations should be transparent to the application. And now Oracle is making the 3.0 release of HeatWave, scaling up node size that will reduce costs for a number of workloads, and introducing in-database machine learning, which could benefit from higher density data nodes.

HeatWave isn’t plain vanilla open source MySQL, as it differentiated with Oracle-developed extensions (outlined below). That’s not particularly unusual in open source, as Amazon Aurora and Azure PostgreSQL Hyperscale, not to mention the countless other PostgreSQL variants on the market, show that open source databases provide clean slates for differentiation.

How Oracle makes HeatWave (and MySQL) different

In making its move to become a serious competitor in the MySQL space, Oracle took the database in a unique direction with HeatWave: It optimized for analytics in addition to transaction processing by taking advantage of MySQL’s support for pluggable storage engines. In this case, it plugged in an in-memory columnar storage engine that operates side by side with the row store, incorporating optimizations tailored for processing analytic queries.

Plugging in a columnar storage engine working side by side with a row-oriented engine isn’t unusual; MariaDB has done it and, in fact, Oracle took a similar path but with different technology for its flagship database several years ago. But to this day, Oracle is the only one to pull off an analytic-optimized engine for MySQL.

In the latest release, Oracle has introduced enhancements to reduce compute costs and bring machine learning in-database.

Let’s start with operating costs. HeatWave version 3.0 doubles data density in each compute node without changing the pricing. So you can now consume (pay for) just half the number of nodes to compute the same workload. And, by the way, Oracle set the stage for all this in the previous HeatWave 2.0 release where it doubled the maximum upper limit for HeatWave clusters to 64 nodes.

Combined, compute cost efficiencies and scale should come in handy now that machine learning models can be run in-database. Hold that thought.

Beyond data density, HeatWave 3.0 makes it more economical to scale, in that you can add any number of nodes (up to a maximum of 64) in any increment. This is consistent with what Oracle introduced for its Autonomous Database cloud service, getting rid of the so-called standard “T-shirt sizes.” So elasticity with HeatWave doesn't mean that you have to double the number of active nodes each time your workload bursts compute. HeatWave also improves availability while resizing, with at most a few microseconds while querying is paused.

HeatWave 3.0 is adding a few tricks to further speed up processing. Like any columnar storage engine, HeatWave makes ample use of data compression. And it’s applying some common techniques such as Bloom filters that reduce the amount of intermediate memory required for query processing. Specifically, HeatWave has implemented Blocked Bloom filters that can perform the necessary data lookups with much less overhead, significantly reducing the amount of intermediate memory required.

These capabilities, in turn, clear the way for Oracle to introduce the capability to process machine learning models inside the database, without need for an external ETL engine or a machine learning execution environment. And in so doing, Oracle is following a trend that has also seen in AWS (Amazon Redshift ML), Google (BigQuery ML), Microsoft (SQL Server with in-database R and Python functions), Snowflake (with Snowpark), and Teradata (via extended SQL). But comparing these approaches is like comparing apples and oranges, as each provider takes different paths ranging from developing models externally to providing limited, curated choices for running ML, while others extend SQL itself.

Heatwave goes the curated route. It’s an approach suited for business analysts or “citizen data scientists” for democratizing machine learning in the same way that self-service visualization placed BI into the hands of the average user. By contrast, the external route is aimed at data scientists in organizations competing on their capability to develop their own unique, highly sophisticated models.

A bonus of the curated approach is that it doesn't require external tools, meaning that selecting, configuring, training and running ML models is performed entirely inside the database. That eliminates the overhead and cost of moving data to tools or ML services running on separate nodes. Oracle also touts the fact that keeping it all in-database reduces potential attack surfaces and consequently reduces security exposure.

Here’s how HeatWave’s AutoML approach works. The user chooses the table, columns and the type of algorithm (e.g., regression or classification), and then specifies where the model artifacts should be stored. The system automatically determines the best algorithm, the appropriate features, and the optimal hyperparameters and generates a tuned model.

How AutoML works in HeatWave

It streamlines key steps;. For instance, when testing a candidate model, it separates individual tasks or steps that the model performs, with each step evaluated using proxies or stubs that simulate the algorithm against a representative sampling of hyperparameters. It then automatically documents the choice of data, algorithms and hyperparameters to make the model explainable, as shown in the figure below.

The advantage of in-database ML processing is a flatter architecture and elimination of the overhead of data movement. While the flipside of bringing in any application processing into the database is heavier processing overhead, there are several design features that make these issues moot.

Cloud-native architecture, which allows compute to be scaled as necessary, eliminates the issue of contention for limited resources. Furthermore, most cloud analytics platforms supporting in-database ML either streamline or support only limited libraries of models to prevent the AI equivalent of the workload from hell, especially for training runs that tend to be the most time-consuming and compute-intensive. Oracle has published ML benchmarks for HeatWave 3.0 that are available on GitHub for customers and prospects to run themselves and verify.

Oracle’s introduction of ML processing in HeatWave complements an ML-related feature from its last release, version 2.0 from last summer. That release featured MySQL Autopilot, which uses internalized machine learning to help customers operate the database, such as suggesting how to provision and load the database, while offering closed-loop automation for failure handling/error recovery and query execution.

With version 3.0, MySQL HeatWave comes full circle, using ML to help run the database and support running ML models inside it. This is another example of a prediction I made for this year, that machine learning will take center stage, both for optimizing the operation of the database and for providing customers the capability to develop and/or run models in-database.

How Oracle makes HeatWave (and MySQL) different

How AutoML works in HeatWave

More