What are graph database query languages?

A new generation of graph databases has taken hold, and a generation of query languages has arrived alongside them. The assorted graph database query languages include the likes of Gremlin, Cypher, and GQL and serve to unpack the information inside graphs.

All databases need a way to talk with their clients, and the query languages they speak define what the database can do. Good graph database query languages unlock the power of graph databases by making it possible -- and sometimes easy -- for developers to ask complex questions about the networks defined in the databases. In the beginning, the languages were proprietary and invented for each new database, but there has been a recent push to create open standards.

In the world of relational databases, SQL (structured query language) has been the dominant standard for years. It defines a way to search for the rows in a table that match specific criteria. If the data spans several tables, it offers a way to align the tables so all the information is joined together in one consistent collection. It's good at finding a particular set of entries with a particular field that matches some rule, but it doesn't do much more than that.

Classic relational databases can store graphs, and before graph databases it was common for developers to use them because they were the only option. SQL can answer basic questions, but traditional query languages generally can't answer the most useful and tantalizing questions. Ironically, perhaps, relational databases are not nearly as good at representing very complex relations as graph databases are. Often, the only solution for a relational database query is to return large blocks of data so the client software can run the analysis.

Graph query languages were created to answer more complex questions like:

In a family tree, how many second cousins does a person have?
In a social media graph recording friends or followers, how many degrees of separation are there between two users?
In a graph of a company's supply chain, what is the longest number of hops between the factory and a customer?
In a collection of banking transactions, are there some people who are connected to an above-average number of fraudulent transactions?
In a computer network, where can a new connection with higher bandwidth fix a bottleneck?

The graph databases require different models because the analysis must go deeper than the basic relations that can be stored in tables. Some queries require following several links or hops before calculating certain statistics. In the beginning, each graph database created a proprietary query language. Lately, the graph database companies have been cross-pollinating by adding new implementations and working toward an open source standard. The most common graph query languages are:

Gremlin -- A graph searching language originally developed for the Apache Tinkerpop project that allows procedural or declarative queries.
Cypher -- First created by Neo4j and later adopted by others as OpenCypher, this declarative language allows searching for nodes and edges that match particular properties.
GQL -- This proposed standard attempts to unify the styles of Cypher, GSQL, and PSQL.
SPARQL -- A standard developed for querying knowledge graphs stored in the RDF format.
PGQL -- Oracle's original language for searching and collecting information from nodes that match specifications.
GSQL -- TigerGraph's original procedural language.
AQL -- ArangoDB's original procedural language.
GraphQL -- Although the name suggests it supports graph querying, this is a more general query language for efficiently searching most document and relational databases. It is finding some uses with graph databases, but only for supporting the same general queries as it does with relational databases.

There are a number of major differences between the query languages. Some are said to be "declarative," while others are "procedural." That is, some let the developer declare what they want by writing simple rules for defining a subset. The database takes the rules, constructs a search plan using any available indices and then finds all potential matches.

One might ask to find all bank transactions over $10,000 that are within 10 miles of each other. Another might search for all social media users who are connected to each other and haven't posted in two weeks. The rules can include all of the filtering on values found in standard query languages ("WHERE AGE<20"), as well as other more complex rules about the network of connections ("IS RELATED TO"). In general, the graph query languages are most successful when they search through the graph of relationships.

The procedural versions come closer to traditional computer languages by allowing the developer to control how the database searches through the items, often by writing loops or other control structures. In general, declarative languages are easier to understand and use because they hide much of the work of searching, but procedural languages are more powerful. Some databases offer a combination of both.

Another major difference comes from the structure of the database itself. Some support the RDF model, while others support so-called property graphs. The RDF model is a W3C standard first designed to encode semantic information. Property graph models tend to be more general and flexible, and some databases support both models.

How do legacy players approach graph query languages?

Oracle implemented graph capabilities to its main database by adding graph searching functions to its regular SQL query language. Extensions called PGQL (Property Graph Query Language) offer a concise way to search graphs and create reports about nodes that match criteria. Their graph analytics framework starts with dozens of common algorithms that can be extended to build complex summaries of the underlying data. They support both property graphs and RDF-style graphs.

Microsoft added graph capabilities to SQL Server in 2017 and extended its version of SQL with a MATCH clause that matches property patterns. The searching can be extended with stored procedures for imperative queries. Microsoft's Cosmos database in the Azure cloud supports Apache TinkerPop API, and thus all Gremlin-style queries.

Amazon's main graph database -- AWS Neptune -- supports both property graphs and RDF-style graphs. The property graphs can be searched with Gremlin-style queries, while SPARQL is used for the RDF-style graphs.

IBM has been working with a number of graph databases, like Neo4j, and also offering its own product as a service in its cloud. The service, called IBM Graph, uses the TinkerPop API with Gremlin, as well as a simpler API for basic retrieval.

How are the upstarts responding?

Neo4J has in recent years become one of the most influential graph databases, and it remains a leader in the field. But it remains a separate company and so is grouped here with the upstarts. In fact, several of the graph database players are of long lineage.

Neo4j has vigorously encouraged other companies to use its query language, Cypher, via the openCypher project. Neo4j is also a big supporter of the GQL standardization process, and the company supports GraphQL for some queries.

TigerGraph stores property graphs and queries them with GSQL, a procedural approach that simplifies parallel processing for scaling to larger datasets. The company behind that database offers a sophisticated visual tool for exploring and querying the dataset. Called GraphStudio, it is available as both a product and a cloud service.

OrientDB is an open source database that uses Gremlin and SQL for querying. It was built by a company that was purchased by SAP, which is now integrating it with the SAP product line.

ArangoDB is designed to support both graph and NoSQL document datasets. The open source database is available as both a community edition and a commercial version that can be purchased as a service. Its associated query language, known as AQL, offers a procedural approach to searching through the data.

AllegoGraph stores RDF-style graphs that can be queried with SPARQL and RDFS++, as well as with programming language extensions like Prolog, a logic programming language, and Allegro Common LISP. Their knowledge graph explorer, Gruff, runs in browsers for visual querying. The product is available for local installation and in clouds like AWS.

Ontotext is focused on creating big knowledge graphs, and it's GraphDB supports SPARQL queries for RDF-style graphs. Ontotext offers three versions (Free, Standard, and Enterprise) with most of the same features, although the free version is limited to two concurrent queries.

Is there anything that graph database query languages can't do?

The graph query languages can offer a concise way to search for particular combinations of entries that fit specific patterns. Some questions, however well-specified, can be difficult to answer in an efficient way.

Certain graph problems, like finding subsets of highly connected nodes called cliques, fall into a class known as NP-complete and may be difficult to solve efficiently. The answers may take exponentially longer to find as the size of the problem grows -- in other words, these won't scale. And it can be dangerously simple to write a query that will take a very long time to solve.

How do legacy players approach graph query languages?

How are the upstarts responding?

Is there anything that graph database query languages can't do?

More