Building Data Services with Apache Cassandra

All applications depend on data, yet application developers don’t like to think about databases. Learning the internals and query language of a particular database adds cognitive load, and requires context switching that detracts from productivity. Still, successful applications must be responsive, resilient and scalable — all characteristics that will be determined by choice of database. How, then, should an application developer balance these considerations?

What if we could shift the balance, providing a data service in developer-friendly idioms, rather than expecting developers to learn database-specific idioms?

At the Stargate project, the open source API data gateway designed to work with Apache Cassandra, we’re excited to start talking publicly about our upcoming JSON API that meets JSON-oriented developers on their terms. Not only is this great news for JSON-oriented developers, but the technique we’ve followed constitutes a new design pattern for leveraging data APIs and advanced data modeling to produce data services.

In this article I’ll discuss how to provide developer-friendly idioms using Cassandra together with Stargate, and how we’re working to do just that for JSON.

Data Models: Interoperability vs. Idiom

In the early days, Cassandra was sometimes described as “a machine for making indexes.” This was a testament to Cassandra’s inherent resilience and flexibility, a clay out of which more robust structures could be molded. Cassandra today is a richer clay with greater possibilities. Not only is it a great database, it’s a great machine for making databases. Here at the Stargate project, we’re using the JSON API to prove this as the first example of a new paradigm in database development.

It’s not unusual for one database to be built out of another. Even MongoDB is built on top of WiredTiger, if you dig deep enough. AWS is known for its extensive use of MySQL behind the scenes, including using the MySQL storage engine for DynamoDB. So the idea of using Cassandra, with its inherent scalability and performance, as the building block for other data systems makes sense.

Yet application developers don’t really interact with databases. Even if your organization manages its own database infrastructure and builds applications against that infrastructure, the first step is generally to define and implement the data models your applications require.

Those data models mediate between application and database. In some sense, data modeling limits a database; it takes unformed, and thus general-purpose, clay and molds it into something purpose-built for a particular application idiom. We sacrifice interoperability for something idiomatic.

Is it a good idea to trade for something idiomatic and give up something interoperable? If you want to beat the averages, then the answer is an emphatic “yes.” We don’t think this way much when choosing databases, but we’ve long thought this way when choosing programming languages.

This idea was well expressed by Paul Graham decades ago when he explained how Viaweb won the early dot-com race to create the first broadly successful, web-based e-commerce platform.

Viaweb wasn’t necessarily the fastest or most scalable e-commerce platform. In Graham’s words it was “reasonably efficient.” Instead, Graham argues that, for programming languages, on a scale of machine-readable to human-readable, the more human-readable (and thus higher-level) languages are more powerful because they improve developer productivity. And at the time of Viaweb, Graham thought the most powerful language was Lisp. The crux of Graham’s argument is this:

“Our hypothesis was that if we wrote our software in Lisp, we’d be able to get features done faster than our competitors, and also to do things in our software that they couldn’t do. And because Lisp was so high level, we wouldn’t need a big development team, so our costs would be lower. If this were so, we could offer a better product for less money and still make a profit. We would end up getting all the users, and our competitors would get none and eventually go out of business.”

Unlocking Developer Productivity

Graham wrote those words 20 years ago, and developer productivity remains the North Star that guides much of innovation in technology. Where Graham talks about the power of higher-level languages, we express that same concept as providing developers with tools that are more idiomatic to their software development experience.

Graham praises Lisp (rightly), and since the dot-com time we have seen a proliferation of new higher-level languages: Ruby and Rust, to name a couple. We’ve also seen the birth and proliferation of mobile device developer languages and frameworks, such as Swift, Flutter and Dart.

So why are languages like C and C++ still important? The old joke about C holds an important truth: “Combining the power of assembly language with the ease of use of assembly language.” If you want to write a compiler, then you need to move closer to the machine language idiom and away from the natural language idiom.

In other words, among other virtues, C and C++ are machines for building new languages. What’s easy to overlook in Graham’s praise of Lisp is that Lisp also has some of this “machine for building languages” characteristic.

Lisp was the first broadly used language to introduce the concept of macros, and it is often the concept of macros that trips up those new to Lisp. Once you understand macros, you understand that Lisp is more of a meta language than a language, and that macros can be used to build a purpose-built language for a specific problem domain.

Designing and creating an initial set of macros is hard, intellectually challenging work. But once Graham and the Viaweb team did that, in effect they had an e-commerce programming language to work with, and that unlocked the developer productivity that enabled them to outpace their competition.

Twenty years later, all of this seems clear enough in the context of programming languages. So, what has happened in the database world? The short answer is that databases have evolved more slowly.

The Data API Revolution

If tabular data is the assembly language of the database world, then SQL is the C/C++ of query languages. We developed tabular data structures and the concept of data normalization in an era when compute and storage were expensive, and for use cases that were well defined with relatively infrequent schema changes. In that context, to operate efficiently at any kind of scale, databases needed to closely mimic the way computers stored and accessed information.

Today’s world is the opposite, making that earlier time seem archaic by comparison: Compute and storage costs are highly commoditized, but in a world of real-time data combined with machine learning and artificial intelligence, use cases are open ended and schema changes are frequent.

The most recent revolution in database technology was the NoSQL revolution, a direct response to the canon of tabular, normalized data laid down by the high priests of the relational database world. When we say “NoSQL revolution,” we refer to the period from 2004, when Google released its MapReduce white paper, until 2007, when Amazon published its Dynamo white paper.

What emerged from this period was a family of databases that achieved unprecedented speed, scalability and resilience by dropping two cherished relational tenets: NoSQL databases favored denormalized data over data normalization, and favored eventual consistency over transactional consistency. Cassandra, first released in 2008, emerged out of this revolution.

Data APIs will be the next major revolution in database technology, a revolution that’s only just getting started. Changes in the database world tend to lag behind changes in programming languages and application development. So while RESTful APIs have been around for a while, and have helped usher in the architecture for distributed, service-oriented applications, we’re only just beginning to see data APIs manifest as a key part of application infrastructure.

To understand the significance of this revolution, and how, 20 years after Paul Graham’s proclamation, the database world is finally delivering on developer productivity, let’s look at Stargate’s own story. It starts by returning to the theme of interoperable versus idiomatic.

Stargate: A High-Fidelity, Idiomatic Developer Experience

When we decided that the Cassandra ecosystem needed a data gateway, we built the original set of Stargate APIs with a sense of urgency. That meant a monolithic architecture; monoliths are faster to build, yet not always better. We launched with a Cassandra Query Language (CQL) API, a REST API and a RESTful Document API. We quickly added GraphQL as an additional API. To date, Stargate has been interoperable; everything from Stargate is stored using a native CQL data model, so in principle, you could query any table from any API.

We’ve learned that in practice, no one really does this. Developers stick to their particular idiom. By favoring interoperability, we bled Cassandra-isms into the developer experience, thus impeding developer productivity. Because Stargate’s original version required developers to understand Cassandra’s wide-column tabular data structures, to understand keyspaces and partitions, we have anchored too close to the machine idiom and too far from the human idiom.

The interoperability trap is to favor general purpose over purpose-built in design thinking. We’ve pivoted to thinking in terms of purpose-built, which trades some general capability for a more specific mode of expression, moving us closer to the human idiom and further away from the machine idiom. And so we started to think: Could we deliver a high-fidelity idiomatic developer experience while retaining the virtues of Cassandra’s NoSQL foundations (scale, availability and performance)?

The key lies in data modeling. In order to turn Cassandra into the “Lisp of databases,” we needed a data model that could serve a purpose analogous to Lisp macros, together with a Stargate API that would enable developers to interact idiomatically with that data model. We started with JSON, the greatest common denominator of data structures among application developers, and thus started building the JSON API for Stargate. Then we had to figure out how best to model JSON in Cassandra.

Stargate already has a Document API, but in Stargate’s original Document API, we used a data model we call “shredding” to render a JSON document as a Cassandra table. This model maps one document to multiple rows in a single Cassandra table and preserves interoperability. If you use CQL to query the resulting table, you’ll get back meaningful results.

This original shredding data model has downsides. It does not preserve metadata about a document. For example, for any document containing arrays, once the document is written, we don’t know anything about array size without fully inspecting the document. More significantly, we’ve departed from Cassandra’s expectations about indexing. Cassandra indexes on rows, but we’ve now spread our document out across multiple rows, making a native Cassandra index of documents impossible.

To make Cassandra a suitable storage engine for JSON, we were going to need a new data model, something superior to shredding. We called it “super shredding.” You can learn more about super shredding at Aaron Morton’s Cassandra Summit talk in December, but here’s a bit of a teaser: We take advantage of Cassandra’s wide-column nature to store one document per row, knowing that a Cassandra row can handle even very large documents.

We also have a set of columns in that row that are explicitly for storing standard metadata characteristics of a JSON document. Now we have something more easily indexable, as well as a means of preserving and retrieving metadata.

Contributing Back to Cassandra

Yes, to get all this to work at scale we will need some underlying changes to Cassandra. Accord, which Apple is contributing to Cassandra 5, will help us handle data changes in a more transactional manner. Storage-attached indexing (SAI) and Global Sort, which DataStax is contributing to Cassandra 5, will help us handle ranged queries against JSON documents in a more performant manner.

Cassandra is not a static piece of software; it’s a vibrant and evolving open source project. So we’re also continuing a longstanding Cassandra tradition of using requirements that emerge client-side to foster changes on the database side. User needs have prompted the proposals for Accord, SAI and Global Sort. These will not only make Stargate’s JSON API better, but will make Cassandra better. This is a great reminder that data engineers and application developers are not two different communities, but complimentary cohorts of the extended Cassandra community.

And JSON is just the first step. Essentially, we will have built a document database that you interact with through a JSON API out of Cassandra, Stargate and a reasonably efficient Cassandra data model. Super shredding is our macro. This approach turns Cassandra into a machine for making databases.

Could this approach be followed by another database besides Cassandra? Not easily, and here’s why. There’s a sort of database analog of the second law of thermodynamics that works in Cassandra’s favor. We start with something that is fast, scalable and resilient, but not very idiomatic to developers. Within the constraints of reasonable efficiency, we trade some of that speed, scale and resilience for a more idiomatic interface to present to developers. What one can’t easily do is go in the reverse direction. Starting with something that is highly idiomatic and then trying to figure out how to make it fast, scalable and resilient is a daunting task that might not even be possible.

That thermodynamic principle is why data APIs are the new database revolution, and why Cassandra is the database that will power this revolution.

Datastax Blog