Global Big Data Conference

MongoDB: The Frankenstein Monster of NoSQL Databases Posted on : Mar 15 - 2016

NoSQL databases are on a collision course for decades-old relational technology. In Gartner’s 2015 magic quadrant for operational databases, no less than 4 NoSQL databases entered the leader quadrant (36%).

Stated differently, 8 years ago no one knew what NoSQL was, and now, more than a third of all leaders in the operational database category are NoSQL databases!

Yet, choosing a NoSQL database is daunting. There are dozens of NoSQL databases, each one with its own set of features, its own strengths and weaknesses, all battling to be the next Oracle.

Increasingly, I have begun to believe that one of these databases will not win the war, due to a series of severe product missteps. That database is none other than MongoDB, currently the most popular NoSQL database (at least, as judged by DB-Engines).

Just three months ago, I exposed the fact that MongoDB packaged up a competing database as a “BI connector” (one with horrendous performance, since it pulled all queried data out of MongoDB to perform any analytics).

MongoDB quietly acknowledged this fact deep in developer documentation, but continues to omit details in their public marketing.

Yet, there are other, much more ancient reasons for questioning MongoDB’s product decisions. Reasons I’ve been forced to examine as part of my day job at SlamData.

I helped develop a MongoDB connector for the Quasar open source project (one of the many open source projects that SlamData depends on). The connector translates SQL2 queries into efficient MongoDB queries that run 100% in-database. This work has given me intimate knowledge of dark skeletons buried in MongoDB’s closet.

What I’ve learned is every bit as gruesome as the title of this post suggests.

Much like Mary Shelly’s Frankenstein monster, MongoDB’s data access layer is sewn together from ragged pieces that don’t fit together. Pieces that were never designed to fit together.

The result, depending on your point of view, is either an Enterprise-grade NoSQL database destined to supplant Oracle, or an unholy abomination of nature, deserving of an angry mob bearing torches and pitchforks.

Let me dissect this creature so you can decide for yourself.

MongoDB’s Data Access Layer

Evolution never re-engineers something when it can just hack new slop on top. As a result, our species has a hindbrain (vital functions such as breathing), a midbrain (our base capabilities), and a forebrain (higher thinking).

In exactly the same way, MongoDB does not have a single, unified data access layer. It actually has three completely different layers, each introduced at a different time, and each piling new capabilities onto the core database.

As we shall see, these layers bear no evidence of being the product of a principled, unified, and well-engineered design. Rather, they address rather ad hoc requirements in totally ad hoc ways; they are tangled together; and they overlap in functionality.

They are, in analogy, the ragged dismembered pieces of various bodies that are sewn together in haphazard fashion to create something that never should have been created.

Let’s take a quick look at each of these layers in turn.

Query

The earliest “brain” in MongoDB is the query framework. A primitive version of this framework existed in the earliest known commit on MongoDB (June 8, 2008).

The query framework, as it exists in modern day MongoDB, supports the following functionality:

Filtering. The rough equivalent of the WHERE clause in SQL.

Paging. The rough equivalent of LIMIT and OFFSET in SQL.

Sorting. The rough equivalent of ORDER BY in SQL.

The query framework’s language is basically a language for building boolean expressions, with a few tacked on knobs for the other stuff. The building blocks of that language are so limited, most types of filters on JSON data cannot be expressed in the language. For example, it is not possible to filter by documents containing an array all of whose elements match some predicate; or to filter by documents containing some keys but not others; etc.

MongoDB seems to have recognized the extreme limitations of the query framework operators, as they added a $where operator to the language which allows one to execute Javascript (Javascript, for all its many warts, at least allows building boolean expressions based on arbitrary criteria).

MapReduce

Early users of MongoDB needed a way to circumvent the limitations of the query framework. Back then, MapReduce was all the rage in the emerging world of big data (for all the wrong reasons), so MongoDB got its own map/reduce.

Except MongoDB’s map/reduce used Javascript, didn’t scale particularly well, and could grind a production cluster to a halt in ways that require a hard reset (still can, a fact I'm happy to demonstrate on your production cluster!).

For better or worse, map/reduce is still around, and it’s the only way of solving most analytics problems. However, in all my years of watching the MongoDB User Forum, I don’t think I’ve ever seen anyone bring up map/reduce without a MongoDB employee telling them, “Don’t use that!”

Indeed, it would not be necessary to use map/reduce if MongoDB had a general-purpose analytics framework.

That, in theory, anyway, was the point of the aggregation framework.

Aggregate

To keep users out of map/reduce land in more cases, MongoDB introduced the aggregation framework, cobbled together specifically for the analytics and data processing use cases that map/reduce was being hijacked for.

The aggregation framework breaks data processing into a linear pipeline — a rather poor approximation of a Directed Acyclic Graph (DAG), which is actually the correct abstraction to use for representing an analytic workflow.

The aggregation framework has its own means of projecting data, which is confusingly similar but also distinct from the way the query framework works (for example, in the query framework, $foo.0 will project the first element of an array field called foo, while it will do no such thing in the aggregation framework!).

Astoundingly, the aggregation framework has two ways of building boolean expressions!

Not only can you build boolean expressions through the expression operators of the aggregation framework, but the aggregation framework embeds the query framework’s language inside the $match stage! Well, not exactly, it’s actually a crippled version of the query framework’s language, one without the ability to use $where expressions.

From the perspective of analytics use cases, the aggregation framework fares better than the query framework, but most analytical use cases are well beyond the reach of the aggregation framework.

Three Wrongs Don't Make a Right

After this whirlwind tour of MongoDB's data access layer, you probably have some legitimate questions.

Why does MongoDB have three query mechanisms? Why did they feel the need to continuously hack new query mechanisms on top of the database?

In my view, it’s because every query language has been a failure, unable to accommodate the diverse needs of JSON-based data access.

And I know exactly why that is!

Composability: The Essence of Principled Engineering

The physical universe is inherently composable. Subatomic particles form atoms. Atoms form molecules. Molecules form compounds. And so on.

Everything is built from a tiny number of critical building blocks. Because these building blocks have the right “shape”, they combine in a myriad ways that give rise to all the diversity in the entire universe.

Principled, well-engineered systems are also composable. Instead of solving very specific problems, they provide a set of building blocks that you can combine in an infinite variety of ways to solve a large class of problems.

More succinctly, ad hoc systems solve N predetermined problems, but principled systems give you N building blocks you can use to solve any problem.

My contention is that MongoDB’s data access layers are inherently non-composable (except map/reduce, which is composable but functionally almost useless).

Instead of providing developers with a core, well-engineered set of properly shaped building blocks that can combine together to solve all problems, MongoDB provides a set of weirdly shaped, poorly connecting blocks that solve whatever ad hoc user problems prompted their development.

MongoDB is not designed, so much as undesigned.

The Smell of the Undead

I realize my opinion is highly contentious, but I believe I can back it up.

Below, I present a laundry list of complaints about MongoDB’s various query mechanisms. I believe they build a compelling case that MongoDB is, without question, the Frankenstein monster of NoSQL databases:

The query framework can only express simple filters on arbitrary JSON data. Acknowleding this fact, MongoDB added $where which allows running arbitrary Javascript (but is too slow for most uses). Instead, a well-engineered, composable query framework would allow forming arbitrary predicates on arbitrary data (much like Javascript or SQL2, for example).

A crippled version of the query framework is eerily grafted into the aggregation framework. To use the aggregation framework, you have to learn two different languages for building expressions: the query framework language, and the aggregation framework language. They are overlapping but inconsistent. It is almost as if MongoDB didn’t realize that a composable expression-oriented language can form boolean values all by itself, without the need for a separate “boolean expression” query language!

For over two years now, I’ve watched MongoDB query questions on StackOverflow, as well as subscribed to mongodb-user. Many of the questions on StackOverflow are about how to query oddly-shaped data for some specific use case. MongoDB’s stock answer to all these questions is, “Don’t do that!” It’s as if MongoDB cannot conceive of a composable query language that can handle arbitrary queries over JSON data, so instead, MongoDB tells people to change their data model. Well, in my opinion, if the answer to every query question is, “Change your data model to fit our limitations!” then your database is broken by design.

Despite the problems with MongoDB’s map/reduce, the framework is relied upon in production by numerous companies, because there’s no other way to do the analytics they want. If MongoDB actually invested in map/reduce, to improve performance and stability, then perhaps that would become the default access layer for analytics. But to date, their response has been, “Bad user! Don’t use map/reduce! Don’t ask that question!”.

Indeed, writing this list now, I can just imagine MongoDB’s responses (or the responses of many MongoDB fans):

“Don’t store your data like that.”

“Pull back the data into the client to query it!”

“MongoDB could not scale if it had a better query language!”

Ad infinitum.

Actually, I think the answer is much simpler. MongoDB’s numerous query languages are crap. A database should have a single query language, and it should be carefully engineered to have the right building blocks to solve arbitrary problems.

The solution is not re-engineering your questions or your data to fit the limitations of the unholy abomination that is MongoDB’s data access layer — instead, it’s re-engineering MongoDB so it can meet the needs of its users!

Imagine that, a NoSQL database that actually allows you to query arbitrary NoSQL data!

MongoDB’s Last Hope

Despite the monstrous issues with MongoDB’s data access layer, there is, in my opinion, a small window of opportunity to turn things around.

If I were driving product vision at MongoDB, here’s what I would do:

Take the high-level architecture of the aggregation framework, and re-engineer its interface from the ground-up. The goal should be a JSON-based query language that will not alienate existing users, but which is engineered for composability from day one. Completely ditch the existing query framework, aggregation framework and map/reduce framework, and ship the new query interface in MongoDB 4.0, with driver-based emulators for the old query and aggregation frameworks. Above all, stop telling MongoDB users they’re storing the wrong data and asking the wrong questions, and deal with the damn problem.

Immediately kill distractions such as the MongoDB BI Connector (AKA the cleverly-disguised PostgreSQL database), MongoDB Compass, and so on, and focus on the damn database. A database company needs to make money off its database (no shit!), and cultivate an ecosystem for ancillary needs. Exactly like it’s done in the world of RDBMS. The crippling mindset that at least two NoSQL vendors have is that they’re too big to fail and can build and sell an entire ecosystem around their databases.

As for those who think (1) cannot be solved, I point to other NoSQL databases that have already solved it (or at least, have come awfully close).

MongoDB’s hypothetical next-generation query language needs to support the right building blocks for arbitrary queries across arbitrary JSON data:

Tearing down structures into more basic components. For example, tearing down an object into keys and values, tearing a string down into an array of characters, and tearing down an array into indexes and elements (the latter is already poorly supported, through $unwind, and MongoDB refuses to address the former, though tickets have been open for years).

Building up structures from basic components. For example, dynamically building an object from keys and values, dynamically building a string from an array of characters, and dynamically building an array from an ordering and elements.

Combining structures. For example, combining two arrays to form another array, combining two objects to form another object, and so on.

Accessing and unnesting arbitrarily nested components of data. Until 3.0, MongoDB had no way to access a single element of an array in the aggregation framework. That’s right, a database designed for storing arrays and other nested data had no way to access nested components of that data (other than map/reduce)!

Converting between values. For example, converting between strings and numbers, numbers and dates, strings and dates, and so on. Semi-structured data is often messy and such transformations are necessary to deal with cases where, for example, a date is stored as a string (for pragmatic or historical reasons).

Generic filtering, sorting, and aggregation on any dimension of nested data. For example, it should be possible to filter an object’s values by its keys, then aggregate the arrays stored inside its values.

Runtime type identification. For a database that encourages storage of heterogeneous structures, MongoDB only recently introduced an operator $isArray to determine if a value is an array; other types can be identified only through a little-known hack that relies on MongoDB's total ordering of values. MongoDB doesn't even have type identification on the roadmap!

With a few weeks of principled design, MongoDB could have a query language that provides the same level of expressiveness for JSON data that SQL provides for relational data. A few primitives with the right shape and semantics are all that’s necessary to enable rich data access on arbitrary JSON data structures!

My guess? It won’t happen. Not because of engineering, who certainly has the capability to turn this ship around. Rather, it won’t happen because management at MongoDB is too in love with its Frankenbaby to see it for the monster that it is!

Beyond MongoDB

MongoDB may be the most widely downloaded NoSQL database, but it’s not the only NoSQL database.

For those who prefer their NoSQL databases with a healthy dose of thoughtful, principled engineering, what are the other choices?

The verdict is still out on that. However, I’ve at least looked at two other databases that stand a good chance of claiming higher ground: Couchbase and MarkLogic.

Couchbase supports a data access layer called N1QL. N1QL provides an engineered, well-thought out solution to the problem of accessing and processing JSON data. N1QL is just about as composable as ANSI SQL, and beats the pants off MongoDB’s query or aggregation frameworks.

MarkLogic, meanwhile, supports XQuery. XQuery is the product of extensive engineering, and despite its unfamiliarity, it’s both extremely powerful and composable.

Other databases to keep an eye on include ArangoDB and OrientDB, although of course there are many more players (Aerospike, Clusterpoint, DocumentDB, and on and on).

I have more work to do before officially endorsing any of these players, but I’m hopeful. Several of the databases I’m looking at now feel well-engineered and principled.

Check back in a few months for my amateur guide to principled NoSQL databases. In the meantime, I recommend grabbing your pitchforks and torches and sending MongoDB’s Frankenstein monster back to the unholy laboratory where it was first reanimated! Source

By John De Goes, CTO at SlamData Inc.

Get the