Document-oriented database

Lua error in package.lua at line 80: module 'strict' not found.

A document-oriented database, or document store, is a computer program designed for storing, retrieving, and managing document-oriented information, also known as semi-structured data. Document-oriented databases are one of the main categories of NoSQL databases, and the popularity of the term "document-oriented database" has grown^[1] with the use of the term NoSQL itself. XML databases are a subclass of document-oriented databases that are optimized to work with XML documents. Graph databases are similar, but add another layer, the relationship, which allows them to link documents for rapid traversal.

Document-oriented databases are inherently a subclass of the key-value store, another NoSQL database concept. The difference lies in the way the data is processed; in a key-value store the data is considered to be inherently opaque to the database, whereas a document-oriented system relies on internal structure in the document in order to extract metadata that the database engine uses for further optimization. Although the difference is often moot due to tools in the systems,^{[lower-alpha 1]} conceptually the document-store is designed to offer a richer experience with modern programming techniques.

Document databases^{[lower-alpha 2]} contrast strongly with the traditional relational database (RDB). Relational databases generally store data in separate tables that are defined by the programmer, and a single object may be spread across several tables. Document databases store all information for a given object in a single instance in the database, and every stored object can be different from every other. This makes mapping objects into the database a simple task, normally eliminating anything similar to an object-relational mapping. This makes document stores attractive for programming web applications, which are subject to continual change in place, and where speed of deployment is an important issue.

Documents

The central concept of a document-oriented database is the notion of a document. While each document-oriented database implementation differs on the details of this definition, in general, they all assume documents encapsulate and encode data (or information) in some standard formats or encodings. Encodings in use include XML, YAML, JSON, and BSON, as well as binary forms like PDF and Microsoft Office documents (MS Word, Excel, and so on).

Documents in a document store are roughly equivalent to the programming concept of an object. They are not required to adhere to a standard schema, nor will they have all the same sections, slots, parts, or keys. Generally, programs using objects have many different types of objects, and those objects often have many optional fields. Every object, even those of the same class, can look very different. Document stores are similar in that they allow different types of documents in a single store, allow the fields within them to be optional, and often allow them to be encoded using different encoding systems. For example, the following is a document, encoded in JSON:

{
    FirstName: "Bob", 
    Address: "5 Oak St.", 
    Hobby: "sailing"
}

A second document might be encoded in XML as:

  <contact>
    <firstname>Bob</firstname>
    <lastname>Smith</lastname>
    <phone type="Cell">(123) 555-0178</phone>
    <phone type="Work">(890) 555-0133</phone>
    <address>
      <type>Home</type>
      <street1>123 Back St.</street1>
      <city>Boys</city>
      <state>AR</state>
      <zip>32225</zip>
      <country>US</country>
    </address>
  </contact>

These two documents share some structural elements with one another, but each also has unique elements. The structure and text and other data inside the document is called usually referred to as the document's content and may be referenced via retrieval or editing methods, (see below). Unlike a relational database where every record contains the same fields, leaving unused fields empty; there are no empty 'fields' in either document (record) in the above example. This approach allows new information to be added to some records without requiring that every other record in the database share the same structure.

Document databases typically provide for additional metadata to be associated with and stored along with the document content. That metadata may be related to facilities the datastore provides for organizing documents, providing security, or other implementation specific features.

CRUD operations

The core operations a document-oriented database supports on documents are similar to other databases and while the terminology isn't perfectly standardized, most practitioners will recognize them as CRUD

Creation (or insertion)
Retrieval (or query, search, finds)
Update (or edit)
Deletion (or removal)

Keys

Documents are addressed in the database via a unique key that represents that document. This key is a simple identifier (or ID), typically a string, a URI, or a path. The key can be used to retrieve the document from the database. Typically the database retains an index on the key to speed up document retrieval, and in some cases the key is required to create or insert the document into the database.

Retrieval

Another defining characteristic of a document-oriented database is that, beyond the simple key-to-document lookup that can be used to retrieve a document, the database offers an API or query language that allows the user to retrieve documents based on content (or metadata). For example, you may want a query that retrieves all the documents with a certain field set to a certain value. The set of query APIs or query language features available, as well as the expected performance of the queries, varies significantly from one implementation to another. Likewise, the specific set of indexing options and configuration that are available vary greatly by implementation.

It is here that the document store varies most from the key-value store. In theory, the values in a key-value store are opaque to the store, they are essentially black boxes. They may offer search systems similar to those of a document store, but may have less understanding about the organization of the content. Document stores use the metadata in the document to classify the content, allowing them, for instance, to understand that one series of digits is a phone number, and another is a postal code. This allows them to search on those types of data, for instance, all phone numbers containing 555, which would ignore the zip code 55555.

Editing

Document databases typically provide some mechanism for updating or editing the content (or other metadata) of a document, either by allowing for replacement of the entire document, or individual structural pieces of the document.

Organization

Document database implementations offer a variety of ways of organizing documents, including notions of

Collections: groups of documents, where depending on implementation, a document may be enforced to live inside one collection, or may be allowed to live in multiple collections
Tags and non-visible metadata: additional data outside the document content
Directory hierarchies: groups of documents organized in a tree-like structure, typically based on path or URI

Sometimes these organizational notions vary in how much they are logical vs physical, (e.g. on disk or in memory), representations.

Relationship to other Databases

Relationship to Key-value Stores

A document-oriented database is a specialized key-value store, which itself is another NoSQL database category. In a simple key-value store, the document content is opaque. A document-oriented database provides APIs or a query/update language that exposes the ability to query or update based on the internal structure in the document. This difference may be moot for users that do not need richer query, retrieval, or editing APIs that are typically provided by document databases. Modern key-value stores often include features for working with metadata, blurring the lines between document stores.

Relationship to search engines

Some search engines (aka information retrieval) systems like Elastic Search provide enough of the core operations on documents to fit the definition of a document-oriented database.

Relationship to Relational Databases

(This section needs cleanup still, sorry!)

In a relational database, data is first categorized into a number of predefined types, and tables are created to hold individual entries, or records, of each type. The tables define the data within each record's fields, meaning that every record in the table has the same overall form. The administrator also defines the relations between the tables, and selects certain fields that they believe will be most commonly used for searching and defines indexes on them. A key concept in the relational design is that any data that may be repeated is normally placed in its own table, and if these instances are related to each other, a column is selected to group them together, the foreign key. This design is known as database normalization.^[2]

For example, an address book application will generally need to store the contact name, an optional image, one or more phone numbers, one or more mailing addresses, and one or more email addresses. In a canonical relational database solution, tables would be created for each of these rows with predefined fields for each bit of data: the CONTACT table might include FIRST_NAME, LAST_NAME and IMAGE columns, while the PHONE_NUMBER table might include COUNTRY_CODE, AREA_CODE, PHONE_NUMBER and TYPE (home, work, etc.). The PHONE_NUMBER table also contains a foreign key column, "CONTACT_ID", which holds the unique ID number assigned to the contact when it was created. In order to recreate the original contact, the database engine uses the foreign keys to look for the related items across the group of tables and reconstruct the original data.

In contrast, in a document-oriented database there may be no internal structure that maps directly onto the concept of a table, and the fields and relations generally don't exist as predefined concepts. Instead, all of the data for an object is placed in a single document, and stored in the database as a single entry. In the address book example, the document would contain the contact's name, image, and any contact info, all in a single record. That entry is accessed through its key, which allows the database to retrieve and return the document to the application. No additional work is needed to retrieve the related data; all of this is returned in a single object.

A key difference between the document-oriented and relational models is that the data formats are not predefined in the document case. In most cases, any sort of document can be stored in any database, and those documents can change in type and form at any time. If one wishes to add a COUNTRY_FLAG to a CONTACT, this field can be to new documents as they are inserted, this will have no effect on the database or the existing documents already stored. To aid retrieval of information from the database, document-oriented systems generally allow the administrator to provide hints to the database to look for certain types of information. These work in a similar fashion to indexes in the relational case. Most also offer the ability to add additional metadata outside of the content of the document itself, for instance, tagging entries as being part of an address book, which allows the programmer to retrieve related types of information, like "all the address book entries". This provides functionality similar to a table, but separates the concept (categories of data) from its physical implementation (tables).

In the classic normalized relational model, objects in the database are represented as separate rows of data with no inherent structure beyond that given to them as they are retrieved. This leads to problems when trying to translate programming objects to and from their associated database rows, a problem known as object-relational impedance mismatch.^[3] Document stores more closely, or in some cases directly, map programming objects into the store. This eliminates the impedance mismatch problem, and is offered as one of the main advantages of the NoSQL approach.

Implementations

Name	Publisher	License	Languages supported	Notes	RESTful API
BaseX	BaseX Team	BSD License	Java, XQuery	Support for XML, JSON and binary formats; client-/server based architecture; concurrent structural and full-text searches and updates.	Yes
Cloudant	Cloudant, Inc.	Proprietary	Erlang, Java, Scala, and C	Distributed database service based on BigCouch, the company's open source fork of the Apache-backed CouchDB project. Uses JSON model.	Yes
Clusterpoint Database	Clusterpoint Ltd.	Proprietary with free download	JavaScript, SQL, PHP, .NET, Java, Python, Node.js, C, C++,	Distributed document-oriented XML / JSON database platform with ACID-compliant transactions; high-availability data replication and sharding; built-in full text search engine with relevance ranking; JS/SQL query language; GIS; Available as pay-per-use database as a service or as an on-premise free software download.^[4]	Yes
Couchbase Server	Couchbase, Inc.	Apache License	.NET, Java, Python, Node.js, PHP, SQL, GoLang, Spring Framework, LINQ	Distributed NoSQL Document Database, JSON model and SQL based Query Language.	Yes^[5]
CouchDB	Apache Software Foundation	Apache License	Erlang	JSON over REST/HTTP with Multi-Version Concurrency Control and limited ACID properties. Uses map and reduce for views and queries.^[6]	Yes^[7]
CrateIO	CRATE Technology GmbH	Apache License	Java	Use familiar SQL syntax for real time distributed queries across a cluster. Based on Lucene / Elasticsearch ecosystem with built-in support for binary objects (BLOBs).	Yes^[8]
DocumentDB	Microsoft	Proprietary	.NET, Java, Python, Node.js, JavaScript, SQL	Platform-as-a-Service offering, part of the Microsoft Azure platform.	Yes
Elasticsearch	Shay Banon	Apache License	Java	JSON, Search engine.	Yes
eXist	eXist	LGPL	XQuery, Java	XML over REST/HTTP, WebDAV, Lucene Fulltext search, binary data support, validation, versioning, clustering, triggers, URL rewriting, collections, ACLS, XQuery Update	Yes^[9]
HyperDex	hyperdex.org	BSD License	C, C++, Go, Node.js, Python, Ruby	Support for JSON and binary documents.	No
Informix	IBM	Proprietary, with no-cost editions^[10]	Various (Compatible with MongoDB API)	RDBMS with JSON, replication, sharding and ACID compliance.	Yes
Jackrabbit	Apache Foundation	Apache License	Java	Java Content Repository implementation	?
Lotus Notes (IBM Lotus Domino)	IBM	Proprietary	LotusScript, Java, Lotus @Formula	MultiValue	Yes
MarkLogic	MarkLogic Corporation	Free Developer license or Commercial^[11]	REST, Java, JavaScript, Node.js, XQuery, SPARQL, XSLT, C++	Distributed document-oriented database for JSON, XML, and RDF triples. Built-in Full text search, ACID transactions, High availability and Disaster recovery, certified security.	Yes
MongoDB	MongoDB, Inc	GNU AGPL v3.0 for the DBMS, Apache 2 License for the client drivers^[12]	C, C++, C#, Java, Perl, PHP, Python, Node.js, Ruby, Scala ^[13]	Document database with replication and sharding, BSON store (binary format JSON).	Yes^[14]
MUMPS Database	?	Proprietary and Affero GPL^[15]	MUMPS	Commonly used in health applications.	?
ObjectDatabase++	Ekky Software	Proprietary	C++, C#, TScript	Binary Native C++ class structures	?
OrientDB	Orient Technologies	Apache License	Java	JSON over HTTP, SQL support, ACID transactions	Yes
PostgreSQL	PostgreSQL	PostgreSQL Free License	C	HStore, JSON store (9.2+), JSON function (9.3+), HStore2 (9.4+), JSONB (9.4+)	No
Qizx	Qualcomm	Commercial	REST, Java, XQuery, XSLT, C, C++, Python	Distributed document-oriented XML database with integrated full text search; support for JSON, text, and binaries.	Yes
RavenDB	Hibernating Rhinos	GNU Affero General Public License	C#, VB.net, Java	2nd generation document database, JSON format with replication and sharding.	Yes
RethinkDB	?	GNU AGPL for the DBMS, Apache 2 License for the client drivers	C++, Python, JavaScript, Ruby, Java	Distributed document-oriented JSON database with replication and sharding.	No
Rocket U2	Rocket Software	Proprietary	?	UniData, UniVerse	Yes (Beta)
Sedna	sedna.org	Apache License	C++, XQuery	XML database	No
SimpleDB	Amazon	Proprietary online service	Erlang		?
Solr	Apache	Apache License	Java	Search engine	Yes
TokuMX	Tokutek	GNU Affero General Public License	C++, C#, Go	MongoDB with Fractal Tree indexing	?
OpenLink Virtuoso	OpenLink Software	GPLv2[1] and proprietary	C++, C#, Java, SPARQL	Middleware and database engine hybrid	?

XML database implementations

Most XML databases are document-oriented databases.

Notes

↑ To the point that document-oriented and key-value systems can often be interchanged in operation.
↑ And key-value stores in general.

References

↑ DB-Engines Ranking per database model category
↑ Lua error in package.lua at line 80: module 'strict' not found.
↑ Lua error in package.lua at line 80: module 'strict' not found.
↑ Document-oriented Database. Clusterpoint. Retrieved on 2015-10-08.
↑ Documentation. Couchbase. Retrieved on 2013-09-18.
↑ CouchDB Overview
↑ CouchDB Document API
↑ [1]
↑ eXist-db Open Source Native XML Database. Exist-db.org. Retrieved on 2013-09-18.
↑ http://www.ibm.com/developerworks/data/library/techarticle/dm-0801doe/
↑ http://developer.marklogic.com/licensing
↑ MongoDB Licensing
↑ Additional 30+ community MongoDB supported drivers
↑ MongoDB REST Interfaces
↑ GTM MUMPS FOSS on SourceForge

External links

DB-Engines Ranking of Document Stores by popularity, updated monthly

[2] To the point that document-oriented and key-value systems can often be interchanged in operation.

[3] And key-value stores in general.

[1] DB-Engines Ranking per database model category

[4] Lua error in package.lua at line 80: module 'strict' not found.

[5] Lua error in package.lua at line 80: module 'strict' not found.

[6] Document-oriented Database. Clusterpoint. Retrieved on 2015-10-08.

[7] Documentation. Couchbase. Retrieved on 2013-09-18.

[8] CouchDB Overview

[9] CouchDB Document API

[10] [1]

[11] Xist-db Open Source Native XML Database. Exist-db.org. Retrieved on 2013-09-18.

[12] ttp://www.ibm.com/developerworks/data/library/techarticle/dm-0801doe/

[13] ttp://developer.marklogic.com/licensing

[14] MongoDB Licensing

[15] Additional 30+ community MongoDB supported drivers

[16] MongoDB REST Interfaces

[17] GTM MUMPS FOSS on SourceForge

[1]

[lower-alpha 1]

[lower-alpha 2]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

v t e Database models
Common models	Flat Hierarchical Dimensional Network Relational Entity–relationship Enhanced Graph Object-oriented Entity–attribute–value
Other models	Associative Multidimensional Array Semantic Star schema XML database
Implementations	Flat file Column-oriented Document-oriented Object-relational Deductive Temporal XML data stores Triplestores

v t e Database management systems
Types	Object-oriented comparison Relational comparison Document-oriented Graph NoSQL NewSQL
Concepts	Database ACID Armstrong's axioms CAP theorem CRUD Null Candidate key Foreign key Superkey Surrogate key Unique key
Objects	Relation table column row View Transaction Log Trigger Index Stored procedure Cursor Partition
Components	Concurrency control Data dictionary JDBC XQJ ODBC Query language Query optimizer Query plan
Functions	Administration and automation Query optimization Replication
Related topics	Database models Database normalization Database storage Distributed DBMS Federated database system Referential integrity Relational algebra Relational calculus Relational database Relational DBMS Relational model Object-relational database Transaction processing

Document-oriented database

Contents

Documents

CRUD operations

Keys

Retrieval

Editing

Organization

Relationship to other Databases

Relationship to Key-value Stores

Relationship to search engines

Relationship to Relational Databases

Implementations

XML database implementations

See also

Notes

References

Further reading

External links

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools