In this article we are going to compare from a technical point of view the two most used data management models on the market: the relational model, which is implemented in most corporate legacy systems, and the document model, that was recently rediscovered because it is able to overcome some inflexibility of the relational model.
So, let’s see the main features, the different approaches to the models, and a series of useful questions to decide which model to adopt.
The Relational Model
The relational model, built around the concept of relationship, structures the data into models that relate to each other through constraints that are made explicit through foreign keys: for example, the Child Table has a reference to the Master table through its ID. This relationship is expressed in a language, usually a Data Definition Language (DDL).
One of the reasons for the popularity of this model is the fact that several tools for database management are available (also for free, such as MySQL Workbench or DBeaver for database design or schema migration).
For the sake of simplicity, in the examples that will follow, we will assume that the relational model is implemented on a SQL compliant database.
The Document Model
The document model focuses on data, rather than relationships. In this regard, the data is not seen as a tuple (or row) within a table, but as a document with properties, which are not theoretically defined. There is no DDL language that defines a schema and therefore some documents may or may not have these properties.
In this model, the relationships are not represented by foreign keys but by nested data. Since the document is unstructured, it may include another document or an array of documents, thus creating an N: 1 or 1: N relationship between the two entities. Therefore, nesting is preferred over an explicit relationship by one or more properties.
In the examples that will follow, we will assume that the document model is implemented with the MongoDB database, which has recently become very popular in the enterprise environment.
Two different approaches to data
In a relational approach, the writing activity is performed by a component that writes data in one place and there is one and only one table that contains that data. If you add a record, a new row will be added to that table. On the contrary, if you want to read that data, the reading query will typically find that information by performing the JOIN: an operation that allows you to query multiple tables, combine their data and return them in filtered form.
On the contrary, a document model pushes the developer to replicate data in different places so that the reading can be performed considering only one data model - therefore without a solution like the JOIN.
Example: a library
Let’s make a simple example, like a library: how do you represent authors and books in the data model? The use case shows the search by book or by the author which is performed by the users of the library through the designated terminals.
In a relational approach, a possible solution could be implemented by using two tables - the Authors table and the Books table - both linked together by a property of the Books table which logically relates the two tables. In this model it is also possible to force all the books to have an author, who must be registered in the database by using the Books table foreign key, that does not allow a null value.
By using a document-oriented approach, instead, it is possible to take advantage of the nesting to describe the relationship. There could three approaches:
- You could create a single Authors collection. Each author would have a property called Books, which contains an array of documents. Each subdocument is, therefore, a book. In this way, once you pick the author, all his books can also be found.
- You could create a Books collection whereby each book has an Author field that contains a document describing the author. In this way, after choosing a book, the author is also found.
- By putting together the two previous points, you could create a collection that contains the books and a collection that contains the authors - as described above. In this way, both types of reading data can be satisfied without finding unnecessary data.
By choosing the document model, it is possible to make some data consistency checks, but not others. For example: by using the third approach, it is possible to survey the books of a non-existent author. Indeed, in the Books collection, you could insert a document whose author is not registered in the Authors collection.
Reading access to data
By using the relational model, the data is always accessible. As a matter of fact, it is possible to access all the information from both tables: starting from Authors and - through JOIN - it is possible to find the data of the books and vice versa.
SELECT * from Authors JOIN Books on Authors.id = Books.author_id where Author.id=??;
SELECT * from Authors JOIN Books on Authors.id = Books.author_id where Book.id=??;
Both readings involve both entities, so the editing to the schema affects the reading queries: adding or removing a property the query can return more or fewer properties. As a matter of fact, if you add a column to the Authors table, both read queries will not be affected.
Using the document model and referring to the three hypotheses mentioned above bring to three possibilities:
- In the first case, the access is simple, if you have to locate an author, but to locate a book you have to select the author and all his books will necessarily return;
- In the second case, the access is simple if you have to locate a single book, but in order to locate all the books of an author, you have to look for the author in the Books collection as a grafted property. In this way, however, you would have a repetition of the author information for each book;
- In the third model, however, by choosing the starting collection, you can easily obtain the necessary data without duplication of information.
In the document model, therefore, it is very important to choose the most suitable data model for the required reading activities.
Writing access to data
The relational model allows you to write the data in one place. In this way, updating or removing a row can be done by a single command without the use of JOIN: automatically, all new reading queries will inherit the change that was made to the data.
Based on the approach, the document model can provide for a greater complexity of writing implementation:
- In the first approach, updating the author or a book can be executed by a single request to the database;
- In the second case, updating a book can be performed with a single query; even the updating of an author can be performed with a single query, but this must consider different documents;
- Finally, in the third case, which involves both collections, updating the book or author must be performed with two different queries (one per collection) to ensure data consistency.
Therefore, the document model might not respect the atomicity and consistency requirements of the transactions: the writing must be performed in two sequential actions, thus violating the atomicity principle. There is also a time, albeit short, in which a collection contains some documents that have not yet been updated.
Which should you choose?
As always, there is no silver bullet: it depends on the type of project and on the critical issues of the application.
However, we can highlight some questions that can be useful in choosing which approach to use.
Do you foresee more readings or more writings?
If you think of a book management application, such as a library, it is more likely that the readings will be executed more frequently than the writings, since the catalog consultation is the main feature. As a matter of fact, the requests for writing would be limited to the booking and/or editing of data, less frequent compared to consulting the catalog.
In these types of contexts, writing and reading operations have different weights.
In a relational database, which includes a large number of tables or records, the reading queries might be less performing, if they involve many tables. Databases with more than a thousand tables are difficult to consult since the number of crossings on these tables is large. Similarly, relational databases with a very high number of records (even billions of rows per table) and with a high number of columns can be affected by the same performance problem. In this case, a document model may be the most appropriate choice.
Otherwise, if your project involves more frequent writings, it may be more useful to adopt a relational model which, as we have seen, does not provide copies of data.
Is the synchronization important?
In a SQL relational database, being ACID, the problem does not exist: if the data is updated, each query executed after the update reads the new data.
In a document database, on the contrary, if the data is replicated in different collections, its updating must be propagated in all its copies: it is necessary to keep track of where each copy of the data is made.
In this regard, there are architectural solutions that decouple the writing on the main collection from the propagation on the copies through a queue system. For example, by using a queue system - such as Kafka - it is possible to update the various copies asynchronously, without having a central census point of all the copies: this facilitates the management of the copies' writings.
In any case, a delay during the data updating must be endured.
Do you foresee many evolutions in the database schema?
During the development of a project, it may happen that you need to change the data schema to add, remove or edit properties. Based on the diversity of the data present and on the type of modification, this operation can be very complex.
In fact, the schema changes can impact all reading and writing queries. This means that for each change, a review of all queries - involving the modified tables - must be performed.
Suppose that you want to evolve the library database schema by adding further information on the author, such as literary prizes, won or participation in competitions.
In a relational model, you can add columns to the Authors table or make one or more tables related to the author.
In the first case, in order to take into account all the possible cases, it is necessary to add a series of columns, one for each information. This implementation allows, for each author, the census of only one data per property thus limiting the capabilities of the model.
In the second case, it is necessary to add a table for each case, with the related foreign keys. For example, you can create an Awards table, a Competitions table and two bridging tables between Authors and the other two tables. For any other information that you want to add, it would be necessary to create two tables. The limitation of this implementation is the maintainability of a large number of tables.
Using a document model instead, one possibility could be to add a field to the Authors collection document, inserting an array of documents: this way you can keep track of all the information requested.
In addition, by attributing a logical meaning to the null value, you can distinguish who has not received awards from those for whom the information is unknown: if the field is an empty array, it means that the author has no awards; otherwise, if the field is null, it means that the information cannot be found.
One doesn’t exclude the other one: primary/secondary approach.
In complex architectures with complex business models, the existing applications are often based on relational databases, whose configurations are based on business requirements that have followed over time - new products, changes in regulations etc.
This evolution has necessarily transformed the underneath data models, to accommodate the new volume of information and organize it within multiple tables. The data models have been able to effectively respond to business requests by increasing their structure as the volume of information to be organized increases.
In an increasingly digitized world, companies are trying to reach users with as many channels as possible and to offer them updated and available information 24/7. The proliferation of data access points and the new data availability needs cannot be satisfied by those same systems, which are not predisposed to withstand such a load of requests. For example, the reading query that was previously performed once a day is now performed continuously, every time the user accesses the app or site. This leads to performance problems,, since the different required properties are allocated on hundreds of different tables.
This problem is even more evident when the touchpoint requires data from different databases. Although some database engines allow you to execute a JOIN between different databases, located on different hosts, this approach is not recommended for three reasons:
- Performance: as we have seen, JOIN can lead to inefficiencies;
- Very difficult evolutions: the database schema is shared with other applications that the development team may not have access to;
- Abstraction of the model: the database is an implementation detail of the representation of the business model that must not be shared outside the development team.
To solve this problem, Mia-Platform has developed an architectural solution called Fast Data which involves the creation of a document database that collects information asynchronously and organizes it to meet the needs of all touchpoints.
In this way,you can experience the benefits of both models: the applications write the data on the relational systems only once and the touchpoints interrogate the document database to access the appropriately organized data safely and quickly, without weighing down the systems underneath.
There is no silver bullet: both models have their strengths and weaknesses. We have seen that, in some cases, it is better to adopt a relational model, while in other contexts the document model best applies. Therefore, there is no better model than another, but only a model that, in certain circumstances, fits better according to the type of operation to be carried out more frequently.
If the number of readings far exceeds the number of other operations, the use of a document model should be taken into consideration, especially if the readings involve many tables or many data. On the contrary, in order to avoid the duplication of data at each update, a relational model is recommended when there are many writings on the database.
The article was written by Tommaso Allevi, R&D Leader of Mia-Platform.