Databases are used every day, and most of your day you are manipulating a database.

When you buy from a shop:

  • The first database to be affected is the shop’s logistics database, where they store the quantity of products they have.
  • Then, in many countries, like Greece, all purchases are reported to the tax authority online, again being saved on a database.
  • Finally, if you pay by card, your card’s merchant and the shop’s bank merchant again log the transaction into databases.

Then, you may use the parking, which probably again may have a database for the free spots. At the end, when you open your online GPS app to set your route, you are requesting data from a database.

One thing easily comes to mind: how do databases hold all those tons of data?

The solution came in 1960 by Sabre Global Distribution System, which was made to store travel services provided by airlines: to make a database distributed.

For a typical database, which may be serving the purposes of one shop, for example, one good server computer machine is enough. Most of the databases work like that; they work on one machine, and that’s it. If your data become more than one machine can handle, that’s it.
But a distributed database, splits the data across many machines. At the same time, it requests data from all the machines to find them. The difference is analogous to one person playing an instrument versus many people playing instruments, aka an orchestra. A distributed database is an orchestra of machines.

Sabre Global Distribution System didn’t make a database system that is distributed but instead a database that is distributed. Meaning that even though they distributed a database, they didn’t provide a way for other projects to use a distributed database; they created a wheel and not a machine that creates wheels. Nowadays, this system is still used and has much more data, including available hotels and much more travel-related data, but it’s completely rewritten with more modern tools than the database they split across many machines back then. In the end, they bought wheels from the factories that started producing the wheels.

Until some decades ago, only very huge systems, like the previously mentioned ones, needed huge databases. The more computers and the internet entered our lives, the more a wheel factory had to be created rather than big systems creating each their own wheel.

In 1998, Italian developer Carlo Strozzi developed the Strozzi NoSQL Open Source Relational Database, and it was the first attempt to create a wheel factory to standardize distributed databases. Even though this database solved the structure problems of distributed databases, it wasn’t really distributed. But it solved most of the problems of them. Somehow, it was like he made not a wheel factory, but a wheel artisanship.

The main problem he solved was that in a distributed system, data can’t be related, and they shouldn’t be related, as people were used to with non-distributed databases. In a non-distributed database, you relate some data with another. For example, you create a table “Users” and then you add a row for the User “Irina,” assigning her the ID 0, and “Rantouan,” assigning him the ID 1. Then, in the rest of the tables, like, for example, a table “favorite_drinks,” you don’t write again “Irina” but her ID, 0. That’s smart because you save space and computing power from just repeating the same things! But you can’t split it across many machines, as one machine will have Rantouan and the other Irina, and the references are going to be unknown from one machine to another.

Carlo’s solution was mostly to repeat yourself… as well as some other ideas… and he named that NoSQL, in contrast to SQL that non-distributed databases were in favor of.

An SQL database :

Users:
| Id | Name |
| 0 | Irina |
| 1 | Rantouan |

Favourite_drinks:
| Id | User_Id | Drink |
| --- | --- | --- |
| 0 | 0 | Tea |
| 1 | 1 | Coffee |

An NoSQL database :

Users:
| Name |
| Irina |
| Rantouan |

User_favourited_drinks:
| Name | Drink |
| --- | --- |
| Irina | Tea |
| Rantouan | Coffee |

After solving the question of how to store data in a distributed database, it was easy to create the distributed database factories.

Next station on our distributed databases trip, after airlines and NoSQL, was telecommunication. Development on NDB (“Network DataBase”) began in the 1990s by Mikael Ronström at Ericsson for the telecommunications market. This project ended up being the predecessor of MySQL Cluster, a technology allowing you, if you want to, to make distributed MySQL, one of the most famous databases, used a lot on the web (this blog and generally WordPress uses MySQL) – and not only. Note, that the NDB we refer to here, have nothing to do with the NDB from Plan9 we talk about on “Origins of Key-Value Document Databases: NDB, the Hidden Gem from Plan 9” and they just share the same name.

Then, databases made to be distributed by their nature appeared. Apache Cassandra is one of my favorite distributed databases. There are many out there, such as MongoDB, Google’s internal BigTable, CockroachDB (an available-to-the-public copy of BigTable made by ex-Google workers), etc.

Distributed databases not only solved the big problems of computing, but they even opened new horizons, as today’s AI tools that we all love are storing their enormous amounts of data in such databases, scrapped from all the internet.

This essay was grammatically checked by ChatGPT, making thousands of requests on it’s huge NoSQL database just to check my grammar by comparing it with what the rest of the people are writing on the internet.

Last modified: 22 Ιουλίου, 2024

Author

Comments

Write a Reply or Comment

Your email address will not be published.