How Distributed Databases Affect Backups

The Big Data has shaken the panorama of the databases. Big data implies having to work with distributed databases and with this scenario, obtaining a complete and reliable copy of several petabytes of data does not seem like something simple.

Hadoop, MongoDB and Cassandra are some of the most used products in Big Data. The data is distributed through distributed databases on multiple servers instead of being packaged in a single, massive server.

The main advantage of this system lies in the flexibility: to accommodate more petabytes, you only have to add one or two additional cheap machines instead of having to pay a lot of money for a large server. However, there is a point where there could be some kind of friction: the backup copies and their subsequent recovery.

Problem and solution of backups in distributed Big Data databases

Traditional backup products have problems with very large amounts of data. The scalable nature of the architecture can also be difficult to handle for traditional backup applications.

Today, horizontally scalable databases include some availability and recovery capabilities, but they are not as robust as those we are used to in traditional systems.

It is a problem that can leave large companies vulnerable when interruptions occur. However, it is also an opportunity for a new class of data protection products that begins to appear. This is the case of RecoverX of the company Datos IO.

RecoverX is a new generation product that allows backups when we have data distributed among several small machines. In these cases, traditional backup products can not give you the solution.

Here the concept of durable log no longer exists because there is no master. Each node is working on its own things. Different nodes could have different privileges and each node has a different view of an operation.

This is partly due to the requirement to take into account what is commonly known as the three Vs of big data; volume, speed and variety. More specifically, in order to offer scalability while housing huge amounts of diverse data arriving at increasingly alarming speeds, the distributed databases have had to move away from the ACID criteria (Atomicity, Consistency, Isolation, Durability) that use the bases of traditional relational data. Instead, they have adopted what are known as BASE principles (Basically Available, Soft state, Eventual consistency).

This is a critical distinction: the most important thing is that where traditional databases promise a strong consistency in everything (the C of ACID), the distributed databases strive in what is called eventual consistency. The updates will be reflected in all the nodes of the database sooner or later, but there is a time delay.

If you need scalability, you need to give up consistency. You have to give up one or the other. That makes it difficult to obtain a reliable and complete backup of the big data database in order to make a recovery just when it is needed. Not only is it difficult to track what data might have been moved in a distributed database at a given time, but it is also difficult to be protected if the data is corrupted.

And this is what IO Data with RecoverX is trying to solve. It attempts to address those concerns through features that include what they call scalable or semantic deduplication versions. The result is distributed database backups that are efficient in space and available in native formats.