With today’s applications being mostly data-driven, selecting the right database is the most critical decision that an application development team needs. The more complex your application, the more complicated your decision becomes. The decision in complicated cases is a trade-off between multiple factors. It is not only which database provider to select but also which type of database to select, making it a perfect fit for your application for the present and the scale-up required in the future.
For SaaS, big data pipelines, analytics platforms, machine learning, and artificial intelligence-based applications deciding on the database becomes crucial. Today many providers claim they have application-specific data solutions, but it is for you to test them out and check if it is a perfect fit for your use case.
This discussion will be centered around nosql vs rdbms and NewSQL. But before that, lets us get a clear understanding of the criteria used to decide on the right database solution.
You might be tempted to judge a database on the three main factors such as capacity, reading speed, writing speed, disaster recovery, and ease of use. If that is what you were stopping at, well, there are a lot more. Your system administrators or database administrators and database architects might know these factors. These additional factors include,
and a few more.
Databases are usually modeled on the type and structure of the data being stored.
Data can come in structured, unstructured, or semi-structured formats. Databases suitable for structured data might not be efficient for unstructured data, and databases great for unstructured data will not work out for structured and semi-structured data. As you can see, a decision on selecting the right database hinges very much on the type of data you want to store. Advanced high-level databases manage to integrate most types of data under one umbrella, using multiple databases and tools, like in a big data framework like Hadoop.
Let’s quickly go through the types.
Structured data is data that has a defined method of storing data, like in a table. Data items are stacked on top of each other, with the attributes or columns or properties aligned perfectly, like in a table with row and column arrangement. It does not have to be a row and column arrangement to be declared as structured data. You could have structured data in the form of a JSON file or an XML file.
Unstructured data is data where there is no specific repeatable layout within the data. By scanning through the data, you cannot be sure of what data items or property is currently being read. In fact, there are no properties or attributes, or columns. These might have to be worked on and extracted out from the data.
Semi-structured data is that in which you have structured metadata about the unstructured data that accompanies it. An email or a picture from your mobile camera is a good example of semi-structured data.
The scale of your application decides what database should be serving at the back end.
Why are we discussing Distributed systems in a database context? In short, databases today have to scale to accommodate a worldwide user base, think Facebook, YouTube, Twitter. Unless you are absolutely sure that your application will not expand to cater to a world audience, you are good without Distributed database systems. Still, if you do see your application scaling heights, you have to be ready to scale up with distributed systems.
What is a distributed database? Distributed databases store data using techniques like duplication and replication, improving data consistency and disaster tolerance. Distributed database systems, break data into chunks, store them across multiple machines, possibly spread over different locations in a parallel fashion, and massively increase performance along with the scale.
Choice of a database will also depend on how well or quick, how reliably, how consistently the database carries out every transaction. A transaction within a database is a single logical unit of work carried out by the database. When it comes to transactions, factors like atomicity, integrity, consistency, isolation, and durability of the transactions become relevant. ACID and BASE are nothing but acronyms that describe how trustworthy and efficient a transaction is within a database. ACID stands for Atomicity, Consistency, Isolation, Durability. The BASE stands for Basically Available, Soft state, Eventual consistency.
Atomicity is the ability of the database to execute a transaction and save it to the database only if all the operations involved in the transaction have been completed successfully. If not, the entire transaction is rolled back. It is either all or nothing. Allowing transactions with suspect atomicity on a database will leave a database in an inconsistent state.
Consistency explains the validity of data held in the database. Atomicity ensures consistency, but there are other ways the data might be rendered inconsistent. How a database ensures consistency is important to look at when selecting a database.
Isolation is a form of concurrency control, where concurrent transactions are queued and executed efficiently, using serializability, among other techniques.
Durability indicates the ability of the database system to ensure committed transactions are persisted under any conditions.
In the BASE, Basically Available assesses the ability of a database to be able to respond to any data request. Every data request should go answered; even if it fails to read the data, a response should be sent back indicating so. The S in BASE standing for Soft state indicates that before reaching an eventual state of consistency, the database system will remain in a soft or fluid state. The E in the BASE is Eventual consistency. It refers to the delay in reaching a consistent state within a Distributed Database, also called a soft state, before a final consistent state is guaranteed. Although a distributed system is guaranteeing high availability, it is not able to guarantee perfect consistency.
Already in the realm of trade-offs, might as well go deeper with a few more aspects related to distributed databases. According to the CAP theorem, a distributed system can’t provide consistency, availability, and partition tolerance, all at the same time. We have gone through both consistency and availability in the discussion above. Partition Tolerance ensures system function despite arbitrary partitioning caused by network failures. Partition here refers to a network partition wherein one component of a network cannot communicate with another. Network partitions are a result of note failures in a distributed file or database system.
This means a significant trade-off has to be settled for when using distributed databases, where network failures can cause network partitions. These trade-off cases are,
AP-high availability and partition tolerance, but inconsistent.
CP-Consistent and partition tolerant, but availability is suspect.
CA-High availability and consistency but low on partition tolerance.
These trade-offs should be considered and configured appropriately when siding with a distributed database system.
With network failures less likely, another trade-off should be considered. The PACELC formulation. It states that for network partitioned (P) distributed systems, the choice is between availability(A) and consistency(C), else(E) in the absence of partitions, the choice between Latency(L) and consistency(C) is advised.
With those concepts out of the way, let’s make use of them in deciding nosql vs rdbms and NewSQL.
Relational Database Management Systems are typically implemented as a single server as opposed to Distributed systems, are very well suited for storing structured relational data. These a characterized as CA systems with a lot of giving on partition tolerance. This setup will need highly reliable and expensive hardware.
RDBMS are good at consistency and ACID transactions while being stable. RDBMS is good at the resolution of complex querying and offline analytics. On the downside, RDBMS includes a lack of data modeling flexibility and schema rigidity.
NoSQL databases are best suited for unstructured data with huge volumes. They enjoy the benefits of scalability and high availability, favoring the Eventual Consistency model. NoSQL offers benefits like simplicity, flexibility, scalability, performance, low latency and high throughput, high availability, and partition tolerance. Limited support for SQL querying, NoSQL databases use bespoke querying languages. Downsides to NoSQL databases are general lack of ACID transactions, compromise on consistency and stability.
NewSQL is the middle path, combining RDBMS, bringing consistency and transactional ACID to NoSQL-like scalability.
A high-level procedure can be followed in order to identify the right database for your application. Something similar to what is shown below might help you get to your answer faster.
If you are interested in making a career in the Data Science domain, our 11-month in-person Postgraduate Certificate Diploma in Data Science course can help you immensely in becoming a successful Data Science professional.