How to Choose a Database

ยท

10 min read

The choice of database does not directly affect the functional requirements, but the Non-Functional requirements are impacted/served by the choice of database.

Some Common Choices we have

CategoryDatabasesDescription
Relational DatabasesMySQL, Oracle Database (Oracle DB), SQLiteRelational databases with structured data storage
Non-Relational Databases
Document DatabasesMongoDB, CouchbaseNoSQL databases storing data as flexible documents
Key-Value StoresRedis, Amazon DynamoDBNoSQL databases storing data as key-value pairs
Columnar DatabasesApache Cassandra, Apache HBaseNoSQL databases optimized for columnar storage
Graph DatabasesNeo4j, Amazon NeptuneNoSQL databases designed for graph data models
Time Series DatabasesInfluxDB, Prometheus, Open TsdbNoSQL databases optimized for time-series data storage
Other Data Stores
File StorageBLOB (Binary Large Object),Amazon S3(with CDN)Stores large binary data objects, such as images or documents
Analytics Based DBData Warehousing solutions such as Hadoop ,Distributed File System (HDFS)Distributed file system used for big data analytics
Caching SolutionsRedis , Memchaced , HCD, Hazelcast
Text Search/Fuzzy search enginesElastic search, solr- Apache Lucene
Application Metric Tracking SystemsTime Series DB such as Influx DB, openTSDB,

SQL vs NoSQL ?

FeatureSQL DatabasesNoSQL Databases
Data ModelRelational, structured data,Fixed schemaNon-relational, flexible schema (document, key-value, columnar, graph)
Consistency vs. AvailabilityGenerally prioritize consistency over availability. ACID transactions ensure data consistency.Often prioritize availability over consistency. Can implement eventual consistency models.
Data RetrievalPrimarily SQL queries using structured query language. Join operations are common.Varies by database type. Typically, data retrieval involves key-value lookups, document queries, or graph traversals.
ScalabilityVertical scalability is common, meaning scaling up resources (CPU, RAM) on a single server.Horizontal scalability is favored, achieved through sharding and distributed architecture. Scales out by adding more servers.

Do you need ACID guarantees ?

A.C.I.D are fundamental properties of DB transactions

  1. Atomicity: Ensuring either complete success or failure.

  2. Consistency: Transactions maintain data consistency, adhering to integrity constraints during execution.

  3. Isolation: Transactions occur independently, preventing interference between concurrent transactions.

  4. Durability: Committed transactions persist despite system failures, ensuring data reliability.

To determine if your software system requires ACID guarantees, it's essential to consider several factors:

  1. Data Criticality: Assess the criticality of your data. Systems dealing with financial transactions, inventory management, or healthcare records often require ACID guarantees to maintain data integrity. For instance, a banking system must ensure that funds are transferred accurately and reliably between accounts, without risking inconsistencies or errors.

  2. Transaction Complexity: Evaluate the complexity of transactions in your system. E-commerce platforms handling orders and inventory updates may require ACID properties to ensure that a purchase transaction deducts items from inventory and updates order status atomically. For example, Amazon's order processing system employs ACID guarantees to maintain consistency across various components involved in order fulfillment.

  3. Concurrency Control: Consider the level of concurrency in your system. Online reservation systems, like those used by airlines or hotels, must handle multiple users concurrently booking seats or rooms. ACID compliance helps prevent issues such as double bookings or overselling by maintaining data consistency across distributed systems.

  4. Regulatory Compliance: Many industries, including banking, healthcare, and e-commerce, are subject to regulatory requirements that mandate data consistency and auditability. For instance, electronic health record systems must ensure that patient data remains consistent and confidential.

  5. Performance and Scalability: While ACID guarantees offer strong consistency, they can introduce performance overhead, especially in distributed systems. Social media platforms like Facebook use a combination of techniques, including eventual consistency and optimistic concurrency control, to maintain performance and scalability while ensuring data integrity.

  6. Hybrid Approaches: In some cases, systems may employ hybrid approaches, combining ACID transactions for critical operations with eventual consistency for less critical data. This approach is common in distributed databases and large-scale applications like Google's Spanner, which provides globally distributed transactions while maintaining strong consistency for critical data.

How would you scale ?

FeatureHorizontal Scaling(SHARDING)Vertical Scaling
DefinitionAdding more machines or instances to distribute load across multiple servers.Increasing the capacity of existing servers (e.g., adding more CPU, RAM, storage).
ScalabilityHighly scalable. Can handle increased traffic and data by adding more servers.Limited scalability. Eventually, hardware limitations may restrict further expansion.
CostGenerally more cost-effective. Can start with fewer resources and scale as needed.May be more expensive. Upgrading hardware components can involve higher upfront costs.
Fault ToleranceProvides better fault tolerance. If one server fails, others can continue to handle requests.Relies heavily on the reliability of a single server. Failure of the main server can cause downtime until it's restored.
ComplexityMay introduce complexity in managing distributed systems, including data partitioning, replication, and synchronization. Utilizes techniques like consistent hashing for efficient data distribution.Typically simpler to implement and manage since it involves upgrading existing hardware without changing the overall system architecture.
Performance ImpactMay experience network latency and communication overhead between distributed components. Distributed queries are common, where data is fetched from multiple nodes simultaneously.Can provide immediate performance improvements by increasing the resources available to the system.
Downtime ImpactCan usually perform upgrades and maintenance without significant downtime since traffic can be rerouted to other nodes.Upgrades or maintenance may require downtime since resources are being added or replaced on the main server.

The Big Four Factors

  1. Data Structure and ACID Compliance: ex. UPI Transactions for Paytm

    • If your data is highly structured and demands strict transactional integrity (ACID compliance), traditional RDBMS solutions like MySQL, Oracle, SQL Server, PostgreSQL, or MariaDB are well-suited for the task.
  2. Structured Data and Scalability: ex. Content Management Systems (CMS)

    • For structured data without stringent ACID requirements, SQL databases still offer viable options. Additionally, consider document-oriented databases such as MongoDB or Couchbase. These NoSQL databases provide scalability advantages while accommodating structured data models.
  3. Unstructured Data and Read-Heavy Workloads: ex. Amazon Product Catalogue

    • If your data lacks a rigid structure and involves a diverse range of types, particularly in scenarios with high read query volumes, document databases like MongoDB are beneficial. They excel in handling unstructured data and support efficient retrieval operations.
  4. Write-Heavy Systems with Finite Queries : ex. Increasing number of Uber drivers sending Location Pings, Placing amazon orders during flash sale

    • In cases where system workloads prioritize write operations and data volumes continue to grow(Ever increasing data with finite queries ), consider employing columnar databases. These specialized databases are well-suited for write-heavy environments, optimizing storage and retrieval efficiency.

Cheat Sheet for DB's

DatabaseType / Data Model / Query Language & Schema FlexibilityCAP Preference & ScalabilityRead/Write BalanceApplications
MySQLRelational / Tables / SQL, Fixed SchemaConsistency, Vertical (Scaling Up)Balanced (Supports both read and write operations)Enterprise Applications, E-commerce, Data Warehousing
MongoDBNon-Relational / Binary JSON / MongoDB Query Language, FlexibleAvailability, Horizontal (Sharding)Read Heavy (limited write concurrency on documents)Content Management Systems, E-commerce, Blogs
CassandraNon-Relational / Wide-Column Tables / CQL (Cassandra Query Language), FlexibleAvailability, Horizontal (Scaling Out)Write HeavyReal-time Analytics, Sensor Data, Event Logging
InfluxDBNon-Relational / Time Series / InfluxQL, FlexibleAvailability, Horizontal (Sharding)Very Write HeavyIoT Devices, Monitoring, Financial Trading
Neo4jGraph / Property Graph / Cypher, FlexibleConsistency(ACID), Horizontal (Scaling Out)Read Heavy (graph-based queries)Social Networks, Recommendations, Fraud Detection
SQLiteRelational / Tables / SQL, Fixed SchemaConsistency, LimitedBalancedEmbedded Systems, Mobile Applications, Small Websites
Amazon DynamoDBNoSQL (Document-oriented) / Key-Value pairs / AWS SDKs, NoSQL Workbench, FlexiblePartition tolerance, Availability, Horizontal (Auto Scaling)BalancedMobile Apps, Gaming, Ad Tech
PostgreSQLRelational / Tables / SQL, FlexibleConsistency, Vertical and Horizontal ScalingBalancedGIS, Data Analysis, High-Volume Transaction Systems

Thank you for Reading .

Additional Notes

Data Partitioning in SQL databases

Partitioning TypeHorizontal PartitioningVertical Partitioning
DefinitionDivides a table's rows into multiple partitions based on a partition key or specific criteria.Involves splitting a table into smaller tables with fewer columns.
Scaling BenefitImproves scalability by distributing data across multiple partitions to handle larger volumes of data and higher query loads.Improves scalability and performance by reducing the amount of data read from disk and optimizing storage efficiency.
ExampleCustomer database partitioned by geographic regionsUser profile database storing frequently accessed and less frequently accessed columns in separate partitions
ImplementationRange partitioning, list partitioning, or hash partitioning are commonly used techniques to distribute data across partitions.Requires analysis of data access patterns and query requirements to determine column placement in each partition.
AdvantagesEnables horizontal scaling by adding more servers or nodes to the cluster.Allows efficient utilization of storage resources and optimized query execution by minimizing the amount of data accessed for each query.
Joins and QueriesHorizontal partitioning may complicate joins involving data from multiple partitions, potentially leading to performance overhead due to network communication.Vertical partitioning simplifies joins as related columns are typically stored together, reducing the need for complex joins. However, queries may still need to access multiple partitions for complete data retrieval.
NormalizationHorizontal partitioning may require denormalization techniques to avoid cross-partition joins and maintain data integrity.Vertical partitioning aligns with normalization principles by segregating related columns into separate tables, promoting data integrity and reducing redundancy.
Data DistributionHorizontal partitioning distributes data based on partition keys, potentially leading to uneven data distribution across partitions.Vertical partitioning distributes data based on column attributes, ensuring related data is stored together and minimizing data skew across partitions.
Scalability ConsiderationsScaling horizontally requires adding more nodes or servers to accommodate increasing data and query loads, necessitating efficient partitioning strategies to maintain performance.Scaling vertically involves optimizing table designs and column partitioning to accommodate growing data volumes while ensuring efficient data access and query performance.

Data Partitioning in Nosql databases

AspectHorizontal PartitioningVertical Partitioning
DefinitionDivides data across multiple nodes or partitions based on a shard key or specific criteria.Segregates data within a document or table into smaller, more manageable units based on columns or attributes.
Scaling BenefitImproves scalability by distributing data across multiple nodes or partitions to handle larger datasets and higher query loads.Enhances scalability and performance by organizing data efficiently within documents or tables, reducing the need to access unnecessary data.
ExampleMongoDB database sharded based on user IDsCassandra table with different columns for frequently accessed and less frequently accessed attributes
ImplementationSharding can be achieved using range-based, hash-based, or composite sharding methods, distributing data across shards based on shard keys.Vertical partitioning involves organizing data within documents or tables based on access patterns and query requirements, typically using column families or attributes.
AdvantagesEnables horizontal scaling by adding more nodes or shards to the cluster, accommodating growing datasets and query workloads.Facilitates efficient data access and retrieval by optimizing document or table structures, reducing data redundancy, and improving query performance.
Joins and QueriesJoins are less relevant as NoSQL databases often denormalize data to avoid complex joins, favoring embedded documents or key-value pairs for related data.Denormalization is encouraged to optimize query performance and simplify data access, minimizing the need for traditional normalization principles.
Data DistributionHorizontal partitioning distributes data based on shard keys, which may lead to uneven data distribution and data hotspots across shards.Vertical partitioning distributes data based on document or table structures, optimizing data distribution and minimizing data skew within documents or tables.
Scalability ConsiderationsScaling horizontally involves adding more nodes or shards to the cluster, necessitating effective sharding strategies and monitoring to maintain data balance and performance.Scaling vertically requires optimizing document or table structures and column families to accommodate growing data volumes while ensuring efficient data access and query performance.

ย