Best of Database — July 2024

1
Article
Substack·2y
System Design: How to Scale a Database
Scaling a database is essential as an application grows to maintain optimal performance. Strategies include: vertical scaling, adding resources to one server; indexing, creating indexes on frequently queried columns; sharding, splitting data across different servers; vertical partitioning, separating columns into smaller tables; caching, storing frequently accessed data in a faster storage layer; replication, creating copies of the database in different regions; materialized views, pre-computing and storing complex query results; and data denormalization, introducing redundancy to optimize reads by combining tables. Each method has trade-offs and can be combined based on application needs.
1.3K
26
2
Article
Community Picks·2y
Design a Robust School Bus Tracker System
This post discusses the architecture for a school bus tracker system, focusing on real-time monitoring and parental notifications. Key functional requirements include frequent location updates, real-time map visualization, proximity notifications, and data isolation between schools. Non-functional requirements involve scalability, high availability, reliability, security, and privacy. The post elaborates on various technical aspects such as API design, geohashing for location indexing, and using Redis for real-time updates. It also covers the system’s read/write ratio, proposing DynamoDB, and explores various scalability strategies using AWS managed services.
267
11
3
Article
System Design Codex·2y
7 Techniques for Database Performance & Scaling
The performance and scalability of databases are crucial for enhancing user experience. Important factors affecting database performance include item size, item type, dataset size, and throughput requirements. Seven effective techniques for optimizing database performance include indexing, materialized views, denormalization, vertical scaling, caching, replication, and sharding. Each technique offers unique benefits and trade-offs. Indexing improves query speed but uses additional disk space, while materialized views reduce query time but require extra storage. Denormalization enhances read performance at the cost of data redundancy. Vertical scaling boosts performance but has hardware limits. Caching decreases database load and read time but may involve data staleness. Replication enhances read performance and availability yet introduces replication lag. Finally, sharding enables horizontal scaling and cost reductions but adds complexity in data management.
153
1
4
Article
KDnuggets·2y
5 Tips for Improving SQL Query Performance
Strong SQL skills are crucial in data roles, where optimizing query performance can significantly impact application efficiency. Key tips include avoiding SELECT * by specifying columns, using GROUP BY instead of SELECT DISTINCT, limiting query results, and employing indexes with caution. Balancing these techniques can improve query performance and ensure efficient database operations.
149
3
5
Article
Hacker News·2y
QuestDB
QuestDB is an open-source time-series database with SQL analytics designed to efficiently handle data ingestion and analysis. The post details the development and debugging of a primary-replica replication feature, addressing a performance issue related to excessive network bandwidth usage. The author implemented a custom network profiling tool using Rust to capture and analyze network traffic, identifying the root cause of the problem. The solution involved optimizing how metadata was uploaded, ultimately improving bandwidth efficiency. Techniques used within QuestDB for high ingestion performance were also highlighted.
142
1
6
Article
Community Picks·2y
The Performance Impact of Writing Bad SQL Queries
Poorly written SQL queries can severely degrade database performance, leading to slow response times and inefficient resource utilization. Common mistakes include using 'SELECT *', ignoring execution plans, and inefficient joins. SQL’s simplicity can lead to writing slow queries, especially without proper knowledge or under tight deadlines. Sometimes, systems can tolerate inefficient queries in non-critical applications or low-concurrency environments. However, these bad queries can cause hidden bottlenecks and increased resource consumption. Using tools like execution plans and IDE plugins can help optimize SQL queries, ensuring better system efficiency and scalability.
119
8
7
Article
Laravel News·2y
Visual EXPLAIN for MySQL and Laravel
The MySQL Visual Explain tool by Tobias Petry simplifies the analysis of slow queries by providing a visual representation of MySQL's EXPLAIN output. An API and a Laravel package are available, adding methods to the query builder and offering various options to visualize and debug queries easily.
111
8
Video
developedbyed·2y
SQL Indexes Explained in 20 Minutes
This post delves into the concept of SQL indexing, explaining its purpose, how it works, and its benefits and drawbacks. It includes a practical example of creating and using indexes to optimize query performance and discusses the potential impact of too many indexes on database size and update operations.
109
1
9
Article
Machine Learning News·2y
Korvus: An All-in-One Open-Source RAG (Retrieval-Augmented Generation) Pipeline Built for Postgres
Korvus aims to simplify the Retrieval-Augmented Generation (RAG) pipeline by executing the entire process within a Postgres database using PostgresML. This approach eliminates the need for multiple external tools, reduces development complexity, and improves efficiency by leveraging in-database machine learning for tasks like embedding generation and data retrieval. Korvus supports multiple programming languages, facilitating easier integration and maintenance of search applications, although its performance metrics are yet to be quantified.
89
10
Article
System Design Codex·2y
How Reddit Serves 100K Metadata Requests Per Second
Reddit faced challenges handling scattered metadata across multiple systems. To address this, they built a unified media metadata store using AWS Aurora Postgres. This solution supports over 100K read requests per second with low latency. The setup included dual writes, data backfill, and robust data validation using Kafka for Change Data Capture (CDC). They also implemented range-based partitioning to ensure performance and scalability, enabling Reddit to handle expected volume growth efficiently.
71
2
11
Article
Community Picks·2y
Performance Benchmarks: Comparing Query Latency across TypeScript ORMs & Databases
Performance benchmarks compare query latencies of three TypeScript ORMs (Prisma, TypeORM, Drizzle) across PostgreSQL databases on AWS RDS, Supabase, and Neon. Benchmarking methodology includes 14 queries, executed 500 times on an EC2 instance to measure query latencies. Results show that performance varies based on the specific query, dataset, schema, and infrastructure. Most queries perform similarly across different ORMs, with some exceptions like 'Nested find all' queries. Factors such as network latency and limited API features impact the results. Prisma Optimize offers insights and recommendations for better query performance.
56
4
12
Article
Hacker News·2y
The Great Database Migration
Shepherd successfully migrated its pricing engine database from SQLite to Postgres with zero downtime. The new architecture improves scalability, performance, and developer experience. The migration included converting synchronous functions to asynchronous, leveraging a serverless architecture with Neon, and automating ETL processes. The project highlighted performance optimizations, including caching strategies and connection pooling, resulting in significantly improved response times.
56
13
Article
Javarevisited·2y
System Design — Tips. Designing a robust and scalable system…
Designing a robust and scalable system involves understanding both functional and non-functional requirements, choosing the right architecture (monolithic vs microservices), and implementing strategies for scalability, database design, fault tolerance, security, and monitoring. Techniques like caching, load balancing, redundancy, and message queues can enhance performance, while considerations like distributed locking, data replication, and API gateways ensure reliability and efficiency in operations.
44
1
14
Article
Community Picks·2y
Dealing with Race Conditions: A Practical Example
The post describes a practical example of dealing with race conditions in an application managing on-call shifts for doctors. It explains how naive API implementations can lead to race conditions and demonstrates two PostgreSQL-based solutions—serializable transaction isolation and advisory locks—to handle these issues. The article includes SQL snippets and code examples for implementing these solutions and discusses the importance of addressing race conditions in various real-life scenarios.
43
15
Article
Hacker News·2y
PostgreSQL and UUID as primary key
UUIDs are often used as primary keys in databases due to their uniqueness and ease of generation. While not always the optimal choice due to size concerns, PostgreSQL offers a dedicated UUID type that is more efficient than storing UUIDs as text. Experiments show that using the `uuid` type significantly reduces table and index size compared to `text`. Furthermore, UUID v7, which generates time-sorted values, improves insert performance over the more common UUID v4. These optimizations are crucial for large datasets and high-traffic applications.
40
16
Article
PlanetScale·2y
Sharding strategies: directory-based, range-based, and hash-based
Discover the different types of sharding strategies—directory-based, range-based, and hash-based—along with their pros and cons. Understand how solutions like Vitess and PlanetScale are making sharding more approachable, even though it remains a complex task. Learn how to choose the right sharding strategy based on your database needs while considering the potential challenges like uneven data distribution and added query complexity.
37
17
Article
Daily Dose of Data Science | Avi Chawla | Substack·2y
GROUPING SETS in SQL
Learn how to efficiently run multiple aggregations in SQL using GROUPING SETS, which allows scanning the table just once. This method is more efficient compared to using UNION with separate queries. The post provides a detailed example and a link to a Jupyter Notebook for practical implementation.
35
18
Article
This Dot·2y
The Dangers of ORMs and How to Avoid Them
A legacy ASP .NET application using Entity Framework faced performance issues due to misuse of ORM tools. The post discusses common ORM misuse patterns like N+1 queries, eager versus lazy loading, and lack of indexes. It provides examples using TypeORM with practical suggestions to improve performance, including prefetching data, avoiding unnecessary data loading, and optimizing database indexes.
33
19
Article
Community Picks·2y
Uber’s Secret to Handle Millions of Logs per second with ClickHouse
Uber overhauled its logging infrastructure by switching to ClickHouse, an open-source OLAP database, to handle millions of logs per second. The change addressed key issues they faced with ElasticSearch, such as developer productivity, performance, and scalability. ClickHouse offers high throughput ingestion, fast query performance, efficient storage, dynamic indexing, and clustering capabilities, making it a robust and scalable solution for Uber's massive logging needs.
29
20
Article
Community Picks·2y
How Halo Scaled to 11.6 Million Users Using the Saga Design Pattern 🎮
Halo scaled to 11.6 million users using the Saga design pattern, which manages failure in distributed systems by dividing transactions into sub-transactions. It uses an Orchestrator for transaction management and a durable log for state tracking, thereby maintaining data consistency and avoiding single points of failure. Saga is commonly used in microservices architectures, such as e-commerce, travel booking systems, and banking.
25
21
Article
Community Picks·2y
Autoscaling in Action: Postgres Load Testing with pgbench
Learn how to use pgbench to conduct a load test on a Postgres database to demonstrate autoscaling in action using the Neon platform. The load test includes simulating 30 clients running a high computational overhead query, which triggers dynamic resource allocation through autoscaling. Key steps, such as enabling autoscaling and monitoring performance metrics, are highlighted. Pgbench and EXPLAIN ANALYZE are used to understand the performance and execution plan of the query.
24
22
Article
Community Picks·2y
Laravel v11.17.0 Released: Add whereLike clause, Allow microsecond travel, Add method QueryExecuted::toRawSql(), Reduce the number of queries with Cache::many and Cache::putMany
Laravel v11.17.0 introduces several new features: the `whereLike` clause to enhance query builders with LIKE queries supporting case sensitivity, microsecond travel precision, the `QueryExecuted::toRawSql()` method to facilitate SQL debugging, and optimized query handling with `Cache::many` and `Cache::putMany` which reduces database queries for caching operations. Several minor improvements and bug fixes are also included.
23
23
Article
Community Picks·2y
Benchmarking PostgreSQL connection poolers: PgBouncer, PgCat and Supavisor
Connecting to a PostgreSQL database for short-lived queries is resource-intensive. To address this, three popular connection poolers are compared: PgBouncer, PgCat, and Supavisor. PgBouncer, often critiqued for limited support in replica failover, has the best latency for low connection counts but is single-threaded. PgCat supports sharding, load balancing, and is multithreaded, showing superior performance in high connection counts. Supavisor, designed for cloud-native environments, handles modern connection demands but shows higher latency. Overall, PgCat delivers higher throughput and is more scalable.
23
1
24
Article
ITNEXT·2y
Database Migrations with Go and Kubernetes
Deploying applications with a database layer often requires database migrations. Goose and migrate are tools that facilitate this process. Containerizing migrations and tagging Docker images with semantic versioning integrates well into deployment pipelines. In Kubernetes, running migrations using initContainers is preferred, offering a snapshot of the latest database state and reducing downtime. However, managing multiple pods running migrations can be handled with leader election among initContainers.
22
25
Article
Lobsters·2y
Gotchas with SQLite in Production
SQLite is gaining attention as an excellent database for production web applications, especially for those seeking simplicity. Despite its suitability for many applications, it has several 'gotchas' that can pose challenges, including configuration, lack of network connections, issues with network and ephemeral file systems, concurrency limitations, transactional overhead, backup complexities, and migration limitations. While SQLite offers lower operational complexity, applications requiring multiple machines or heavy write workloads may find MySQL or Postgres more suitable.
19

See all Database archives