Everything I know about good system design

Looper：真正优秀的系统设计是“不起眼的”（underwhelming）和简单的。它追求的是长期稳定运行，而不是使用各种时髦、复杂的技术。一个看起来令人印象深刻的复杂系统，往往掩盖了糟糕的底层决策。

https://www.seangoedecke.com/good-system-design

I see a lot of bad system design advice. One classic is the LinkedIn-optimized “bet you never heard of queues” style of post, presumably aimed at people who are new to the industry. Another is the Twitter-optimized “you’re a terrible engineer if you ever store booleans in a database” clever trick¹. Even good system design advice can be kind of bad. I love Designing Data-Intensive Applications, but I don’t think it’s particularly useful for most system design problems engineers will run into.

我看到过很多糟糕的系统设计建议。一个经典的例子是 LinkedIn 上那种“我打赌你从来没听说过_队列_ ”式的帖子，大概是针对行业新手的。另一个例子是 Twitter 上那种“如果你把布尔值存储在数据库中，那你就是个糟糕的工程师”式的巧妙伎俩 ^1。即使是好的系统设计建议也可能有点糟糕。我喜欢 《设计数据密集型应用程序》 ，但我认为它对工程师遇到的大多数系统设计问题并不是特别有用。

What is system design? In my view, if software design is how you assemble lines of code, system design is how you assemble services. The primitives of software design are variables, functions, classes, and so on. The primitives of system design are app servers, databases, caches, queues, event buses, proxies, and so on.

什么是系统设计？在我看来，如果软件设计是组装代码的方式，那么系统设计就是组装_服务的_方式。软件设计的原语是变量、函数、类等等。系统设计的原语是应用服务器、数据库、缓存、队列、事件总线、代理等等。

This post is my attempt to write down, in broad strokes, everything I know about good system design. A lot of the concrete judgment calls do come down to experience, which I can’t convey in this post. But I’m trying to write down what I can.

这篇文章旨在粗略地写下我所了解的关于良好系统设计的一切。很多具体的判断都依赖于经验，我无法在这篇文章中详尽阐述。但我尽力写下我所能写的内容。

Recognizing good design | 识别好的设计

What does good system design look like? I’ve written before that it looks underwhelming. In practice, it looks like nothing going wrong for a long time. You can tell that you’re in the presence of good design if you have thoughts like “huh, this ended up being easier than I expected”, or “I never have to think about this part of the system, it’s fine”. Paradoxically, good design is self-effacing: bad design is often more impressive than good. I’m always suspicious of impressive-looking systems. If a system has distributed-consensus mechanisms, many different forms of event-driven communication, CQRS, and other clever tricks, I wonder if there’s some fundamental bad decision that’s being compensated for (or if the system is just straightforwardly over-designed).

好的系统设计是什么样的？我之前写过，它看起来令人失望。实际上，它看起来很长时间都没有出错。如果你有这样的想法：“哈，这比我预期的要容易”，或者“我从来都不用考虑系统的这个部分，没问题”，那么你就可以知道你面对的是好的设计。矛盾的是，好的设计是谦逊的：糟糕的设计往往比好的设计更令人印象深刻。我总是对看起来令人印象深刻的系统持怀疑态度。如果一个系统有分布式共识机制、许多不同形式的事件驱动通信、CQRS 和其他巧妙的技巧，我想知道是否存在一些根本性的错误决策需要补偿（或者系统是否只是直接过度设计）。

I’m often alone on this. Engineers look at complex systems with many interesting parts and think “wow, a lot of system design is happening here!” In fact, a complex system usually reflects an absence of good design. I say “usually” because sometimes you do need complex systems. I’ve worked on many systems that earned their complexity. However, a complex system that works always evolves from a simple system that works. Beginning from scratch with a complex system is a really bad idea.

我经常独自思考这个问题。工程师们看到包含许多有趣部分的复杂系统，会想：“哇，这里面居然有这么多系统设计！” 事实上，复杂系统通常反映出缺乏良好的设计。我说“通常”是因为有时你确实需要复杂的系统。我开发过许多系统，它们本身就很复杂。然而，一个有效的复杂系统总是从一个有效的简单系统发展而来。从零开始开发一个复杂的系统是一个非常糟糕的主意。

State and statelessness | 状态与无状态

The hard part about software design is state. If you’re storing any kind of information for any amount of time, you have a lot of tricky decisions to make about how you save, store and serve it. If you’re not storing information², your app is “stateless”. As a non-trivial example, GitHub has an internal API that takes a PDF file and returns a HTML rendering of it. That’s a real stateless service. Anything that writes to a database is stateful.

软件设计的难点在于状态（state）。如果你要存储任何类型的信息，无论存储多长时间，你都需要做出许多关于如何保存、存储和提供这些信息的棘手决策。如果你不存储信息，你的应用就是“无状态的”。举个不寻常的例子，GitHub 有一个内部 API，它接收 PDF 文件并返回其 HTML 渲染结果。这是一个真正的无状态服务。任何写入数据库的操作都是有状态的。

You should try and minimize the amount of stateful components in any system. (In a sense this is trivially true, because you should try to minimize the amount of all components in a system, but stateful components are particularly dangerous.) The reason you should do this is that stateful components can get into a bad state. Our stateless PDF-rendering service will safely run forever, as long as you’re doing broadly sensible things: e.g. running it in a restartable container so that if anything goes wrong it can be automatically killed and restored to working order. A stateful service can’t be automatically repaired like this. If your database gets a bad entry in it (for instance, an entry with a format that triggers a crash in your application), you have to manually go in and fix it up. If your database runs out of room, you have to figure out some way to prune unneeded data or expand it.

您应该尝试在任何系统中最小化有状态组件的数量。（从某种意义上说，这是显而易见的，因为您应该尝试最小化系统中所有组件的数量，但是有状态组件特别危险。）您应该这样做的原因是有状态组件可能会陷入不良状态 。只要您执行广泛合理的操作，我们的无状态 PDF 渲染服务就会永远安全运行：例如，在可重启的容器中运行它，这样如果出现任何问题，它可以被自动终止并恢复到工作状态。有状态服务无法像这样自动修复。如果您的数据库中出现错误条目（例如，条目的格式会触发应用程序崩溃），您必须手动进入并修复它。如果您的数据库空间不足，您必须想办法修剪不需要的数据或扩展它。

What this means in practice is having one service that knows about the state – i.e. it talks to a database – and other services that do stateless things. Avoid having five different services all write to the same table. Instead, have four of them send API requests (or emit events) to the first service, and keep the writing logic in that one service. If you can, it’s worth doing this for the read logic as well, although I’m less absolutist about this. It’s sometimes better for services to do a quick read of the user_sessions table than to make a 2x slower HTTP request to an internal sessions service.

这在实践中意味着，一个服务了解状态（即与数据库通信），其他服务执行无状态操作。避免五个不同的服务都写入同一张表。相反，让其中四个服务向第一个服务发送 API 请求（或发出事件），并将写入逻辑保留在该服务中。如果可以，读取逻辑也值得这样做，尽管我对此并不绝对。 有时， 服务快速读取 user_sessions 表比向内部会话服务发送慢两倍的 HTTP 请求更好。

Databases | 数据库

Since managing state is the most important part of system design, the most important component is usually where that state lives: the database. I’ve spent most of my time working with SQL databases (MySQL and PostgreSQL), so that’s what I’m going to talk about.

由于状态管理是系统设计中最重要的部分，因此最重要的组件通常就是状态所在的位置：数据库。我大部分时间都在使用 SQL 数据库（MySQL 和 PostgreSQL），所以我接下来要讨论的就是这些。

Schemas and indexes | 模式与索引

If you need to store something in a database, the first thing to do is define a table with the schema you need. Schema design should be flexible, because once you have thousands or millions of records, it can be an enormous pain to change the schema. However, if you make it too flexible (e.g. by sticking everything in a “value” JSON column, or using “keys” and “values” tables to track arbitrary data) you load a ton of complexity into the application code (and likely buy some very awkward performance constraints). Drawing the line here is a judgment call and depends on specifics, but in general I aim to have my tables be human-readable: you should be able to go through the database schema and get a rough idea of what the application is storing and why.

如果您需要在数据库中存储数据，首先要做的就是定义一个包含所需模式的表。模式设计应该灵活，因为一旦拥有数千或数百万条记录，更改模式就会非常麻烦。但是，如果您将其设计得过于灵活（例如，将所有内容都放在 JSON 的“值”列中，或者使用“键”和“值”表来跟踪任意数据），则会给应用程序代码带来极大的复杂性（并且可能会带来一些非常棘手的性能限制）。在这里划清界限需要根据具体情况进行判断，但总的来说，我的目标是让我的表易于理解：您应该能够浏览数据库模式，并大致了解应用程序存储的内容及其原因。

If you expect your table to ever be more than a few rows, you should put indexes on it. Try to make your indexes match the most common queries you’re sending (e.g. if you query by email and type, create an index with those two fields). Indexes work like nested dictionaries, so make sure to put the highest-cardinality fields first (otherwise each index lookup will have to scan all users of type to find the one with the right email). Don’t index on every single thing you can think of, since each index adds write overhead.

如果您预计表的数据量会超过很多行，则应该为其创建索引。尽量使索引与您发送的最常见查询匹配（例如，如果您通过 email 和 type 进行查询，则应为这两个字段创建索引）。索引的工作原理类似于嵌套字典，因此请确保将基数最高的字段放在最前面（否则每次索引查找都必须扫描所有 type 的用户，才能找到具有正确 email 的用户）。不要对您能想到的每一个字段都创建索引，因为每个索引都会增加写入开销。

Bottlenecks | 瓶颈

Accessing the database is often the bottleneck in high-traffic applications. This is true even when the compute side of things is relatively inefficient (e.g. Ruby on Rails running on a preforking server like Unicorn). That’s because complex applications need to make a lot of database calls – hundreds and hundreds for every single request, often sequentially (because you don’t know if you need to check whether a user is part of an organization until after you’ve confirmed they’re not abusive, and so on). How can you avoid getting bottlenecked?

在高流量应用程序中，访问数据库通常是瓶颈。即使计算效率相对较低（例如，Ruby on Rails 运行在像 Unicorn 这样的预分叉服务器上），情况也是如此。这是因为复杂的应用程序需要进行_大量的_数据库调用——每个请求都要进行数百次调用，而且通常是顺序进行的（因为在确认用户没有滥用权限之前，你不知道是否需要检查用户是否属于某个组织，等等）。如何避免瓶颈？

When querying the database, query the database. It’s almost always more efficient to get the database to do the work than to do it yourself. For instance, if you need data from multiple tables, JOIN them instead of making separate queries and stitching them together in-memory. Particularly if you’re using an ORM, beware accidentally making queries in an inner loop. That’s an easy way to turn a select id, name from table to a select id from table and a hundred select name from table where id = ?.

查询数据库时，直接查询数据库。让数据库来做这项工作几乎总是比自己做更高效。例如，如果您需要来自多个表的数据，请将它们 JOIN 而不是分别进行查询，然后在内存中将它们拼接起来。尤其是在使用 ORM 时，要小心在内部循环中意外进行查询。这很容易将 select id, name from table 变成“ select id from table 和 100 select name from table where id = ? ”。

Send as many read queries as you can to database replicas. A typical database setup will have one write node and a bunch of read-replicas. The more you can avoid reading from the write node, the better – that write node is already busy enough doing all the writes. The exception is when you really, really can’t tolerate any replication lag (since read-replicas are always running at least a handful of ms behind the write node). But in most cases replication lag can be worked around with simple tricks: for instance, when you update a record but need to use it right after, you can fill in the updated details in-memory instead of immediately re-reading after a write.

有时你确实需要拆分查询。这种情况并不常见，但我遇到过一些非常丑陋的查询，在数据库中拆分它们比尝试将它们作为单个查询运行更容易。我相信构建索引和提示总是可以的，这样数据库就可以更好地处理查询，但偶尔进行策略性的查询拆分仍然是一个值得在你的工具箱中拥有的工具。

Every so often you do want to break queries apart. It doesn’t happen often, but I’ve run into queries that were ugly enough that it was easier on the database to split them up than to try to run them as a single query. I’m sure it’s always possible to construct indexes and hints such that the database can do it better, but the occasional tactical query-split is a tool worth having in your toolbox.

向数据库副本发送尽可能多的读取查询。典型的数据库设置通常包含一个写入节点和多个读取副本。越能避免从写入节点读取数据越好——因为写入节点已经足够忙于处理所有写入操作了。例外情况是，你真的无法容忍任何复制延迟（因为读取副本的运行速度总是比写入节点至少慢几毫秒）。但在大多数情况下，复制延迟可以通过一些简单的技巧来解决：例如，当你更新一条记录但需要立即使用它时，你可以将更新的详细信息填充到内存中，而不是在写入后立即重新读取。

Beware spikes of queries (particularly write queries, and particularly transactions). Once a database gets overloaded, it gets slow, which makes it more overloaded. Transactions and writes are good at overloading databases, because they require a lot of database work for each query. If you’re designing a service that might generate massive query spikes (e.g. some kind of bulk-import API), consider throttling your queries.

警惕查询峰值（尤其是写入查询，以及事务）。一旦数据库过载，速度就会变慢，从而进一步加剧数据库过载。事务和写入操作很容易导致数据库过载，因为它们需要为每个查询执行大量的数据库工作。如果您正在设计一个可能产生大量查询峰值的服务（例如某种批量导入 API），请考虑限制查询速度。

Slow operations, fast operations | 慢操作，快操作

A service has to do some things fast. If a user is interacting with something (say, an API or a web page), they should see a response within a few hundred ms³. But a service has to do other things that are slow. Some operations just take a long time (converting a very large PDF to HTML, for instance). The general pattern for this is splitting out the minimum amount of work needed to do something useful for the user and doing the rest of the work in the background. In the PDF-to-HTML example, you might render the first page to HTML immediately and queue up the rest in a background job.

What’s a background job? It’s worth answering this in detail, because “background jobs” are a core system design primitive. Every tech company will have some kind of system for running background jobs. There will be two main components: a collection of queues, e.g. in Redis, and a job runner service that will pick up items from the queues and execute them. You enqueue a background job by putting an item like {job_name, params} on the queue. It’s also possible to schedule background jobs to run at a set time (which is useful for periodic cleanups or summary rollups). Background jobs should be your first choice for slow operations, because they’re typically such a well-trodden path.

Sometimes you want to roll your own queue system. For instance, if you want to enqueue a job to run in a month, you probably shouldn’t put an item on the Redis queue. Redis persistence is typically not guaranteed over that period of time (and even if it is, you likely want to be able to query for those far-future enqueued jobs in a way that would be tricky with the Redis job queue). In this case, I typically create a database table for the pending operation with columns for each param plus a scheduled_at column. I then use a daily job to check for these items with scheduled_at <= today, and either delete them or mark them as complete once the job has finished.

Caching | 缓存

Sometimes an operation is slow because it needs to do an expensive (i.e. slow) task that’s the same between users. For instance, if you’re calculating how much to charge a user in a billing service, you might need to do an API call to look up the current prices. If you’re charging users per-use (like OpenAI does per-token), that could (a) be unacceptably slow and (b) cause a lot of traffic for whatever service is serving the prices. The classic solution here is caching: only looking up the prices every five minutes, and storing the value in the meantime. It’s easiest to cache in-memory, but using some fast external key-value store like Redis or Memcached is also popular (since it means you can share one cache across a bunch of app servers).

The typical pattern is that junior engineers learn about caching and want to cache everything, while senior engineers want to cache as little as possible. Why is that? It comes down to the first point I made about the danger of statefulness. A cache is a source of state. It can get weird data in it, or get out-of-sync with the actual truth, or cause mysterious bugs by serving stale data, and so on. You should never cache something without first making a serious effort to speed it up. For instance, it’s silly to cache an expensive SQL query that isn’t covered by a database index. You should just add the database index!

I use caching a lot. One useful caching trick to have in the toolbox is using a scheduled job and a document storage like S3 or Azure Blob Storage as a large-scale persistent cache. If you need to cache the result of a really expensive operation (say, a weekly usage report for a large customer), you might not be able to fit the result in Redis or Memcached. Instead, stick a timestamped blob of the results in your document storage and serve the file directly from there. Like the database-backed long-term queue I mentioned above, this is an example of using the caching idea without using a specific cache technology.

Events | 事件

As well as some kind of caching infrastructure and background job system, tech companies will typically have an event hub. The most common implementation of this is Kafka. An event hub is just a queue – like the one for background jobs – but instead of putting “run this job with these params” on the queue, you put “this thing happened” on the queue. One classic example is firing off a “new account created” event for each new account, and then having multiple services consume that event and take some action: a “send a welcome email” service, a “scan for abuse” service, a “set up per-account infrastructure” service, and so on.

You shouldn’t overuse events. Much of the time it’s better to just have one service make an API request to another service: all the logs are in the same place, it’s easier to reason about, and you can immediately see what the other service responded with. Events are good for when the code sending the event doesn’t necessarily care what the consumers do with the event, or when the events are high-volume and not particularly time-sensitive (e.g. abuse scanning on each new Twitter post).

Pushing and pulling | 推与拉

When you need data to flow from one place to a lot of other places, there are two options. The simplest is to pull. This is how most websites work: you have a server that owns some data, and when a user wants it they make a request (via their browser) to the server to pull that data down to them. The problem here is that users might do a lot of pulling down the same data – e.g. refreshing their email inbox to see if they have any new emails, which will pull down and reload the entire web application instead of just the data about the emails.

The alternative is to push. Instead of allowing users to ask for the data, you allow them to register as clients, and then when the data changes, the server pushes the data down to each client. This is how GMail works: you don’t have to refresh the page to get new emails, because they’ll just appear when they arrive.

If we’re talking about background services instead of users with web browsers, it’s easy to see why pushing can be a good idea. Even in a very large system, you might only have a hundred or so services that need the same data. For data that doesn’t change much, it’s much easier to make a hundred HTTP requests (or RPC, or whatever) whenever the data changes than to serve up the same data a thousand times a second.

Suppose you did need to serve up-to-date data to a million clients (like GMail, does). Should those clients be pushing or pulling? It depends. Either way, you won’t be able to run it all from a single server, so you’ll need to farm it out to other components of the system. If you’re pushing, that will likely mean sticking each push on an event queue and having a horde of event processors each pulling from the queue and sending out your pushes. If you’re pulling, that will mean standing up a bunch (say, a hundred) of fast⁴ read-replica cache servers that will sit in front of your main application and handle all the read traffic⁵.

Hot paths

When you’re designing a system, there are lots of different ways users can interact with it or data can flow through it. It can get a bit overwhelming. The trick is to mainly focus on the “hot paths”: the part of the system that is most critically important, and the part of the system that is going to handle the most data. For instance, in a metered billing system, those pieces might be the part that decides whether or not a customer gets charged, and the part that needs to hook into all user actions on the platform to identify how much to charge.

Hot paths are important because they have fewer possible solutions than other design areas. There are a thousand ways you can build a billing settings page and they’ll all mainly work. But there might be only a handful of ways that you can sensibly consume the firehose of user actions. Hot paths also go wrong more spectacularly. You have to really screw up a settings page to take down the entire product, but any code you write that’s triggered on all user actions can easily cause huge problems.

Logging and metrics

How do you know if you’ve got problems? One thing I’ve learned from my most paranoid colleagues is to log aggressively during unhappy paths. If you’re writing a function that checks a bunch of conditions to see if a user-facing endpoint should respond 422, you should log out the condition that was hit. If you’re writing billing code, you should log every decision made (e.g. “we’re not billing for this event because of X”). Many engineers don’t do this because it adds a bunch of logging boilerplate and makes it hard to write beautifully elegant code, but you should do it anyway. You’ll be happy you did when an important customer is complaining that they’re getting a 422 – even if that customer did something wrong, you still need to figure out what they did wrong for them.

You should also have basic observability into the operational parts of the system. That means CPU/memory on the hosts or containers, queue sizes, average time per-request or per-job, and so on. For user-facing metrics like time per-request, you also need to watch the p95 and p99 (i.e. how slow your slowest requests are). Even one or two very slow requests are scary, because they’re disproportionately from your largest and most important users. If you’re just looking at averages, it’s easy to miss the fact that some users are finding your service unusable.

Killswitches, retries, and failing gracefully

I wrote a whole post about killswitches that I won’t repeat here, but the gist is that you should think carefully about what happens when the system fails badly.

Retries are not a magic bullet. You need to make sure you’re not putting extra load on other services by blindly retrying failed requests. If you can, put high-volume API calls inside a “circuit breaker”: if you get too many 5xx responses in a row, stop sending requests for a while to let the service recover. You also need to make sure you’re not retrying write events that may or may not have succeeded (for instance, if you send a “bill this user” request and get back a 5xx, you don’t know if the user has been billed or not). The classic solution to this is to use an “idempotency key”, which is a special UUID in the request that the other service uses to avoid re-running old requests: every time they do something, they save the idempotency key, and if they get another request with the same key, they silently ignore it.

It’s also important to decide what happens when part of your system fails. For instance, say you have some rate limiting code that checks a Redis bucket to see if a user has made too many requests in the current window. What happens when that Redis bucket is unavailable? You have two options: fail open and let the request through, or fail closed and block the request with a 429.

Whether you should fail open or closed depends on the specific feature. In my view, a rate limiting system should almost always fail open. That means that a problem with the rate limiting code isn’t necessarily a big user-facing incident. However, auth should (obviously) always fail closed: it’s better to deny a user access to their own data than to give a user access to some other user’s data. There are a lot of cases where it’s not clear what the right behavior is. It’s often a difficult tradeoff.

Final thoughts | 最后思考

There are some topics I’m deliberately not covering here. For instance, whether or when to split your monolith out into different services, when to use containers or VMs, tracing, good API design. Partly this is because I don’t think it matters that much (in my experience, monoliths are fine), or because I think it’s too obvious to talk about (you should use tracing), or because I just don’t have the time (API design is complicated).

The main point I’m trying to make is what I said at the start of this post: good system design is not about clever tricks, it’s about knowing how to use boring, well-tested components in the right place. I’m not a plumber, but I imagine good plumbing is similar: if you’re doing something too exciting, you’re probably going to end up with crap all over yourself.

Especially at large tech companies, where these components already exist off the shelf (i.e. your company already has some kind of event bus, caching service, etc), good system design is going to look like nothing. There are very, very few areas where you want to do the kind of system design you could talk about at a conference. They do exist! I have seen hand-rolled data structures make features possible that wouldn’t have been possible otherwise. But I’ve only seen that happen once or twice in ten years. I see boring system design every single day.

edit: this post was discussed on Hacker News with lots of good comments. I was amused by the comments that said “why even mention ‘don’t read your writes’, who would do that” right next to the comments that said “hmm, it seems way too fiddly to not read your writes”.

You’re supposed to store timestamps instead, and treat the presence of a timestamp as true. I do this sometimes but not always – in my view there’s some value in keeping a database schema immediately-readable.↩
Technically any service stores information of some kind for some duration, at least in-memory. Typically what’s meant here is storing information outside of the request-response lifecycle (e.g. persistently on-disk somewhere, such as in a database). If you can stand up a new version of the app by simply spinning up the application server, that’s a stateless app.↩
Gamedevs on Twitter will say that anything slower than 10ms is unacceptable. Whether that ought to be the case, it’s just factually not true about successful tech products – users will accept slower responses if the app is doing something that’s useful to them.↩
They’re fast because they don’t have to talk to a database in the way the main server does. In theory, this could just be a static file on-disk that they serve up when asked, or even data held in-memory.↩
Incidentally, those cache servers will either poll your main server (i.e. pulling) or your main server will send the new data to them (i.e. pushing). I don’t think it matters too much which you do. Pushing will give you more up-to-date data but pulling is simpler.↩

If you liked this post, consider subscribing to email updates about my new posts, or sharing it on Hacker News.

June 21, 2025 │ Tags: good engineers, software design

《关于优秀系统设计我所知道的一切》的总结

LOOPER'S DAILY