Oct 8, 2024

Have cloudflare solved the n+1 query problem?

n+1 queries are a common problem in web development. A footgun often encountered in your junior years when you’re doing anything to avoid writing raw SQL. Should we blame ORMs? Probably not, but do we anyway? Absolutely.

Let’s take a look at a simple example, using the database that serves this blog:

SELECT id, content, user_id FROM comments WHERE post_id = 1

Unsurprisingly, this gets the id, content, and user_id for all comments on my first post. Which is great, except the user_id is 01925ccc-78f4-7bd8-917d-31d94825ea7a

Pehaps I should fetch the user’s name before I display it on my website?

SELECT display_name FROM users WHERE id = '01925ccc-78f4-7bd8-917d-31d94825ea7a'

Now we have the name of a single user, and their comment. Perfect for a blog like this where there’s only one reader!

Consider instead that I was ThePrimeagen Netflix btw and I had a thousand viewers all writing comments. The issue we’ll soon discover is that we’re using one query to get the comments, and N queries to get the user details for each comment. This is an n+1 query problem.

const comments = await sql`SELECT id, content, user_id FROM comments WHERE post_id = 1`;

for (const comment of comments) {
  const user = sql`SELECT display_name FROM users WHERE id = ${comment.user_id}`;
  console.log(user[0].display_name, comment.content);
}

But why is this a problem?

Imagine you bring in McKinsey. An intern you’re paying $1300 an hour informs you that the reason you have no users is that your blog is too slow. They’ve seen a tweet about Serverless Servers - It’s time to deploy to the edge.

You go ahead and build a fancy distributed architecture. Your blog now runs on a v8 isolate close to your user, backed by a fully managed database.

The latency to the user is 10ms. Database is 20ms. This is a pretty ideal scenario. Unless your users play CS2 for a living, no one is going to notice the 60ms round trip, right? Well, if your post has 1000 comments, you’re looking at (1 + 1000) * (20ms * 2) = 40040ms to load the comments for a single post.

40 seconds??? lmao.

So how do we solve this?

There’s many ways, actually. Generally you would think about your query patterns ahead of time, and join the users and comments tables in a single query or view. This fetches all the data up front, and saves on the true performance bottleneck - the network. But that’s hardly worth writing about.

Instead, Cloudflare have done something interesting: Shipped SQLite in Durable Objects.

SQLite is not client/server. It runs in the same process address space as your application. Which means that a thousand queries is no longer an issue. In fact its no less performant than if you pre-loaded everything with a join.

Lets see it action

Let’s compare the performance of a traditional database setup (D1) with Cloudflare’s Durable Objects using SQLite.

We have two examples using workers. In this first example, we’re spinning up one with Smart Placement to get it as close to the DB as possible. This worker connects to a D1 instance in Australia.

Note: This test is slightly unfair as there’s an inital cold-start time for D1. Run it twice if you want apples-to-apples.

Now we’ll try using a Durable Object, which gets spawned close to your location - except it has access to SQLite directly.

Note: With the above examples, the performance library doesnt increment timers the same way in production as it would running on localhost ¹

The keen eyed will note that not only was the latency eliminated, we also increased the number of requests significantly. If you’d told me you’re intentionally spinning up a thousand sql queries from the server 6 months ago - i would have slapped you.

// Just to prove its still n+1, here's what we run:
// Get the comments
const commentsDuration = await this.measureTime(async () => {
  const comments = await this.ctx.storage.sql.exec(
    "SELECT id, content, user_id FROM comments WHERE post_id = ?",
    [postId]
  ).toArray();
  this.comments = comments;
});
traces.push({ description: "Fetch all comments", duration: parseFloat(commentsDuration.toFixed(6)) });

// For each comment, get the user
for (const comment of this.comments) {
  const userDuration = await this.measureTime(async () => {
    const user = await this.ctx.storage.sql.exec(
      "SELECT display_name FROM users WHERE id = ?",
      [comment.user_id]
    ).one();
    comment.user_name = user.display_name;
  });
  traces.push({ description: `Fetched user ${comment.user_name} for comment ${comment.id}`, duration: parseFloat(userDuration.toFixed(6)) });
}

// And then we print out some stats
const summary = {
  totalQueries: traces.length,
  totalTime: parseFloat(traces.reduce((sum, trace) => sum + trace.duration, 0).toFixed(6)),
  averageQueryTime: parseFloat((traces.reduce((sum, trace) => sum + trace.duration, 0) / traces.length).toFixed(6))
};

Cloudflare havent invented anything new here. Yet packaging up Durable Objects with a semi-managed database AND i can stay relational without joins??? It might finally be enough to convert me to the edge… lord.

Appendix

Just to be fair to D1, here’s what the same test looks like when done in a sane way with a single query join. This is still fast. But its not instant.

Footnotes

^[1] to mitigate potential spectre attacks the performance library doesnt increment timers the same way when deployed to hosted workerd. Instead it increments time only after the last I/O occurs. When benchmarking locally, we can see more accurate timings, which are typically a single digit microsecond for the SQLite example.