z, ? | toggle help (this) |
space, → | next slide |
shift-space, ← | previous slide |
d | toggle debug mode |
## <ret> | go to slide # |
r | reload slides |
n | toggle notes |
Hello everyone. It is great to be here in Kiev. This talk is entitled “Deep Dive into Eager Loading Limited Associations”.
RubyC
2014
My name is Jeremy Evans, and I have been the lead developer of the ruby Sequel library for the last 6 years.|For those of you who have not heard of the Sequel library before, it is an SQL database library for ruby and it contains an object relational mapper which is similar in basic usage to ActiveRecord.
We start this talk on a boat, getting briefed on the conditions underwater. Unfortunately, the water is murky, with only a few meters of visibility.
While I’m sure most of you are familiar with associations, let me briefly talk about associations in general. Associations are really a shortcut for querying for related objects in other tables.
Let us consider a database that stores blog posts in one table, and replies to those blog posts in a separate table. Here is a basic Sequel model for the posts table.
class Post > Sequel::Model end
If you want to add a method that returns all replies for a post, a simple solution is something like this. This adds an instance method named replies that returns all replies where the post_id for the reply matches the id of the current post.|This is simple and straight forward. However, since methods like this are very common, it makes sense to abstract them and offer a higher level interface.
class Post > Sequel::Modeldef replies Reply.where(post_id: id).all endend
That higher level interface is associations. Instead of manually writing an instance method to get the replies for the post, we call a class method that writes the instance method for us.
class Post > Sequel::Model
one_to_many :replies
end
Calling the replies method on a post instance then runs this SQL query and returns all related replies. Now, this works great if you want to get the replies for a single post object.
post = Post[1] post.repliesSELECT * FROM replies WHERE (post_id = 1)
However, what if you have multiple post objects, and you want to get the replies for all of them?
posts = Post.where(id: [1,2,3,4,5]).all
You could just loop over the posts, and call the replies method separately for each post.
posts = Post.where(id: [1,2,3,4,5]).all
posts.each{|p| p.replies}
However, this will run a separate query for every post, which can be inefficient, especially if there is a significant latency between the database and the application.
posts = Post.where(id: [1,2,3,4,5]).all posts.each{|p| p.replies}SELECT * FROM replies WHERE (post_id = 1) SELECT * FROM replies WHERE (post_id = 2) SELECT * FROM replies WHERE (post_id = 3) SELECT * FROM replies WHERE (post_id = 4) SELECT * FROM replies WHERE (post_id = 5)
What would be optimal would be to run a single query similar to the one shown, which would return all replies for those 5 posts, and then automatically associate each reply with the correct post.
posts = Post.where(id: [1,2,3,4,5]).all posts.each{|p| p.replies}SELECT * FROM replies WHERE (post_id IN (1,2,3,4,5))
This is the reason that eager loading was introduced. In Sequel, the default eager loading method is called eager. By calling eager, you let Sequel know that you want to eagerly load the replies for all of the returned posts.|This will automatically run the query to get all replies for those 5 posts, and it will associate each reply to the appropriate post.
posts = Post.where(id: [1,2,3,4,5]).
eager(:replies).all
SELECT *
FROM replies
WHERE (post_id IN (1,2,3,4,5))
The reason for our underwater excursion today to find out in detail how to correctly handle the association if it is limited to a given number of records.
Here is an association that returns the first 10 replies for a post, using the association’s limit option.
class Post > Sequel::Model
one_to_many :first_10_replies,
limit: 10, order: :id,
class: :Reply
end
Generally, when adding a limit, you will also want to add an order, so that you get consistent results. Assuming that the replies table has a typical autoincremented primary key, ordering by id should make sure that the replies are returned in the order they were received.
class Post > Sequel::Model
one_to_many :first_10_replies,
limit: 10, order: :id,
class: :Reply
end
Taking the limit and order together, this association should return the first 10 replies received for a given post.
class Post > Sequel::Model
one_to_many :first_10_replies,
limit: 10, order: :id,
class: :Reply
end
Note that for regular association loading, where you have a single post and want the first 10 replies for that post, the query to use is simple. You start with the initial query to use for regular loading
post = Post[1] post.first_10_repliesSELECT * FROM replies WHERE (post_id = 1)
You then order by the reply’s id so that earlier replies come before later replies.
post = Post[1]
post.first_10_replies
SELECT *
FROM replies
WHERE (post_id = 1)
ORDER BY id
Finally you limit the number of returned records to 10 using the LIMIT clause.|While this strategy works fine for loading the replies for a single post, it cannot be used when eagerly loading.
post = Post[1]
post.first_10_replies
SELECT *
FROM replies
WHERE (post_id = 1)
ORDER BY id
LIMIT 10
Let us look back at the query used for eagerly loading. Now let us try to use the same strategy of using the LIMIT clause.
Post. where(id: [1,2,3,4,5]). eager(:first_10_replies)SELECT * FROM replies WHERE (post_id IN (1, 2, 3, 4, 5))
You would end up with the SQL used here.|The reason this does not work is that the LIMIT 10 clause here applies to the entire result set, giving you the first 10 replies to any of the 5 posts. This may return 5 replies each for posts 1 and 2, and no replies for posts 3, 4, or 5.|What you want is first 10 replies to each of the 5 posts, which a LIMIT clause cannot handle. So if you cannot use a LIMIT clause to do eager loading, how can you get the correct results?
Post. where(id: [1,2,3,4,5]). eager(:first_10_replies) SELECT * FROM replies WHERE (post_id IN (1, 2, 3, 4, 5))ORDER BY id LIMIT 10
To find out, you cannot stay above the water, you need to dive in!|Just like you have rules when diving, I think there should be a rule for eager loading.
That rule is: “Eager loading an association for a group of records should give the same results as loading the association separately for each individual record.”|Eager loading should be considering an optimization, and if it changes what values are returned, I think it is broken.
Eager loading an association for a group of records should give the same results as loading the association separately for each individual record.
With Sequel, the rule should be followed for limited associations. If you are eagerly loading the first 10 replies for a set of posts, calling the first_10_replies method on any of the returned posts will return an array containing the first 10 replies, even if the post has 1000 replies.
Post.eager(:first_10_replies).
first.
first_10_replies.length
# => 10
This is not true if you are using ActiveRecord. ActiveRecord’s documented behavior for eagerly loading an association with a limit, is to ignore the limit.|If the first post returned has 1000 replies, calling the first_10_replies method here will return an array containing all 1000 replies.
Post.includes(:first_10_replies).
first.
first_10_replies.length
# => 1000
While it cannot really be considered a bug as the behavior is documented, it violates the rule, as eager loading changes the results of the first_10_replies method. It certainly would be nice if the behavior could be changed so that it did not violate the rule.
Post.includes(:first_10_replies). first. first_10_replies.length # => 1000
Since we are still near the surface, let us consider simple ways to handle eager loading limited associations correctly, so that the rule is not violated.
The simplest way to fix it is to just not eager load in that case. Basically, if asked to eager load for an association with a limit, just do not eager load. When the first_10_replies method is called for each post, issue a query to get the first 10 replies for that post.
If you are only eagerly loading for a small number of posts, or the latency between the database and the application is low, this actually performs quite well in most cases.|However, as the number of posts you are eagerly loading for increases, or the latency between the database and the application increases, performance starts to degrade.
Another simple way to handle eagerly loading the first 10 replies association is to eagerly load all associated replies, but take the resulting array of replies for each post, and use array slicing to extract the first 10 replies before returning the results to the user.
This was Sequel’s original strategy for eager loading limited associations. What Sequel does is fairly simple, just using a range to slice the array to return only the first 10 repiles.|At the database level, this does not perform any differently than the ActiveRecord approach of loading all objects, and in many cases it performs worse than skipping the eager load, but at least it ensures that the user gets correct results and the rule is not violated.
posts.each do |post| a = post.associationsa[:first_10_replies] = a[:first_10_replies][0...10]end
While skipping the eager load or slicing the resulting array are easy solutions, what would be better would be to send a single query to the database that returned the first 10 replies for each of the posts being eagerly loaded.
It turns out that there are multiple ways to do that in SQL. However, you need to leave the shallows and dive deeper to figure it out.
One strategy that uses a single SQL query to eagerly load limited associations is to use a correlated subquery. A correlated subquery is when a subquery references columns from the query that contains it, and for each row returned by the outer query, the subquery is evaluated separately.
Here is the SQL that implements a correlated subquery strategy for eager loading. It is basically the same query as used for eager loading without a limit, but with the addition of the highlighted subquery to remove replies that would not be in the first 10.
SELECT * FROM replies WHERE ((post_id IN (1, 2, 3, 4, 5))AND (id IN ( SELECT t1.id FROM replies AS t1 WHERE t1.post_id = replies.post_id ORDER BY id LIMIT 10)))ORDER BY id
This highlighted part shows the correlation. The subquery’s WHERE clause filters for rows where the replies in the subquery have the same post id as replies in the outer query. So the WHERE clause in the subquery makes it so the subquery consists of only rows in the replies table that have the same post_id as the current row in the main query.
SELECT * FROM replies WHERE ((post_id IN (1, 2, 3, 4, 5)) AND (id IN ( SELECT t1.id FROM replies AS t1 WHERE t1.post_id = replies.post_id ORDER BY id LIMIT 10))) ORDER BY id
The LIMIT clause inside the subquery then ensures that the subquery consists of only the first 10 rows in the replies table that have the same post_id as the current row in the main query.
SELECT *
FROM replies
WHERE ((post_id IN (1, 2, 3, 4, 5))
AND (id IN (
SELECT t1.id
FROM replies AS t1
WHERE t1.post_id = replies.post_id
ORDER BY id
LIMIT 10)))
ORDER BY id
Since the subquery now returns the ids of the first 10 replies for the current post, the outer query can use a IN clause, which will remove replies where the reply is not in the first 10 replies for its associated post.
SELECT *
FROM replies
WHERE ((post_id IN (1, 2, 3, 4, 5))
AND (id IN (
SELECT t1.id
FROM replies AS t1
WHERE t1.post_id = replies.post_id
ORDER BY id
LIMIT 10)))
ORDER BY id
Unfortunately, this approach can be slow for large datasets, unless the database is good at optimizing correlated subqueries, and it is not too difficult to determine why. Let us say that posts 1, 2, 3, and 4 have few replies, but post 5 has 10,000 replies.
SELECT * FROM replies WHERE ((post_id IN (1, 2, 3, 4, 5)) AND (id IN ( SELECT t1.id FROM replies AS t1 WHERE t1.post_id = replies.post_id ORDER BY id LIMIT 10))) ORDER BY id
Unless the database is good at optimizing correlated subqueries, what the database will probably do is run this subquery for every one of the 10,000 replies to post 5.
SELECT * FROM replies WHERE ((post_id IN (1, 2, 3, 4, 5)) AND (id IN (SELECT t1.id FROM replies AS t1 WHERE t1.post_id = replies.post_id ORDER BY id LIMIT 10)))ORDER BY id
Note that it is possible for the query to be optimized such that this subquery is only run 5 times. A good correlated subquery optimizer will realize that only expression in the subquery that references the outer query is the highlighted replies.post_id. Since this is the only expression in the correlated subquery that varies based on the current row in the outer query, you can cache the results of previous subquery evaluations by the post_id and only reevaluate the subquery if you see a post_id that is not in the cache.|This is one area where Microsoft SQL Server’s optimizer is significantly better than PostgreSQL’s optimizer. On Microsoft SQL Server, using a correlated subquery for eager loading can be faster than the default of slicing the array, but in PostgreSQL it can be orders of magnitude slower.
SELECT *
FROM replies
WHERE ((post_id IN (1, 2, 3, 4, 5))
AND (id IN (
SELECT t1.id
FROM replies AS t1
WHERE t1.post_id = replies.post_id
ORDER BY id
LIMIT 10)))
ORDER BY id
Unfortunately, while most SQL databases support this type of correlated subquery, MySQL will not even execute it, as it does not support using the LIMIT clause on a subquery that is used in an IN clause.
SELECT * FROM replies WHERE ((post_id IN (1, 2, 3, 4, 5)) AND (id IN ( SELECT t1.id FROM replies AS t1 WHERE t1.post_id = replies.post_id ORDER BY id LIMIT 10))) ORDER BY id
Sequel added support for using correlated subqueries to do eager loading in Sequel 3.28.
However, because on databases without good optimizers it performs poorly, and databases with good optimizers can use a more performant strategy, support for using correlated subqueries for eager loading was removed in Sequel 4.
Another strategy to eagerly load limited associations in a single query is to use a window function to compute the order of replies for each post.|Window functions are similar to aggregate functions, but instead of using GROUP BY to combine multiple input rows into a single output row, they compute their results based on a set of rows related to the current row, called a frame.|That is probably hard to understand, so let me provide a simple example.
Consider the following dataset of replies, which I have stripped down to just the two columns we are interested in, the id and the post id.|In this case, there are 3 replies to post 1, 2 replies to post 2, and 1 reply to post 3. We can use a window function to determine what the order of the replies should be for each post.
SELECT id, post_id FROM replies
id | post_id |
---|---|
1 | 1 |
2 | 1 |
3 | 1 |
4 | 2 |
5 | 2 |
6 | 3 |
That window function is called row_number. What this row_number window function call does is provide a number for the position of the reply relative to other replies with the same post_id.
SELECT id, post_id,row_number() OVER (PARTITION BY post_id ORDER BY id)FROM replies
id | post_id | row_number |
---|---|---|
1 | 1 | 1 |
2 | 1 | 2 |
3 | 1 | 3 |
4 | 2 | 1 |
5 | 2 | 2 |
6 | 3 | 1 |
As you can see, the window functions are specified by using the OVER clause after the function call.
SELECT id, post_id,
row_number() OVER
(PARTITION BY post_id
ORDER BY id)
FROM replies
id | post_id | row_number |
---|---|---|
1 | 1 | 1 |
2 | 1 | 2 |
3 | 1 | 3 |
4 | 2 | 1 |
5 | 2 | 2 |
6 | 3 | 1 |
The OVER clause contains a PARTITION BY clause, which specifies the column or columns that you want to use for the grouping. Here we are partitioning by post_id, which means the row_number calculation is done separately for each set of replies with the same post_id.
SELECT id, post_id,
row_number() OVER
(PARTITION BY post_id
ORDER BY id)
FROM replies
id | post_id | row_number |
---|---|---|
1 | 1 | 1 |
2 | 1 | 2 |
3 | 1 | 3 |
4 | 2 | 1 |
5 | 2 | 2 |
6 | 3 | 1 |
We are also specifying the ORDER BY clause, which specifies in which order the individual rows should be processed inside the partition. Here we are ordering by the reply’s id, so for replies with post_id 2, it will process reply 4 before reply 5.
SELECT id, post_id,
row_number() OVER
(PARTITION BY post_id
ORDER BY id)
FROM replies
id | post_id | row_number |
---|---|---|
1 | 1 | 1 |
2 | 1 | 2 |
3 | 1 | 3 |
4 | 2 | 1 |
5 | 2 | 2 |
6 | 3 | 1 |
Putting this all together, what row_number does is for each of the replies for post 1, it orders them by their id. The first row has the row number of 1, the second row has the row number of 2, and the third row has the row number of 3.
SELECT id, post_id, row_number() OVER (PARTITION BY post_id ORDER BY id) FROM replies
id | post_id | row_number |
---|---|---|
1 | 1 | 1 |
2 | 1 | 2 |
3 | 1 | 3 |
4 | 2 | 1 |
5 | 2 | 2 |
6 | 3 | 1 |
Likewise for post 2, the first reply has row number 1 and the second reply has row number 2.
SELECT id, post_id, row_number() OVER (PARTITION BY post_id ORDER BY id) FROM replies
id | post_id | row_number |
---|---|---|
1 | 1 | 1 |
2 | 1 | 2 |
3 | 1 | 3 |
4 | 2 | 1 |
5 | 2 | 2 |
6 | 3 | 1 |
For post 3, there is only a single reply, and that reply has row number 1.|Now that we know how to specify the order of the rows, let us apply this to eager loading.
SELECT id, post_id, row_number() OVER (PARTITION BY post_id ORDER BY id) FROM replies
id | post_id | row_number |
---|---|---|
1 | 1 | 1 |
2 | 1 | 2 |
3 | 1 | 3 |
4 | 2 | 1 |
5 | 2 | 2 |
6 | 3 | 1 |
You start by taking the basic eager loading query that uses IN with the key values.
SELECT * FROM replies WHERE (post_id IN (1, 2, 3, 4, 5))
What would be nice is if you could just use the row_number window function in the WHERE clause to remove the rows that are over the limit for the association.|Unfortunately, this does not work as it is not valid SQL syntax. The reason for that is that window functions are not computed until after the WHERE clause has been processed, since the WHERE clause modifies the inputs to the window function. So you cannot use window functions in WHERE clauses.
SELECT * FROM replies WHERE (post_id IN (1, 2, 3, 4, 5))AND row_number() OVER (PARTITION BY post_id ORDER BY id) <= 10
Since you cannot use a window function in the WHERE clause, you have to move it into the SELECT clause. Note that this does not remove replies over the limit, it just returns the row number as an additional column.
SELECT *,row_number() OVER (PARTITION BY post_id ORDER BY id)FROM replies WHERE (post_id IN (1, 2, 3, 4, 5))
In order to remove replies over the limit, you need to make that query a subquery and select from it, and have the outer query use a WHERE clause to remove replies where the row number is over the limit.|This can be significantly faster than skipping eager loading or slicing the resulting array. Unfortunately, it is still suboptimal. Considering our earlier example with 10,000 replies for post 5, this will probably compute the row number for all 10,000 replies returned by the subquery, and later remove 9,990 replies in the WHERE clause in the outer query.
SELECT *
FROM (
SELECT *,
row_number() OVER
(PARTITION BY post_id
ORDER BY id)
FROM replies
WHERE (post_id IN (1, 2, 3, 4, 5))
) AS t1
WHERE (row_number <= 10)
ORDER BY row_number
Unless the database’s query optimizer is smart enough to avoid computing the row numbers for rows after the limit, using window functions will always be suboptimal for eager loading limited associations.
Additionally, window functions are not supported on simpler databases such as MySQL or SQLite, so you will have to use a different strategy for those databases.
The window function strategy was the first SQL eager loading strategy added to Sequel in Sequel 3.28.
It became the default eager loading strategy on databases that support window functions in Sequel 4.
So the main problem with using window functions is that it computes the row number for all rows after the limit. Similarly, the main problem for correlated subqueries is that they also perform unnecessary processing. Is there a way to eager load while avoiding unnecessary processing?
There is actually a fairly simple way to do so. You can use a UNION to combine the results of each of queries you would use when regularly loading. The UNION strategy requires a significantly different approach than the other methods previously discussed. Instead of starting with the query for an unlimited association and adding a filter to remove rows that are over the limit, you just UNION separate regular association queries.
It would be nice if you could do something like this, just sticking a UNION ALL in between all of the separate clauses used to regularly load the association. Unfortunately, SQL is not that simple. The UNION clause needs to come before the ORDER and LIMIT clauses in SQL.
SELECT * FROM replies WHERE (post_id = 1) ORDER BY id LIMIT 10 UNION ALL SELECT * FROM replies WHERE (post_id = 2) ORDER BY id LIMIT 10 UNION ALL SELECT * FROM replies WHERE (post_id = 3) ORDER BY id LIMIT 10 UNION ALL SELECT * FROM replies WHERE (post_id = 4) ORDER BY id LIMIT 10 UNION ALL SELECT * FROM replies WHERE (post_id = 5) ORDER BY id LIMIT 10
So what you need to do is wrap the queries in subqueries. This certainly looks ugly, but this is the fastest way to eagerly load associations with limits. As virtually all SQL databases support the UNION ALL clause, this strategy is more widely applicable than the window function strategy. It even works on MySQL!
SELECT * FROM (SELECT * FROM replies WHERE (post_id = 1) ORDER BY id LIMIT 10) AS t1 UNION ALL SELECT * FROM (SELECT * FROM replies WHERE (post_id = 2) ORDER BY id LIMIT 10) AS t1 UNION ALL SELECT * FROM (SELECT * FROM replies WHERE (post_id = 3) ORDER BY id LIMIT 10) AS t1 UNION ALL SELECT * FROM (SELECT * FROM replies WHERE (post_id = 4) ORDER BY id LIMIT 10) AS t1 UNION ALL SELECT * FROM (SELECT * FROM replies WHERE (post_id = 5) ORDER BY id LIMIT 10) AS t1
As I mentioned, this is the fastest way to eagerly load associations with limits. It is faster than the correlated subquery and window function strategies on all databases I have tested on because it avoids unnecessary processing.
Note that it is only the fastest if you index correctly. If you do not index correctly, it is much slower than the window function or array slicing strategies, since it will do a sequential scan of the replies table for every post you are eagerly loading for, instead of a single sequential scan of the replies table.
How does a UNION query avoid the unnecessary processing required by the other strategies? It is because for each subquery in the union query, it can use a single index, read only the number of index entries specified by association limit, fetch only those rows from the table, and move on to the next subquery. This avoids processing rows beyond the limit for the association, which is why it is the fastest strategy.
Unfortunately, there are some problems with the UNION strategy. On PostgreSQL, somewhere between eagerly loading replies for 5,000 posts and 10,000 posts, you start hitting the default stack limit of 2MB. You can increase PostgreSQL’s stack limit up to 32MB, but at some point you will hit the stack limit and the query will fail. The UNION approach also takes up significantly more memory than the window function approach, so it is more likely to hit a memory limit as well.
The good news is these limits are easy to work around. Instead of sending a single union query with 10,000 subqueries, you can send 10 union queries each with 1,000 subqueries, or 100 union queries each with 100 subqueries.
In my testing on PostgreSQL, I found that sending a single UNION query with 1000 subqueries was signficantly slower than sending 10 UNION queries each with 100 subqueries, which surprised me. It turns out that a UNION query with 200 subqueries is more than 2 times slower than a UNION query with 100 subqueries.| However, if you decrease the number of subqueries per UNION query, you increase the total number of queries you need to send to the database.|What I have found is that on PostgreSQL, the optimum number of subqueries per UNION query is actually a function of the latency between the database and the application.
On a local database, where there is almost zero latency between the database and the application, the optimum number of subqueries per union query is between 5 and 20.
If you are using Amazon web services with a database and application on different servers in the same availability zone, there is about a half millisecond latency, and the optimium number of subqueries per union is between 20 and 30.
If the database is in a different availability zone in the same datacenter, with a 2 millisecond latency, the optimium number of subqueries per union is between 40 and 50.
If the database is in a different datacenter, with a 20 millisecond latency, the optimium number of subqueries per union is between 100 and 150.
In most cases where Sequel is used, the database is probably in the same datacenter, so Sequel assumes about a 2 millisecond latency, and therefore sets the default number of subqueries per UNION to 40. Users can modify this value via an option.
Eager loading limited associations via a UNION was added as the default strategy in Sequel 4.10. You can still select the window function or array-slicing strategy manually for the rare cases where they are faster, such as when the table does not have the appropriate indexes.
So far I have just been discussing query-per-association eager loading. That is because it is the simpler case.
Handling eager loading when using JOINs is even more complex, and requires diving even deeper. First, let us discuss why you would want to eager load via JOINs at all. Here are a couple examples.
If you want to eagerly load replies for multiple posts, but wanted the posts to be ordered based on the time of the first reply or most recent reply, you need to JOIN.
If you want to eagerly load replies for multiple posts, but you do not want posts unless they have a reply that is within the last hour, you need to JOIN.
Let me briefly describe how eager loading via JOINs is done for an association with no limit. In Sequel, the method that does this is called eager_graph, and it has the same API as the eager method.
Post.where(posts__id: [1,2]).
order(:replies__id).
eager_graph(:replies).all
SELECT posts.id,
replies.id AS replies_id,
replies.post_id
FROM posts
LEFT OUTER JOIN replies
ON (replies.post_id = posts.id)
WHERE (posts.id IN (1, 2))
ORDER BY replies.id
This method selects all of the columns of both tables explicitly, specifying aliases where the column name would be ambiguous.
Post.where(posts__id: [1,2]). order(:replies__id). eager_graph(:replies).allSELECT posts.id, replies.id AS replies_id, replies.post_idFROM posts LEFT OUTER JOIN replies ON (replies.post_id = posts.id) WHERE (posts.id IN (1, 2)) ORDER BY replies.id
It joins replies to posts via a LEFT OUTER JOIN, so that the query will return posts that do not have any replies.
Post.where(posts__id: [1,2]). order(:replies__id). eager_graph(:replies).all SELECT posts.id, replies.id AS replies_id, replies.post_id FROM postsLEFT OUTER JOIN replies ON (replies.post_id = posts.id)WHERE (posts.id IN (1, 2)) ORDER BY replies.id
Note that we are specifying an order by replies.id so that the query will return posts in the order of the earliest reply. This is the reason we have to eagerly load via JOINs instead of issuing a separate query for the replies.
Post.where(posts__id: [1,2]). order(:replies__id). eager_graph(:replies).all SELECT posts.id, replies.id AS replies_id, replies.post_id FROM posts LEFT OUTER JOIN replies ON (replies.post_id = posts.id) WHERE (posts.id IN (1, 2)) ORDER BY replies.id
Let us consider the case where you want to eager load the first 10 replies for each post via JOINs. Here is what the basic SQL would look like without any limits applied.
Post.where(posts__id: [1,2]). order(:first_10_replies__id). eager_graph(:first_10_replies).all SELECT posts.id, first_10_replies.id AS first_10_replies_id, first_10_replies.post_id FROM posts LEFT OUTER JOIN replies AS first_10_replies ON (first_10_replies.post_id = posts.id) WHERE (posts.id IN (1, 2)) ORDER BY first_10_replies.id
Note that you cannot just add a LIMIT clause. This is because the LIMIT clause would operate on the joined result set, which will almost certainly not give the result that you want. Adding a LIMIT 10 clause here, you might get only get post 1 returned, instead of getting post 1 and post 2.
Post.where(posts__id: [1,2]).
order(:first_10_replies__id).
eager_graph(:first_10_replies).all
SELECT posts.id,
first_10_replies.id AS first_10_replies_id,
first_10_replies.post_id
FROM posts
LEFT OUTER JOIN replies AS first_10_replies
ON (first_10_replies.post_id = posts.id)
WHERE (posts.id IN (1, 2))
ORDER BY first_10_replies.id
LIMIT 10
The rule that eager loading an association for a group of records should give the same results as loading the association separately for each individual record still applies when eager loading via JOINs. What strategies can be used when eager loading via JOINs so that the rule is not violated?
First, we cannot skip eager loading in this case. Since the associated table has been JOINed into the query, there is no way to undue that.
However, while we cannot skip eager loading, we can slice the resulting array of replies for each post after eager loading. Just like in the query per association case, this does not save us anything at the database level, but at least it returns correct results.|Array slicing is Sequel’s default strategy when eager loading via JOINs.
Note that just as in the query per association case, ActiveRecord does not even do this, it will return an array with records beyond the limit.
We can also eliminate the union approach. We do not know which posts will be returned until the query is run, therefore we do not know which subqueries to use in the UNION query.
It is possible to use correlated subqueries for eager loading via JOINs.
This SQL query implements a correlated subquery strategy for eager loading via JOINs. The approach is similar to the approach used for a separate query per association.|The main difference is that we filter the associated table before joining using the highlighted inline subquery.
Post.where(posts__id: [1,2]). order(:first_10_replies__id). eager_graph_with_options(:first_10_replies, limit_strategy: :correlated_subquery).all SELECT posts.id, first_10_replies.id AS first_10_replies_id, first_10_replies.post_id FROM posts LEFT OUTER JOIN (SELECT * FROM replies WHERE (id IN ( SELECT t1.id FROM replies AS t1 WHERE t1.post_id = replies.post_id ORDER BY id LIMIT 10))) AS first_10_replies ON (first_10_replies.post_id = posts.id) WHERE (posts.id IN (1, 2)) ORDER BY first_10_replies.id
That subquery uses an IN expression with a correlated subquery to filter rows over the limit for the association.|On PostgreSQL and Microsoft SQL Server, this approach can perform decently, sometimes even being the fastest implementation. On databases with poorer optimizers, such as SQLite, it can perform orders of magnitude slower in some cases.
Post.where(posts__id: [1,2]). order(:first_10_replies__id). eager_graph_with_options(:first_10_replies, limit_strategy: :correlated_subquery).all SELECT posts.id, first_10_replies.id AS first_10_replies_id, first_10_replies.post_id FROM posts LEFT OUTER JOIN ( SELECT * FROM replies WHERE (id IN (SELECT t1.id FROM replies AS t1 WHERE t1.post_id = replies.post_id ORDER BY id LIMIT 10))) AS first_10_replies ON (first_10_replies.post_id = posts.id) WHERE (posts.id IN (1, 2)) ORDER BY first_10_replies.id
In Sequel, to use an correlated subquery strategy when eager loading via JOINs, you have to use a method called eager_graph_with_options. This is because eager_graph’s API does not allow for an options hash.|Sequel cannot determine the optimal strategy when eager loading via JOINs, as it depends on how many replies and posts would be returned. For this reason, you must specify the strategy to use on a per-call basis.
Post.where(posts__id: [1,2]). order(:first_10_replies__id). eager_graph_with_options(:first_10_replies, limit_strategy: :correlated_subquery).all SELECT posts.id, first_10_replies.id AS first_10_replies_id, first_10_replies.post_id FROM posts LEFT OUTER JOIN ( SELECT * FROM replies WHERE (id IN ( SELECT t1.id FROM replies AS t1 WHERE t1.post_id = replies.post_id ORDER BY id LIMIT 10)) ) AS first_10_replies ON (first_10_replies.post_id = posts.id) WHERE (posts.id IN (1, 2)) ORDER BY first_10_replies.id
Using correlated subqueries for eager loading via JOINs was added in Sequel 4.10. While in some cases it is much slower than other strategies, there are situations where it is the fastest strategy. In general it performs the best when eagerly loading for a small number of posts, and worst for a large number of posts.
You can also use a window function to eager load limited associations via JOINs.
This SQL query implements a window function strategy for eager loading via JOINs. This is similar to the approach used in the correlated subquery strategy. The main difference is that the subquery uses the row_number window function instead of a correlated subquery.
Post.where(posts__id: [1,2]). order(:first_10_replies__id). eager_graph_with_options(:first_10_replies, limit_strategy: :window_function).all SELECT posts.id, first_10_replies.id AS first_10_replies_id, first_10_replies.post_id FROM posts LEFT OUTER JOIN (SELECT id, post_id FROM ( SELECT *, row_number() OVER (PARTITION BY replies.post_id ORDER BY id) FROM posts ) AS t1 WHERE (row_number <= 10)) AS first_10_replies ON (first_10_replies.posts_id = posts.id) WHERE (posts.id IN (1, 2)) ORDER BY first_10_replies.id
As when using a separate query per association, you cannot filter directly via a window function, you have to select the window function value in a subquery and then filter using the result of the window function in the outer query.|In general the window function strategy is the fastest strategy when the number of replies is large. It performs worst when the number of posts is small.
Post.where(posts__id: [1,2]). order(:first_10_replies__id). eager_graph_with_options(:first_10_replies, limit_strategy: :window_function).all SELECT posts.id, first_10_replies.id AS first_10_replies_id, first_10_replies.post_id FROM posts LEFT OUTER JOIN ( SELECT id, post_id FROM ( SELECT *, row_number() OVER (PARTITION BY replies.post_id ORDER BY id) FROM posts ) AS t1 WHERE (row_number <= 10) ) AS first_10_replies ON (first_10_replies.posts_id = posts.id) WHERE (posts.id IN (1, 2)) ORDER BY first_10_replies.id
Support for using window functions to eagerly load associations via JOINs was added in Sequel 4.8. It is not used unless you specifically select it as an option, as it can perform much worse than the default of array slicing in cases where the number of posts is small.
Note that if you are filtering or ordering based on columns in the associated table, the results can significantly change between using array slicing in ruby and using a SQL-based strategy.
Let us say you want to eagerly load the first 10 replies for posts 1 and 2, but you only want posts which have a reply that contains the string diving. If you use the array-slicing approach, this will look for replies that contain the word diving and return the first 10 of those replies for each post. If you use one of the SQL-based approaches, this will take the first 10 replies to each post, and remove replies that do not contain diving.|Let us say post 2 has a 100 replies, and only the 50th reply contains the word diving. With the array slicing approach, this will return post 2 with the 50th reply associated. With an SQL-based approach, this will not return post 2 at all, since none of the first 10 replies to post 2 contain diving.
Post.where(posts__id: [1,2]).
order(:first_10_replies__id).
eager_graph(:first_10_replies).
where(first_10_replies__text: /diving/)
all
I do not think there is a single correct approach here, as there are good arguments to be made for either approach. Sequel lets you choose which approach you want to take.
This brings us to the deepest level of our dive, where I will discuss filtering by associations.
First, in case you are unfamiliar with this concept, both Sequel and ActiveRecord support filtering by many-to-one associations. For a many-to-one association, this is basically sugar for using objects instead of their foreign keys when filtering.
post = Post[1] Reply.where(post: post) SELECT * FROM replies WHERE (post_id = 1)
To filter by an association, you use the association symbol as the key in a hash you pass to the where method.
post = Post[1]
Reply.where(post: post)
SELECT *
FROM replies
WHERE (post_id = 1)
It will recognize that an association was used, and instead of using the post column, it will use the foreign key of the association, post_id, with the primary key value of the object passed in, which in this case is 1.
post = Post[1] Reply.where(post: post) SELECT * FROM replies WHERE (post_id = 1)
Sequel goes a step farther than ActiveRecord, allowing filtering by all types of associations. So this code allows you to select all posts that have that reply. Since this is a one-to-many association and not a many-to-many association, this should return the single post associated with the reply.|Note that while ActiveRecord appears to support this syntax, it will generate incorrect SQL, in the worst case resulting in silent failure.
reply = Reply.first(post_id: 1) Post.where(replies: reply)
This is the SQL that would be used. It is nice and simple, just looking for all posts with id 1. Since id is the primary key, this should return a single post.
reply = Reply.first(post_id: 1) Post.where(replies: reply) SELECT * FROM posts WHERE (id = 1)
Now consider the case where the association you are filtering by has a limit. In this case, using the same query is incorrect. This reply might be the hundredth reply to that post, in which case it is not one of the first 10 replies, and therefore the query should not return any posts.
reply = Reply.first(post_id: 1) Post.where(first_10_replies: reply) SELECT * FROM posts WHERE (id = 1)
So how can you handle filtering by limited associations? Just as in the eager loading cases, there are strategies that can be used. In general, variants on the strategies used for eager loading via JOINs will also work for filtering by limited associations.
As you are not eagerly loading and you must add some sort of filter to the dataset, you cannot avoid the problem by skipping it.
Unlike when eager loading, there is no way to slice an array to fix the problem. This is because in the eager load cases, you end up with a separate array of replies for each post which you can slice based on the limit for the association. In the filtering by associations cases, it returns an array of posts, so you cannot use the association limit. You must handle the issue purely at the SQL level to get correct results.
Just like when the eager loading via JOINs, you cannot use a UNION approach, since you do not know in advance the related posts.
Similar to the eager loading cases, it is possible to filter by a limited association using a correlated subquery.
Here is the filtering by associations query. We have to add a filter to this query that will remove posts if the given reply is not in the first 10 replies for that post.
reply = Reply.first(post_id: 1) # reply.id = 2 Post.where(first_10_replies: reply) SELECT * FROM posts WHERE (id = 1)
Here is a filter that can be added to the query that will remove such posts. This filter uses IN with a subquery to restrict the returned posts to ones that contain that reply in the first 10 of their replies.
reply = Reply.first(post_id: 1) # reply.id = 2 Post.where(first_10_replies: reply) SELECT * FROM posts WHERE ((posts.id = 1)AND (posts.id IN ( SELECT replies.post_id FROM replies WHERE ((replies.id IN ( SELECT t1.id FROM replies AS t1 WHERE (t1.post_id = replies.post_id) ORDER BY id LIMIT 10)) AND (replies.id = 2)))))
This subquery has two conditions. One condition is that the reply’s id is 2, since we are only interested in posts that have that reply.
reply = Reply.first(post_id: 1) # reply.id = 2 Post.where(first_10_replies: reply) SELECT * FROM posts WHERE ((posts.id = 1) AND (posts.id IN ( SELECT replies.post_id FROM replies WHERE ((replies.id IN ( SELECT t1.id FROM replies AS t1 WHERE (t1.post_id = replies.post_id) ORDER BY id LIMIT 10)) AND (replies.id = 2)))))
The other subquery condition is the more interesting one. This uses an additional subquery with IN, which removes rows from the outer subquery that are beyond the limit for the association, using the same correlated subquery strategy that you’ve seen while eager loading.
reply = Reply.first(post_id: 1) # reply.id = 2 Post.where(first_10_replies: reply) SELECT * FROM posts WHERE ((posts.id = 1) AND (posts.id IN ( SELECT replies.post_id FROM repliesWHERE ((replies.id IN ( SELECT t1.id FROM replies AS t1 WHERE (t1.post_id = replies.post_id) ORDER BY id LIMIT 10))AND (replies.id = 2)))))
This approach is actually quite fast if the number of posts being matched is small, and the number of replies for each post is not very large. However, for a large number of posts or replies, it can take a very long time.
Just like when eager loading via JOINs, using a correlated subquery strategy for filtering by associations was added to Sequel 4.10. It is the default strategy used on databases that do not support window functions.
As you might expect, the other strategy for filtering by limited associations is using window functions.
Here is the SQL for the window function strategy. This is basically the same SQL as used in the correlated subquery strategy, except for the inner subquery. Instead of using a correlated subquery, it uses a window function, and just as before, since you cannot filter directly on the window function, it has to run the window function in a subquery and filter that in an outer query. So instead of 2 nested subqueries it uses 3 nested subqueries.
reply = Reply.first(post_id: 1) # reply.id = 2 Post.where(first_10_replies: reply) SELECT * FROM posts WHERE ((posts.id = 1) AND (posts.id IN ( SELECT replies.post_id FROM replies WHERE ((replies.id IN (SELECT id FROM ( SELECT id, row_number() OVER (PARTITION BY post_id ORDER BY id) FROM replies) AS t1 WHERE (row_number <= 10))AND (replies.id = 2)))))
Unfortunately, it is not all roses with the window function approach. Especially in the case where you are only looking for the posts using a single reply, the correlated subquery approach can be much faster. However, the window function approach is more consistent, dealing better with larger datasets. At least, that is true on PostgreSQL.
On Microsoft SQL Server, the correlated subquery strategy was always much faster in my testing. The correlated subquery strategy also works on databases that do not support window functions, so it has wider applicability.
Support for using window functions to implement filtering by limited associations was added in Sequel 4.8.
So, to review what I discussed today. I started the talk with a basic discussion of loading associations, as well as the general rule that eager loading an association for a group of records should give the same results as loading the association separately for each individual record.
I talked about easy ways to handle that, such as skipping the eager loading and falling back to regular loading, or using array slicing to at least return the correct results.
I then talked about eager loading limited associations using a separate query per association. I compared the UNION, correlated subquery, and window function stragies, and explained why using a UNION is almost always the fastest approach if the table is indexed correctly, and it works on the widest variety of databases.
I then talked about eager loading limited associations using JOINs in a single query. I talked about how array slicing can still be done to limit results correctly at the ruby level, and also showed how to join to subqueries that use correlated subqueries or window functions to limit results correctly at the SQL level. I also explained that when you are filtering or ordering based on the columns in the associated table, the results change based on whether array slicing or an SQL-based strategy is used.
Finally, I talked about filtering by limited associations, how no approaches at the ruby level can handle them, and how correlated subqueries and window functions can be used filter by limited associations at the SQL level.
In the interests of time and your collective sanity, I did not even talk about more complex cases, such as handling limits for associations with one or more join tables, or associations that use composite keys, or associations that have conditions or offsets in addition to limits.
But worry not, as Sequel handles all of those cases for you.
That wraps this presentation up. Thank you very much for taking the deep dive with me today.
If you have any questions, I’ll be happy to answer them now.
Thanks for for the pictures!
Title Boat Dive In Diving Deeper Even Deeper Deepest Dive Sunset