Hello everyone. It is great to be here in Moscow. This presentation is titled "Give-and-Go with PostgreSQL and Sequel".
Give-and-Go
with
PostgreSQL
and
Sequel
RailsClub 2013 Moscow
z, ? | toggle help (this) |
space, → | next slide |
shift-space, ← | previous slide |
d | toggle debug mode |
## <ret> | go to slide # |
c, t | table of contents (vi) |
f | toggle footer |
r | reload slides |
n | toggle notes |
p | run preshow |
Hello everyone. It is great to be here in Moscow. This presentation is titled "Give-and-Go with PostgreSQL and Sequel".
RailsClub 2013 Moscow
My name is Jeremy Evans, and I have been the lead developer of the ruby Sequel library for the past 5 years, and a PostgreSQL user for over 10 years.
To eliminate the possibility of confusion, when I refer to Sequel, I'm talking about the ruby library. When I refer to SQL, I'm talking about the database query language.
Also, because PostgreSQL isn't easy to pronounce, I'll be referring to it as Postgres in the rest of the presentation.
For those of you who aren't familiar with Sequel, it is a database abstraction library for ruby. It sits above the underlying database drivers such as pg or mysql2, and offers an interface that works with any SQL database. It has mostly replaced the use of DBI in the ruby community.|The main distinguishing feature of Sequel is that it attempts to represent SQL concepts using ruby objects, instead of using strings of SQL code fragments. Sequel also ships with an object relational mapper called Sequel::Model, which is built on top of Sequel and is similar to ActiveRecord in basic usage.
There is a saying in hockey, don't go through life without goals. With that in mind, the goal of this presentation is to show off a lot of awesome Postgres features, and how you can easily make use of them with Sequel.
I think one of the reasons that Postgres is not used in more places, is that the advanced features that Postgres offers are sometimes not easy to use without good client library support. Sequel makes it easy to use many advanced Postgres features, and this presentation will separate those features into three categories.
The first category is related to the data types that Postgres supports. Unlike many SQL databases, Postgres supports a number of complex datatypes, such as arrays, hash tables, JSON, ranges, and composite types.
The second category is query types that Postgres supports. I will be talking about common table expressions, window functions, full text search, as well as returning data from insert, update, and delete queries.
The final category is really just a catch all for features that do not fit cleanly into one of the previous two categories. I will be talking about delayed validation of constraints, prepared transactions, procedural languages, and notification channels.
Let's jump right in and talk about some of the advanced data types supported by Postgres. Most SQL databases deal with simple scalar types, such as strings, numbers, and dates, requiring multiple columns or references to other tables to deal with combinations of scalar types.|Postgres has numerous types that store combinations of scalar types in a single column, allowing you to store more complex data in a single row.
The simplest combination type offered by Postgres is the array. A Postgres array type contains an arbitrary number of values of the same type of data. Postgres supports both single and multidimensional arrays.
In standard SQL, you would generally store this information in a separate table. But then you need to do a JOIN or separate queries to work with the results. This can be slow for large datasets. By using an array type, you can eliminate JOINs from your queries, allowing substantial speedups.
Depending on the type of queries you are doing, with proper indexes, using arrays instead of a separate join table can be up to 20x faster.
There are a couple of tradeoffs you have to make to get such a speedup. One is that your application becomes Postgres-specific. The second is that you need to have support in your client library for the array types.|Note that these tradeoffs are pretty much the case for all advanced Postgres types, so while I won't mention them again, you should assume they apply to the other types as well.
If you don't have special support in your client library, you'll probably have arrays returned as strings in Postgres array format.
"{1,2,3}"
You need client library support so that the array types are recognized and parsed correctly into ruby array-like objects.
[1,2,3]
Sequel ships with such support in the form of an extension. If you are going to be using array types, you just load the pg_array extension into your Sequel::Database instance, and then it correctly deals with Postgres array types. It supports most common Postgres array types by default, and it also supports registering custom array types.
Sequel also ships with an additional extension called pg_array_ops, which provides DSL support for Postgres array functions and operators.
Let's say you are looking for all rows where the array column contains both 3 and 1. You can use the pg_array method to tell Sequel that the argument should be treated as a Postgres array, and then you can call the contains method to check if it contains the given elements. This translates into the SQL you see on screen.
Sequel.pg_array(:array_column).
contains([3,1])
"array_column" @> ARRAY[3,1]
So not only does Sequel allow you to treat Postgres arrays as ruby array-like objects, it also represents the SQL operations on Postgres arrays as ruby expression objects.
This extends into emulating the Postgres type system. For example, the return value of the contains operator in Postgres is a boolean, so the value Sequel returns for the contains method is a boolean expression. The return value of the concatenation operator on two Postgres arrays is a Postgres array, so the value Sequel returns for the concat method is an array expression.
op = Sequel.pg_array(:array_column) op.contains([3,1]) Boolean op.concat([1]) Array
This allows you to chain the methods together intelligently. If you have a row with two array columns, you can see if their concatenation contains both numbers, by calling contains on the array expression object returned by concat.
Sequel.pg_array(:array_column). concat(:other_array_column). contains([3,1])
In addition to making it easier to use Postgres arrays in your own code, Sequel also ships with a model plugin called pg_array_associations, which implements many to many associations by storing foreign keys in an array.
Here is the classic JOIN table usage in a many to many association. With a table for albums and a table for tags, the association between albums and tags is stored in a join table, which has foreign keys pointing to both the albums table and the tags table.
albums_tags ------------ ------| album_id | | | tag_id |-----| | ------------ | | albums tags | | -------- -------- | |->| id | | id |<-| | name | | name | -------- --------
Using the pg_array_associations plugin, you can eliminate the join table, by storing the tag foreign keys in an array column in the albums table. This means you can then get all tags related to a given album without doing a JOIN.|The Sequel::Model API is almost identical to the JOIN table case, and the plugin handles advanced Sequel::Model association features, such as multiple methods of eager loading, filtering by associations, and dataset associations.
albums tags ----------- -------- | id | |->| id | | name | | | name | | tag_ids |--| -------- -----------
Unfortunately, the current downside to storing foreign keys in an array is that you lose automatic referential integrity, so you can't ensure that all of the foreign keys in the array continue to reference existing rows.|There have been patches for Postgres 9.2 and 9.3 to add support for array element foreign keys, but they were not merged. Hopefully a future version of Postgres will support them. In the meantime, it is possible to emulate referential integrity for foreign keys in arrays using triggers.
The next type I want to talk about is hstore. The hstore type is a hash table stored in a single column, where keys and values are both strings. Unlike arrays, hstore is not a core Postgres type. It ships with Postgres as an extension, and it is one of the most commonly used extensions.
In standard SQL, the general way to represent a hash table is using the "entity-attribute-value" model, with a separate table that has a foreign key pointing to the current row, and two string columns, one for the attribute name and one for the attribute value.
With proper indexes, using hstores instead of a separate entity attribute value table can be up to 15x faster.
So I've used the phrase "with proper indexes" a couple of times, which is pretty vague. Postgres advanced types would be mostly useless without good index support.
The hockey great Gordie Howe once said, You can always get someone to do your thinking for you. Thankfully, Postgres does a good job of thinking for you, supporting multiple advanced index types that you probably wouldn't be able to think up yourself.
Postgres has GIN and GIST index types that it can use to index the advanced data types that Postgres supports. Arrays can be indexed with GIN, allowing index usage for queries where the array contains an entry, a subset or superset of entries, or has any overlap with set of entries. Hstores can be indexed with both GIN and GIST, allowing index usage for queries where the hstore contains a given key, any or all of a set of keys, or a subset of keys and values.
Sequel ships with a pg_hstore extension for the hstore type. Loading this extension into your Database object makes it return the hstore type as a ruby hash-like object.
Similarly, Sequel ships with a pg_hstore_ops extension for DSL support for hstore operators and functions.
Let's say you are looking for all rows where the hstore column contains a key of foo with a value of bar. You can use code very similar to the array code I showed earlier to do so, just wrapping the column using the hstore method, and using the contains method.
Sequel.hstore(:hstore_column).
contains('foo'=>'bar')
"hstore_column" @> '"foo"=>"bar"'
In addition to replacing the entity-attribute-value model, hstores can also be used to combine multiple columns into a single column. If you have 10 columns where the values for most rows are NULL, you could combine them into a single hstore column for ease of maintenance and possibly better performance.
hstores were the first Postgres type that allowed for schemaless design, even if the implementation was limited to just strings.
What if you want to do a full schemaless design, similar to document databases such as MongoDB? While the merits of a fully schemaless design are debateable, if you want to go that route, Postgres has a type for that.
That is the JSON type. As you might expect, this allows you to store nested data using Javascript objects, arrays, numbers, and strings. The JSON type is a fairly recent addition to Postgres, appearing first in 9.2, but with many additional features in 9.3.
As you might expect, Sequel ships with an extension for the json type, returning values as ruby hash-like or array-like objects.
A Postgres replacement for a MongoDB collection would look something like the following in Sequel. Basically, just a UUID primary key and a schemaless JSON hash for the attributes.
Sequel.create_table(:collection) do uuid :_id, :primary_key=>true json :attributes end
Let's take a look at an example of using these types in a real application. This application is called KaeruEra, which is a error aggregator for ruby web applications, similar to Airbrake or Errbit. Basically, it logs errors that happen in other web applications, and provides a way to view, search, and track fixes to the errors.
As you might expect, the main table in the application stores errors. Using Postgres types, there is only a single row created for every error. Here are some of the fields that are stored.|First is the exception class and message, which are both stored as plain strings. Then there is the backtrace for the exception, which is stored as an array of strings. It also stores the request environment, which fits nicely into an hstore, since keys and values are generally strings. Finally, it stores the request params and session variables as json, since they often contain nested structures.|Without the advanced types provided by Postgres, storing similar information in any sort of structured and queryable format would require inserting hundreds of rows across 5 different tables.
You may be wondering why you would want to use an hstore type if you can use the json type, since the json type supports nested structures and the hstore type does not. One reason is that there is no automatic indexing support for the json type, unlike the hstore type.|You can still create indexes based on a specific member in a json column, even a nested object member, but you have to create specific indexes for each type of search.
Note that these tradeoffs will probably be going away in future versions of Postgres. It is likely by Postgres 9.4 or 9.5 that the hstore type will support nesting and values other than just strings. It is also likely that the json type will gain the ability to use GIST and GIN indexes. I expect that by Postgres 9.5, hstore and json will be fairly similar in capabilities.
Let me get back to discussing existing Postgres features. I mentioned earlier that Postgres added a whole lot more capabilities to json in 9.3. Those capabilities were in the form of additional json operators and functions. Sequel's pg_json_ops extension adds DSL support for these new operators and functions.
Let's say you were using the JSON displayed here, which is an object containing an array element for the key a. Let's also say there was a common search that depended on the third element of this array, which would be 9 in this example.
# {"a":[1,4,9,16,25]}
You could add an index on this array element using the following Sequel code. The Sequel code highlighted in returns the third array element of the json object member with the key a.
# {"a":[1,4,9,16,25]}
DB.add_index :collection,
Sequel.pg_json(:attributes)[%w'a 2']
What if you want the benefits of having nested data in your row but want the consistency benefits that come from using a defined schema?
Postgres supports this with composite types, where a single column contains a data structure with multiple subtypes. When you combine composite types with array types, you basically have the power to store full documents with unlimited nesting in a single row.
Postgres allows for user defined types using the CREATE TYPE syntax, which is very similar to CREATE TABLE, taking the list of columns in the type.
CREATE TYPE
Additionally, for every table you create with CREATE TABLE, a corresponding type is created.
CREATE TABLE
Let's say you have a music database and want to store each album in single row, including the full tracklist. A typical JSON representation would be something like this, storing the name and length of the album, and then an array of tracks, with each track having a name, length, and reference to the related artist.
{"name": "Some Album", "length": 3600, "tracks": [ {"name": "Track 1", "length": 290, "artist_id": 1}, {"name": "Track 2", "length": 310, "artist_id": 2}]}
Here is how you could structure the data using Postgres arrays and composite types. You have a separate track type with a name, length, and reference to artist. Then your albums table has a column for the name, a column for the album length, and a column containing the tracklist as an array of track types.|This approach allows you to keep an entire album including all tracks in a single row. It combines the benefit of a document database, getting an entire object in a single JOIN-less query, but still retains the data consistency benefits that come from using a defined schema. I actually use a design similar to this in one of my applications, though for a different reason.
track type: name text length integer artist_id foreign key album type: name text length integer tracks track[]
I'm sure most of you have heard of Heroku, the platform-as-a-service provider. One of the many benefits of Heroku is that they offer a completely free way to run low-traffic web applications, including a free Postgres database.|When Heroku first started, this free database was limited to 5MB of data. Last year, they switched from 5MB of data, to 10,000 rows of data. At the time, I had an application that was over 10,000 rows, but had less than 5MB of data. It stored tracks in a separate table from albums with one row per track.|Using an array of composite types, I was able to eliminate the separate tracks table and just store the tracks alongside the related album, getting me under the 10,000 row limit. I was able to do this while increasing performance and still allowing for indexed querying by track name, using Postgres advanced indexing features.
Postgres allows you to combine the schemaless and relational approaches and get the benefits of both worlds, using a defined schema for the columns you know you need, but having a schemaless json or hstore column in the cases where you want to trade consistency for flexibility.
track type: name text length integer times_played integer attributes json/hstore album type: name text length integer artist_id foreign key tracks track[] attributes json/hstore
Sequel ships with a pg_row extension that enables composite type support, returning values as ruby hash-like objects. Additionally, Sequel also ships with a pg_row model plugin, which allows you to represent composite types as Sequel::Model instances.
The final Postgres data type I'd like to talk about is the range type. Range types in Postgres are similar to ruby ranges, but they are a bit more flexible. Postgres ranges can be unbounded on either end, and they allow for exclusive beginnings, neither of which ruby allows.|Postgres has built in support for numeric, date, and timestamp ranges, and also allows you to define your own custom range types.
Sequel ships with a pg_range extension that enables support for range types, returning Postgres range types as ruby range-like objects, and allowing you to use ruby ranges to represent range types in queries.
And to make using Postgres ranges easier, Sequel also ships with a pg_range_ops extension that adds DSL support for range operators and functions.
Let's say you have a hotel reservation system, and you want to see which reservations include today's date, you could use the following code. By design, it looks similar to the hstore and array code, just wrapping the object with pg_range and using the contains method.
Sequel.pg_range(:reservation_dates).
contains(Date.today)
"reservation_dates" @> '2013-09-28'
Range types by themselves are only marginally useful, since they can be emulated by using separate columns for the beginning and the end of the range. However, there is another Postgres feature that makes range types very useful.
That feature is exclusion constraints. Exclusion constraints are similar to unique constraints, but instead of protecting against duplicate values, they protect against overlapping values. For any work where overlapping values are not allowed, such as when scheduling access to a limited resource, Postgres exclusion constraints are probably the best approach to the problem.
Let's say you want to book a reservation for room 1 from September 26 to September 28. Assuming nothing else has already reserved room 1 in that time period, this will work correctly.
r = Date.new(2013, 9, 26)..
Date.new(2013, 9, 28)
DB[:reservations].
insert(:room_id=>1,
:reservation_dates=>r)
Insert successful
Later, if something else tries to reserve room 1 from September 28 to September 30, it will fail, because that overlaps with the existing reservation from September 26 to September 28.
r = Date.new(2013, 9, 28)..
Date.new(2013, 9, 30)
DB[:reservations].
insert(:room_id=>1,
:reservation_dates=>r)
Error raised!
When the exclusion constraint fails, Sequel will recognize it as an exclusion constraint violation, and raise a specific exception class. You can then rescue that exception class in your application code and be confident that the exclusion constraint is what caused the error.|You can then alert the user that the room is already reserved during that time frame, and offer to book a different room at the same time or the same room at a different time.
Sequel::Postgres::
ExclusionConstraintViolation
So let me briefly review the types I discussed.
First was the array type, good for keeping multiple instances of the same type of data, instead of using a separate table.
Next was the hstore type, for keeping simple attribute keys and values in a single column.
Then I talked about the JSON type, for fully schemaless documents.
After that I talked about composite types, for the nesting benefits of using documents, but with the data consistency benefits that come from using a defined schema.
Finally, I talked about range types and how using them with exclusion constraints can make it much easier to automatically protect against overlapping values.
So that brings us to the second part of the presentation, where I'll be talking about some of the advanced query types that Postgres supports.
The first type of query I'll be talking about is the common table expression query.
A basic common table expression is a way to give a subquery a name for just the period of a single query, equivalent to defining a temporary view.|In this example, we are creating a named subquery called current_reservations, which selects the rows in the reversations table that include the current date. The outer query then selects from that subquery, and then joins to the rooms table.|Note that in this example, this doesn't give you any additional features that you didn't already have, since you could just filter the rows in the main query.|So why did I spend time discussing this? Well, it turns out that while basic common table expressions are uninteresting, they have an interesting variant.
WITH current_reservations AS ( SELECT * FROM reservations WHERE reservation_dates @> CURRENT_DATE) SELECT * FROM current_reservations JOIN rooms ON (room_id = rooms.id)
That variant is the recursive common table expression. A recursive expression has similar syntax to a basic common table expression, but it recursively queries itself using the output from the previous iteration as input to the following iteration, and when no more output is produced, returning the combined output from all iterations as the result set of the view.|That's hard to reason about abstractly, so let me give you a concrete example.
Here's the simplest possible recursive common table expression I can think of. Note the use of RECURSIVE to signal that this is a recursive query. This query will return ten rows with with a value starting at 1 and ending at 10. Let me break down how it works.|First, as you can see here, recursive expressions are broken into two parts separated by a union. The part before the union is referred to as the base case, and cannot refer the recursive expression internally.|The part after the union is the recursive case, and when it refers to the recursive expression, for the first iteration it receives the output of the base case. On future iterations, it receives the results of the previous iteration. This continues until the recursive part produces no output.
WITH RECURSIVE a AS ( SELECT 1 AS b UNION ALL SELECT a.b + 1 FROM a WHERE a.b < 10) SELECT * FROM a
Here is the base case. This is pretty simple, it just returns a single column with the value of 1.
WITH RECURSIVE a AS (
SELECT 1 AS b
UNION ALL
SELECT a.b + 1
FROM a
WHERE a.b < 10)
SELECT * FROM a
Here is the recursive case. The first time the recursive case runs, the value of a.b will be 1, since that is what the base case returned. Since a.b is less than 10, this will return 1 row, with the value of a.b plus 1, which is 2.|Since it returned output, another iteration will run, this time where a.b is 2. These iterations will keep running until a.b is 10. When a.b is 10, the recursive case will not return any rows, at which point the recursion stops.
WITH RECURSIVE a AS (
SELECT 1 AS b
UNION ALL
SELECT a.b + 1
FROM a
WHERE a.b < 10)
SELECT * FROM a
When the recursion has finished, all output from both the base case and all recursive case iterations are combined into a single result set, which is returned. That result set will be all numbers between 1 and 10.
WITH RECURSIVE a AS (
SELECT 1 AS b
UNION ALL
SELECT a.b + 1
FROM a
WHERE a.b < 10)
SELECT * FROM a
If a simpler counter is fun, surely using a recursive expression to return all fibonacci numbers under 100 is even more fun. You can do that by iterating using two columns, where the each iteration returns the larger number and the sum of the two numbers in the previous iteration. Since these examples might be too simplistic, let's move on to a real example.
WITH RECURSIVE a AS ( SELECT 1 AS b, 1 AS c UNION ALL SELECT a.c, a.b + a.c FROM a WHERE a.b < 100) SELECT b FROM a
Let's say you are using the reservations table we discussed early, storing a date range. You get a new 5-day reservation and attempt to save it, but it turns out that it overlaps with an existing reservation, so an ExclusionConstraintViolation is raised. What should you do?|Well, to be helpful to the user, you may want to scan the reservations table and try to find the next available 5-day reservation that doesn't overlap with an existing reservation. How can you do that?
Sequel::Postgres::
ExclusionConstraintViolation
If you guessed using a recursive expression, have the person next to you pat you on the back. Here's example code that does that. While this is a fairly complex example, I'll try to break down how it works.
WITH RECURSIVE range AS ( SELECT CURRENT_DATE AS i, daterange(CURRENT_DATE, (CURRENT_DATE + 5)) AS r UNION ALL (SELECT ( SELECT max(upper(reservation_dates)) FROM reservations WHERE (reservation_dates && daterange((range.i), (range.i + 5))) ), daterange((range.i), (range.i + 5)) FROM range WHERE ((range.i - CURRENT_DATE) < 30))) SELECT r FROM range WHERE (i IS NULL)
As usual, we'll start with the base case, which returns two values. The first value is the date to start looking for reservations, which for this example is the current date. The second value is the starting range of dates we are looking in, which for this example is the 5-day date range starting the current date.
WITH RECURSIVE range AS ( SELECT CURRENT_DATE AS i, daterange(CURRENT_DATE, (CURRENT_DATE + 5)) AS r UNION ALL (SELECT ( SELECT max(upper(reservation_dates)) FROM reservations WHERE (reservation_dates && daterange((range.i), (range.i + 5))) ), daterange((range.i), (range.i + 5)) FROM range WHERE ((range.i - CURRENT_DATE) < 30))) SELECT r FROM range WHERE (i IS NULL)
To test for overlap, in the recursive case we create a 5-day date range using the starting date returned by the previous iteration. We test for overlap using a subquery of the reservations table, comparing the date range we created in this iteration to the existing date ranges in the reservation_dates column.
WITH RECURSIVE range AS ( SELECT CURRENT_DATE AS i, daterange(CURRENT_DATE, (CURRENT_DATE + 5)) AS r UNION ALL (SELECT ( SELECT max(upper(reservation_dates)) FROM reservations WHERE (reservation_dates && daterange((range.i), (range.i + 5))) ), daterange((range.i), (range.i + 5)) FROM range WHERE ((range.i - CURRENT_DATE) < 30))) SELECT r FROM range WHERE (i IS NULL)
We select two values in the recursive case. The first value is the maximum upper bound of any overlapping date range. If there is overlap, this will the starting point of the next iteration. If there is no overlap, this will be NULL. We choose this value because it allows us to skip over days inside the overlapping range, which we know would also overlap.|We also select the new date range we used to test for overlap. On the next iteration, if there was no overlap on the previous iteration, the WHERE clause evaluates to NULL and the recursion will stop.|We limit our search to 30 days in the future so that it only searches the time frame we are interested in.
WITH RECURSIVE range AS ( SELECT CURRENT_DATE AS i, daterange(CURRENT_DATE, (CURRENT_DATE + 5)) AS r UNION ALL (SELECT ( SELECT max(upper(reservation_dates)) FROM reservations WHERE (reservation_dates && daterange((range.i), (range.i + 5))) ), daterange((range.i), (range.i + 5)) FROM range WHERE ((range.i - CURRENT_DATE) < 30))) SELECT r FROM range WHERE (i IS NULL)
When the recursion stops, the common table expression will return the rows used for all iterations. The only row we care about is the one where there is no overlap, which is when the first value is NULL, so we filter for rows where the first value is NULL, and select the second value, which is the non-overlapping date range.|The advantage of using this approach to find open reservations is that it is fairly fast as it runs completely on the server, it skips unnecessary testing of date ranges that we know would overlap, it returns as soon as it finds an open reservation, and it is probably easier than writing a function in a database procedural language that does the same thing.
WITH RECURSIVE range AS ( SELECT CURRENT_DATE AS i, daterange(CURRENT_DATE, (CURRENT_DATE + 5)) AS r UNION ALL (SELECT ( SELECT max(upper(reservation_dates)) FROM reservations WHERE (reservation_dates && daterange((range.i), (range.i + 5))) ), daterange((range.i), (range.i + 5)) FROM range WHERE ((range.i - CURRENT_DATE) < 30))) SELECT r FROM range WHERE (i IS NULL)
As you've seen in the previous example, recursive expression syntax can be somewhat complex, and it does take some time to learn how to write recursive queries. However, as hockey great Bobby Orr used to say, Forget about style, worry about results.
Well, Sequel doesn't always forget about style. It offers helper methods for writing recursive queries. The SQL in the previous example wasn't written by hand, it was generated from this Sequel dataset.|Sequel has a with_recursive method to generate recursive expressions. You pass this method the name of the recursive expression, a dataset for the base case, represented here with a red background, and a dataset for the recursive case, represented here with a white background.
range = proc{|d| Sequel.function( :daterange, d, Sequel.+(d, 5))} c = :reservation_dates cd = Sequel::CURRENT_DATE DB[:range]. select(:r). where(:i=>nil). with_recursive(:range, DB.select(cd.as(:i), range[cd].as(:r)), DB[:range]. select{[DB[:reservations]. select{max(upper(c))}. where(Sequel.pg_range(c). overlaps(range[:range__i])), range[:range__i]]}. where{range__i - cd < 30})
Note that this Sequel code is more concise than the SQL it generates, because it is able reuse expressions. There are three separate calls to create a date range in the SQL, but the Sequel code just uses 3 calls to the same proc.
range = proc{|d| Sequel.function( :daterange, d, Sequel.+(d, 5))} c = :reservation_dates cd = Sequel::CURRENT_DATE DB[:range]. select(:r). where(:i=>nil). with_recursive(:range, DB.select(cd.as(:i), range[cd].as(:r)), DB[:range]. select{[DB[:reservations]. select{max(upper(c))}. where(Sequel.pg_range(c). overlaps(range[:range__i])), range[:range__i]]}. where{range__i - cd < 30})
It's also easier to change the starting date or the number of days in the reservation by modifying a single value instead of modifying values in multiple places in the SQL.
range = proc{|d| Sequel.function( :daterange, d, Sequel.+(d, 5))} c = :reservation_dates cd = Sequel::CURRENT_DATE DB[:range]. select(:r). where(:i=>nil). with_recursive(:range, DB.select(cd.as(:i), range[cd].as(:r)), DB[:range]. select{[DB[:reservations]. select{max(upper(c))}. where(Sequel.pg_range(c). overlaps(range[:range__i])), range[:range__i]]}. where{range__i - cd < 30})
The previous examples I've given haven't shown this, but recursive expressions allow Postgres to run the same types of queries that you previously would use a graph database for. You can write queries that traverse trees and graphs and return all related objects.
Sequel ships with a model plugin that uses recursive expressions for tree traversal, called rcte_tree.
rcte_tree
You just load the plugin into your model class, and it adds methods load all ancestors or descendants for a given node or set of nodes in the tree in a single recursive query.
class Node < Sequel::Model plugin :rcte_tree end
If you call the descendants method, it runs a recursive query similar to this. The base case starts with selecting all children of the current node, which are nodes where the parent_id is 2. The recursive case selects children of the output of the previous case, until there are no such children.
Node[2].descendants WITH t AS ( SELECT * FROM nodes WHERE (parent_id = 2) UNION ALL SELECT nodes.* FROM nodes INNER JOIN t ON (t.id = nodes.parent_id) ) SELECT * FROM t
Sequel even has the ability to eagerly load only the first few levels of descendants, instead of loading all descendants. This can be a good middle ground between loading all descendants at once, and loading each level of descendants in a separate query.|This query is similar to the previous query, except we are keeping track of the root node separately as we are eagerly loading, and we are incrementing the level on each iteration and only returning results up to the level selected.
Node. where(:id=>2). eager(:descendants=>3) WITH t AS ( SELECT parent_id AS root, nodes.*, 0 AS level FROM nodes WHERE (parent_id IN (2)) UNION ALL SELECT t.root, nodes.*, t.level + 1 AS level FROM nodes INNER JOIN t ON (t.id = nodes.parent_id) WHERE t.level < 2 ) SELECT * FROM t
If you are familar with nested sets, using recursive queries to load trees allows basically all of the features of nested sets, but it has much simpler storage, a much simpler implementation, and is orders of magnitudes faster in many use cases.
I'd now like to discuss another interesting query type that Postgres supports called window functions. Window functions are similar to aggregate functions, but instead of combining all grouped input rows into a single output row, they allow you to perform an aggregate calculation for a specified grouping related to the current row, called a frame, without grouping multiple input rows into a single output row.
Let's consider the following dataset, which lists some hockey equipment with an assigned category and price. Let's say you want to return this data, with the difference between the price of the item and the average price of the item in the same category. What's the easiest way to accomplish that?
Category | Name | Price |
---|---|---|
1 | Elbow Pad | 1000 |
1 | Shin Guard | 1500 |
1 | Helmet | 2000 |
1 | Gloves | 1200 |
2 | Jersey | 500 |
2 | Socks | 200 |
2 | Mouth Guard | 250 |
2 | Puck | 150 |
3 | Stick | 2500 |
3 | Skates | 3500 |
If you guessed using a window function, pat the person next to you on the back. An SQL window function looks similar to a regular SQL function, but with the addition of the OVER clause.|In this case, the OVER clause uses partitions by category to separate the groupings, so that the average price is computed per group, and that average is subtracted from the item's price.
SELECT *, price - avg(price) OVER (PARTITION BY category) FROM items
Here are the results of such a query. Notice how the most expensive items in each category, highlighted in red, have a positive difference, while the cheapest items in the category, highlighted in white, have a negative difference.|What if you want to order the items by their price relative to their category, so that the cheapest items in each category appear first?
Category | Name | Price | Difference |
---|---|---|---|
1 | Elbow Pad | 1000 | -425 |
1 | Shin Guard | 1500 | 75 |
1 | Helmet | 2000 | 575 |
1 | Gloves | 1200 | -225 |
2 | Jersey | 500 | 225 |
2 | Socks | 200 | -75 |
2 | Mouth Guard | 250 | -25 |
2 | Puck | 150 | -125 |
3 | Stick | 2500 | -500 |
3 | Skates | 3500 | 500 |
There's a window function for that called row_number. In this case, in addition to partitioning by category, we are also ordering the window function by price. When we used the window function to compute an average, we didn't care about ordering, but the row_number function depends on the order of rows, so we need to specify it.|Finally, we order the entire result set by row_number and then category, to ensure that the cheapest items in each category are listed first.
SELECT *, row_number() OVER (PARTITION BY category ORDER BY price) FROM items ORDER BY row_number, category
Running that query, you get the following output. Notice how the the cheapest items in each category are all on the top. More expensive items are ordered by the order of their price relative to the price of other items in their category.
Category | Name | Price | Row Number |
---|---|---|---|
1 | Elbow Pad | 1000 | 1 |
2 | Puck | 150 | 1 |
3 | Stick | 2500 | 1 |
1 | Gloves | 1200 | 2 |
2 | Socks | 200 | 2 |
3 | Skates | 3500 | 2 |
1 | Shin Guard | 1500 | 3 |
2 | Mouth Guard | 250 | 3 |
1 | Helmet | 2000 | 4 |
2 | Jersey | 500 | 4 |
As you might expect, Sequel has a helper for creating window functions. The previous window function SQL wasn't written by hand, it was generated from this Sequel dataset. Using Sequel's syntax makes it easy to create complex SQL window functions using ruby objects instead of string interpolation.
DB[:items]. select_append{ row_number(:over, :order=>:price, :partition=>:category){} }. order(:row_number, :category)
Not only does Sequel allow you to use a DSL to create window functions, it uses window functions internally to efficiently support the eager loading of model associations with limits.
Consider the following association for the HockeyPlayer class, allowing you to get the first 10 games played by the hockey player.
class HockeyPlayer < Sequel::Model
one_to_many :first_10_games,
:class=>:Game,
:order=>:date,
:limit=>10
end
For a single hockey player, this association works as you would expect, just limiting the number of returned games to 10 using the LIMIT clause. However, this approach cannot be used for eager loading.
player.first_10_games
SELECT *
FROM games
WHERE (player_id = 1)
ORDER BY date
LIMIT 10
If you tried to use the same approach for eager loading, you would end up with the SQL used here. The reason this doesn't work is that the LIMIT 10 clause here applies to the entire result set, giving you the first 10 games played by any of the team members, when you want the first 10 games played by each team member.
Player. where(:team_id=>1). eager(:first_10_games) SELECT * FROM games WHERE (player_id IN (1, 2, 3, 4, 5)) ORDER BY date LIMIT 10
Thankfully, Sequel handles this situation for you automatically using a window function, similar to the example I gave about ordering hockey equipment. Sequel uses a subselect that returns the row_number in addition to the rest of the query results, and then has the main select remove rows where the row_number is over the limit for the association.|The reason for this slightly convoluted approach of using a subselect and filtering results in the outer select is because SQL window functions are not allowed in WHERE clauses.
SELECT * FROM ( SELECT *, row_number() OVER ( PARTITION BY player_id ORDER BY date) FROM games WHERE (player_id IN (1, 2, 3, 4, 5)) ) AS t1 WHERE (row_number <= 10)
The next query type I'd like to discuss is the use of the RETURNING clause. Unlike standard SQL, where only SELECT can return rows, Postgres supports the RETURNING clause on INSERT, UPDATE, and DELETE statements, which returns the rows inserted, updated, or deleted.
If you want to insert a row into a table, and have the values of the inserted row returned, you can use a query like this, with RETURNING *. This is useful as it automatically returns any default values set by the database.|While you could get the information separately using a SELECT query, this combines the INSERT and SELECT into a single query, which is faster and avoids race conditions.
INSERT INTO players (name)
VALUES ('Jeremy')
RETURNING *
By default, when you create new model instances, Sequel will use the primary key of the inserted row to retrieve the inserted row values with a SELECT query.|However, since Postgres supports RETURNING, Sequel does the insert and select in a single query that uses RETURNING *, speeding up all model creation automatically.
Sequel::Model.create
Additionally, Sequel datasets have a returning method, allowing you to specify which columns to return in insert, update, and delete queries. This allows you to express the previous SQL query as a simple dataset.
DB[:players]. returning. insert(:name=>'Jeremy')
You can use this syntax to do things like update archive tables by automatically inserting rows deleted from one table into another table.|This example uses a common table expression with DELETE RETURNING, and has the outer query insert the output of the common table expression into the archive table.
DB[:old_players]. with(:deleted_players, DB[:players]. where(:name=>'Jeremy'). returning. with_sql(:delete_sql)). insert(DB[:deleted_players])
The last query type I'd like to talk about is full-text search. Unlike most SQL databases, Postgres has good built-in full text search features, with full transactional support.
Here is the basic full text search query in Postgres. You wrap the column containing the text to search in a tsvector, and the terms you are searching for in a tsquery, and use the @@ operator to perform the search.|The first argument to these functions is the language you want to use for the search. Since that is pretty ugly, Sequel simplifies it for you.
SELECT * FROM players WHERE to_tsvector('english', biography) @@ to_tsquery('english', 'example text');
Sequel has a full_text_search method that takes the column to search, the search terms, and accepts the language as an option so you don't need to specify it twice.
DB[:players]. full_text_search( :biography, 'example text', :language=>'english')
One of the best things about the full text search in Postgres is how flexible it is. If you remember the KaeruEra example I discussed earlier, let's say I want to add full text searching to the backtrace column, which is an array, and the environment variable column, which is an hstore.|We can convert the backtrace array to a string, and the environment hstore to an array and then to a string, and then do full text searching on the concatenation of those two strings.|Now, all these conversions are going to be slow for large numbers of records by default. To speed things up, you'll have to add an index.
DB[:errors]. full_text_search( Sequel.expr{[ array_to_string(backtrace, ' '), array_to_string(hstore_to_array(env), ' ') ]}, 'example text', :language=>'english')
Adding a full text index in Sequel is very easy, since the syntax is very similar to the syntax you use for a full text search. The great part about this is you don't need to actually store the string containing the concatenated full text of both the backtrace array and the environment hstore in a column, which saves you space and makes querying faster and easier. Once you setup a full text index, full text searches are generally very fast.
DB.alter_table(:errors) do add_full_text_index( Sequel.expr{[ array_to_string(backtrace, ' '), array_to_string(hstore_to_array(env), ' ') ]}, :language=>'english', :name=>:errors_ftidx) end
Let me briefly review the query types I discussed.
First is the recursive common table expression, allowing for incremental scanning of tables and efficient traversal of trees and graphs.
Next was window functions, for performing aggregate calculations on rows without combining them, and useful for limited eager loading.
After that, I talked about the RETURNING clause, used to return rows from INSERT, UPDATE, and DELETE statements.
Finally, I talked about the advanced full text search features offered by Postgres.
That brings us to the final section of the talk, which will cover some other interesting features of Postgres.
The first feature I'm going to discuss in this section is defining foreign key and check constraints with delayed validation, using NOT VALID. NOT VALID constraints fix a long standing issue, which is that adding a constraint to a table requires a table scan and lock to ensure that the constraint is not violated, and if any row violates the constraint, adding the constraint fails. During this table scan, the table is locked.|However, when the constraint is marked NOT VALID, the table scan and lock are not done. The constraint is still enforced for all new rows and updates to existing rows, so you can sure that new invalid data is not introduced. After adding the constraint, you can clean up the existing invalid data by updating invalid rows to not violate the constraint.
NOT VALID
After you have removed the invalid rows from the table, you can then vaildate the constraint using VALIDATE CONSTRAINT. This still requires a table scan to ensure existing rows do not violate the constraint, but it requires a lesser lock on the table.
VALIDATE CONSTRAINT
Sequel supports both of these syntaxes, with a :not_valid option when defining constraints, and a validate_constraint method for validating them later.
DB.alter_table(:players) do add_constraint({ :not_valid=>true, :name=>valid_player_number}, :number=>0..99) validate_constraint \ :valid_player_number end
The next feature I'd like to discuss is the use of LISTEN and NOTIFY to send and receive messages on notification channels. This is basically a self contained interprocess communication mechanism, where database clients can send messages to each other. One of the most common uses for this is to notify clients when data has changed.|You can add a insert trigger that uses NOTIFY to notify a channel of all inserted rows, with the primary key of the inserted row. Then you can have another process that listens for these notifications and acts on them.
LISTEN
NOTIFY
Let's assume that you have added such a trigger, and you want to automatically tweet every time a player is added to your favorite team. You can use Sequel's support for listening on channels to receive notifications on the players table, and then automatically tweet the results. There are a couple things to note here.|First is the arguments to listen, the first of which is the notification channel name. This can also be an array if you want to listen on multiple channels simultaneously. The second argument is the options hash. In this case, the loop option is used, so that you continually listen for notifications, yielding them as they come in.|The second thing to note is the three arguments the block receives. The first is the channel the message was received on. The second is the pid of the backend process that generated the notification. The third is the payload of the message notification, which in this case we assume is the id of the player added.
DB.listen(:players, :loop=>true) do |chan, pid, id| p = DB[:players]. first(:id=>id, :team_id=>2) if p s = "#{p[:name]} joined HC CSKA Moscow" Twitter.update(s) end end
After we receive the notification, we need to check the players table to get the information on the inserted player. Since we only care about players that join our favorite team, so we check to see if the player has that team id. If so, we tweet that the player joined our favorite team.|This is a fairly contrived example, but a similar approach can be used to solve a real problem.
DB.listen(:players, :loop=>true) do |chan, pid, id| p = DB[:players]. first(:id=>id, :team_id=>2) if p s = "#{p[:name]} joined HC CSKA Moscow" Twitter.update(s) end end
Phil Karlton, in addition to having an awesome mustache, working on X11, and playing roller hockey, was also famous for the saying that There are only two hard things in Computer Science: cache invalidation and naming things. While this is obviously a tongue-in-cheek comment, it is telling that of all the hard problems he could pick to stand next to naming things, he chose cache invalidation.
It turns out that the combination of LISTEN and NOTIFY is actually great at handling cache invalidation. You have a trigger that uses NOTIFY whenever a table changes, and you then have your caching application use LISTEN to be notified whenever changes are committed, so it can modify the related cache.|Sequel ships with a model plugin called static_cache that caches entire small and infrequently updated model tables, so that many common model operations do not require database queries.|Historically, the static_cache plugin could only be used if the underlying model table was never updated at runtime. However, with some recent changes to the plugin, it has an option that allows the table to be updated at runtime, but it requires the user to manually reload the cache in that case.
static_cache
To make this easier on the user, Sequel ships with a pg_static_cache_updater extension, which makes it easy to define triggers on the underlying model tables that use NOTIFY to signal changes. It also has a method that creates a thread in your application that LISTENs for changes, so that any time the underlying database table for a model changes, all of the application processes automatically update their in-process cache for that model. So while naming things is still hard, cache invalidation becomes a little bit easier.
pg_static_cache_updater
The next feature I would like to talk about is prepared transactions. Prepared transactions are also known as two-phase commit, and are used in distributed database transactions to ensure that either all sides commit or neither side commits.|With distributed databases, using a regular transaction is not sufficient. You can't commit before receiving confirmation from the other database, since the other database may not commit, and then you cannot roll the transaction back. You can't wait until after receiving confirmation from the other database to commit, because there may be a deferred constraint violation or some other issue that causes the current database to rollback.|In either of those cases, the two databases would be out of sync, and that's not a good situation to be in.
It just so happens that Sequel has full support for Postgres prepared transactions. To create a prepared transaction, you just use the prepare option to Sequel's transaction method. The value for this option should be the transaction id string related to the current transaction. Then you just write the rest of the transaction code like you would normally.|The difference between a prepared transaction and a regular transaction is that after the transaction block ends, it appears to the outside world like the transaction has not been committed. However, the locks required by the transaction are still in effect, so you can't do things that will cause the prepared transaction to not be committed.
DB.transaction(:prepare=>'foobar') do
# Your code here
end
After you have received confirmation from the other side that the transaction has been committed, you can call the commit_prepared_transaction method with that transaction id string to commit the transaction you just prepared.
DB.commit_prepared_transaction('foobar')
If for some reason the other side was not able to successfully commit the transaction, you can roll it back on the current database using the rollback_prepared_transaction method. This ensures that the two databases are kept in sync.
DB.rollback_prepared_transaction('foobar')
The final feature I'd like to talk about is Postgres's support for procedural languages. Unlike most SQL databases, which support a single procedural language or maybe two, Postgres supports many different procedural languages.
Here are some of the procedural languages supported by Postgres. The languages in red ship with Postgres, and the ones in black are available as external downloads. Hopefully some languages here are familiar to you.
Unfortunately, as you can see, I put a line through Ruby. While there is support for using ruby as a procedural language, in my experience it's not stable and shouldn't be used.
Another interesting feature is the ability to execute anonymous functions at runtime, without defining them separately, using a DO statement.
Naturally, Sequel supports this, using the do method. Here's an example that uses an anonymous function written in the plpgsql procedural language to insert a row. What if you want to get a little meta?
DB.do(<<DO, :language=>'plpgsql') BEGIN EXECUTE $$INSERT INTO players (name) VALUES ('Jeremy');$$; END DO
Let's say you want to have your anonymous plpgsql function call a anonymous function written in perl to insert the row.
DB.do(<<DO, :language=>'plpgsql') BEGIN EXECUTE $$DO LANGUAGE plperl $pl$ spi_exec_query(<<'PL') INSERT INTO players (name) VALUES ('Jeremy'); PL $pl$;$$; END DO
Or have that perl function call a anonymous function written in python.
DB.do(<<DO, :language=>'plpgsql') BEGIN EXECUTE $$DO LANGUAGE plperl $pl$ spi_exec_query(<<'PL') DO LANGUAGE plpythonu $py$ plpy.execute("""INSERT INTO players (name) VALUES ('Jeremy');""") $py$ PL $pl$;$$; END DO
Or have that python function call a anonymous function written in javascript.
DB.do(<<DO, :language=>'plpgsql') BEGIN EXECUTE $$DO LANGUAGE plperl $pl$ spi_exec_query(<<'PL') DO LANGUAGE plpythonu $py$ plpy.execute("""DO LANGUAGE plv8 $js$ plv8.execute("INSERT INTO players" + "(name) VALUES ('Jeremy');"); $js$""")$py$ PL $pl$;$$; END DO
Or even have that javascript function call run some shell code. I've tested this and it actually works. So this is an anonymous plpgsql function calling an anonymous perl function calling an anonymous python function calling an anonymous javascript function running some shell code that uses the psql program to insert a row into the database.|The hardest part in getting this to work was getting the quoting correct. This makes extensive use of Postgres dollar quoting, which is similar to but more flexible than ruby heredoc syntax.
DB.do(<<DO, :language=>'plpgsql') BEGIN EXECUTE $$DO LANGUAGE plperl $pl$ spi_exec_query(<<'PL') DO LANGUAGE plpythonu $py$ plpy.execute("""DO LANGUAGE plv8 $js$ plv8.execute("DO LANGUAGE plsh " + "$sh$#!/bin/sh\\\\npsql -c\\\\" " + "INSERT INTO players (name) " + "VALUES ('Jeremy');\\\\"$sh$");$js$ """)$py$ PL $pl$;$$; END DO
So that finishes up the final section of this presentation. Let me briefly review what I talked about.
I talked about adding constraints and delaying validation until after you have had a chance to clean up existing invalid rows.
NOT VALID
I talked about the use of LISTEN and NOTIFY for simple interprocess communication and cache invalidation using Postgres notification channels.
LISTEN
NOTIFY
I talked about using prepared transactions to keep distributed databases in sync.
Finally, I talked about Postgres's support for many procedural languages, and using anonymous functions written in those languages.
That is pretty much all of the Postgres features I wanted to talk about. I'd like to leave you with a quote from one of the greatest hockey players, Wayne Gretzky, who said that a good hockey player plays where the puck is. A great hockey player plays where the puck is going to be.|When you are designing an application where the data actually matters, when it comes time to choose a database, you can't just think about today, but about many years in the future. In any situation where the data you are dealing with is important, your database will probably last longer than your application.|Over the last 10 years, I've watched Postgres get better with every single release. I fully believe Postgres is a great database choice for most applications, and that if you are writing your application in ruby, that Sequel is the best way to access it. Hopefully this presentation has shown off some of the cool things you can do with Postgres.|If you have any questions about anything I've covered, I'll be happy to answer them now.
Photo Credits
Gordie Howe photo (c) Jordan Cooper http://www.flickr.com/photos/jordoncooper/
Bobby Orr photo (c) Andrew Reilly http://www.flickr.com/photos/reillyandrew/
Wayne Gretzky photo (c) Martin Labelle http://www.flickr.com/photos/bestter/
Other hockey photos (c) Dinur Blum http://www.flickr.com/photos/dinur/