Surprises in SQL – State-of-the-art options in the standard query language
New Voice

© Lead Image © Olly, fotolia.com
In recent times, many developers have come to view SQL as inflexible and limited, but this classical database language has some tricks and special features that many users don't know about.
State-of-the-art SQL can do much more than you might think. Despite its popular image as a fairly limited database tool, SQL is no longer restricted to the relational data model but can also handle nested objects and structured documents, features more commonly associated with later technologies like NoSQL. Of course, it all depends on what you call SQL. Not all vendors implement all the features of the various SQL standards that have appeared through the years. In this article, I take you on a tour of some of interesting tricks available through standards-based SQL.
92 and 99
The SQL:92 standard is the starting point for the complete, classical SQL database system we think of today. SQL:92 was already the second major version of the SQL standard, and it achieved a certain level of completeness as an embodiment of the classic relational model. However, developers knew even back in 1992 that the relational model is not ideal for all data.
The third major version of the SQL standard in 1999 brought an end to the plain vanilla relational SQL. All signs pointed to object-oriented programming. The standard featured the concept of the object-relational database, but a couple of years too late as it turned out. Object-relational mappers (ORMs) had already begun to build a bridge between object-oriented programming and the relational data model.
SQL:1999 also introduced some other new features, such as loops. Even though SQL:1999 broke with many traditions, it still remained a declarative language. In other words, it is impossible by definition to tell a database how to execute a query. As long as the results are okay, the database has full freedom. This declarative nature made it difficult for the standard to address loops, because a loop defines the solution's approach. The trick SQL used in SQL:1999 was to define constructs that can only be executed as loops. Programmers can then use these constructs as loops, which should not really exist in a declarative language.
As an example, consider the problem of converting the PHP and SQL code from Listing 1 into pure SQL. The example starts by loading a list of categories into a PHP array and then runs a further query for each category; the query returns the three most popular products from each category.
Listing 1
PHP SQL Pseudocode
Performance-conscious users could object at this point that placing database queries in loops of an imperative language – PHP in this case – could cause trouble. A join is typically a better option to make sure that the database returns all the required data at the same time. In this case, there is a problem: the example cannot be handled with a simple join. The examples from Listing 2 show that, no matter how you twist and turn, the result is always the three most popular products all told – and not the three most popular per category.
Listing 2
Two Attempted Joins
The problem is that LIMIT
does not act in a category-specific way. To achieve the desired results, you need to run LIMIT
in a subquery that is restricted to a specific category (Listing 3). However, this is no longer valid SQL; subqueries in the FROM
clause cannot access data external to the subquery.
Listing 3
Invalid Subquery
The WHERE
clause that uses k.category
to access a table external to the subquery is thus invalid. Or at least, that was the case in SQL:92. SQL:1999 supports this kind of access if the user precedes the subquery with the new LATERAL
keyword (see Listing 4).
Listing 4
With LATERAL
This query is equivalent to the PHP code in Listing 1, except that the database executes the loop itself and the latencies between the application and the database are thus avoided. Figure 1 shows the similarities between the Foreach
loop and LATERAL
.
Another advantage of using SQL for everything is that the user can process the results with SQL downstream. For example, you could sort the overall results differently using an ORDER BY
clause or write the results directly to a caching table with INSERT INTO
… SELECT
… . The latter case avoids transporting data from the database to the application and then back to the database.
The benefit of being able to process the results with SQL farther down the line applies to any SQL query, of course. As a general rule, users should not assume that SQL databases are just storage bins. Data processing is often easier with SQL than with other programming languages. The results are typically more correct and the performance better. This approach only fails when faced with massively parallel access, such as experienced by Google and Facebook. If you are that big, however, you will definitely have the means to build a proprietary solution, and until you get there, the flexibility that SQL offers is often the better approach.
SQL:1999 introduced a second construct that can also be used like a loop: WITH RECURSIVE
. The details of how this works are fairly complex and well beyond the scope of this article, but Figure 2 shows the basics.
The WITH RECURSIVE
variant has three benefits:
- It is better supported by today's databases than
LATERAL
– for example, it is also supported by SQLite (see Figure 3). - Data can pass from one iteration to the next.
- It is possible to formulate a dynamic termination condition.
The disadvantage is that the loop body cannot be transferred one-to-one to SQL: It must be merged with the part after UNION ALL
.
WITH RECURSIVE
supports some important use cases. The query shown in Figure 2 is a row generator; it simply returns 10 numbered rows – very practical for generating test data. A more important use case is traversing graphs, such as finding the shortest connection between two persons on a social network.
Rapid Steps – SQL:2003
After ditching relational-only thought patterns in SQL:1999, in only took four years for the next major revision of the SQL standard. The focus of SQL:2003 was on two points: XML and analytical functions.
XML support is interesting because SQL databases became document stores, as one would call them today. In SQL:2003, users can store XML both as text and as validated documents that can be processed using SQL and XQuery.
Although some databases support the XML extension today, XML was unable to assert itself in web development. Its competitor JSON was simply too attractive. All popular SQL databases have introduced features for handling JSON documents in recent years. These extensions are purely proprietary – each database offers a different feature set.
But the second SQL:2003 focus – analytic functions – has asserted itself. In particular, the window functions are supported by many databases today, and they vastly simplify data preparation.
A window function lets the database programmers use aggregate functions, such as SUM
or COUNT
, without GROUP BY
. Of course, you still need to define the rows to use for the aggregate, but if you don't use GROUP BY
, because rows are grouped, you can now use the new OVER
clause directly after the aggregate function.
The following query clarifies the effect: it adds an additional column to a query (SELECT *
) that has the number of rows in the overall result when done:
SELECT * , COUNT(*) OVER () FROM [...]
It makes no difference how you proceed: JOIN
, WHERE
, HAVING
, GROUP BY
, ORDER BY
– everything is possible. The COUNT(*) OVER()
window does not have any side effects on the rest of the query. The empty bracket in the OVER
clause means the COUNT
function runs against all the rows of the result.
The window function essentially says: Count all rows! The result of this query is thus an additional column containing the number of rows in the result. To clarify things once again, the row count comes with each row and is thus returned multiple times; however, this does not mean it is determined multiple times!
Of course, there is no need to return the same results umpteen times with the window function. You can use the OVER
clause to define the rows against which to run the function. The most important tool for defining the rows is PARTITION BY
:
SELECT * , COUNT(*) OVER (PARTITION BY category) FROM [...]
You can read the preceding expression as: Count the rows with the same value in the category column!. In other words, PARTITION BY
delimits the rows just like GROUP BY
, but it does not group the rows like GROUP BY
does; instead, it simply specifies the rows to which the window function is applied.
If you have multiple rows of the same category, you will receive the same result in each case of these rows. Remember that a window function does not have any side effects on the remaining results of the query. In particular, this means that multiple window functions can be used in a single query without them influencing one another. As an example, if you need both the number of rows in the overall results, and the number of rows per category, you can use the relevant above examples together in a query.
The rows that are visible for a window function can be further delimited if the rows are sorted. That allows OVER
clauses to mean something like, all rows before or three rows before to rows after. "Before" and "after" refer to a sort order, which is freely definable using ORDER BY
in the OVER
clause. This means you can, say, compute a subtotal (Sum of all rows before) or a moving average (Average of three rows before to three rows after) Listing 5); value
and time
are the column names in this example.
Listing 5
ORDER BY with OVER()
The formulation of the BETWEEN
range follows similar rules to BETWEEN
in the WHERE
clause: The start must be somewhere before the finish, and the specified values themselves are part of the range. The AVG
function in the example thus generates the average value of up to seven rows: the three rows before, the current row itself, and the next three rows. If you don't have enough rows to fill the window, then it is smaller – this would be the case in the first row, for example, which cannot be preceded by three rows.
You can define the window size by row, but also by value range. For example, you can define windows that cover three days before and three days after – no matter how many rows these days cover (Listing 6) – just define the BETWEEN
range with RANGE
instead of ROWS
.
Listing 6
With Value Ranges
In addition to the well-known aggregate functions, SQL:2003 also introduced ranking functions that strictly require an OVER
clause with ORDER BY
.
Not Just Aggregate Functions
These ranking functions in particular are ROW_NUMBER
, RANK
, and DENSE_RANK
. As the name would suggest, ROW _NUMBER
lets you enumerate rows. RANK
and DENSE_RANK
are less intuitive. Both return a rank as per the ORDER BY
clause. RANK
and DENSE_RANK
differ from ROW_NUMBER
in that ex aequo placements (ties) with RANK
and DENSE_RANK
take the same rank, as is typical in sports. In other words, you could have two first-place contenders.
RANK
and DENSE_RANK
differ in the question of how to rank the next contender. RANK
omits a placement in this case – thus returning 3
; DENSE_RANK
uses placements without gaps – that is, 2
in this example.
SQL:2011 introduced more window functions for access to individual rows of a sorted data window. LAG
and LEAD
let the user access the previous and next row. FIRST_VALUE
, LAST_VALUE
, and NTH_VALUE
provide access to the first, last, or nth row of the window. The example in Listing 7 illustrates this function by means of a competition.
Listing 7
LAG and FIRST
The first column determines the placement using RANK
. This step is followed by the name and the number of points achieved. The last two columns show the gap to the competitor just ahead and to the winner. The query uses LAG
to access the score of the competitor in front and FIRST_VALUE
to see how many points the winner has. The number of points achieved by the current candidate is deducted (-points
) to give you the required gaps.
A word of warning on this example: this OVER
clause uses ORDER BY
without using BETWEEN
to delimit the window. In this case, a default clause takes effect: RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
. This default is surprising for two reasons. First, the subsequent rows are excluded from the window. Second, this exclusion is based on RANGE
not on ROWS
. In other words, you still have competitors with the same score in the window.
Buy this article as PDF
(incl. VAT)
Buy Linux Magazine
Direct Download
Read full article as PDF:
Price $2.95
Subscribe to our Linux Newsletters
Find Linux and Open Source Jobs
Subscribe to our ADMIN Newsletters
Find SysAdmin Jobs
News
-
Kubuntu Focus Announces XE Gen 2 Linux Laptop
Another Kubuntu-based laptop has arrived to be your next ultra-portable powerhouse with a Linux heart.
-
MNT Seeks Financial Backing for New Seven-Inch Linux Laptop
MNT Pocket Reform is a tiny laptop that is modular, upgradable, recyclable, reusable, and ships with Debian Linux.
-
Ubuntu Flatpak Remix Adds Flatpak Support Preinstalled
If you're looking for a version of Ubuntu that includes Flatpak support out of the box, there's one clear option.
-
Gnome 44 Release Candidate Now Available
The Gnome 44 release candidate has officially arrived and adds a few changes into the mix.
-
Flathub Vying to Become the Standard Linux App Store
If the Flathub team has any say in the matter, their product will become the default tool for installing Linux apps in 2023.
-
Debian 12 to Ship with KDE Plasma 5.27
The Debian development team has shifted to the latest version of KDE for their testing branch.
-
Planet Computers Launches ARM-based Linux Desktop PCs
The firm that originally released a line of mobile keyboards has taken a different direction and has developed a new line of out-of-the-box mini Linux desktop computers.
-
Ubuntu No Longer Shipping with Flatpak
In a move that probably won’t come as a shock to many, Ubuntu and all of its official spins will no longer ship with Flatpak installed.
-
openSUSE Leap 15.5 Beta Now Available
The final version of the Leap 15 series of openSUSE is available for beta testing and offers only new software versions.
-
Linux Kernel 6.2 Released with New Hardware Support
Find out what's new in the most recent release from Linus Torvalds and the Linux kernel team.