`

Emulating Analytic (AKA Ranking) Functions with MySQL

阅读更多

Emulating Analytic (AKA Ranking) Functions with MySQL

by Stéphane Faroult
03/29/2007

http://www.oreilly.com/pub/a/mysql/2007/03/29/emulating-analytic-aka-ranking-functions-with-mysql.html

One of the most hailed extensions brought to SQL in recent years has been these functions that Oracle calls analytic functions, DB2 calls OLAP functions, and SQL Server 2005 calls ranking functions--but which MySQL, so far, still lacks. The good news is that they can be (relatively) easily and efficiently emulated.

<!----><!----><script language="JavaScript" type="text/javascript"> //<!----><\/script>'); //]]> </script><script language="JavaScript" src="http://ad.doubleclick.net/adj/ttm.oreillynet/mysqlart;sz=336x280;tile=3;ord=2634021333294635.5?" type="text/javascript"></script><script language="JavaScript" src="http://view.atdmt.com/MRT/jview/tchtkvse0120000116mrt/direct;vt.1/01?buster_url=&amp;pub_view_url=&amp;click=" type="text/javascript"> </script><noscript></noscript><script language="javascript" src="HTTP://rmd.atdmt.com/tl/NMMRTUMISVSE/47b24f42c73c4e92b5aa81005850fc75/8d1dff611a584178b583699548bf481b47b24f42c73c4e92b5aa81005850fc75.js?ver=117&amp;atdmt="> </script>
<script src="HTTP://rmd.atdmt.com/tl/NMMRTUMISVSE/47b24f42c73c4e92b5aa81005850fc75/8d1dff611a584178b583699548bf481b47b24f42c73c4e92b5aa81005850fc75a.js?spd=342&amp;atdmt=A4ERecordNumber=0a4edelim"> </script>
<script src="HTTP://rmd.atdmt.com/tl/TopLayer.v2k.js"> </script>
<noscript></noscript><!---->

Oracle introduced these functions early (with version 8.1.6, back in 1999 or about) and still offers a greater number of such functions than the other major players (who usually provide the most basic functions, from which the others can be derived.) Therefore, I will take my references from Oracle, and will show how you can, in most cases, obtain the same result with MySQL.

What Are Analytic Functions?

Anyone who has dabbled with SQL is familiar with aggregate functions such as SUM() or COUNT(). But as their name implies, when you use these functions and perform a group by, the detail is lost in the aggregate. Very simply, analytic functions allow you to return the detail (simple, unaggregated rows) and at the same time, on the very same row, a result that is computed from operations on several rows that are related in one way or another to the current one. Let me give an example. Suppose we have the unavoidable Oracle EMP table:

SQL> desc emp
 Name                                      Null?    Type
 ----------------------------------------- -------- ----------------
 EMPNO                                     NOT NULL NUMBER(4)
 ENAME                                              VARCHAR2(10)
 JOB                                                VARCHAR2(9)
 MGR                                                NUMBER(4)
 HIREDATE                                           DATE
 SAL                                                NUMBER(7,2)
 COMM                                               NUMBER(7,2)
 DEPTNO                                             NUMBER(2)

We can easily compute with a group by the total salary amount, department by department:

SQL> select deptno, sum(sal)
  2  from emp
  3  group by deptno;

    DEPTNO   SUM(SAL)
---------- ----------
        30       9400
        20      10875
        10       8750

By doing so, we lose the detail. However, if we use SUM() in an analytical way, we can return each row, and compute on the same row the total salary amount, specifying in the over clause that the SUM() function actually applies to all rows that refer to the same department number as the current one:

SQL> select deptno, ename, sal, sum(sal) over (partition by deptno)
  2  from emp
  3  order by 1, 3
  4  /

    DEPTNO ENAME         SAL     SUM(SAL)OVER(PARTITIONBYDEPTNO)
---------- ---------- ---------- -------------------------------
        10 MILLER           1300                            8750
        10 CLARK            2450                            8750
        10 KING             5000                            8750
        20 SMITH             800                           10875
        20 ADAMS            1100                           10875
        20 JONES            2975                           10875
        20 SCOTT            3000                           10875
        20 FORD             3000                           10875
        30 JAMES             950                            9400
        30 MARTIN           1250                            9400
        30 WARD             1250                            9400
        30 TURNER           1500                            9400
        30 ALLEN            1600                            9400
        30 BLAKE            2850                            9400

14 rows selected.

SQL>

Such a result can be very useful in computing, for instance, what percentage of the salary mass of a department does the Pointy Haired Boss's salary represent? Needless to say, all we need to do with MySQL is to join a regular select to the aggregate to obtain a similar result:

mysql> select a.DEPTNO, a.ENAME, a.SAL, b.TOTSAL
    -> from EMP as a
    ->       inner join (select DEPTNO, sum(SAL) TOTSAL
    ->                   from EMP
    ->                   group by DEPTNO) as b
    ->               on a.DEPTNO = b.DEPTNO
    -> order by 1, 3;
+--------+--------+------+--------+
| DEPTNO | ENAME  | SAL  | TOTSAL |
+--------+--------+------+--------+
|     10 | MILLER | 1300 |   8750 |
|     10 | CLARK  | 2450 |   8750 |
|     10 | KING   | 5000 |   8750 |
|     20 | SMITH  |  800 |  10875 |
|     20 | ADAMS  | 1100 |  10875 |
|     20 | JONES  | 2975 |  10875 |
|     20 | SCOTT  | 3000 |  10875 |
|     20 | FORD   | 3000 |  10875 |
|     30 | JAMES  |  950 |   9400 |
|     30 | WARD   | 1250 |   9400 |
|     30 | MARTIN | 1250 |   9400 |
|     30 | TURNER | 1500 |   9400 |
|     30 | ALLEN  | 1600 |   9400 |
|     30 | BLAKE  | 2850 |   9400 |
+--------+--------+------+--------+
14 rows in set (0.00 sec)

mysql>

We must, though, be careful about one thing: the analytic function applies to the result set defined by the query in which it appears. This means that if we apply a restrictive condition, the sum as computed by the analytic function changes:

SQL> select deptno, ename, sal, sum(sal) over (partition by deptno)
  2  from emp
  3  where job not in ('PRESIDENT', 'MANAGER')
  4  order by 1, 3
  5  /

    DEPTNO ENAME             SAL SUM(SAL)OVER(PARTITIONBYDEPTNO)
---------- ---------- ---------- -------------------------------
        10 MILLER           1300                            1300
        20 SMITH             800                            7900
        20 ADAMS            1100                            7900
        20 SCOTT            3000                            7900
        20 FORD             3000                            7900
        30 JAMES             950                            6550
        30 WARD             1250                            6550
        30 MARTIN           1250                            6550
        30 TURNER           1500                            6550
        30 ALLEN            1600                            6550

10 rows selected.

SQL>

If we want to have the same result in our MySQL implementation, we must apply the restrictive condition on the job description twice, once when computing the aggregate, and once when returning the individual rows. So, those highly praised analytic functions are just shortcuts for joining a table to an aggregate over the same table? Big deal! Well, not quite. Things get a little more complicated when we introduce a minor twist: relating rows in the "window" defined by the over clause to each other, which is usually performed by adding an order by to the clause, and optionally a range restriction.

Adding Ordering to the OVER Clause

SUM() is not the best function to apply an order by to, because whatever the order of the rows, the sum remains the same. It is a different matter when you want to compare rows within a window to each other, for instance ranking them with regard to a particular column. Oracle allows you to use regular aggregate functions in an analytical way, but it also introduces a number of purely analytic functions, of which RANK() can be considered to be the archetype. Actually, the typical question that calls for an analytic function is along the lines of "what are the top two salaries per department?". This is a question that calls for ranking rows relative to each other (hence the name of ranking function used with SQL Server 2005).

The per department in the previous question clearly tells us that we must partition by department. We could have omitted the partition by to apply the function to the whole of the table, but then a simple order by and a limit clause would have done the trick in MySQL. What allows us to assign a particular rank is the salary, which is what we must order by.

SQL> select deptno, empno, ename, sal, rank() over (partition by deptno
  2                                                 order by sal desc) "RANK"
  3  from emp;

    DEPTNO      EMPNO ENAME             SAL       RANK
---------- ---------- ---------- ---------- ----------
        10       7839 KING             5000          1
        10       7782 CLARK            2450          2
        10       7934 MILLER           1300          3
        20       7788 SCOTT            3000          1
        20       7902 FORD             3000          1
        20       7566 JONES            2975          3
        20       7876 ADAMS            1100          4
        20       7369 SMITH             800          5
        30       7698 BLAKE            2850          1
        30       7499 ALLEN            1600          2
        30       7844 TURNER           1500          3
        30       7654 MARTIN           1250          4
        30       7521 WARD             1250          4
        30       7900 JAMES             950          6

14 rows selected.

SQL>

We get a result set that we just need to "push" inside a subquery in the from clause. By applying to it a filter referring to the newly computed column RANK we get what we want:

SQL> select *
  2  from (select deptno, empno, ename, sal,
  3               rank() over (partition by deptno
  4                            order by sal desc) "RANK"
  5        from emp)
  6  where "RANK" <= 2
  7  order by deptno, "RANK"
  8  /

    DEPTNO      EMPNO ENAME             SAL       RANK
---------- ---------- ---------- ---------- ----------
        10       7839 KING             5000          1
        10       7782 CLARK            2450          2
        20       7788 SCOTT            3000          1
        20       7902 FORD             3000          1
        30       7698 BLAKE            2850          1
        30       7499 ALLEN            1600          2

6 rows selected.

SQL>

Note that there is a tie in department 20.

How can we compute a rank with MySQL? A quick Web search will provide you with an answer: by using a subquery in the select list. Saying that Blake has the highest salary in department 30 only means that no one from this department earns more. Because Allen has the second highest salary, only one person in the same department, our friend Blake, gets a bigger bank transfer each month. Computing how many people from the same department earn more than the current employee and adding one to the result yields the desired answer. In the very same fashion as we have done it with Oracle, we can compute the rank for every row, push the query into the from clause and filter on the rank in MySQL. The only difference is that with MySQL, associating an alias to a subquery in the from clause is mandatory.

mysql> select *
    -> from (select a.DEPTNO, a.EMPNO, a.ENAME, a.SAL,
    ->              (select 1 + count(*)
    ->               from EMP b
    ->               where b.DEPTNO = a.DEPTNO
    ->                 and b.SAL > a.SAL) RANK
    ->       from EMP as a) as x
    -> where x.RANK <= 2
    -> order by x.DEPTNO, x.RANK;
+--------+-------+-------+------+------+
| DEPTNO | EMPNO | ENAME | SAL  | RANK |
+--------+-------+-------+------+------+
|     10 |  7839 | KING  | 5000 |    1 |
|     10 |  7782 | CLARK | 2450 |    2 |
|     20 |  7902 | FORD  | 3000 |    1 |
|     20 |  7788 | SCOTT | 3000 |    1 |
|     30 |  7698 | BLAKE | 2850 |    1 |
|     30 |  7499 | ALLEN | 1600 |    2 |
+--------+-------+-------+------+------+
6 rows in set (0.01 sec)

mysql>

A lovely result. Are we happy? Well, if there is one thing that too many years spent with databases has taught me, it is that tables have a tendency to grow over time. That the nice little trick that gives exactly the result we want had better be tested against a table a tad bigger than 14 lines. I have therefore created, both in Oracle and MySQL, an EMPLOYEES table vaguely similar in structure to EMP, and I have dutifully populated it with a mere 10,000 rows. The is by no means an impressive number of rows, even if companies with 10,000 employees are big corporations. The employees are uniformly (but randomly) spread among six departments. I have asked Oracle to time its queries, and Oracle (10.2) and MySQL (5.0) are running on the very same server. Let's ask for the five biggest salaries for each department.

Here is what we get with Oracle:

SQL> select *
  2  from (select deptno, empno, lastname, firstname, sal,
  3               rank() over (partition by deptno
  4                            order by sal desc) "RANK"
  5        from employees)
  6  where "RANK" <= 5
  7  order by deptno, "RANK"
  8  /

    DEPTNO      EMPNO LASTNAME             FIRSTNAME                   SAL       RANK
---------- ---------- -------------------- -------------------- ---------- ----------
        10       4751 WOLFORD              KATHY                   7997.34          1
        10      10517 HARRIS               GARLAND                 7993.74          2
        10       8135 GRAHAM               LIANA                   7988.67          3
        10       7028 PRATT                GLORIA                  7980.45          4
        10       6694 GEARY                SANDRA                  7973.03          5
        20       7158 WHALEY               PATRICIA                7997.44          1
        20      10008 JUNKER               TOM                     7992.63          2
        20       1782 SCOTT                JOHN                    7991.65          3
        20       4158 OLIVER               BLANCHE                 7989.13          4
        20       3537 BUCHANAN             LESLEY                  7982.61          5
        30       7892 VALENCIA             EDWARD                  7993.51          1
        30       5491 PEREZ                MARTIN                  7990.48          2
        30       5751 LARSON               LEE                      7990.4          3
        30       5743 PACHECO              YOLANDA                 7989.91          4
        30       7115 COVEY                BRANDON                 7984.36          5
        40       5493 KOFFLER              DENNIS                  7998.95          1
        40       4123 MURRAH               MARY                    7997.23          2
        40       8103 MORGAN               ASHLEY                   7995.9          3
        40       6171 WINSTEAD             DOROTHY                 7993.35          4
        40       7098 FERRELL              HECTOR                  7987.78          5
        50       9860 ORMAN                GUADALUPE               7993.79          1
        50       7446 KURLAND              GARY                    7989.36          2
        50       8470 THOMAS               JAMES                   7989.11          3
        50       9251 SOLANO               JULIE                   7988.24          4
        50       2769 MILTON               DEBORAH                 7986.29          5
        60       2899 FARMER               FRANK                    7981.1          1
        60       5021 COCHRAN              GARY                     7978.5          2
        60       6607 MOORE                DONALD                   7977.7          3
        60       4013 BURMEISTER           HATTIE                  7973.04          4
        60       1472 LADD                 JEFFREY                  7952.5          5

30 rows selected.

Elapsed: 00:00:00.05
SQL>

MySQL now:

mysql>select *
    -> from (select a.DEPTNO, a.EMPNO, a.LASTNAME, a.FIRSTNAME, a.SAL,
    ->              (select 1 + count(*)
    ->               from EMPLOYEES b
    ->               where b.DEPTNO = a.DEPTNO
    ->                 and b.SAL > a.SAL) RANK
    ->       from EMPLOYEES as a) as x
    -> where x.RANK <= 5
    -> order by x.DEPTNO, x.RANK;
+--------+-------+------------+-----------+---------+------+
| DEPTNO | EMPNO | LASTNAME   | FIRSTNAME | SAL     | RANK |
+--------+-------+------------+-----------+---------+------+
|     10 |  4751 | WOLFORD    | KATHY     | 7997.34 |    1 |
|     10 | 10517 | HARRIS     | GARLAND   | 7993.74 |    2 |
  [snip]
|     60 |  4013 | BURMEISTER | HATTIE    | 7973.04 |    4 |
|     60 |  1472 | LADD       | JEFFREY   |  7952.5 |    5 |
+--------+-------+------------+-----------+---------+------+
30 rows in set (5 min 29.12 sec)

Exactly the same result (<blush></blush>except for the slightly embarrassing line under the table.) Let's take a closer look at what we are doing. Our subquery that computes the rank is correlated: that means that it refers to the current row of EMPLOYEES, through both the DEPTNO and the SAL columns. We need them to compute how many employees in the same department earn a bigger salary than the current employee. In other words, we must fire the query for each of the 10,000 rows we inspect. Now, when I created my table, I defined EMPNO as the primary key, and created an additional index on LASTNAME, since I expected it to be a frequent search criterion. Neither DEPTNO nor SAL are indexed, in either database. Indexing DEPTNO, given the distribution, is pointless, because it isn't selective enough. Ten thousand times a full scan over a ten thousand row table: one hundred million rows inspected in all. Somewhat more visible than fourteen scans of a fourteen row table...

Let me be upfront about this, I dislike adding indexes. People often underestimate their impact on inserts, deletes, and (sometimes) updates, which can be very high. I don't want to fix a query and break the nightly batch program. There are also cases when I couldn't index. Suppose that instead of having departments, we have stockmarkets. Instead of having employees, we have securities, and instead of ranking salaries, we want to rank the difference between the current price and yesterday's closing price, as a percentage. There is a high update rate, and even if we wanted to, MySQL doesn't allow you to index a computed column. What would we do? Never mind, this isn't a real life assignment, just an ONLamp article. Let's just try to save face and index the two columns that we need to correlate the main query and the subquery.

mysql> create index tempidx on EMPLOYEES(DEPTNO, SAL);
Query OK, 10000 rows affected (0.14 sec)
Records: 10000  Duplicates: 0  Warnings: 0

When we run the query again, here is the bottom line:

30 rows in set (26.48 sec)

We have avoided having to commit hara-kiri (just) but compared to Oracle, it's still far from glorious. Shall we have to kowtow to the mighty Oracle? Let's not throw in the towel yet.

Getting the Top Values Efficiently, Take One

When we run into a performance problem, there is no such thing as RTFMing. We are trying to get with MySQL the same results as SQL extensions that Oracle has, and MySQL hasn't. Instead of using an otherwise squeaky clean SQL construct such as the correlated subquery in the select list, couldn't we find extensions that MySQL has and Oracle hasn't to help us? Perhaps it would be wise to restate the problem. We want to compare the employee in the current row to the employees in the same department. How would we do it manually? I don't know about you, but I don't think that I would compute for each employee in turn how many people make more in the same department. I'd rather establish a full list of employees in the same department, and sort them by salary (which is probably what Oracle does behind the scene); and I'd reuse the same list for each employee in the department instead of recomputing the list each time. What we need is a sorted list.

There is precisely a function that builds a list, GROUP_CONCAT, and interestingly it is a function that Oracle natively lacks (even if it is possible to create such a function as a user-written function). The MySQL documentation provides the full syntax as:

GROUP_CONCAT([DISTINCT] expr [,expr ...]
             [ORDER BY {unsigned_integer | col_name | expr}
                 [ASC | DESC] [,col_name ...]]
             [SEPARATOR str_val])

Have you noticed the order by passed as a parameter? Doesn't it ring a bell? There is of course a catch: the resulting string has a limited length (although adjustable through the group_concat_max_len variable), 1024 characters by default. I shall return to this topic in a moment, but since for the time being I am only interested by the first five items in the list (since I want the people who have the top five salaries) the default maximum length is much more than I need.

I now have found how to construct my sorted list. But how shall I determine the position of the employee described by the current row inside this list? We have another function (which once again Oracle lacks) that fits our requirements, FIND_IN_SET(str,strlist). You should take note that in this function, the separator used in strlist is required to be a comma, which means that we must use a comma (the default) as the separator in our call to GROUP_CONCAT(). So, if whatever we need to store in our list may contain a comma, we must make use of REPLACE() to substitute this comma for an innocuous character whenever we add an item to the list, or search an item in the list.

Properly equipped with these two functions, I can now write and run my query (after having dropped the index on DEPTNO and SAL to be in the same context as Oracle):

mysql> select *
    -> from (select e.DEPTNO,
    ->              e.EMPNO,
    ->              e.LASTNAME,
    ->              e.FIRSTNAME,
    ->              e.SAL,
    ->              find_in_set(e.SAL, x.SALLIST) RANK
    ->       from EMPLOYEES as e,
    ->            (select DEPTNO, group_concat(SAL
    ->                                         order by SAL desc) SALLIST
    ->             from EMPLOYEES
    ->             group by DEPTNO) as x
    ->       where e.DEPTNO = x.DEPTNO) as z
    -> where RANK <= 5
    ->   and RANK > 0
    -> order by DEPTNO, RANK;

A word about the list we build: we haved simply concatenated the salaries by decreasing order, letting MySQL implicitly convert float values to character strings. Since we have about 1,600 employees per department, the default length of 1024 characters is grossly insufficient to build a complete list. However, as I have said previously, we don't need the complete list, we just need the first five values. Nevertheless, we must anticipate that salaries who are not in the top 120 or whereabout will fail to be stored, and therefore will not be found by the FIND_IN_SET() function (which will return 0.) Hence the necessity for an additional condition that RANK must be strictly positive. We get the same result as usual:

+--------+-------+------------+-----------+---------+------+
| DEPTNO | EMPNO | LASTNAME   | FIRSTNAME | SAL     | RANK |
+--------+-------+------------+-----------+---------+------+
|     10 |  4751 | WOLFORD    | KATHY     | 7997.34 |    1 |
|     10 | 10517 | HARRIS     | GARLAND   | 7993.74 |    2 |
    [snip]
|     60 |  4013 | BURMEISTER | HATTIE    | 7973.04 |    4 |
|     60 |  1472 | LADD       | JEFFREY   |  7952.5 |    5 |
+--------+-------+------------+-----------+---------+------+
30 rows in set, 1 warning (0.30 sec)

Same result, except that this time the performance, without being quite as excellent as Oracle's, is very decent. We get a warning, in punishment of our sin of trying to build lists that are more than 1K in length, but otherwise we get precisely what we want, with an SQL usage that is both relatively elegant and efficient. We have in the GROUP_CONCAT()/FIND_IN_SET() combination a kind of MySQL crypto-analytic function that enables us to answer almost all the "what are the top (or bottom) n" questions, as long as n isn't too large, and can be used to emulate several Oracle analytic functions. For instance, Oracle can handle both rank() and dense_rank(); the former focuses on what is ranked, and the latter on how it is ranked. Let's suppose that Tom, Dick, and Harry have obtained the following scores at a game:

Tom

   825

Dick

    825

Harry

  750

Harry's rank() is 3, because rank() focuses on what is ranked, the people, and there are two people who have a better score than Harry. Both Tom and Dick have rank 1. However, dense_rank() focuses on how people are ranked, which is to say scores. And here, Harry's dense_rank() is 2, since there is only one score value higher than 750; "dense"  refers to the fact that there is no gap in rank numbers. If you want to emulate dense_rank(), all you need to do is to add a distinct to the expression in group_concat(). And if you want to emulate row_number(), that assigns a unique number within the group defined by the partition clause, all you need to do is build a list using unique identifiers. In our example, keeping in mind that we accept that we don't need more values than what the longest string returned by group_concat() is able to hold, we can emulate:

   row_number() over (partition by deptno
                      order by hiredate)

by using

    FIND_IN_SET(EMPNO, EMPNOLIST)
with EMPNOLIST built as
    GROUP_CONCAT(EMPNO ORDER BY HIREDATE) as EMPNOLIST

associated with the usual group by DEPTNO. A word of caution though, as with in our example with SUM(), any additional restrictive condition that Oracle would need only once will have to be added to the group by query as well with MySQL.

In my next article, we'll look at another way to do this same analytical query, as well as how to implement the LAG function and how to foresee values in queries.

Stéphane Faroult first discovered the relational model and the SQL language in 1983.


 

Return to ONLamp.com.

<!---->

Global site tag (gtag.js) - Google Analytics