Working on a test case with sysbench, I encountered this:
1 2 3 4 5 6 7 8 | mysql> select * from sbtest1 where id = round(rand()*10000, 0); +------+--------+-------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------+ | id | k | c | pad | +------+--------+-------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------+ | 179 | 499871 | 09833083632-34593445843-98203182724-77632394229-31240034691-22855093589-98577647071-95962909368-34814236148-76937610370 | 62233363025-41327474153-95482195752-11204169522-13131828192 | | 1606 | 502031 | 81212399253-12831141664-41940957498-63947990218-16408477860-15124776228-42269003436-07293216458-45216889819-75452278174 | 25423822623-32136209218-60113604068-17409951653-00581045257 | +------+--------+-------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------+ 2 rows in set (0.30 sec) |
I was really surprised. First, and the most important, id is a primary key and the rand() function should produce just one value. How come it returns two rows? Second, why is the response time 0.30 sec? That seems really high for a primary key access.
Looking further:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 | CREATE TABLE `sbtest1` ( `id` int(11) NOT NULL AUTO_INCREMENT, `k` int(11) NOT NULL DEFAULT '0', `c` char(120) NOT NULL DEFAULT '', `pad` char(60) NOT NULL DEFAULT '', PRIMARY KEY (`id`), KEY `k_1` (`k`) ) ENGINE=InnoDB AUTO_INCREMENT=1000001 DEFAULT CHARSET=latin1 mysql> explain select * from sbtest1 where id = round(rand()*10000, 0); +----+-------------+---------+------------+------+---------------+------+---------+------+--------+----------+-------------+ | id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra | +----+-------------+---------+------------+------+---------------+------+---------+------+--------+----------+-------------+ | 1 | SIMPLE | sbtest1 | NULL | ALL | NULL | NULL | NULL | NULL | 986400 | 10.00 | Using where | +----+-------------+---------+------------+------+---------------+------+---------+------+--------+----------+-------------+ |
So it is a primary key, but MySQL does not use an index, and it returns two rows. Is this a bug?
Deterministic vs nondeterministic functions
Turned out it is not a bug at all. It is pretty logical behavior from MySQL, but it is not what we would expect. First, why a full table scan? Well, rand() is nondeterministic function. That means we do not know what it will return ahead of time, and actually that is exactly the purpose of rand() – to return a random value. In this case, it is only logical to evaluate the function for each row, each time, and compare the results. i.e. in our case
- Read row 1, get the value of id, evaluate the value of RAND(), compare
- Proceed using the same algorithm with the remaining rows.
In other words, as the value of rand() is not known (not evaluated) beforehand, so we can’t use an index.
And in this case – rand() function – we have another interesting consequence. For larger tables with an auto_increment primary key, the probability of matching the rand() value and the auto_increment value is higher, so we can get multiple rows back. In fact, if we read the whole table from the beginning and keep comparing the auto_inc sequence with “the roll of the dice”, we can get many rows back.
That behavior is totally counter-intuitive. Nevertheless, to me, it’s also the only correct behavior.
We expect to have the rand() function evaluated before running the query. This can actually be achieved by assigning rand() to a variable:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 | mysql> set @id=round(rand()*10000, 0); select @id; select * from sbtest1 where id = @id; Query OK, 0 rows affected (0.00 sec) +------+ | @id | +------+ | 6068 | +------+ 1 row in set (0.00 sec) +------+--------+-------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------+ | id | k | c | pad | +------+--------+-------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------+ | 6068 | 502782 | 84971025350-12845068791-61736600622-38249796467-85706778555-74134284808-24438972515-17848828748-86869270666-01547789681 | 17507194006-70651503059-23792945260-94159543806-65683812344 | +------+--------+-------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------+ 1 row in set (0.00 sec) mysql> explain select * from sbtest1 where id = @id; +----+-------------+---------+------------+-------+---------------+---------+---------+-------+------+----------+-------+ | id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra | +----+-------------+---------+------------+-------+---------------+---------+---------+-------+------+----------+-------+ | 1 | SIMPLE | sbtest1 | NULL | const | PRIMARY | PRIMARY | 4 | const | 1 | 100.00 | NULL | +----+-------------+---------+------------+-------+---------------+---------+---------+-------+------+----------+-------+ 1 row in set, 1 warning (0.01 sec) |
This would meet our expectations.
There are (at least) two bug reports filed, with very interesting discussion:
- rand() used in scalar functions returns multiple rows
- SELECT on PK with ROUND(RAND()) give wrong errors
Other databases
I wanted to see how it works in other SQL databases. In PostgreSQL, the behavior is exactly the same as MySQL:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 | postgres=# select * from t2 where id = cast(random()*10000 as int); id | c ------+--------- 4093 | asdasda 9378 | asdasda (2 rows) postgres=# select * from t2 where id = cast(random()*10000 as int); id | c ------+--------- 5988 | asdasda 6674 | asdasda (2 rows) postgres=# explain select * from t2 where id = cast(random()*10000 as int); QUERY PLAN -------------------------------------------------------------------- Seq Scan on t2 (cost=0.00..159837.60 rows=1 width=12) Filter: (id = ((random() * '10000'::double precision))::integer) (2 rows) |
And SQLite seems different, evaluating the random() function beforehand:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 | sqlite> select * from t2 where id = cast(abs(CAST(random() AS REAL))/92233720368547 as int); 16239|asdsadasdsa sqlite> select * from t2 where id = cast(abs(CAST(random() AS REAL))/92233720368547 as int); 32910|asdsadasdsa sqlite> select * from t2 where id = cast(abs(CAST(random() AS REAL))/92233720368547 as int); 58658|asdsadasdsa sqlite> explain select * from t2 where id = cast(abs(CAST(random() AS REAL))/92233720368547 as int); addr opcode p1 p2 p3 p4 p5 comment ---- ------------- ---- ---- ---- ------------- -- ------------- 0 Init 0 12 0 00 Start at 12 1 OpenRead 0 30182 0 2 00 root=30182 iDb=0; t2 2 Function0 0 0 3 random(0) 00 r[3]=func(r[0]) 3 Cast 3 69 0 00 affinity(r[3]) 4 Function0 0 3 2 abs(1) 01 r[2]=func(r[3]) 5 Divide 4 2 1 00 r[1]=r[2]/r[4] 6 Cast 1 68 0 00 affinity(r[1]) 7 SeekRowid 0 11 1 00 intkey=r[1]; pk 8 Copy 1 5 0 00 r[5]=r[1] 9 Column 0 1 6 00 r[6]=t2.c 10 ResultRow 5 2 0 00 output=r[5..6] 11 Halt 0 0 0 00 12 Transaction 0 0 2 0 01 usesStmtJournal=0 13 Int64 0 4 0 92233720368547 00 r[4]=92233720368547 14 Goto 0 1 0 00 |
Conclusion
Be careful when using MySQL nondeterministic functions in a “where” condition – rand() is the most interesting example – as their behavior may surprise you. Many people believe this to be a bug that should be fixed. Let me know in the comments: do you think it is a bug or not (and why)? I would also be interested to know how it works in other, non-opensource databases (Microsoft SQL Server, Oracle, etc)
PS: Finally, I’ve got a “clever” idea – what if I “trick” MySQL by using the deterministic keyword…
MySQL stored functions: deterministic vs not deterministic
So, I wanted to see how it works with MySQL stored functions if they are assigned “deterministic” and “not deterministic” keywords. First, I wanted to “trick” mysql and pass the deterministic to the stored function but use rand() inside. Ok, this is not what you really want to do!
1 2 3 4 5 6 7 8 9 | DELIMITER $$ CREATE FUNCTION myrand() RETURNS INT DETERMINISTIC BEGIN RETURN round(rand()*10000, 0); END$$ DELIMITER ; |
From MySQL manual about MySQL stored routines we can read:
Assessment of the nature of a routine is based on the “honesty” of the creator: MySQL does not check that a routine declared
DETERMINISTIC
is free of statements that produce nondeterministic results. However, misdeclaring a routine might affect results or affect performance. Declaring a nondeterministic routine asDETERMINISTIC
might lead to unexpected results by causing the optimizer to make incorrect execution plan choices. Declaring a deterministic routine asNONDETERMINISTIC
might diminish performance by causing available optimizations not to be used.
The result is interesting:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 | mysql> select myrand(); +----------+ | myrand() | +----------+ | 4202 | +----------+ 1 row in set (0.00 sec) mysql> select myrand(); +----------+ | myrand() | +----------+ | 7548 | +----------+ 1 row in set (0.00 sec) mysql> explain select * from t2 where id = myrand()\G *************************** 1. row *************************** id: 1 select_type: SIMPLE table: NULL partitions: NULL type: NULL possible_keys: NULL key: NULL key_len: NULL ref: NULL rows: NULL filtered: NULL Extra: Impossible WHERE noticed after reading const tables 1 row in set, 1 warning (0.00 sec) mysql> show warnings; +-------+------+--------------------------------------------------------------------------------+ | Level | Code | Message | +-------+------+--------------------------------------------------------------------------------+ | Note | 1003 | /* select#1 */ select '2745' AS `id`,'asasdas' AS `c` from `test`.`t2` where 0 | +-------+------+--------------------------------------------------------------------------------+ 1 row in set (0.00 sec) mysql> select * from t2 where id = 4202; +------+---------+ | id | c | +------+---------+ | 4202 | asasdas | +------+---------+ 1 row in set (0.00 sec) mysql> select * from t2 where id = 2745; +------+---------+ | id | c | +------+---------+ | 2745 | asasdas | +------+---------+ 1 row in set (0.00 sec) |
So MySQL optimizer detected the problem (somehow).
If I use the NOT DETERMINISTIC keyword, then MySQL works the same as when using the rand() function:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 | DELIMITER $$ CREATE FUNCTION myrand2() RETURNS INT NOT DETERMINISTIC BEGIN RETURN round(rand()*10000, 0); END$$ DELIMITER ; mysql> explain select * from t2 where id = myrand2()\G *************************** 1. row *************************** id: 1 select_type: SIMPLE table: t2 partitions: NULL type: ALL possible_keys: NULL key: NULL key_len: NULL ref: NULL rows: 262208 filtered: 10.00 Extra: Using where 1 row in set, 1 warning (0.00 sec) |
—
Photo by dylan nolte on Unsplash
I think this is absolutely a bug. Standard rules of algebra would say that we evaluate expressions from the inside out, so we should be able to do the RAND(), then the ROUND(), and be left with a statement equivalent to SELECT * FROM table WHERE id = – in which case we should get back 0 or 1 rows.
The RAND() function is nondeterministic, yes, but it is never going to return more than one value – as opposed to some nondeterministic function that might return an unknown number of rows – so why can the optimizer / parser not be smart enough to figure that out? It used to be the case (and maybe it still is) that if you tried to do an EXPLAIN on a SELECT that contained a subquery, MySQL would actually run the subquery before giving you the optimizer’s plan – this seems like a much less complicated variant of that scenario.
Ernie, I think a next reply by Joe shows a good point, really non-deterministic function should be evaluated for each row. Altho it is not what we can expect
I say it’s not a bug. Imagine doing a simulation. Say you had a column containing the object’s (represented by the row) chance of success. Each time you run the query, you want back a random list of rows which “succeeded’ during the simulation. You’d write something like:
SELECT * FROM agents WHERE success_prob > rand()
If rand() evaluated before the the query executed (instead of row by row), then all of the rows that returned when rand() = .2 would also be returned when rand() = .1 (plus some additional rows). There would be no variation among which rows got selected… just among the quantity of rows selected .
The author correctly points out that a variable should be used to store a single random value before query exeution if you want a single random value to evaluate against all rows. But the WHERE clause evaluates for each row, so it makes sense that rand() should evaluate for each row separately if it is included in the WHERE clause.
You may also wish to check this HackerNews Discussion https://news.ycombinator.com/item?id=18617809
AFAICT, the described behavior is according to the SQL standard. In theory, a WHERE clause should be evaluated for all rows. A database system may optimize this and not inspect every row if it has some other mechanism to identify which rows will satisfy the WHERE clause (e.g., an index). Also, functions calls that are constant (i.e., deterministic and does not depend on input from the current row) may be evaluated once. In this case, the function is non-deterministic, so it needs to be evaluated on every row.
Yes, I agree here. What is interesting, it seems that MySQL and PostgreSQL have this (described) behavior, but SQLite and (it seems like) MS SQL server has different.