JSON Structure, The Next Generation (Performance, Scale, and Fun)
Now you know the basic JSON functionality available to you, as covered in part one of this series. Let’s look deeper into performance, scale, and oddities that may occur.
You can do almost anything you could do in SQL with standard data types and objects, but you may run into specific problems and limitations. Here is an example of that. Let’s say I want to get the top 10 rated movies of all time.
1 2 3 | mysql> select json_column->>'$.title' as title, json_column->>'$.imdb_rating' as rating, json_column->>'$.imdb_id' as imdb_id from movies_json where json_column->>'$.imdb_rating' > 8 order by json_column->>'$.imdb_rating' desc limit 10; ERROR 1038 (HY001): Out of sort memory, consider increasing server sort buffer size |
In this case, the size required to sort the data set is much larger than the available sort size. Selecting this data from a normalized structure would not cause this (nor would generated columns).
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 | mysql> select @@sort_buffer_size; +--------------------+ | @@sort_buffer_size | +--------------------+ | 262144 | +--------------------+ 1 row in set (0.00 sec) mysql> set @@sort_buffer_size=26214400; Query OK, 0 rows affected (0.00 sec) mysql> select json_column->>'$.title' as title, json_column->>'$.imdb_rating' as rating, json_column->>'$.imdb_id' as imdb_id from movies_json where json_column->>'$.imdb_rating' > 8 order by json_column->>'$.imdb_rating' desc limit 10; +----------------------------------------------------+--------+-----------+ | title | rating | imdb_id | +----------------------------------------------------+--------+-----------+ | The Mountain II (2016) | 9.9 | tt5813916 | | Toxic Temptation (2016) | 9.9 | tt4621978 | | 1985 (2016) | 9.9 | tt5932546 | | Jag har din rygg (2015) | 9.8 | tt3689312 | | My Head Hurts (2000) | 9.8 | tt1346290 | | Boz: Colors and Shapes (Video 2006) | 9.8 | tt0876256 | | Foreclosed (Video 2010) | 9.8 | tt1648984 | | Nocturne in Black (2016) | 9.8 | tt4536608 | | Monrad & Rislund: Det store triumftog (Video 2004) | 9.8 | tt0425266 | | Questione di Sguardi (2014) | 9.8 | tt4423586 | +----------------------------------------------------+--------+-----------+ 10 rows in set, 65535 warnings (0.78 sec) |
So we solved the issue with not having enough space to sort the dataset, but did you notice the time? Almost 1 second. Any website you are working on nowadays can not succeed if all your database queries take 0.5-1 second to return. Yes, you can put a cache in front of these and probably will. A better cache solves it all, right? Well, not really. In this case, you can speed up things drastically with a few important and small improvements. Before moving on, notice the warnings? Let’s ignore these for one moment. (These are because of the null values in the rating column (but these are not impacting performance).
Let’s look at generated columns. Generated columns allow you to create columns based on other columns or circumstances on the fly. Note, you can also use functional indexes, which I will cover later. In the case of JSON, we can pull values out of our document and store them read-only in a column that can be indexed (we all know indexing is good, right!). Let’s do this! I am going to create a table with generated columns for various columns. Then I am going to load the data via insert from our already loaded movies table.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 | mysql> create table movies_json_generated ( -> ai_myid int AUTO_INCREMENT primary key, -> imdb_id varchar(255) generated always as (`json_column` ->> '$.imdb_id'), -> title varchar(255) generated always as (`json_column` ->> '$.title'), -> imdb_rating decimal(5,2) generated always as (`json_column` ->> '$.imdb_rating'), -> overview text generated always as (`json_column` ->> '$.overview'), -> director json generated always as (`json_column` ->> '$.director'), -> cast json generated always as (`json_column` ->> '$.cast'), -> json_column json -> ) engine = innodb; Query OK, 0 rows affected (0.04 sec) mysql> create unique index imdb_idx on movies_json_generated(imdb_id); Query OK, 0 rows affected (0.04 sec) Records: 0 Duplicates: 0 Warnings: 0 mysql> insert into movies_json_generated (json_column ) select json_column from movies_json; ERROR 1366 (HY000): Incorrect decimal value: 'null' for column 'imdb_rating' at row 1 mysql> Drop Table movies_json_generated; Query OK, 0 rows affected (0.04 sec) |
I wanted to take a moment and dive into what I have seen as a common issue. As you can see, the table was created without problem, but it failed when inserting data from our JSON objects. The reason? I am storing the IMDB rating in a decimal(5,2) field, but the JSON reference has ‘null’ (quoted) as text. Type conversions when working with JSON and MySQL can be a bit tricky for some people. Remember those warnings above? They were type conversion warnings. Easy to ignore as they did not appear to cause an immediate issue (but they did). Type conversions and character sets can be a bit of a problem when accessing JSON data.
Allow a slight diverged rant: If your application has an inconsistent or rapidly changing JSON format/schema you can make using standard database functions difficult ( if not impossible ).
I often hear people talk about structure and the lack of flexibility as a massive drawback for relational databases. JSON is flexible; people love flexibility. The issue is flexibility comes at a cost, and you will pay that cost somewhere. If you are validating and maintaining a structure in your database, your code must fit in the structure and changes to the structure maybe slow ( database migrations). If you use a flexible database schema, then you need to validate in your code. This validation in the code may be simple now, but as you grow, the more iterations or changes to your data, the more sprawling the code to check and validate data is. Whether you want to admit it or not, throwing any unstructured data into the wasteland that is a document is a recipe for problems. In the case of storing just a dump of JSON in your database, sometimes that data is complex to access or manipulate without writing code and having access to that validation.
In this case, the type conversion is pretty straightforward and you can solve this issue in a couple of different ways. Ultimately it is about ensuring the JSON value is converted correctly. I will use the json_value function to unquote the null.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 | mysql> create table movies_json_generated_stored ( -> ai_myid int AUTO_INCREMENT primary key, -> imdb_id varchar(255) generated always as (`json_column` ->> '$.imdb_id') , -> title varchar(255) generated always as (`json_column` ->> '$.title') , -> imdb_rating decimal(5,2) generated always as (json_value(json_column, '$.imdb_rating')) , -> overview text generated always as (`json_column` ->> '$.overview') , -> director json generated always as (`json_column` ->> '$.director') , -> cast json generated always as (`json_column` ->> '$.cast') , -> json_column json -> ) engine = innodb; mysql> mysql> create unique index imdb_idx on movies_json_generated(imdb_id); Query OK, 0 rows affected (0.01 sec) Records: 0 Duplicates: 0 Warnings: 0 mysql> insert into movies_json_generated (json_column ) select json_column from movies_json; Query OK, 375359 rows affected (40.26 sec) Records: 375359 Duplicates: 0 Warnings: 0 |
Now let’s compare searching for a movie using first the imdb_id from the JSON document and then from the generated table using the column we indexed:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 | mysql> select json_column->>'$.title', json_column->>'$.imdb_rating' from movies_json where json_column->>'$.imdb_id'='tt2395427'; +--------------------------------+-------------------------------+ | json_column->>'$.title' | json_column->>'$.imdb_rating' | +--------------------------------+-------------------------------+ | Avengers: Age of Ultron (2015) | 7.5 | +--------------------------------+-------------------------------+ 1 row in set (0.86 sec) mysql> select title, imdb_rating from movies_json_generated where imdb_id='tt2395427'; +--------------------------------+-------------+ | title | imdb_rating | +--------------------------------+-------------+ | Avengers: Age of Ultron (2015) | 7.50 | +--------------------------------+-------------+ 1 row in set (0.01 sec) |
Great! Let’s go back to our example using IMDB rating to get the top 10 movies. To make this faster, we will need to first create an index on the generated column.
1 2 3 | mysql> create index idx_rating on movies_json_generated ( imdb_rating ); Query OK, 0 rows affected (1.45 sec) Records: 0 Duplicates: 0 Warnings: 0 |
With that out of the way, let’s get the top 10 list:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 | mysql> select json_column->>'$.title' as title, json_column->>'$.imdb_rating' as rating, json_column->>'$.imdb_id' as imdb_id from movies_json_generated where imdb_rating > 8 order by imdb_rating desc limit 10; +--------------------------------------------------+--------+-----------+ | title | rating | imdb_id | +--------------------------------------------------+--------+-----------+ | Advent (IV) (2016) | 10.0 | tt6129028 | | 311 Live: 3/11 Day 2006 (2006) | 10.0 | tt0872240 | | Light Study (2013) | 10.0 | tt3130306 | | Future Boyfriend (2016) | 10.0 | tt3955652 | | Cory in the House: All Star Edition (Video 2007) | 10.0 | tt2402070 | | 1985 (2016) | 9.9 | tt5932546 | | Toxic Temptation (2016) | 9.9 | tt4621978 | | The Mountain II (2016) | 9.9 | tt5813916 | | Questione di Sguardi (2014) | 9.8 | tt4423586 | | Foreclosed (Video 2010) | 9.8 | tt1648984 | +--------------------------------------------------+--------+-----------+ 10 rows in set (0.01 sec) |
A very nice drop from 0.78 seconds to 0.01 seconds! But wait… why is the data different? Ahhh glad you noticed! As discussed above, pulling data out of JSON often requires some type of conversion. By default, values coming out of JSON are considered as text, not numeric, so it’s sorting based on the ASCII Value (oops). So you can get the same results by forcing the type conversion:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 | mysql> select json_column->>'$.title' as title, json_column->>'$.imdb_rating' as rating, json_column->>'$.imdb_id' as imdb_id from movies_json_generated where json_column->>'$.imdb_rating' > 8 order by json_column->>'$.imdb_rating'*1 desc limit 10; +--------------------------------------------------+--------+-----------+ | title | rating | imdb_id | +--------------------------------------------------+--------+-----------+ | 311 Live: 3/11 Day 2006 (2006) | 10.0 | tt0872240 | | Advent (IV) (2016) | 10.0 | tt6129028 | | Cory in the House: All Star Edition (Video 2007) | 10.0 | tt2402070 | | Light Study (2013) | 10.0 | tt3130306 | | Future Boyfriend (2016) | 10.0 | tt3955652 | | Toxic Temptation (2016) | 9.9 | tt4621978 | | The Mountain II (2016) | 9.9 | tt5813916 | | 1985 (2016) | 9.9 | tt5932546 | | Nocturne in Black (2016) | 9.8 | tt4536608 | | My Head Hurts (2000) | 9.8 | tt1346290 | +--------------------------------------------------+--------+-----------+ 10 rows in set, 65535 warnings (0.89 sec) Alternative you can use: cast(json_value(json_column,'$.imdb_rating') as float) |
So not only can you significantly speed up your performance with indexing with generated columns, you can also ensure that you are getting consistent types and the expected results. The documentation has a detailed section on ordering and group by JSON values. Generated columns also improve most of the other queries we showed above.
Functional Indexes Without Generated Columns
Generated columns work well enough, but for our queries (In MySQL 8.0.13 or later), we can create indexes on the JSON functions we call regularly and forgo the generated columns altogether.
1 2 3 | mysql> create index title_idx on movies_json ((json_value(json_column,'$.title'))); Query OK, 0 rows affected (2.90 sec) Records: 0 Duplicates: 0 Warnings: 0 |
This works well enough if you are trying to match an exact match; if you need to use a like or wild card the functional index won’t be used. As you explore using functional indexes with JSON, be mindful there are some character set and collation restrictions and restrictions on which functions can be used, so your mileage will vary. I will avoid a deep dive here, but you can review the documentation here https://dev.mysql.com/doc/refman/8.0/en/create-index.html#create-index-functional-key-parts & here: https://dev.mysql.com/doc/refman/8.0/en/create-table-secondary-indexes.html#json-column-indirect-index. In my testing, the more complicated or nested the JSON was, the more problems I ran into. For this reason, I stuck with the generated columns for my examples above.
Multi-Valued Indexes
As of 8.0.17 MySQL also supports Multi-Valued indexes, a secondary index that allows you to index an array. This is helpful if you have an array within your JSON (this does not appear to support arrays of characters at this time, the helpful “This version of MySQL doesn’t yet support ‘CAST-ing data to array of char/binary BLOBs’”). This can help with JSON designs that are straightforward, but as your JSON becomes more nested and complex I ran into problems.
Performance Summary
Super unscientific, but these query run times hold over multiple iterations, look at the differences speed-wise:
Query | Access JSON Directly (seconds) | Generated Column (seconds) |
Simple Search Via IMDB ID | 0.75 | 0.01 |
Search for Avengers Titled Movies | 0.76 | 0.01> |
Updating a single value within the JSON searching via IMDB or title | 0.80 | 0.01> |
Find top 10 movies of all time | 0.89 | 0.01 |
Characters played by Robert Downey JR in the avengers’ movies | 0.74 | 0.01> |
More Performance Needed? Normalizing Data for Query Patterns
So far we have done a lot of simple interactions and were able to speed up access to an acceptable level. But not everything fits within the available toolbox. When searching for movies or ratings for a specific cast member ( show me all the Avengers movies Robert Downey JR. played and the characters ), we used an index on the title generated column to reduce the JSON Documents we had to fully process to get the character he played. See below:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 | mysql> select title, imdb_rating, t.* from movies_json_generated, json_table(json_column, '$.cast[*]' columns( -> V_name varchar(200) path '$.name', -> V_character varchar(200) path '$.character') -> ) t where t.V_name like 'Robert Downey Jr.%' and title like 'Avengers%'; +--------------------------------+-------------+-------------------+------------------------------------+ | title | imdb_rating | V_name | V_character | +--------------------------------+-------------+-------------------+------------------------------------+ | Avengers: Age of Ultron (2015) | 7.50 | Robert Downey Jr. | Tony Stark | | Avengers: Endgame (2019) | 9.00 | Robert Downey Jr. | Tony Stark / Iron Man | | Avengers: Infinity War (2018) | NULL | Robert Downey Jr. | Tony Stark / Iron Man | +--------------------------------+-------------+-------------------+------------------------------------+ 3 rows in set (0.00 sec) mysql> explain select title, imdb_rating, t.* from movies_json_generated, json_table(json_column, '$.cast[*]' columns( V_name varchar(200) path '$.name', V_character varchar(200) path '$.character') ) t where t.V_name like 'Robert Downey Jr.%' and title like 'Avengers%'; +----+-------------+-----------------------+------------+-------+---------------+-----------+---------+------+------+----------+----------------------------------------------------------+ | id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra | +----+-------------+-----------------------+------------+-------+---------------+-----------+---------+------+------+----------+----------------------------------------------------------+ | 1 | SIMPLE | movies_json_generated | NULL | range | title_idx | title_idx | 1023 | NULL | 8 | 100.00 | Using where | | 1 | SIMPLE | t | NULL | ALL | NULL | NULL | NULL | NULL | 2 | 50.00 | Table function: json_table; Using temporary; Using where | +----+-------------+-----------------------+------------+-------+---------------+-----------+---------+------+------+----------+----------------------------------------------------------+ 2 rows in set, 1 warning (0.00 sec) |
But let’s assume you needed to get a list of all characters he played in his career (Will truncate the full result set).
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 | mysql> select title, imdb_rating, t.* from movies_json_generated, json_table(json_column, '$.cast[*]' columns( -> V_name varchar(200) path '$.name', -> V_character varchar(200) path '$.character') -> ) t where t.V_name like 'Robert Downey Jr.%'; +-------------------------------------------------------------------------------------------------------------+-------------+-------------------+------------------------------------------------------------------------------------------------+ | title | imdb_rating | V_name | V_character | +-------------------------------------------------------------------------------------------------------------+-------------+-------------------+------------------------------------------------------------------------------------------------+ | The 65th Annual Academy Awards (1993) | NULL | Robert Downey Jr. | Himself - Presenter | | Sherlock Holmes: A Game of Shadows (2011) | 7.50 | Robert Downey Jr. | Sherlock Holmes | | Due Date (2010) | 6.60 | Robert Downey Jr. | Peter Highman | | Eros (2004) | 6.00 | Robert Downey Jr. | Nick Penrose (segment "Equilibrium") | | The EE British Academy Film Awards (2015) | 7.40 | Robert Downey Jr. | Himself - Tribute to Lord Attenborough | | "Saturday Night Live" John Lithgow/Mr. Mister (TV Episode 1985) | NULL | Robert Downey Jr. | Bruce Winston / Rudy Randolph III / Various (as Robert Downey) | +-------------------------------------------------------------------------------------------------------------+-------------+-------------------+------------------------------------------------------------------------------------------------+ 213 rows in set (7.14 sec) mysql> explain select title, imdb_rating, t.* from movies_json_generated, json_table(json_column, '$.cast[*]' columns( V_name varchar(200) path '$.name', V_character varchar(200) path '$.character') ) t where t.V_name like 'Robert Downey Jr.%'; +----+-------------+-----------------------+------------+------+---------------+------+---------+------+--------+----------+----------------------------------------------------------+ | id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra | +----+-------------+-----------------------+------------+------+---------------+------+---------+------+--------+----------+----------------------------------------------------------+ | 1 | SIMPLE | movies_json_generated | NULL | ALL | NULL | NULL | NULL | NULL | 358174 | 100.00 | NULL | | 1 | SIMPLE | t | NULL | ALL | NULL | NULL | NULL | NULL | 2 | 50.00 | Table function: json_table; Using temporary; Using where | +----+-------------+-----------------------+------------+------+---------------+------+---------+------+--------+----------+----------------------------------------------------------+ 2 rows in set, 1 warning (0.00 sec) |
Now our access pattern needs to start at a deeper nested element within the JSON document. Here you are faced with a few options, but almost all of them lead to creating and maintaining a new table to get that level of data either via trigger, code to break out the data when being loaded or via batch job (or caching this evil slow query).
I ended up creating the following “Database Schema” for it:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 | create table movies_normalized_meta ( ai_myid int AUTO_INCREMENT primary key, imdb_id varchar(32), title varchar(255), imdb_rating decimal(5,2), json_column json ) engine = innodb; create unique index imdb_id_idx on movies_normalized_meta (imdb_id); create index rating_idx on movies_normalized_meta (imdb_rating); create table movies_normalized_actors ( ai_actor_id int auto_increment primary key, actor_id varchar(50), actor_name varchar(500) ) engine = innodb; create index actor_id_idx on movies_normalized_actors (actor_id); create index actor_name_idx on movies_normalized_actors (actor_name); create table movies_normalized_cast ( ai_actor_id int, ai_myid int, actor_character varchar(500) ) engine = innodb; create index cast_id_idx on movies_normalized_cast (ai_actor_id,ai_myid); create index cast_id2_idx on movies_normalized_cast (ai_myid); create index cast_character_idx on movies_normalized_cast (actor_character); create unique index u_cast_idx on movies_normalized_cast (ai_myid,ai_actor_id,actor_character); |
On loading the JSON into MySQL I added an actor table that will have a row for each new unique actor as well as a cast table that has the movie, actor, and name of the character they played in the movie (Note I could optimize the structure a bit, but that’s for another day ). This gives me a ton of flexibility in reporting and a major performance boost. Now to get all the movies Robert Downey JR was in and the characters he played I can it via:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 | mysql> select title, imdb_rating, actor_character from movies_normalized_meta a, movies_normalized_cast b, movies_normalized_actors c where a.ai_myid=b.ai_myid and b.ai_actor_id = c.ai_actor_id and actor_name='Robert Downey Jr.'; +-------------------------------------------------------------------------------------------------------------+-------------+------------------------------------------------------------------------------------------------+ | title | imdb_rating | actor_character | +-------------------------------------------------------------------------------------------------------------+-------------+------------------------------------------------------------------------------------------------+ | The 65th Annual Academy Awards (1993) | NULL | Himself - Presenter | | Sherlock Holmes: A Game of Shadows (2011) | 7.50 | Sherlock Holmes | | Due Date (2010) | 6.60 | Peter Highman | | Eros (2004) | 6.00 | Nick Penrose (segment "Equilibrium") | Saturday Night Live in the '80s: Lost & Found (2005) | 7.20 | Various (archive footage) | | "Saturday Night Live" John Lithgow/Mr. Mister (TV Episode 1985) | NULL | Bruce Winston / Rudy Randolph III / Various (as Robert Downey) | +-------------------------------------------------------------------------------------------------------------+-------------+------------------------------------------------------------------------------------------------+ 213 rows in set (0.01 sec) mysql> explain select title, imdb_rating, actor_character from movies_normalized_meta a, movies_normalized_cast b, movies_normalized_actors c where a.ai_myid=b.ai_myid and b.ai_actor_id = c.ai_actor_id and actor_name='Robert Downey Jr.'; +----+-------------+-------+------------+--------+-------------------------------------+----------------+---------+-------------------------------+------+----------+-----------------------+ | id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra | +----+-------------+-------+------------+--------+-------------------------------------+----------------+---------+-------------------------------+------+----------+-----------------------+ | 1 | SIMPLE | c | NULL | ref | PRIMARY,actor_name_idx | actor_name_idx | 2003 | const | 213 | 100.00 | Using index | | 1 | SIMPLE | b | NULL | ref | u_cast_idx,cast_id_idx,cast_id2_idx | cast_id_idx | 5 | movie_json_test.c.ai_actor_id | 2 | 100.00 | Using index condition | | 1 | SIMPLE | a | NULL | eq_ref | PRIMARY | PRIMARY | 4 | movie_json_test.b.ai_myid | 1 | 100.00 | NULL | +----+-------------+-------+------------+--------+-------------------------------------+----------------+---------+-------------------------------+------+----------+-----------------------+ 3 rows in set, 1 warning (0.00 sec) |
Not only is this 7 seconds faster than accessing the same data via the JSON functions, but I can also use these tables to easily do things like give me all the movies that had two or three actors tother, play six degrees, etc that would be a challenge only access the JSON directly.
Take-Aways and Lessons Learned
First, think before you store your data. Understanding what you are storing, why you are storing, and how you will access it is paramount. How you will access and use your data has a profound impact on the optimal database setup, design, and usage. If you only access top-level properties in your document, a simple design relying only on using MySQL’s built-in JSON functions may be totally ok. But as you want to dive deeper into the data and start looking at pulling our subsets of data or correlating different documents to ensure performance and scalability you may end up reducing your flexibility and build (hold your breath) an actual database schema that is normalized and everything.
Second, MySQL’s JSON functions get the job done for most basic use cases. You can get what you need, but you need to understand the implementations and what is available and what is not. As I perform these same processes with PostgreSQL and MongoDB you will see where some functions are better than others.
Finally, don’t fear schemas! Seriously, structure is good. No matter how you access your data and use it, you will assume some structure (it all depends on where you want to enforce that structure).
Next up in this series, I will dive into JSON with PostgreSQL followed by JSON and MongoDB. Stay tuned!
Please pay attention to this bug https://bugs.mysql.com/bug.php?id=104325 when using multivalued indexes with prepared statements