JSON Structure, The Next Generation (Performance, Scale, and Fun)

Storing JSON MySQLNow you know the basic JSON functionality available to you, as covered in part one of this series.  Let’s look deeper into performance, scale, and oddities that may occur.

You can do almost anything you could do in SQL with standard data types and objects, but you may run into specific problems and limitations.  Here is an example of that.  Let’s say I want to get the top 10 rated movies of all time.

In this case, the size required to sort the data set is much larger than the available sort size.  Selecting this data from a normalized structure would not cause this (nor would generated columns).

So we solved the issue with not having enough space to sort the dataset, but did you notice the time?  Almost 1 second.  Any website you are working on nowadays can not succeed if all your database queries take 0.5-1 second to return.  Yes, you can put a cache in front of these and probably will.  A better cache solves it all, right?  Well, not really.  In this case, you can speed up things drastically with a few important and small improvements.  Before moving on, notice the warnings?  Let’s ignore these for one moment. (These are because of the null values in the rating column (but these are not impacting performance).

Let’s look at generated columns.   Generated columns allow you to create columns based on other columns or circumstances on the fly. Note, you can also use functional indexes, which I will cover later.  In the case of JSON, we can pull values out of our document and store them read-only in a column that can be indexed (we all know indexing is good, right!).  Let’s do this!  I am going to create a table with generated columns for various columns.  Then I am going to load the data via insert from our already loaded movies table.

I wanted to take a moment and dive into what I have seen as a common issue. As you can see, the table was created without problem, but it failed when inserting data from our JSON objects.  The reason? I am storing the  IMDB rating in a decimal(5,2) field, but the JSON reference has ‘null’ (quoted) as text. Type conversions when working with JSON and MySQL can be a bit tricky for some people.  Remember those warnings above?  They were type conversion warnings.  Easy to ignore as they did not appear to cause an immediate issue (but they did).  Type conversions and character sets can be a bit of a problem when accessing JSON data. 

Allow a slight diverged rant:   If your application has an inconsistent or rapidly changing JSON format/schema you can make using standard database functions difficult ( if not impossible ).

I often hear people talk about structure and the lack of flexibility as a massive drawback for relational databases. JSON is flexible; people love flexibility. The issue is flexibility comes at a cost, and you will pay that cost somewhere.  If you are validating and maintaining a structure in your database, your code must fit in the structure and changes to the structure maybe slow ( database migrations).  If you use a flexible database schema, then you need to validate in your code.  This validation in the code may be simple now, but as you grow, the more iterations or changes to your data, the more sprawling the code to check and validate data is. Whether you want to admit it or not, throwing any unstructured data into the wasteland that is a document is a recipe for problems. In the case of storing just a dump of JSON in your database, sometimes that data is complex to access or manipulate without writing code and having access to that validation. 

In this case, the type conversion is pretty straightforward and you can solve this issue in a couple of different ways. Ultimately it is about ensuring the JSON value is converted correctly. I will use the json_value function to unquote the null.

Now let’s compare searching for a movie using first the imdb_id from the JSON document and then from the generated table using the column we indexed:

Great!  Let’s go back to our example using IMDB rating to get the top 10 movies.  To make this faster, we will need to first create an index on the generated column.

With that out of the way, let’s get the top 10 list:

A very nice drop from 0.78 seconds to 0.01 seconds!  But wait… why is the data different?  Ahhh glad you noticed!  As discussed above, pulling data out of JSON often requires some type of conversion.  By default, values coming out of JSON are considered as text, not numeric, so it’s sorting based on the ASCII Value (oops).   So you can get the same results by forcing the type conversion:

So not only can you significantly speed up your performance with indexing with generated columns, you can also ensure that you are getting consistent types and the expected results.  The documentation has a detailed section on ordering and group by JSON values. Generated columns also improve most of the other queries we showed above.

Functional Indexes Without Generated Columns

Generated columns work well enough, but for our queries (In MySQL 8.0.13 or later), we can create indexes on the JSON functions we call regularly and forgo the generated columns altogether.   

This works well enough if you are trying to match an exact match; if you need to use a like or wild card the functional index won’t be used. As you explore using functional indexes with JSON, be mindful there are some character set and collation restrictions and restrictions on which functions can be used, so your mileage will vary.   I will avoid a deep dive here, but you can review the documentation here https://dev.mysql.com/doc/refman/8.0/en/create-index.html#create-index-functional-key-parts & here:  https://dev.mysql.com/doc/refman/8.0/en/create-table-secondary-indexes.html#json-column-indirect-index.  In my testing, the more complicated or nested the JSON was, the more problems I ran into.   For this reason, I stuck with the generated columns for my examples above. 

Multi-Valued Indexes

As of 8.0.17 MySQL also supports Multi-Valued indexes, a secondary index that allows you to index an array.  This is helpful if you have an array within your JSON (this does not appear to support arrays of characters at this time, the helpful “This version of MySQL doesn’t yet support ‘CAST-ing data to array of char/binary BLOBs’”).  This can help with JSON designs that are straightforward, but as your JSON becomes more nested and complex I ran into problems.  

Performance Summary

Super unscientific, but these query run times hold over multiple iterations, look at the differences speed-wise: 

QueryAccess JSON Directly (seconds)Generated Column
(seconds)
Simple Search Via IMDB ID0.75 0.01
Search for Avengers Titled Movies0.760.01>
Updating a single value within the JSON searching via IMDB or title0.800.01>
Find top 10 movies of all time0.890.01
Characters played by Robert Downey JR in the avengers’ movies0.740.01>

More Performance Needed? Normalizing Data for Query Patterns

So far we have done a lot of simple interactions and were able to speed up access to an acceptable level.  But not everything fits within the available toolbox.  When searching for movies or ratings for a specific cast member ( show me all the Avengers movies Robert Downey JR. played and the characters ), we used an index on the title generated column to reduce the JSON Documents we had to fully process to get the character he played.  See below:

But let’s assume you needed to get a list of all characters he played in his career (Will truncate the full result set).  

Now our access pattern needs to start at a deeper nested element within the JSON document.  Here you are faced with a few options, but almost all of them lead to creating and maintaining a new table to get that level of data either via trigger, code to break out the data when being loaded or via batch job (or caching this evil slow query).  

I ended up creating the following “Database Schema” for it:

On loading the JSON into MySQL I added an actor table that will have a row for each new unique actor as well as a cast table that has the movie, actor, and name of the character they played in the movie (Note I could optimize the structure a bit, but that’s for another day ).  This gives me a ton of flexibility in reporting and a major performance boost.  Now to get all the movies Robert Downey JR was in and the characters he played I can it via:

Not only is this 7 seconds faster than accessing the same data via the JSON functions, but I can also use these tables to easily do things like give me all the movies that had two or three actors tother, play six degrees, etc that would be a challenge only access the JSON directly. 

Take-Aways and Lessons Learned

First,  think before you store your data.  Understanding what you are storing, why you are storing, and how you will access it is paramount. How you will access and use your data has a profound impact on the optimal database setup, design, and usage.  If you only access top-level properties in your document, a simple design relying only on using MySQL’s built-in JSON functions may be totally ok.  But as you want to dive deeper into the data and start looking at pulling our subsets of data or correlating different documents to ensure performance and scalability you may end up reducing your flexibility and build (hold your breath) an actual database schema that is normalized and everything.  

Second, MySQL’s JSON functions get the job done for most basic use cases.  You can get what you need, but you need to understand the implementations and what is available and what is not.  As I perform these same processes with PostgreSQL and MongoDB you will see where some functions are better than others.   

Finally,  don’t fear schemas!  Seriously, structure is good.  No matter how you access your data and use it, you will assume some structure (it all depends on where you want to enforce that structure).  

Next up in this series, I will dive into JSON with PostgreSQL followed by JSON and MongoDB. Stay tuned!

Subscribe
Notify of
guest

1 Comment
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
Vladimir

Please pay attention to this bug https://bugs.mysql.com/bug.php?id=104325 when using multivalued indexes with prepared statements