Tags and FullText indexes in MySQLAs a principal architect at Percona, one of my duties is to tune MySQL database servers for our customers. The tuning effort looks at every aspect of the database service like the operating system, the MySQL configuration, the schema, the queries, etc. We have well-defined processes to tune the operating system and the MySQL configuration. However, tuning the schema and the queries using it can be anywhere from trivial to extremely challenging.

The challenge with the schema and the queries is mostly with the indexes. The most common types of indexes are based on b-trees or hash lists. InnoDB doesn’t support hash indexes, a bummer for equality conditions. B-tree indexes are more general-purpose, they are decent for equality conditions and very good for range conditions. They are however quite heavy and very dependent on the defined order of columns. They are also poor for double range conditions.

Double range conditions can be handled by the GIS R-tree type index, a topic I’d like to explore in a future post. The focus of this post is FullText indexes but not for their primary use case like text documents. I want to show you how this type of index can be used to efficiently solve performance problems of a type of schema I’ll refer to as “Tag-based schema”. It is surprising how little attention fulltext indexes are receiving, I did a quick search on this blog and our latest post was from 2018.

What is a Tag-based schema?

At least from a high-level view, a database table is used to map the properties of an object or an entity to columns. The object can be a user or a relation in a social network database or a transaction in a credit card processing database. Some objects have few, very well-defined properties but others have many dozens if not hundreds of properties.  A table with a hundred columns that may need indexes is impractical and very difficult to tune. Let’s have a look at a few examples.

Example: House rental

Let’s imagine you are designing the database for a house rental application. You’ll need a table to describe the houses you are offering for rent. In such a context, houses are objects with many properties describing their basic attributes, amenities, and surroundings. Let’s try to list the main ones:

Attributes: Number of rooms, number of bathrooms, private bathrooms,  full kitchen, double beds, twin beds, parking spaces, meeting/workspace space, rating (stars), price ranges, region, etc.

Amenities: Air conditioning, swimming pool, jacuzzi, fireplace, pool table, wifi, home theater, cable tv, nice view, quiet (or not), non-smoking (or not),  etc.

Surroundings: Beach, hiking, walking, fitness, skiing, restaurants, Bars, Spectacles, Theaters, parks, distance from the city center, distance from the airport, etc.

Example: Tree classification

A very different example is a database listing species of trees and how to identify them based on their general characteristics, their bark, leaves, seeds, etc.

Bark: Smooth, paper, peeling, ridges, scale, etc.

Leaves: SimpleSym, simpleAsym, multi-opposite, multi-alternate, multi-whorled, Needle flat, Needle round, lobed, toothed, etc

Seeds: Fruits, Nuts, Samaras, Acorns, Cones, Pods, etc.

These types of objects are difficult to map. Other than the wide approach (many columns) we discussed above, the other traditional approach is the entity value schema. The entity value schema uses one row per attribute, instead of being wide, we can say it is long. The entity value schema is flexible but also presents a number of tuning challenges. Let’s explore an alternative using fulltext indexes and tags. In this alternative, you store the attributes word elements (tags), concatenating them using a delimiter into a TEXT column.

Tags with a fulltext index

First, we need to define a dictionary table for the attributes/tags:

The table contains all the tags defined. A description is a human-readable version of the tag,  GroupTags allows to set exclusive groups (only one value possible). The above table is a simple example but it can easily be expanded.

Here’s a sample tag list for a house or apartment rental site. Keep in mind I am not specialized at all in that type of business, I could easily be missing critical aspects.  More realistically, there could be many hundreds of tags.

We will store our properties to rent in this table:

In order to help manipulate the tags in a consistent way, we’ll create four handy SQL functions, hasTag, AddTag, AddTags, and removeTag.  The SQL code for these functions is available on my GitHub.  With these functions, let’s add a few properties.

Now, we can use the fulltext index on tags to search the table. Let’s say you want to find out which properties have a Jacuzzi:

Do you also need to have a wifi connection?

Then you realize a swimming pool could replace the Jacuzzi:

The weakness of this approach is that some attributes include other ones.  For example, if you want a Property with at least three bedrooms, you’ll need to specify the additional attributes to include higher counts of bedrooms:

Other implementations

My above example is fairly simple. As mentioned previously, there are two other implementations I can think of: wide and entity.  Let’s look at what would be the equivalent queries of these alternate implementations.

Wide (many columns):

Entity (vertical):

Both of these implementations have their own tuning challenges.  The wide implementation is easy to deploy but hard to maintain and nearly impossible to index properly given the large number of columns that can be part of the where clause. You can easily expand the Entity implementation but then, the PropertyAttributes table will grow quite large. You must also have one column for every supported data type.

Conclusion

In this post, we used fulltext indexes to solve issues with schema mapping objects with a large number of attributes. This approach, using text tags to map the attributes is quite efficient and elegant. It is much easier to maintain than a wide approach mapping attributes to columns. It is also more efficient than the entity value schema where every attribute is a row from a table.

Fulltext indexes are less common than the b-tree indexes around which InnoDB is built. With the GIS R-tree indexes, there are among the less used index types of MySQL. Yet, they can be useful out of their intended initial scope, as we demonstrated in this post.

In a future post, I’d like to explore the use of R-tree indexes to solve another common SQL problem, the double range condition. If you are interested in unorthodox usages of database features, stay posted.

Subscribe
Notify of
guest

0 Comments
Inline Feedbacks
View all comments