Archiving MySQL and MongoDB DataThis post discusses archiving MySQL and MongoDB data and determining what, when and how to archive data.

Many people store infrequently used data. This data is taking up storage space and might make your database slower than it could be. Archiving data can be a huge benefit, both regarding the performance impact and storage savings.

Why archive MySQL and MongoDB Data?

One of the reasons for archiving data is freeing up space on your database volumes. You can store archived data on slower, less expensive storage devices, and current data on the faster database drives. Archiving old data makes backups and restores run faster since they need to process less data. Last, but by no means least, archiving data has the benefit of making your queries perform more efficiently since they do not need to process through old data.

What do you archive?

That is the big question. Archiving too much is just as detrimental as not archiving enough (or at all). As you’ll see, finding this balance requires foresight and planning. Fortunately, you can tweak your archiving scheme to make it work better as time goes by,

Some people feel that keeping all the data in their database, even if they don’t access that data frequently, is the best way to go. If you are lucky enough to have vast quantities of storage, and a database that is performing well, keeping all of the data in your database might be a good idea. Even with lots of storage, archiving off some data that you don’t use regularly might have advantages. We all know someone whose desk is piled with stacks of paper. When they need something, they tell us that they know where everything is. Even if they can find the requested item, they need to work through the piles of stuff to locate it. They also have to decide where to put new items so that they can be easily found. In your database, this equates to slower queries and potentially slower writes. Clearing out some of the less frequently accessed data will have a beneficial effect overall.

At the other end of the spectrum are the people who want to archive in a manner that is too aggressive. This means that any requests for data must access the archive location This might be slower and more burdensome, causing the queries to run slowly. In addition, new data written into the database will have to go through an archive process fairly quickly, which might slow down the database. This is the person who puts each and every item they own into storage. It makes for a clean home, but it’s tough to find many of the items that you own. In our database, this means that most queries are run against archived data, and archive processes are frequently running. This too can slow down performance overall.

The best archiving solution is one that meets both the needs of efficient use of storage and efficiency of queries and inserts. You want to be able to write new data quickly, access frequently used data promptly, and still be able to get the information that might not often be used. There is no simple answer here: each company will have different needs and requirements. For some companies, regulations might govern how long data must be stored. With these sorts of requirements in place, you should look to place data that isn’t accessed often on a storage medium that is lower in cost (and often slower in performance). It is still there, but it is not crowding out the more commonly used data. Other companies might query or manipulate data shortly after it is loaded into the database, and they might be able to archive more often.

When do you archive?

This is another big consideration. Do you archive data daily, weekly, monthly, annually or on some other schedule? The basic answer is that it doesn’t matter what the schedule is. It matters that there is some sort of schedule, and that archiving is happening as expected. Keeping to a schedule allows everyone to know that the data is being archived as expected, and will avoid any “gee, we completely forgot about doing that” issues from arising.

Frequent archiving (daily or weekly) is good when you have high data volumes and normally need to access only the latest data in your queries. Think of stock data. Queries to pull trade volumes and pricing over a short time period are more frequent than queries that would analyze a stock’s performance over time. Therefore, archiving old data can be helpful since it keeps the frequently accessed table’s data easily accessible, but still accommodates the need to get at data for longer time spans. With high data volume, you might need to archive often so that one archive process can complete before another is started.

Less frequent archiving might be used when you have longer term projects or if you find your current database is performing reasonably well. In these cases, archiving monthly, quarterly, or annually might make sense. This is like cleaning out your garage or attic. You might do it, but you probably don’t do it every week. The amount of stuff being stored, along with the space to store it in, might determine how often you do this type of cleanup.

How do you go about archiving MySQL and MongoDB data?

There are lots of possibilities here as well. If well planned, it can be an easy implementation. But like many things, figuring this out is usually done once things have gotten a little out of control.

You can archive data using a standard backup, moving it to another table in the database, exporting to a flat file, or moving it to another database altogether. The end goal is a cleaner production environment that still allows access to the archived data if it is needed. The method for performing the archive determines the method used to bring that data back to a state in which it can be queried. One of the considerations must be how much time you are willing and able to invest in making that data available again.

  1. You can use your standard backup method to create and manage your archive, but this is a solution that is cumbersome and prone to error. You can perform a backup and then delete the unwanted data from your table(s). Now, the deleted data is only stored in your backup and must be restored in order to be queried. You should restore to another database for this purpose so that you keep your production environment clean. With this option, you also have to consider the methods for recovering space used by deleted files. This opens to the possibility of someone restoring to the original database, which can cause a host of problems. With MongoDB, there is an optional –archive option that moves the data to an archive location that you specify. MongoDB version 3.2 added this option.
  2. Another possibility is to move the data to another MySQL table or MongoDB collection in the existing database (i.e., moving from the transactions table to transactions_archived). This is a fast and efficient way to backup the data, and it allows for easy querying since the data still resides in the database. Of course, this assumes that you have enough storage space to accommodate the active and the archive tables/collections.
  3. You can also export the data to be archived to a flat file and then delete it from the original table or collection. This is workable if the data needs to be kept available but is unlikely to be regularly needed. (It will need to be imported in order to query it.) This method also comes with all the caveats about needing to delete and recover the space of the archived records, issues with importing into the original database (and ruining all the good archiving work you’ve done, and the possibility of deleting the flat file.
  4. Alternatively, you can move the data to another database. This too can be an effective method for archiving data, and can also allow that data to be made available to others for query analysis. Once again, all warnings about recovering the space apply, but the good thing here is that the data does not need to be restored to be queried. It is simply queried through the other database.

pt-archiver

Another option for archiving MySQL data is a tool like pt-archiver. pt-archiver is a component of the Percona Toolkit that nibbles old data from a source table and moves it to a target table. The target can be in the current or an archive database. It is designed for use with an up-and-running database. It has minimal impact on the overall performance of the database. It is part of the Percona Toolkit, so it is available as an open source download. It has the benefit of slowly working to move the data and is always running. This allows it to archive data regularly and cleanly. One warning is that it does delete the data from the source table, so you should test it before running it in production. pt-archiver works with MySQL data only. It is also important to note that removing large quantities of data might cause InnoDB fragmentation. Running OPTIMIZE TABLE to recover the space resolves this. As of version 5.7.4, this is no longer a locking action.

So now what?

Unless you are in the enviable position where archiving MySQL and MongoDB data isn’t an issue, the first step is to come up with an archiving scheme. This will likely involve many different people since there can be an impact across the entire organization. Determine what can and should be archived, and then determine how best to archive the data. Document the process and test it before putting it into production. In the end, your database and your users will thank you.