For this blog post, I’ve decided to try ClickHouse: an open source column-oriented database management system developed by Yandex (it currently powers Yandex.Metrica, the world’s second-largest web analytics platform).

In my previous set of posts, I tested Apache Spark for big data analysis and used Wikipedia page statistics as a data source. I’ve used the same data as in the Apache Spark blog post: Wikipedia Page Counts. This allows me to compare ClickHouse’s performance to Spark’s.

ClickHouse

I’ve spent some time testing ClickHouse for relatively large volumes of data (1.2Tb uncompressed). Here is a list of ClickHouse advantages and disadvantages that I saw:

ClickHouseClickHouse advantages

  • Parallel processing for single query (utilizing multiple cores)
  • Distributed processing on multiple servers
  • Very fast scans (see benchmarks below) that can be used for real-time queries
  • Column storage is great for working with “wide” / “denormalized” tables (many columns)
  • Good compression
  • SQL support (with limitations)
  • Good set of functions, including support for approximated calculations
  • Different storage engines (disk storage format)
  • Great for structural log/event data as well as time series data (engine MergeTree requires date field)
  • Index support (primary key only, not all storage engines)
  • Nice command line interface with user-friendly progress bar and formatting

Here is a full list of ClickHouse features

ClickHouse disadvantages

  • No real delete/update support, and no transactions (same as Spark and most of the big data systems)
  • No secondary keys (same as Spark and most of the big data systems)
  • Own protocol (no MySQL protocol support)
  • Limited SQL support, and the joins implementation is different. If you are migrating from MySQL or Spark, you will probably have to re-write all queries with joins.
  • No window functions

Full list of ClickHouse limitations

Group by: in-memory vs. on-disk

Running out of memory is one of the potential problems you may encounter when working with large datasets in ClickHouse:

By default, ClickHouse limits the amount of memory for group by (it uses a hash table for group by). This is easily fixed – if you have free memory, increase this parameter:

If you don’t have that much memory available, ClickHouse can “spill” data to disk by setting this:

According to the documentation, if you need to use max_bytes_before_external_group_by it is recommended to set max_memory_usage to be ~2x of the size of max_bytes_before_external_group_by.

(The reason for this is that the aggregation is performed in two phases: (1) reading and building an intermediate data, and (2) merging the intermediate data. The spill to disk can only happen during the first phase. If there won’t be spill, ClickHouse might need the same amount of RAM for stage 1 and 2.)

Benchmarks: ClickHouse vs. Spark

Both ClickHouse and Spark can be distributed. However, for the purpose of this test I’ve run a single node for both ClickHouse and Spark. The results are quite impressive.

Benchmark summary

 Size / compression Spark v. 2.0.2 ClickHouse
 Data storage format Parquet, compressed: snappy  Internal storage, compressed 
 Size (uncompressed: 1.2TB)  395G 212G

Clickhouse

 

 Test Spark v. 2.0.2 ClickHouse  Diff
 Query 1: count (warm) 7.37 sec (no disk IO) 6.61 sec  ~same
 Query 2: simple group (warm)  792.55 sec (no disk IO)  37.45 sec 21x better
 Query 3: complex group by  2522.9 sec 398.55 sec 6.3x better

 

ClickHouse vs. MySQL

I wanted to see how ClickHouse compared to MySQL. Obviously, we can’t compare some workloads. For example:

  • Storing terabytes of data and querying (“crunching” would be a better word here) data without an index. It would take weeks (or even months) to load data and build the indexes. That is a much more suitable workload for ClickHouse or Spark.
  • Real-time updates / OLTP. ClickHouse does not support real-time updates / deletes.

Usually big data systems provide us with real-time queries. Systems based on map/reduce (i.e., Hive on top of HDFS) are just too slow for real-time queries, as it takes a long time to initialize the map/reduce job and send the code to all nodes.

Potentially, you can use ClickHouse for real-time queries. It does not support secondary indexes, however. This means it will probably scan lots of rows, but it can do it very quickly.

To do this test, I’m using the data from the Percona Monitoring and Management system. The table I’m using has 150 columns, so it is good for column storage. The size in MySQL is ~250G:

Scanning the whole table is significantly faster in ClickHouse. Retrieving just ten rows by key is faster in MySQL (especially from memory).

But what if we only need to scan limited amount of rows and do a group by? In this case, ClickHouse may be faster. Here is the example (real query used to create sparklines):

MySQL

It is relatively fast.

ClickHouse (some functions are different, so we will have to rewrite the query):

As we can see, even though ClickHouse scans more rows (270K vs. 11K – over 20x more) it is faster to execute the ClickHouse query (0.10 seconds in MySQL compared to 0.01 second in ClickHouse). The column store format helps a lot here, as MySQL has to read all 150 columns (stored inside InnoDB pages) and ClickHouse only needs to read seven columns.

Wikipedia trending article of the month

Inspired by the article about finding trending topics using Google Books n-grams data, I decided to implement the same algorithm on top of the Wikipedia page visit statistics data. My goal here is to find the “article trending this month,” which has significantly more visits this month compared to the previous month. As I was implementing the algorithm, I came across another ClickHouse limitation: join syntax is limited. In ClickHouse, you can only do join with the “using” keyword. This means that the fields you’re joining need to have the same name. If the field name is different, we have to use a subquery.

Below is an example.

First, create a temporary table to aggregate the visits per month per page:

Second, calculate the actual list:

Their response time is really good, considering the amount of data it needed to scan (the first query scanned 2.57 TB of data).

Conclusion

The ClickHouse column-oriented database looks promising for data analytics, as well as for storing and processing structural event data and time series data. ClickHouse can be ~10x faster than Spark for some workloads.

Appendix: Benchmark details

Hardware

  • CPU: 24xIntel(R) Xeon(R) CPU L5639 @ 2.13GHz (physical = 2, cores = 12, virtual = 24, hyperthreading = yes)
  • Disk: 2 consumer grade SSD in software RAID 0 (mdraid)

Query 1

ClickHouse:

Spark:

Query 2

ClickHouse:

Spark:

Query 3

ClickHouse:

Spark:

 

8 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
spcblack

Where’s ‘Like’ button ? 🙂 I like your posting.

Peter Zaitsev

It is on the left side of the screen. Feel free not only like but also share on Facebook LinkedIn or Twitter 🙂

akrus

Would be nice to see the comparison with MariaDB ColumnStore.

Jacob

It would be nice to compare ClickHouseDB with MariaDB ColumnStore.
It could be useful not just compare the performance, but also ease of installation,
data loading and resources consumption.

David

Hi Alexander, what was the spec of the equipment you used to run your benchmarks?

David

Sorry! ignore the request, now that my eyes are working properly I found the specs.