Compaction

Topic: Internals

Audience: Scylla and Apache Cassandra developers

In Scylla, as in Apache Cassandra, SSTables are immutable. Mutations, adding new data or overriding or deleting old data, are inserted into a memory table and this memory table is periodically dumped to disk, each time to a new SSTable No SSTable is ever modified after writing.

After many writes, the system has many separate SSTables. They might contain outdated data. For example, different SSTables might contain both an old value and new value for the same cell, or an old value for a cell later deleted. That is fine, as timestamps on each value make it possible to decide which is the most recent value. However, extra values are a waste of disk space. SSTable buildup also slows down reads: different SSTables can hold different columns of the same row, so a query might need to read from multiple SSTables to compose its result.

To free up disk space and speed up reads, Scylla must do compaction operations. Compaction is the process of reading several SSTables, then writing one SSTable containing the merged, most recent, information.

By default, Scylla starts a compaction task whenever a new SSTable is written. This results in lower total storage use.

Only one compaction operation on the same table can run at a time, so compaction may be postponed if there is an ongoing compression. Compaction will not start when the system is already shutting down.

Compaction strategy

Scylla implements the same three compaction strategies as Apache Cassandra.

Size-Tiered Compaction is Apache Cassandra’s oldest and still default compaction strategy. Compaction is triggered when the system has enough (four by default) similarly sized SSTables. These are merged together, to form one larger SSTables. Later, when several large SSTables have accumulated, they will be merged to form one even-larger SSTable - and so on.

This means that the system has several size tiers (small SSTables, large SSTables, even-larger SSTables) and in each tier there are roughly the same number of files. When one tier is full, the system merges all its tables to one table in the next tier.

Choose the SizeTieredCompactionStrategy if rows are written once and never modified (or written a few times and then not modified again). In that case, each row will eventually end up being written as a whole to one compacted SSTable, and reads are efficient. But continuously modifying existing rows will result in each row being split across several SSTables, making reads slow. This doesn’t happen in leveled compaction (see below).

There are two other disadvantages of size-tiered compaction, even if the workload is write-mostly.

  1. Obsolete data (overwritten or deleted columns) in a very large SSTable will stay behind for a long time, and waste a lot of space, for a long time, until finally merged.
  2. Compaction requires a lot of temporary space: In worst case, we need to merge all existing SSTables into one, so we need half the disk to be empty to write the output file and only later can delete the old SSTables.

Leveled Compaction was introduced in Apache Cassandra 1.0. With leveled compaction, instead of potentially huge SSTables the system uses small, fixed-size (by default 160 MB) SSTables divided into different “levels”. The technique works as follows:

  • New SSTables (dumped from memtables) are created in “Level 0”.
  • The other levels (all except level 0) are each a run of SSTables, of exponentially increasing size: “Level 1” is a run of 10 SSTables (of 160 MB each), “Level 2” is a run of 100 SSTables (160 MB each), etc.
  • A run of SSTables is a LSM terminology (see http://en.wikipedia.org/wiki/Log-structured_merge-tree) for a set of SSTables with non-overlapping key ranges. In other words, each SSTable has a range (between its first key and its last key - remember that the keys are sorted), and these ranges are disjoint between the different SSTables in a single level.
  • A run can be thought of as a split-up huge SSTable. The benefit of a run is that while a huge SSTable must be rewritten as a whole on modification, in a run we can modify only parts of it (individual SSTables) while keeping the disjoint key requirement. This is what leveled compaction does.
  • When we have enough SSTables in Level 0, we compact them with all 10 SSTables in Level 1. This compaction works like this: We read in parallel the 4 SSTables in level 0 and 10 in level 1, and write new SSTables for level 1 (replacing the 10 old ones we’ve compacted). We don’t create one large SSTable - rather, we write one SSTable and when we reach the size limit (160 MB), we start a new SSTable. Because the merge happens on sorted keys, the new SSTables we generate are a run, with non-overlapping key ranges.
  • Now, after the compaction of level 0 into level 1, it is possible that we have more than the desired number 10 of SSTables in level1. In that case, we pick one excess SSTable from level 1, and compact it into level 2:
    • We take one SSTable from level 1 (this SSTable will be deleted after the compaction)
    • We look at this SSTable’s key range, and find all SSTables in level 2 which overlap with it.
    • Typically, there are about 12 of these (the level 1 SSTable spans roughly 1/10th of the keys, while each level 2 SSTable spans roughly 1/100th of the keys, so 10 level-2 SSTables will overlap the level-1 SSTable’s range, plus two more on the edges).
    • As before, we compact the one SSTable from level 1 and the 12 SSTables from level 2 and replace all of those with new SSTables in level 2.
  • After this compaction of level 1 into level 2, now we can have excess SSTables in level 2 so we merge them into level 3.

With the leveled compaction strategy, SSTable reads are efficient. The great number of small SSTables doesn’t mean we need to look up a key in that many SSTables, because we know the SSTables in each level have disjoint ranges, so we only need to look in one SSTable in each level. In the typical case, we just need to read one SSTable.

The other factors making this compaction strategy efficient are that at most 10% of space will be wasted by obsolete rows, and only enough space for ~10x the small SSTable size needs to be reserved for temporary use by compaction.

The downside of this method is two times more I/O on writes, so it is not as good for write-new-data-mostly workloads.

Date-Tiered Compaction was the last compaction strategy added to Apache Cassandra (in version 2.1), designed for time series data. A time series uses the time of a data point as the clustering key. In a time-series use case, we see some common features:

  1. Clustering key and write time are correlated.
  2. Data is added in time order. Only few out-of-order writes, typically rearranged by just a few seconds.
  3. Data is only deleted through TTL or by deleting an entire partition.
  4. The rate at which data is written is nearly constant.
  5. A query on a time series is usually a range query on a given partition; The most common query is of the form “values from the last hour/day/week”.

Because of the above assumptions, it can easily know which SSTable (before compaction) contains data relevant to a particular requested time range. However, both Size-Tiered and Leveled compaction destroy this neat ordering, because they usually merge old and new data in the same output SSTable, making it no longer possible to rule out a whole “old” SSTable from a search for “new” data.

The date-tiered compaction strategy first sorts the SSTables by time, then compacts adjacent (time-wise) SSTables. The result are SSTables whose sizes increase exponentially as they grow older. For example, at some point we can have the last minute of data in one SSTable (by default, base_time_seconds = 60), another minute before that in another SSTable, then the 4 minutes before that in one SSTable, then the 4 minutes before that, then an SSTable of the 16 minutes before that, and so on. This structure can easily be maintained by compaction, very similar to what we did in size-tiered compaction: When we have 4 (the default value for min_threshold) small (one-minute) consecutive SSTables, we compact them into one 4-minute SSTable. When we have 4 of those bigger SSTables one after another (time-wise), we merge them into a 16-minute SSTable, and so on.

Antique SSTables older than max_SSTable_age_days (by default 365 days) are not compacted any more - doing those compactions will not be useful for most queries, will be very slow, and require huge amounts of temporary disk space.

References

Size Tiered: Shrikant Bang’s Notes

Knowledge Base