Compaction

Compaction nodetool commands

The nodetool utility provides a number of commands related to compaction:

compactionstats
Statistics about current and pending compactions.
compactionhistory
List details about the last compactions.

Common options

There is a number of common options for all the compaction strategies;

enabled (default: true)
Whether minor compactions should run. Note that you can have ‘enabled’: true as a compaction option and then do ‘nodetool enableautocompaction’ to start running compactions.
tombstone_threshold (default: 0.2)
How much of the sstable should be tombstones for us to consider doing a single sstable compaction of that sstable.
tombstone_compaction_interval (default: 86400s (1 day))
Since it might not be possible to drop any tombstones when doing a single sstable compaction we need to make sure that one sstable is not constantly getting recompacted - this option states how often we should try for a given sstable.
min_threshold (default: 4)
Lower limit of number of sstables before a compaction is triggered. Not used for LeveledCompactionStrategy.
max_threshold (default: 32)
Upper limit of number of sstables before a compaction is triggered. Not used for LeveledCompactionStrategy.

Further, see the section on each strategy for specific additional options.

Size Tiered Compaction Strategy

The basic idea of SizeTieredCompactionStrategy (STCS) is to merge SSTables of approximately the same size. All SSTables are put in different buckets depending on their size. An SSTable is added to the bucket if size of the SSTable is within bucket_low and bucket_high of the current average size of the SSTables already in the bucket. This will create several buckets and the most interesting of those buckets will be compacted. The most interesting one is decided by figuring out which bucket’s SSTables takes the most reads.

Major compaction

When running a major compaction with STCS you will end up with two SSTables per data directory (one for repaired data and one for unrepaired data). There is also an option (-s) to do a major compaction that splits the output into several SSTables. The sizes of the SSTables are approximately 50%, 25%, 12.5%… of the total size.

STCS options

min_SSTable_size (default: 50MB)
SSTables smaller than this are put in the same bucket.
bucket_low (default: 0.5)
How much smaller than the average size of a bucket a SSTable should be before not being included in the bucket. That is, if bucket_low * avg_bucket_size < SSTable_size (and the bucket_high condition holds, see below), then the SSTable is added to the bucket.
bucket_high (default: 1.5)
How much bigger than the average size of a bucket a SSTable should be before not being included in the bucket. That is, if SSTable_size < bucket_high * avg_bucket_size (and the bucket_low condition holds, see above), then the SSTable is added to the bucket.

Defragmentation

Defragmentation is done when many SSTables are touched during a read. The result of the read is put in to the memtable so that the next read will not have to touch as many SSTables. This can cause writes on a read-only-cluster.

Leveled Compaction Strategy

The idea of LeveledCompactionStrategy (LCS) is that all SSTables are put into different levels where we guarantee that no overlapping SSTables are in the same level. By overlapping we mean that the first/last token of a single SSTable are never overlapping with other SSTables. This means that for a SELECT we will only have to look for the partition key in a single SSTable per level. Each level is 10x the size of the previous one and each SSTable is 160MB by default. L0 is where SSTables are streamed/flushed - no overlap guarantees are given here.

When picking compaction candidates we have to make sure that the compaction does not create overlap in the target level. This is done by always including all overlapping SSTables in the next level. For example if we select an SSTable in L3, we need to guarantee that we pick all overlapping SSTables in L4 and make sure that no currently ongoing compactions will create overlap if we start that compaction. We can start many parallel compactions in a level if we guarantee that we wont create overlap. For L0 -> L1 compactions we almost always need to include all L1 SSTables since most L0 SSTables cover the full range. We also can’t compact all L0 SSTables with all L1 SSTables in a single compaction since that can use too much memory.

When deciding which level to compact LCS checks the higher levels first (with LCS, a “higher” level is one with a higher number, L0 being the lowest one) and if the level is behind a compaction will be started in that level.

Major compaction

It is possible to do a major compaction with LCS - it will currently start by filling out L1 and then once L1 is full, it continues with L2 etc. This is sub optimal and will change to create all the SSTables in a high level instead.

Bootstrapping

During bootstrap SSTables are streamed from other nodes. The level of the remote SSTable is kept to avoid many compactions after the bootstrap is done. During bootstrap the new node also takes writes while it is streaming the data from a remote node - these writes are flushed to L0 like all other writes and to avoid those SSTables blocking the remote SSTables from going to the correct level, we only do STCS in L0 until the bootstrap is done.

STCS in L0

If LCS gets very many L0 SSTables reads are going to hit all (or most) of the L0 SSTables since they are likely to be overlapping. To more quickly remedy this LCS does STCS compactions in L0 if there are more than 32 SSTables there. This should improve read performance more quickly compared to letting LCS do its L0 -> L1 compactions. If you keep getting too many SSTables in L0 it is likely that LCS is not the best fit for your workload and STCS could work out better.

Starved SSTables

If a node ends up with a leveling where there are a few very high level SSTables that are not getting compacted they might make it impossible for lower levels to drop tombstones etc. For example, if there are SSTables in L6 but there is only enough data to actually get a L4 on the node the left over SSTables in L6 will get starved and not compacted. This can happen if a user changes SSTable_size_in_mb from 5MB to 160MB for example. To avoid this LCS tries to include those starved high level SSTables in other compactions if there has been 25 compaction rounds where the highest level has not been involved.

LCS options

SSTable_size_in_mb (default: 160MB)
The target compressed (if using compression) SSTable size - the SSTables can end up being larger if there are very large partitions on the node.
fanout_size (default: 10)
The target size of levels increases by this fanout_size multiplier. You can reduce the space amplification by tuning this option.

Time Window CompactionStrategy

TimeWindowCompactionStrategy (TWCS) is designed specifically for workloads where it’s beneficial to have data on disk grouped by the timestamp of the data, a common goal when the workload is time-series in nature or when all data is written with a TTL. In an expiring/TTL workload, the contents of an entire SSTable likely expire at approximately the same time, allowing them to be dropped completely, and space reclaimed much more reliably than when using SizeTieredCompactionStrategy or LeveledCompactionStrategy. The basic concept is that TimeWindowCompactionStrategy will create 1 SSTable per file for a given window, where a window is simply calculated as the combination of two primary options:

compaction_window_unit (default: DAYS)
A Java TimeUnit (MINUTES, HOURS, or DAYS).
compaction_window_size (default: 1)
The number of units that make up a window.

Taken together, the operator can specify windows of virtually any size, and TimeWindowCompactionStrategy will work to create a single SSTable for writes within that window. For efficiency during writing, the newest window will be compacted using SizeTieredCompactionStrategy.

Ideally, operators should select a compaction_window_unit and compaction_window_size pair that produces approximately 20-30 windows - if writing with a 90 day TTL, for example, a 3 Day window would be a reasonable choice ('compaction_window_unit':'DAYS','compaction_window_size':3).

TimeWindowCompactionStrategy Operational Concerns

The primary motivation for TWCS is to separate data on disk by timestamp and to allow fully expired SSTables to drop more efficiently. One potential way this optimal behavior can be subverted is if data is written to SSTables out of order, with new data and old data in the same SSTable. Out of order data can appear in two ways:

  • If the user mixes old data and new data in the traditional write path, the data will be comingled in the memtables and flushed into the same SSTable, where it will remain comingled.
  • If the user’s read requests for old data cause read repairs that pull old data into the current memtable, that data will be comingled and flushed into the same SSTable.

While TWCS tries to minimize the impact of comingled data, users should attempt to avoid this behavior. Specifically, users should avoid queries that explicitly set the timestamp via CQL USING TIMESTAMP. Additionally, users should run frequent repairs (which streams data in such a way that it does not become comingled), and disable background read repair by setting the table’s read_repair_chance and dclocal_read_repair_chance to 0.

Changing TimeWindowCompactionStrategy Options

Operators wishing to enable TimeWindowCompactionStrategy on existing data should consider running a major compaction first, placing all existing data into a single (old) window. Subsequent newer writes will then create typical SSTables as expected.

Operators wishing to change compaction_window_unit or compaction_window_size can do so, but may trigger additional compactions as adjacent windows are joined together. If the window size is decrease d (for example, from 24 hours to 12 hours), then the existing SSTables will not be modified - TWCS can not split existing SSTables into multiple windows.

Apache Cassandra Query Language