Compaction Strategies

Scylla’s write path follows the well-known Log Structured Merge (LSM) design for efficient writes that are immediately available for reads. Scylla is not the first project to use this method as one of the first projects to make this technique popular was the Lucene search engine, in 1999.

Scylla writes its updates to a memory table (MemTable), and when that becomes too big, it is flushed to a new file. This file is sorted to make it easy to search and later merge. This is why the tables are known as Sorted String Tables or SSTables.

../../../_images/compaction-datawrites.png
MemTable
An in-memory data structure servicing both reads and writes. Once full, the Memtable flushes to an SSTable
SSTable
A concept borrowed from Google Big Table, SSTables or Sorted String Tables store a series of immutable rows where each row is identified by its row key.

In time, two major problems start to appear. First, data in one SSTable which is later modified or deleted in another SSTable wastes space as both tables are present in the system. Second, when data is split across many SSTables, read requests are processed slower as many SSTables need to be read. Scylla mitigates the second problem by using a bloom filter and other techniques to avoid reading from SSTables that do not include the desired partition. However, as the number of SSTables grows, inevitably so do the number of disks from which we need to read on every read query.

For these reasons, as soon as enough SSTables have accumulated, Scylla performs a compaction.

Compaction merges several SSTables into one new SSTable which contains only the live data from the input SSTables. Merging several sorted files to get a sorted result is an efficient process, and this is the main reason why SSTables are kept sorted.

By default, Scylla starts a compaction task whenever a new SSTable is written. The compaction process results in lower total storage being used and prevents buildups from happening. In addition, compaction will not start if the system is shutting down. On shutdown, any compaction in progress suddenly stops.

Compaction
Is the process of reading several SSTables, comparing the data and timestamps and then writing one SSTable containing the merged, most recent, information.
Compaction Strategy
A compaction strategy is what determines which of the SSTables will be compacted, and when.

Scylla implements the following compaction strategies:

Amplification

A main aim of compaction is to reduce amplification. Amplification causes bottlenecks and poor performance. There are three types of amplification:

Read Amplification (RA)
Excessive read requests which require many SSTables. Read Amplification is calculated by the number of disk reads per query. High RA occurs when there are many pages to read in order to answer a query.
Write Amplification (WA)
Excessive compaction of the same data. Write amplification is calculated by the ratio of bytes written to storage versus bytes written to the database. High WA occurs when there are more bytes/second written to storage than are actually written to the database.
Space Amplification (SA)
Excessive disk space usage which requires that the disk be larger than a perfectly-compacted representation of the data (i.e., all the data in one single SSTable). Space Amplification is calculated as the ratio of the size of database files on a disk to the actual data size. High SA occurs when there is more disk space being used than the size of the data.

Size-tiered Compaction Strategy (STCS)

Size-Tiered Compaction is triggered when the system detects that there are enough (four by default) similarly sized SSTables. Once triggered, the tables are merged, resulting in one larger SSTable. As time progresses and several large SSTables have accumulated, they will be merged to form one even-larger SSTable and so on.

This means that the system has several size tiers (small SSTables, large SSTables, even-larger SSTables) and in each tier, there is roughly the same number of files. When one tier is full (the threshold has been reached), the system merges all its tables to create one SSTable in the same size as the tables in the next tier.

../../../_images/compaction-size-tiered.png

The parameter min_threshold dictates the size of the tier. If rows are written once and never modified (or written a few times and then not modified again), each row will eventually end up being written as a whole to one compacted SSTable.

Size-tiered compaction benefits

This is a popular strategy for LSM workloads. It results in a low and logarithmic (in size of data) number of SSTables, and the same data is copied during compaction a fairly low number of times. Use the table in Which strategy is best to determine if this is the right strategy for your needs.

Size-tiered compaction disadvantages

This strategy has the following drawbacks (particularly with writes):

  • Continuously modifying existing rows results in each row being split across several SSTables, making reads slow, which doesn’t happen in Leveled compaction.
  • Obsolete data (overwritten or deleted columns) in a very large SSTable remains, wasting space, for a long time, until it is finally merged.
  • Compaction requires a lot of temporary space as the new larger SSTable is written before the duplicates are purged. In the worst case up to half the disk space needs to be empty to allow this to happen.

See Also

Leveled Compaction Strategy (LCS)

Leveled Compaction uses small, fixed-size (by default 160 MB) SSTables divided into different levels. Each level represents a run of a number of SSTables (see A run of SSTables).

../../../_images/compaction-leveled.png

The compaction method works as follows:

  1. New SSTables (created from MemTables) are created in Level 0. All other levels are each a run of SSTables, of exponentially increasing size as follows:
    • Level 1 is a run of 10 SSTables (160 MB each table * 10)
    • Level 2 is a run of 100 SSTables (160 MB each table * 100), etc.
  2. When there are enough SSTables in Level 0, they are compacted with the 10 SSTables in Level 1. This compaction works as follows:
    • Read in parallel 4 SSTables in level 0 and 10 in Level 1
    • Write new SSTables for Level 1 (replacing the 10 old tables which were compacted).
    • Instead of creating one large SSTable which impacts functionality. Several tables are written as follows: One SSTable is created. When it reaches the size limit (160 MB), a new table starts. As the data is merged on the sorted keys, this generates a run (see A run of SSTables), with non-overlapping key ranges.
  3. If after the compaction from Level 0 into Level 1, there are more than 10 SSTables in Level 1, the excess SSTables from Level 1 are compacted and put into Level 2 as follows:
    • Take one SSTable from Level 1 (this SSTable will be deleted after the compaction)
    • Look at this SSTable’s key range, and find all SSTables in Level 2 which overlap with it. Typically, there are about 12 of these (the Level 1 SSTable spans roughly 1/10th of the keys, while each Level 2 SSTable spans roughly 1/100th of the keys, so 10 Level 2 SSTables will overlap the Level 1 SSTable’s range, plus two more on the edges).
    • As before, compact the 1 SSTable from Level 1 and the 12 SSTables from Level 2 and create new SSTables in Level 2.
    • If after this compaction of Level 1 into Level 2, there are excess SSTables in Level 2 (as Level 2 can only take 100 tables), merge them into Level 3.

A run of SSTables

A run is a log-structured-merge (LSM) term for a large SSTable split into several smaller SSTables. In other words, a run is a collection of SSTables with non-overlapping key ranges. The benefit of a run is that when a modification is done, only parts of it (small individual SSTables and not one huge SSTable) are modified.

Leveled Compaction benefits

With the leveled compaction strategy, the following benefits are noteworthy:

  • SSTable reads are efficient. The great number of small SSTables doesn’t mean we need to look up a key in that many SSTables, because we know the SSTables in each level have disjoint ranges, so we only need to look in one SSTable in each level. In the typical case, only one SSTable needs to be read.
  • The other factors making this compaction strategy efficient are that at most 10% of space will be wasted by obsolete rows, and only enough space for ~10x the small SSTable size needs to be reserved for temporary use by compaction.

Use the table in Which strategy is best to determine if this is the right strategy for your needs.

Leveled Compaction disadvantages

The downside of this method is there is two times more I/O on writes, so it is not as good for workloads which focus on writing mostly new data.

Only one compaction operation on the same table can run at a time, so compaction may be postponed if there is a compaction already in progress. As the size of the files is not too large, this is not really an issue.

See Also

Time-window Compaction Strategy (TWCS)

Time-window compaction strategy was introduced in Cassandra 3.0.8 for time-series data as a replacement for Date-tiered Compaction Strategy (DTCS). Time-Window Compaction Strategy compacts SSTables within each time window using Size-tiered Compaction Strategy (STCS). SSTables from different time windows are never compacted together.

The strategy works as follows:

  • A time window is defined. This window is configured by its time parameter (one day, for example).
  • SSTables created within the time window are compacted using Size-tiered Compaction Strategy (STCS).
  • Once the threshold has been reached, take all SSTables which were created during the time window and compact the data into one SSTable.
  • The final resulting SSTable is never compacted with other time-windows’ SSTables.

With this explanation, if the time window was for one day, at the end of the day, the SSTables accumulated for that day only would be compacted into one SSTable.

Use the table in Which strategy is best to determine if this is the right strategy for your needs.

Time-window Compaction benefits

  • Keeps entries according to a time range, making searches for data within a given range easy to do, resulting in better read performance
  • Improves over DTCS in that it reduces the number to huge compactions
  • Allows you to expire an entire SSTable at once (using a TTL) as the data is already organized within a time frame

Time-window Compaction deficits

  • Time-window compaction is only ideal for time-series workloads

See Also

Date-tiered Compaction Strategy (DTCS)

Date-Tiered Compaction is designed for time series data. This strategy was introduced with Cassandra 2.1. It is only suitable for time-series data. This strategy is not recommended and has been replaced by Time-window compaction.

Date-tiered compaction strategy works as follows:

  • First it sorts the SSTables by time and then compacts them into adjacent (time-wise) SSTables.
  • This results in SSTables whose sizes increase exponentially as they grow older.

For example, at some point we can have the last minute of data in one SSTable (by default, base_time_seconds = 60), another minute before that in another SSTable, then the 4 minutes before that in one SSTable, then the 4 minutes before that, then an SSTable of the 16 minutes before that, and so on. This structure can easily be maintained by compaction, very similar to size-tiered compaction. When there are 4 (the default value for min_threshold) small (one-minute) consecutive SSTables, they are compacted into one 4-minute SSTable. When there are 4 of the bigger SSTables one after another (time-wise), they are merged into a 16-minute SSTable, and so on.

Antique SSTables older than max_SSTable_age_days (by default 365 days) are not compacted as doing these compactions would not be useful for most queries, the process would be very slow, and the compaction would require huge amounts of temporary disk space.

Which strategy is best

Every workload type may not work well with every compaction strategy. Unfortunately, the more mixed your workload, the harder it is to pick the correct strategy. This table presents what can be expected depending on the strategy you use for the workload indicated, allowing you to make a more informed decision. Keep in mind that best choice for our testing may not be the best choice for your environment. You may have to experiment to find which strategy works best for you.

If you are unsure about the meaning of each type of amplification see the definitions above.

Compaction Strategy Matrix

Workload/Compaction Strategy Size-tiered Leveled Time-Window Comments
Write-only check check x [1] and [2]
Overwrite x check x [3] and [4]
Read-mostly, with few updates x check x [5]
Read-mostly, with many updates check x x [6]
Time Series x x check [7] and [8]

1 When using Size-tiered with write-only loads it will use approximately 2x peak space - SA

2 When using Leveled Compaction with write only loads you will experience 2x writes - WA

3 When using Size-tired with Overwrite loads, SA occurs

4 When using Leveled Compaction with overwrite loads, WA occurs

5 When using Size-tiered with mostly read loads with little updates, SA occurs

6 When using Leveled with mostly read loads with many updates, WA occurs in excess

7 When using Size-tiered with Time Series workloads, SA, RA, and WA occurs.

8 When using Leveled with Time Series workloads, SA and WA occurs.