SSTable Compression

Chunked Compression of Data File

SSTables compression allows to optionally compress the SSTables Data File, which is the biggest part of the SSTable (the other parts, like the index, cannot be compressed).

Because random-access read from the data file is important, Apache Cassandra implements chunked compression: The uncompressed file is divided into chunks of a configurable fixed size (usually 64 KB), and each of these chunks is compressed separately and written to the compressed data file, followed by a 4-byte checksum of the compressed chunk. The compressed chunks have different lengths, so we need to remember their offsets so we can seek to an arbitrary chunk containing our desired uncompressed data. This list of offsets is stored in a separate Compression Info File, described below.

The Compression Info File


The Compression Info File only exists if the data file is compressed. The Compression Info File specifies the compression parameters that the decompressor needs to know (such as the compression algorithm and chunk size), and the list of offsets of the compressed chunks in the compressed file:

struct short_string {
    be16 length;
    char value[length];
struct option {
    short_string key;
    short_string value;
struct compression_info {
    short_string compressor_name;
    be32 options_count;
    option options[options_count];
    be32 chunk_length;
    be64 data_length;
    be32 chunk_count;
    be64 chunk_offsets[chunk_count];

The compressor_name can be one of three strings supported by Apache Cassandra: “LZ4Compressor”, “SnappyCompressor” or “DeflateCompressor”. Cassandra defaults to LZ4Compressor, but a user could choose any one of the three.

The options may contain additional options for the decompressor, But usually no options are listed, and only one option exists: “crc_check_chance”, whose value is a floating-point string which defaults (if the option doesn’t appear) to “1.0”, and determines the probability that we verify the checksum of a compressed chunk we read.

chunk_length is the length of the chunks into which the original uncompressed data is divided before compression. The decompressor needs to know this chunk size as well, so when given a desired byte offset in the original uncompressed file, it can determine which chunk it needs to uncompress. The chunk length defaults to 65536 bytes, but can be any power of two.

data_length is the total length of the uncompressed data.

chunks_offsets is the list of offsets of the compressed chunks inside the compressed file. To read data in position p in the uncompressed file, we calculates p / chunk_length is the uncompressed chunk number. The compressed version of this check is now begins in chunk_offsets[p / chunk_length]. The compressed chunk ends 4 bytes before the beginning of the next chunk (because, as explained above, every compressed chunk is followed by a 4-byte checksum).

The Compressed Data File

As explained above, the compressed data file is a sequence of compressed chunked, each a compressed version of a fixed-sized (chunk_length from the Compression Info File) uncompressed chunk. Each compressed chunk is followed by a be32 (big-endian 4-byte integer) Adler32 checksum of the compressed data, which can be used to verify the data hasn’t been corrupted

Scylla Architecture


© 2016, The Apache Software Foundation.

Apache®, Apache Cassandra®, Cassandra®, the Apache feather logo and the Apache Cassandra® Eye logo are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries. No endorsement by The Apache Software Foundation is implied by the use of these marks.