Scylla Secondary Indexes

Secondary Indexes will be available as experimental features on Scylla 2.1. To enable experimental features in Scylla, add the line experimental: true to scylla.yaml.

The data model in Scylla partitions data between cluster nodes using a partition key, which is defined in the database schema. This is an efficient way to look up rows because you can find the node hosting the row by hashing the partition key.

However, this also means that finding a row using a non-partition key requires a full table scan which is inefficient.

Secondary indexes are a mechanism in Scylla which allows efficient searches on non-partition keys by creating an index. In effect, they are indexes created on columns other than the primary (partition) key. (Note that you will be able to create an index on clustering key in later versions).

Secondary indexes provide the following advantages:

  1. Secondary Indexes are (mostly) transparent to your application. Queries have access to all the columns in the table, and you can add or remove indexes on the fly without changing the application.
  2. We can use the value of the indexed column to find the corresponding index table row in the cluster so that reads are scalable.
  3. Updates can be more efficient with secondary indexes that materialized views because only changes to the primary key and indexed column cause an update in the index view.

What’s more, the size of an index is proportional to the size of the indexed data. As data in Scylla is distributed to multiple nodes, it’s impractical to store the whole index on a single node, as it limits the size of the index to the capacity of a single node, not the capacity of the whole cluster.

For this reason, secondary indexes in Scylla are global rather than local. With global indexing, a materialized view is created is created for each index. This materialized view has the indexed column as a partition key and primary key (partition key and clustering keys) of the indexed row as clustering keys.

Secondary indexes created globally provide a further advantage: you can use the value of the indexed column to find the corresponding index table row in the cluster so reads are scalable. Note however, that with this approach, writes are slower than with local indexing because of the overhead required to keep the indexed view up to date.

How Secondary Index Queries Work

Scylla breaks indexed queries into two parts:

  1. a query on the index table to retrieve partition keys for the indexed table, and
  2. a query to the indexed table using the retrieved partition keys.
../../_images/secondary_index.png

In the example above:

  1. The query arrives on node 7, which acts as a coordinator for the query.
  2. The node notices the query on an index column and issues a read to index table on node two, which has the index table row for “user@example.com”;
  3. This query will return a set of user IDs that are now used to retreive contents of the indexed table.

Usage

To enable secondary indexes in Scylla version 2.1, add experimental: true to scylla.yaml. Stop and restart the node. You can also start a docker node using the -experimental 1 option.

Example

Given the following schema:

Let’s populate it with some test data:

Note that if we try to query on a column (the part after the WHERE clause) in a Scylla table that isn’t part of the primary key, we’ll see that this is not permitted. For example:

SELECT * FROM ks.users WHERE email = beassebyv@house.gov;

will result in an error.

Secondary indexes are designed to allow efficient querying of non-partition key columns. If we create an index for each of our email and country columns by executing the following CQL statements:

We can now query the indexed columns as if they were partition keys:

Note that you can use the DESCRIBE command to see the whole schema for the ks.users table, including created indexes and views: