Scylla Ring Architecture - Overview

Scylla is a database that scales out and up. Scylla adopted much of its distributed scale-out design from the Apache Cassandra project (which adopted distribution concepts from Amazon Dynamo and data modeling concepts from Google BigTable).

In the world of big data, a single node cannot hold the entire dataset and thus, a cluster of nodes is needed.

A Scylla cluster is a collection of nodes, or Scylla instances, visualized as a ring. All of the nodes should be homogenous using a shared-nothing approach. This article describes the design that determines how data is distributed among the cluster members.

A Scylla keyspace is a collection of tables with attributes that define how data is replicated on nodes. A keyspace is analogous to a database in SQL. When a new keyspace is created, the user sets a numerical attribute, the replication factor (RF), that defines how data is replicated on nodes. For example, an RF of 2 means a given token or token range will be stored on 2 nodes (or replicated on one additional node). We will use an RF value of 2 in our examples.

A table is a standard collection of columns and rows, as defined by a schema. Subsequently, when a table is created, using CQL (Cassandra Query Language) within a keyspace, a primary key is defined out of a subset of the table’s columns.

The table in the diagram below can thus be defined in CQL as follows:

CREATE TABLE users (
     ID int,
     NAME text,
     ADDRESS text,
     PHONE text,
     PHONE_2 text,
     PRIMARY KEY (ID)
);

In a CQL table definition, the primary key clause specifies, at a minimum, a single column partition key and may also specify clustering key columns. The primary key uniquely identifies each partition/row combination in the table, while the clustering keys dictate how the data (rows) are sorted within a given partition. For more information, see our CQL documentation.

A row is a container for columns associated with a primary key. In other words, a primary key represents one or more columns needed to fetch data from a CQL table.

../../_images/ring-architecture-1.png

A token is a value in a range, used to identify both nodes and partitions. The partition key is the unique identifier for a partition, and represented as a token which is hashed from the primary key.

A partition is a subset of data that is stored on a node and replicated across nodes. There are two ways to consider a partition. In CQL, a partition appears as a group of sorted rows, and is the unit of access for queried data, given that most queries access a single partition. On the physical layer, a partition is a unit of data stored on a node and is identified by a partition key.

In the diagram above, there are 3 partitions shown, with the partition keys of 101, 102, and 103.

A partition key is the primary means of looking up a set of rows that comprise a partition. A partition key serves to identify the node in the cluster that stores a given partition, as well as to distribute data across nodes in a cluster.

The partitioner, or partition hash function, using a partition key, determines where data is stored on a given node in the cluster. It does this by computing a token for each partition key. By default, the partition key is hashed using the Murmur3 hashing function.

../../_images/ring-architecture-2.png

The hashed output of the partition key determines its placement within the cluster.

../../_images/ring-architecture-3.png

The figure above illustrates an example 0-1200 token range divided evenly amongst a three node cluster.

Scylla, by default, uses the Murmur3 partitioner. With the MurmurHash3 function, the 64-bit hash values (produced for the partition key) range from From to To. This explains why there are also negative values in our nodetool ring output below.

../../_images/ring-architecture-4.png

In the drawing above, each number represents a token range. With a replication factor of 2, we see that each node holds one range from the previous node, and one range from the next node.

Note, however, that Scylla exclusively uses a Vnode-oriented architecture. A Vnode represents a contiguous range of tokens owned by a single Scylla node. A physical node may be assigned multiple, non-contiguous Vnodes.

Scylla’s implementation of a Vnode oriented architecture provides several advantages. First of all, rebalancing a cluster is no longer required when adding or removing nodes. Secondly, as rebuilding can stream data from all available nodes (instead of just the nodes where data would reside on a one-token-per-node setup), Scylla can rebuild faster.

../../_images/ring-architecture-5.png

The proportion of Vnodes assigned to each node in a cluster is configurable in the num_tokens setting of scylla.yaml; the default is 256.

You can use the nodetool command to describe different aspects of your nodes, and the token ranges they store. For example,

$ nodetool ring <keyspace>

Outputs all tokens of a node, and displays the token ring information. It produces output as follows for a single datacenter:

Datacenter: datacenter1
=======================
Address     Rack        Status State   Load            Owns                Token
                                                                         9156964624790153490
172.17.0.2  rack1       Up     Normal  110.52 KB       66.28%            -9162506483786753398
172.17.0.3  rack1       Up     Normal  127.32 KB       66.69%            -9154241136797732852
172.17.0.4  rack1       Up     Normal  118.32 KB       67.04%            -9144708790311363712
172.17.0.4  rack1       Up     Normal  118.32 KB       67.04%            -9132191441817644689
172.17.0.3  rack1       Up     Normal  127.32 KB       66.69%            -9080806731732761568
172.17.0.3  rack1       Up     Normal  127.32 KB       66.69%            -9017721528639019717
...

Here we see that, for each token, it shows the address of the node, which rack it is on, the status (Up or Down), the state, the load, and the token. The Owns column shows the percentage of the ring (the keyspace) actually handled by that node.

$ nodetool describering <keyspace>

Shows the token ranges of a given keyspace. That output, on a three node cluster, looks like this:

Schema Version:082bce63-be30-3e6b-9858-4fb243ce409c
TokenRange:
TokenRange(start_token:9143256562457711404, end_token:9156964624790153490, endpoints:[172.17.0.4], rpc_endpoints:[172.17.0.4], endpoint_details:[EndpointDetails(host:172.17.0.4, datacenter:datacenter1, rack:rack1)])
TokenRange(start_token:9081892821497200625, end_token:9111351650740630104, endpoints:[172.17.0.4], rpc_endpoints:[172.17.0.4], endpoint_details:[EndpointDetails(host:172.17.0.4, datacenter:datacenter1, rack:rack1)])
...

We can also get information on our cluster with

$ nodetool describecluster

Cluster Information:
   Name: Test Cluster
   Snitch: org.apache.cassandra.locator.SimpleSnitch
   Partitioner: org.apache.cassandra.dht.Murmur3Partitioner
   Schema versions:
      082bce63-be30-3e6b-9858-4fb243ce409c: [172.17.0.2, 172.17.0.3, 172.17.0.4]

Terms for Review

Keyspace

A collection of tables with attributes which define how data is replicated on nodes.

Table

A collection of columns fetched by row. Columns are ordered by Clustering Key.

Node

A single installed instance of Scylla.

Nodetool

A simple command-line interface or administering a Scylla node. A nodetool command can display a given node’s exposed operations and attributes. Scylla’s nodetool contains a subset of these operations.

Primary Key

In a CQL table definition, the primary key clause specifies the partition key and optional clustering key. These keys uniquely identify each partition and row within a partition.

Partition Key

The unique identifier for a partition, a partition key may be hashed from the first column in the primary key. A partition key may also be hashed from a set of columns, often referred to as a compound primary key. A partition key determines which virtual node gets the first partition replica.

Clustering Key

A single or multi-column clustering key determines a row’s uniqueness and sort order on disk within a partition.

Partitioner

A hash function for computing which data is stored on which node in the cluster. The partitioner takes a partition key as an input, and returns a ring token as an output.

By default Scylla uses the 64 bit Murmurhash3 function and this hash range is numerically represented as an unsigned 64bit integer, From to To.

Token

A value in a range, used to identify both nodes and partitions. Each node in a Scylla cluster is given an (initial) token, which defines the end of the range a node handles.

Token Range

The total range of potential unique identifiers supported by the partitioner.

By default, each Scylla node in the cluster handles 256 token ranges. Each token range corresponds to a Vnode. Each range of hashes in turn is a segment of the total range of a given hash function.

Virtual Node (Vnode)

A range of tokens owned by a single Scylla node. Scylla nodes are configurable and support a set of Vnodes - or ‘Virtual Nodes’.

In legacy token selection, a node owns one token (or token range) per node. With Vnodes, a node can own many tokens or token ranges; within a cluster, these may be selected randomly from a non-contiguous set.

In a Vnode configuration, each token falls within a specific token range which in turn is represented as a vnode. Each vnode is then allocated to a physical node in the cluster.

Cluster

One or multiple Scylla nodes, acting in concert, which own a single contiguous token range. State is communicated between nodes in the cluster via the Gossip protocol.