Data Modelling
Last updated
Last updated
In Scylla, as opposed to relational databases, the data model is based around the queries and not just around the domain entities. When creating the data model, we take into account both the conceptual data model and the application workflow: which queries will be performed by which users and how often.
Even data distribution: data should be evenly spread across the cluster so that every node holds roughly the same amount of data. Scylla determines which node should store the data based on hashing the partition key. Therefore, choosing a suitable partition key is crucial. More on this later on.
To minimize the number of partitions accessed in a read query: To make reads faster, we’d ideally have all the data required in a read query stored in a single Table. Although it’s fine to duplicate data across tables, in terms of performance, it’s better if the data needed for a read query is in one table.
Avoiding data duplication: To get efficient reads, we sometimes have to duplicate data.
Minimizing the number of writes: writes in Scylla DB aren’t free, but they are very efficient and “cheap.” Scylla DB is optimized for high write throughput. Reads, while still very fast, are usually more expensive than writes and are harder to fine-tune.
We’d usually be ready to increase the number of writes to increase read efficiency. Keep in mind that the number of tables also affects consistency.
ScyllaDB data modeling is query-based. That is, we think of the application workflow and the queries early on in the data model process
A Keyspace is a top-level container that stores tables
A Table is how ScyllaDB stores data and can be thought of as a set of rows and columns
The Primary Key is composed of the Partition Key and Clustering Key
One of our goals in data modeling is even data distribution. For that, we need to select a partition key correctly
Selecting the Primary Key is very important and has a huge impact on query performance
composite primary key = partition key + clustering key
The Partition Key is responsible for data distribution across the nodes. It determines which node will store a given row. It can be one or more columns. Without this, query wont execute
The Clustering Key is responsible for sorting the rows within the partition. It can be zero or more column
To query without the partition key, cluster would have to perform full scan which is super expensive.
To run this query : (wont work)
we partition and clustering key
This also eliminates the problem of storing just one value for a primary key. Partition Key can be more than one column.
If we have multiple partition keys
this query wont work, cuz we need both the partition keys
----------------------------
it is possible to define (do this):
If there is more than one column in the Clustering Key (pet_name and heart_rate in the example above), the order of these columns defines the clustering order. For a given partition, all the rows are physically ordered inside ScyllaDB by the clustering order. This order determines what select query you can efficiently run on this partition.
In this example, the ordering is first by pet_name and then by heart_rate.
In addition to the Partition Key columns, a query may include the Clustering Key. If it does include the Clustering Key columns they must be used in the same order as they were defined.
It fails, as pet_name
comes before heart_rate
sin the clustering key.
By default, sorting is based on the natural (ASC) order of the clustering columns.
In order to get query in DESC order, we would need to define the table in the same way
We don't need partitions size to grow too big, so we can break them by using composite partition key. For example
In this case, we would be able to query by pet_chip, for a given day, without having large partitions. The partition size will be limited by the day. Every day, a new partition will be created for each pet.
1
Query-based: Application -> Data -> Model
Entity-based: Data -> Model -> Application
4
Denormalization
Support for foreign-keys, Joins
5
CAP Theorem, Eventual Consistency
ACID Guarantee
6
Distributed Architecture
Mostly single point of failure