How to Install Cassandra Across Multiple Data Centers
Traducciones al EspañolEstamos traduciendo nuestros guías y tutoriales al Español. Es posible que usted esté viendo una traducción generada automáticamente. Estamos trabajando con traductores profesionales para verificar las traducciones de nuestro sitio web. Este proyecto es un trabajo en curso.
Apache Cassandra is a distributed database designed for low-latency replication across geographically remote data centers. It allows users to define the number of copies to store in each center and determines the level of resiliency. This guide provides a brief introduction to Cassandra and demonstrates how to install and configure the software. It also explains how to define a keyspace and add tables and data to Cassandra.
What is Cassandra?
Cassandra is an open source NoSQL database originally built by Facebook. NoSQL databases are an alternative to traditional relational database management systems (RDBMS) systems. They model data in a simple and flexible style without the rigid tabular constraints of RDBMS applications. NoSQL applications are generally faster than RDBMS systems, with higher throughput. However, they often duplicate data and are less consistent than RDBMS applications. NoSQL databases are designed to work in data centers consisting of distributed data and user sessions. They use the horizontal scaling strategy to increase the capacity of the system through the addition of new servers.
Due to its architecture and design, Cassandra minimizes latency and provides true real-time capabilities. Low latency is critical, as even small delays can dramatically decrease client satisfaction. Poor performance can also interfere with important tasks such as transaction processing and fraud detection. Cassandra is highly scalable, robust, and responsive. It quickly replicates data and continues to provide good performance even when handling a large number of user requests.
Cassandra is designed to run on multiple machines in a cluster-based architecture. The cluster can be topologically conceptualized as a ring of individual nodes. Although the nodes are independent of each other, they are also interconnected and share data. Using a hash function, the Cassandra replication process distributes records equally among the different nodes in the cluster. The cluster does not have a master node. Any system can handle read and write requests, and individual node failure does not affect the rest of the cluster. Cassandra can detect and correct data inconsistencies and automatically manage node additions and removals.
The nodes within a Cassandra cluster can be geographically distributed across multiple data centers and around the world. This brings the database closer to the end user, reducing routing delays. Users in Europe can access a data center in London, while American customers use the Chicago data center. Because both data centers are part of the same cluster, the entries are automatically integrated, reconciled, and replicated.
Each Cassandra cluster includes one or more data centers, and each data center contains multiple racks. A rack can include multiple servers, but they must share the same IP address. Within a data center, racks might be physically separated, but they are generally in the same vicinity. Some architectures define a logical data center, consisting of two or more data centers or a single geographically dispersed data center.
Data Management in Cassandra
Cassandra uses a lightweight variant of SQL called the Cassandra Query Language (CQL). Like most NoSQL languages, CQL is simpler and easier to use than SQL. The Cassandra CQL shell can be accessed using the cqlsh
command. The base-level CQL object is a keyspace, which groups related data together. A keyspace is similar to a database in an RDBMS application. It defines the replication details for all of its tables. A Cassandra table differs somewhat from its RDBMS equivalent. It stores a list of key-value pairs (which can be nested many levels deep) and is typically not normalized.
Cassandra requires a different approach to data modeling compared to an RDBMS application. To efficiently model tables and columns, follow these principles:
- Map each Cassandra table to one or more queries.
- Inside the table, add all columns required to address the queries.
- If necessary, include the same information inside different tables, even though this duplicates data. Rapid query processing takes precedence over data normalization.
- In every table, include a primary key to uniquely identify the entry. This key enables replication, allowing Cassandra to partition the data across multiple nodes.
Cassandra is a complex application with many options and parameters. It is important to have a basic understanding of its architecture before using it in production. For more information, review the Cassandra Documentation. Consult the Data Modelling Guidelines before proceeding with the database design.
Advantages and Disadvantages of Cassandra
In addition to data distribution and replication, there are many advantages and use cases for Cassandra:
- It can gracefully handle traffic spikes.
- It features highly flexible data management.
- It is an excellent choice whenever redundancy and enhanced reliability are required.
- It greatly reduces latency across different geographical regions.
- It can assist with disaster recovery and unexpected outages of an entire data center.
- It includes enhanced analytics and logs.
- It efficiently handles sparse data, in which not all columns appear in all records.
Cassandra is usually not the best choice for small or little-used data sets, or if replication and high availability are not important. In these cases, it introduces unnecessary complexity and overhead. Cassandra does not support table joins, so it is not a good choice for heavily normalized data. Finally, it can be complex and difficult to learn, and it might require considerable tuning before being put into production.
Before You Begin
If you have not already done so, create a Linode account and Compute Instance. See our Getting Started with Linode and Creating a Compute Instance guides.
Follow our Setting Up and Securing a Compute Instance guide to update your system. You may also wish to set the timezone, configure your hostname, create a limited user account, and harden SSH access.
Each data center should have at least two nodes. Cassandra recommends at least 4GB of memory for all nodes.
sudo
. If you are not familiar with the sudo
command, see the
Users and Groups guide.How to Install Cassandra
This guide is intended for Ubuntu 22.04 LTS users but is generally applicable for other Ubuntu releases and Linux distributions. All nodes in the cluster should use the same software release to avoid unexpected interoperability issues.
To install Cassandra, follow these steps. Unless otherwise specified, execute the following commands on every node in the Cassandra cluster.
Ensure the system is up to date. Reboot the system if necessary:
sudo apt update -y && sudo apt upgrade -y
Cassandra requires the use of a Java runtime. There are several different versions of Java to choose from, including OpenJDK and Oracle Java. This guide uses OpenJDK 11. Install OpenJDK using
apt
:sudo apt install default-jre
Use the
java -version
command to confirm that OpenJDK 11 is installed:Optional Cassandra does not require the Java compiler or JDK, but many administrators choose to install it anyway. Optionally install the default JDK using the following command:
sudo apt install default-jdk
java -version
openjdk version "11.0.19" 2023-04-18
Cassandra supports several installation methods. This guide uses
apt
to install Cassandra. First add the Cassandra repository to the list of packages. The following example adds the package for release 4.1. To install a different release, replace41x
with the actual major and minor release numbers.Note A Cassandra Docker image and a binary file installation are also available. For information on these approaches, see the Cassandra Install Documentation.echo "deb https://debian.cassandra.apache.org 41x main" | sudo tee -a /etc/apt/sources.list.d/cassandra.sources.list
deb https://debian.cassandra.apache.org 41x main
Add the repository GPG keys to the list of trusted keys:
curl https://downloads.apache.org/cassandra/KEYS | sudo tee /etc/apt/trusted.gpg.d/cassandra.asc
Update the list of packages:
sudo apt-get update
Use
apt
to install the application:sudo apt-get install cassandra
Confirm that the status of the Cassandra service is
active
:sudo systemctl status cassandra
● cassandra.service - LSB: distributed storage system for structured data Loaded: loaded (/etc/init.d/cassandra; generated) Active: active (running) since Wed 2023-06-21 11:43:53 EDT; 41s ago
Confirm that the
cqlsh
command connects to the database and displays thecqlsh
prompt.Note Cassandra takes about a minute to fully initialize. Before it is ready, it rejects any connection attempts.cqlsh
[cqlsh 6.1.0 | Cassandra 4.1.2 | CQL spec 3.4.6 | Native protocol v5] Use HELP for help. cqlsh>
Use the
exit
command to quit the CQL shell:exit
Repeat the steps in this section for each node in the cluster.
How to Configure Cassandra to Run in Multiple Data Centers
After all nodes in the cluster are operational, they can be configured together into the same group. For each node, determine its data center and rack name. Each rack name must be unique within the data center.
A Cassandra node derives its configuration from the cassandra.yaml
file. Although this file is quite extensive, only a few attributes are required to add a node to a cluster. For more information on the different configuration options, see the cassandra.yaml configuration guide.
To fully configure a cluster, follow these steps.
Configure the
ufw
firewall on each node to allow SSH connections, open ports7000
,9042
, and9160
, and activate the firewall:sudo ufw allow OpenSSH sudo ufw allow 7000/tcp sudo ufw allow 9042/tcp sudo ufw allow 9160/tcp sudo ufw enable
Rules updated Rules updated (v6) Rules updated Rules updated (v6) Rules updated Rules updated (v6) Rules updated Rules updated (v6) Command may disrupt existing ssh connections. Proceed with operation (y|n)? y Firewall is active and enabled on system startup
Optional For extra security, only allow connections from the other nodes in the cluster. The format for these commands is:
sudo ufw allow OpenSSH sudo ufw allow from remote-IP to local-IP proto tcp port 7000 sudo ufw allow from remote-IP to local-IP proto tcp port 9042 sudo ufw allow from remote-IP to local-IP proto tcp port 9160 sudo ufw enable
Replace
remote-IP
with the IP address of one of the other nodes, andlocal-IP
with the IP address of the current node. Add an entry for each of the other nodes in the cluster, changingremote-IP
to the actual IP address of the remote node.Unfortunately, this can become cumbersome for large clusters, and it is easy to accidentally omit a connection.
Confirm the configuration:
sudo ufw status
Status: active To Action From -- ------ ---- OpenSSH ALLOW Anywhere 9160/tcp ALLOW Anywhere 7000/tcp ALLOW Anywhere 9042/tcp ALLOW Anywhere OpenSSH (v6) ALLOW Anywhere (v6) 9160/tcp (v6) ALLOW Anywhere (v6) 7000/tcp (v6) ALLOW Anywhere (v6) 9042/tcp (v6) ALLOW Anywhere (v6)
Shut down all nodes in the cluster to avoid data corruption or connection problems. If one of the nodes is currently used in production, this action should be performed within a maintenance window.
sudo systemctl stop cassandra
Delete the application test data. This avoids any unnecessary data replication.
sudo rm -rf /var/lib/cassandra/*
Repeat the above steps for each node in the cluster.
Determine the architecture for the cluster.
- Choose a name for the entire cluster.
- Divide the nodes into data centers based on their proximity to each other.
- Decide on a meaningful name for each data center.
- Within each data center, supply a rack name for each system. Finally, determine the seed order within each data center. Seed nodes manage cluster discovery and activation. Choose at least two seeds for the cluster, one of which should be the primary seed.
Note For more information about seeds, see the Cassandra FAQ.On the first node, edit the main Cassandra YAML file:
sudo nano /etc/cassandra/cassandra.yaml
Make the following changes:
For the
cluster_name
attribute, enter the cluster name. This entry must be the same for every node in the cluster.Inside the
parameters
record in theseed_provider
attribute, add a comma-separated list of seeds to theseeds
variable. Enter the primary seed for the local data center first, followed by the other seed nodes in the data center. Then append any seeds from the other data centers.For example, data center
dc1
might have the primary seednode1
and the secondary seednode2
. Another data center in the cluster hasnode3
andnode4
as its seeds. For this cluster, the value ofseeds
should benode1_ip, node2_ip, node3_ip, node4_ip
.The
listen_address
field must contain the IP address of the system. For additional security, use the private IP address, if one is configured.The
rpc_address
can be changed to the127.0.0.1
loopback address. If the server hostname is configured, it can be left aslocalhost
.Set the
endpoint_snitch
field toGossipingPropertyFileSnitch
.
The following file sample provides a template for the file changes. Follow the system architecture defined earlier and replace placeholder values (such as
NODE1_IP
) with the IP addresses corresponding to the associated node. Leave the remainder of the file unchanged.- File: /etc/cassandra/cassandra.yaml
1 2 3 4 5 6 7 8 9 10 11 12
cluster_name: 'Main Cluster' ... seed_provider: - class_name: org.apache.cassandra.locator.SimpleSeedProvider parameters: - seeds: "NODE1_IP, NODE2_IP, NODE3_IP, NODE4_IP" ... listen_address: NODE1_IP ... rpc_address: 127.0.0.1 ... endpoint_snitch: GossipingPropertyFileSnitch
Note Ensurestart_native_transport
is set totrue
andnative_transport_port
is9042
. Depending on the Cassandra release, these values might already be set correctly.When done, press CTRL+X, followed by Y then Enter to save the file and exit
nano
.On the same node, edit the
/etc/cassandra/cassandra-rackdc.properties
file:sudo nano /etc/cassandra/cassandra-rackdc.properties
Specify the data center and rack name of the system. The following example configures a node inside the
london
data center with the rack namerack1
:- File: /etc/cassandra/cassandra-rackdc.properties
1 2
dc=london rack=rack1
Configure
cassandra.yaml
andcassandra-rackdc.properties
on the remaining nodes in the first data center.- Start with the
cassandra.yaml
changes. Each system must have the same value for thecluster_name
attribute. seeds
must have the same value on each node in the same data center. The node addresses must appear in the same order each time.- Set
listen_address
to the IP address of the system. - For
cassandra-rackdc.properties
,dc
remains the same for all nodes in the data center. But each system must have its own unique rack name.
The following example illustrates how to configure the two files on a second node in the
london
data center.- File: /etc/cassandra/cassandra.yaml
1 2 3 4 5 6 7 8 9 10 11 12
cluster_name: 'Main Cluster' ... seed_provider: - class_name: org.apache.cassandra.locator.SimpleSeedProvider parameters: - seeds: "NODE1_IP, NODE2_IP, NODE3_IP, NODE4_IP" ... listen_address: NODE2_IP ... rpc_address: 127.0.0.1 ... endpoint_snitch: GossipingPropertyFileSnitch
- File: /etc/cassandra/cassandra-rackdc.properties
1 2
dc=london rack=rack2
- Start with the
Configure each additional node in the first data center in this manner, changing the value of
listen_address
as required.Now configure the nodes in the second data center.
- In
/etc/cassandra/cassandra.yaml
, Thecluster_name
must be the same for all nodes in all data centers. - The
seeds
attribute must list the seed nodes for the local data center first, then the seeds for the remote centers. - Change
listen_address
to the system IP address. - In
/etc/cassandra/cassandra-rackdc.properties
, change the value ofdc
to the name of the second data center. - Ensure each
rack
is unique within the center.
The following example applies to a node in the
singapore
data center.- File: /etc/cassandra/cassandra.yaml
1 2 3 4 5 6 7 8 9 10 11 12
cluster_name: 'Main Cluster' ... seed_provider: - class_name: org.apache.cassandra.locator.SimpleSeedProvider parameters: - seeds: "NODE3_IP, NODE4_IP, NODE1_IP, NODE2_IP" ... listen_address: NODE3_IP ... rpc_address: 127.0.0.1 ... endpoint_snitch: GossipingPropertyFileSnitch
- File: /etc/cassandra/cassandra-rackdc.properties
1 2
dc=singapore rack=rack1
- In
How to Activate a Cassandra Cluster
Nodes must be brought online in a certain order when activating the Cassandra cluster. Follow these steps to properly activate the cluster.
Restart Cassandra on the primary seed in one of the data centers. This is the node listed first in
seeds
.sudo systemctl start cassandra
Ensure the Cassandra service is
active
:sudo systemctl status cassandra
● cassandra.service - LSB: distributed storage system for structured data Loaded: loaded (/etc/init.d/cassandra; generated) Active: active (running) since Wed 2023-06-21 14:05:57 EDT; 19s ago
Restart the primary seed nodes in all remaining data centers. Wait for
cassandra
to becomeactive
.Restart all remaining nodes in the cluster, then wait two or three minutes to allow all systems to synchronize.
Confirm the status of the cluster:
sudo nodetool status
Each node appears in the output. The
Status/State
of each node should beUN
, which stands forUp
andNormal
:Datacenter: london ================== Status=Up/Down |/ State=Normal/Leaving/Joining/Moving -- Address Load Tokens Owns (effective) Host ID Rack UN 192.168.1.5 132.68 KiB 16 47.9% e6905cf5-5a97-447a-b57f-f22f9613510e rack1 UN 192.168.1.15 25.56 KiB 16 51.6% 672f85de-3eee-4971-b981-f6dd2c844f52 rack2 Datacenter: singapore ===================== Status=Up/Down |/ State=Normal/Leaving/Joining/Moving -- Address Load Tokens Owns (effective) Host ID Rack UN 192.168.2.10 132.7 KiB 16 49.5% c8a9accb-7df7-41ed-8062-7eba46faaa10 rack2 UN 192.168.2.20 137.83 KiB 16 51.0% 8dd52e5b-4fcb-463f-9c2a-b71158663385 rack1
Note If a node is notUp
andNormal
, ensure thecassandra
service is stable. Verify the details incassandra-rackdc.properties
and confirm the node is part of the correct data center. The rack name must be unique within the data center. After changing any configuration files, stop and restart Cassandra.
How to Add Tables and Data to Cassandra
Cassandra uses the CQL language to alter database contents. Cassandra can read data from a file or users can add entries manually using the cqlsh
utility. Create a keyspace before adding any tables or data. A keyspace defines the replication style, indicating how many times to replicate the data. Database tables are only meaningful within the context of their parent keyspace. For more information on CQL, consult the Cassandra CQL documentation.
To add a keyspace, table, and data to Cassandra, follow these steps.
Enter the CQL shell on one of the nodes.
cqlsh
Create a keyspace using the
CREATE KEYSPACE
statement and define the replication procedure.- For a cluster containing multiple data centers, use
NetworkTopologyStrategy
for theclass
. - Specify a replication factor for each data center in the cluster. This indicates how many copies of the data to store in the data center.
- The syntax for the statement is
CREATE KEYSPACE IF NOT EXISTS keyspacename WITH REPLICATION = { 'class' : 'NetworkTopologyStrategy', 'datacenter1' : 2, 'datacenter2' : 2 };
.
The following example saves two copies of each table entry in the
store
keyspace to thelondon
data center, and two to thesingapore
center.CREATE KEYSPACE IF NOT EXISTS store WITH REPLICATION = { 'class' : 'NetworkTopologyStrategy', 'london' : 2, 'singapore' : 2 };
- For a cluster containing multiple data centers, use
Confirm the keyspace is successfully created:
desc keyspaces;
The new
store
keyspace is listed alongside any existing keyspaces:store system_auth system_schema system_views system system_distributed system_traces system_virtual_schema
Add a table to the keyspace. The table name is indicated using the syntax
keyspacename.tablename
. The definition defines the schema of columns, including the names and data types. Each table in a Cassandra keyspace must have a primary key. This key is used to partition the table entries.CREATE TABLE IF NOT EXISTS store.shopping_cart ( userid text PRIMARY KEY, item_count int );
Use the
INSERT
command to add an entry to the table.INSERT INTO store.shopping_cart (userid, item_count) VALUES ('59', 12);
Use the
SELECT * FROM
command to view all data in the table.SELECT * FROM store.shopping_cart;
userid | item_count --------+------------ 59 | 12 (1 rows)
To confirm the data has been replicated correctly, access the CQL shell on another node. Run the same
SELECT
command and ensure the same data is displayed. Validate all data centers to ensure Cassandra correctly adheres to the keyspace replication factor.Note Cassandra creates the minimum number of copies to satisfy the keyspace requirements. If the replication factor for a data center is two, only two nodes in the center receive the table data. Depending on the network latency, it might take a second or so for data to appear in a distant data center. This is especially likely for batch reads of data from a file.SELECT * FROM store.shopping_cart;
userid | item_count --------+------------ 59 | 12 (1 rows)
Conclusion
Apache Cassandra is a distributed database designed for low latency, high throughput, and high redundancy. It can replicate data across different data centers, with each center containing locally proximate nodes. Cassandra can be installed using either apt
or Docker, and configured using several YAML and text files. To add data to Cassandra, first define a keyspace to indicate how to replicate data within the cluster. Then add tables to the keyspace and add data to the tables. For more information on Cassandra, see the Cassandra Documentation.
More Information
You may wish to consult the following resources for additional information on this topic. While these are provided in the hope that they will be useful, please note that we cannot vouch for the accuracy or timeliness of externally hosted materials.
This page was originally published on