Ceph internals and Integration with Openstack

Ceph is  software defined storage system. It is an open source system which provides a unified storage system which is highly scalable and without a single point of failure.     Because of it is open, scalable and distributed, Ceph is becoming the best storage solution for cloud computing technologies.  Ceph is the best storage option for cloud computing because every component in Ceph is scalable, there is no single point of failure, Ceph runs on readily available commodity hardware and everything is self manageable in Ceph. Ceph provides block,object and file storage, this feature allows users to access the storage they want. Ceph stores data as objects. Object storage allows hardware and platform independence. Objects are independent of physical path, which makes them location independent. This enables Ceph to scale to exabyte level.

In a cloud environment the storage must be scalable. It should be easily scalable up and down at a low cost and it should be easily integrated with other components of Openstack. Traditional systems are too expensive to scale up and scale down. The cloud environment needs a storage system that can fulfil the current and future storage needs.  Ceph is  the best option for Openstack technology. Openstack has storage components called swift and cinder. Swift provides object based storage and cinder provides block storage to the VM’s. Ceph can provide both object and block storage to Openstack, thus providing a unified storage solution.Traditional storage system cannot provide the storage solutions that the cloud technologies are looking for. Ceph provides a low cost scalable storage for cloud technologies such as Openstack.

                                         CEPH STORAGE ARCHITECTURE

cepharchitectureThe diagram shows the architecture of Ceph storage system. Ceph storage system is made up of several software daemons with each of them having unique functions. These daemons are independent of one another. We will look at each component in detail.


RADOS stands for RELIABLE AUTONOMOUS DISTRIBUTED OBJECT STORAGE. RADOS is responsible for all the features of Ceph. RADOS acts a base layer for all the operations in a Ceph cluster. When a write request is made to the Ceph storage cluster, CRUSH algorithm calculates the location to write data. RADOS layer processes this information and stores data accordingly in OSDs in the cluster. RADOS distributes data to all nodes in the cluster based on CRUSH algorithm. RADOS replicates objects, creates copies, stores them across different failure zones. RADOS also works on providing data reliability based on the replication factor configured. RADOS stores data in the form of objects on pools. The following figure shows the components of rados. RADOS consists of osds and monitors.



OSD is the data storage device in the ceph cluster. They store data in the form of objects on physical disk drives. A ceph cluster has many OSDs. Each object in osd has one primary copy and several secondary copies, which are scattered across cluster nodes making them highly available and fault tolerant. The secondary OSD will be under the control of primary OSD. When a disk fails, the secondary OSD is promoted to the primary OSD. New secondary copies are created during this recovery operation.

When a client wants to store or retrieve data, client requests cluster map from monitor nodes and after this they interact directly with the OSDs. This mechanism provides fast data transaction. OSDs heartbeat each other, if an OSD fails, other OSDs tell the monitor that the OSD is down. The monitor then issues a new monitor map saying one OSD is down and the remaining OSDs replicates the data that was on the failed OSD. The secondary osd becomes the primary OSD. OSDs check the heartbeat of other OSDs every 6 seconds. If a neighbouring OSD does not show a heartbeat within 20 seconds, it will consider that osd down and report it to the Ceph monitor node. Monitor node then updates the cluster map. The monitor acknowledges an OSD is down after receiving 3 notifications from the failed OSDs neighbouring OSD that it is down.

The following diagram shows the architecture of OSD. OSDs can be configured on nodes. The nodes can have file systems that supports journalling. The journalling can be configured to work on separate faster disks to increase the performance of the Ceph cluster. The following figure shows OSD components.



Ceph monitor monitors the health of the entire ceph cluster. Ceph monitor maintains a master copy of the cluster map. Ceph monitor does not provide data to the clients. Clients contact monitor for the clustermap. Clients and nodes in a ceph cluster contacts monitor from time to time to get a recent copy of cluster map. Monitor nodes also stores cluster logs. The number of monitors in a cluster should be of odd number. This is to maintain a quorum to prevent split. The recommended number of monitors is three and the minimum requirement is one monitor node.The ceph monitor does its work by maintaining master copy of the cluster map. The cluster map consists of monitor,OSD,PG,CRUSH and MDS maps.

Monitor map– monitor map has information about a monitor node that includes its cluster ID, hostname,IP address and port number. It also maintains the last changed information.

OSD map– this map stores information about OSDs such as count,state,weight,osd host information etc. It also has information such as pool name, pool ID,pool type, placement groups etc.

PG map– This map keeps information such as placement group ID, object count, state of OSDs, time stamp, placement group version etc

CRUSH map-crush map has information about storage devices in the cluster

As mentioned earlier monitors does not provide or store data to clients. The function of monitor node is to maintain cluster map and provide monitor maps to other nodes in the cluster and to the clients.


Crush algorithm  defines the working of a ceph storage cluster. CRUSH stands for controlled replication under scalable hashing.

Before discussing crush algorithm let’s look at the traditional way of storing data. Traditional data storage method includes a metadata lookup table. Metadata is the data about data. It stores information such as location of the data. When a new data is added, the metadata table is updated first with the metadata of the file. The data is actually stored on the disk only after this procedure. When a data is retrieved from the disk, the metadata table is searched to find the location of the file. This metadata lookup slows down the storage/retrieval process. Also if you lose the metadata, all data will be lost. As the amount of data increases the storage/retrieval mechanism will suffer from bottlenecks and performance will be slower. Ceph doesn’t have any metadata lookup mechanism. This improves performance of ceph cluster. It is the duty of CRUSH algorithm to calculate the metadata and location of the object. The metadata is calculated only when needed.

Now let us look at the CRUSH lookup mechanism.

All data is stored as objects in a ceph cluster. The client contacts the ceph monitor to get a copy of the cluster map, for a read or write operation. The cluster map helps client to identify the state of the cluster. The data is then converted into objects. Object is assigned a pool name and object ID. A pool contains a number of placement groups. The object is hashed with the number of placement group to calculate the final placement group the object will be stored. After the placement group is determined, the placement group determines the primary OSD where the object will be stored. After getting the OSD ID the client can directly contact the OSD to store or retrieve data. After the data is completely written to the primary OSD, the OSD will calculate the location of secondary placement groups and OSDs and copies object to those nodes. This ensures data replication and high availability. The following diagram shows the crush lookup in detail.


Another important concept in ceph is pools and placement groups.

A placement group is a logical collection of objects replicated in an OSD. A placement group is a logical collection of objects that are mapped to different OSDs. Each placement group is replicated and distributed on more than one OSD in a cluster. PGs makes it easy to manage huge amount of data that are replicated across the cluster. Without PGs it will be difficult to manage huge volumes of data as each object will have to managed individually rather than as a group  and will also impact the performance and speed of the system. We should decide the number of placement groups in a pool while configuring our ceph cluster. A recommended number of placement group is 50 to 100 per OSD.  The formula to calculate the number of placement group in a cluster is

             (OSDs * 100)
Total PGs =  ------------
              pool size

The result obtained from the above equation is then rounded up to the nearest power of two.As we mentioned earlier, OSDs perform peering operations. CEPH cluster stores multiple copies of multiple PGs across various OSDs. A group of OSDs in a particular PG is known as acting sets. An acting set consists of primary,secondary, tertiary OSDs and so on. Once the primary OSD is down, the secondary OSD is promoted as the primary OSD. An OSD can be primary OSD for some PGs and can be secondary for other PG. Thus PGs make it easy to manage operations in a cluster.

A pool is also a logical partition to store objects.  Pool has many placement groups and placement groups are mapped to OSDs. Pool provides many functionalities to the ceph cluster. A faster pool can be created on SSD disks. This pool can be used as cache pool for increasing reading speed of the cluster. Another feature of ceph pool is the snapshot feature. We can take snapshot of a pool and restore it when required.  Also access controls can be set on a pool to provide security.

All the above discussed components of ceph makes ceph a reliable and efficient storage system by managing data in ceph. When a client writes data into ceph cluster, the data is written to the primary OSD. The primary OSD the copies these data into secondary and tertiary OSDs based on the replication level. After receiving acknowledgment from secondary and tertiary OSDs, the primary OSD send the acknowledgement to the client that the data is written to the cluster. Ceph consistently manages data in this way and and provides data to clients from this replicas in the event of failures.
The following figure shows a ceph storage cluster. Pool A has replication size of 3 and Pool B and Pool C has replication size of two. Objects inside the placement group are then mapped to OSDs based on their replication factor.


Now we will look into ceph block storage and ceph object storage as openstack uses both of this ceph components.


Ceph block storage is known as RADOS block device. Linux kernel now supports Ceph rbd driver and it is also supported by QEMU and KVM. Ceph rbd can provide block device storage to hypervisor nodes and virtual machines. Clients map Ceph block devices using librados. RADOS then stores and distributes the block storage data across the cluster. librados is a C library that allows other applications to work with RADOS. librados offers rich API support , which enables applications to access RADOS. After a Ceph block device is mapped to a client, it can be formatted with a file system and can be mounted. Or it can be used as  a RAW partition.Openstack components such as cinder and glance uses block devices. Cinder is the block storage service for openstack and glance is the image service of openstack. Ceph provides block devices for both these services. Ceph block storage has many advantages.It supports snapshot and cloning. This enables openstack to spin up hundreds of VMs in a short time. When data is written to a ceph RBD, it maps the data into objects, and replicates and stores them across the cluster. The following figure shows the components of ceph storage cluster.

ceph rbd


Ceph object gateway provides object storage to ceph cluster. Ceph object gateway is also known as rados gateway. Rados gateway acts like a proxy that converts HTTP requests to RADOS requests and vice versa. It provides a swift and s3 compatible object storage. The radosgw daemon is used to interact with librgw library and librados. Ceph object store supports three interfaces. It provides Amazon S3 compatible interface, Openstack swift compatible interface and an admin API which provides a HTTP restful API to access the ceph cluster. Ceph object storage can be used by the object service of openstack. Ceph provides API for openstack swift to access the ceph cluster and use the object storage feature of Ceph.
The following diagram shows the various interfaces of Ceph object gateway.

ceph object


Cloud platforms like Openstack requires a storage system that is reliable, scalable, unified and distributed.  Ceph provides a reliable storage back end for Openstack.  Ceph integrates easily with Openstack components like cinder, glance, nova and keystone. Ceph provides a low cost storage for openstack, which helps in getting the cost down. Another advantage of using Ceph is that it provides a unified storage solution for Openstack. Ceph provides file, object and block storage for Openstack. The Ceph block storage has capabilities like thin provisioning, snapshot, cloning, which helps to spin up VM’s quickly and makes backing up and cloning of VM’s easy. The copy on write mechanism of Ceph helps Openstack to spin up many instances at once,Ceph can provide persistent boot volumes for Openstack instances. Ceph also provides API for swift and s3 storage interfaces.

Openstack has a modular architecture. Each module has a unique task. Some of these components require a reliable storage backend. Openstack services like cinder, glance, swift integrates with ceph. Keystone can be integrated if we need s3 compatible object storage for Ceph backend. Booting from Ceph volume is done by nova integration. The following diagram shows Ceph integration with Openstack.



Ceph block device is integrated into Openstack through libvirt. libvirt connects the librbd library to the QEMU interface. The first step in using ceph block device in Openstack is to install qemu, libvirt and Openstack on the node. Openstack node used should be a Ceph client. So we need to install ceph client packages on that node first.  The following diagram shows the interaction of Ceph block device with Openstack.


Openstack glance provides images for the VMs through glance and volumes through cinder. Ceph rbd integrates with these two services.  Openstack glance stores cinder in a Ceph rbd and cinder uses Ceph block device to boot VM using images

Now let us look into integrating openstack and ceph. The first step as mentioned earlier is to install ceph clients on openstack nodes. Then make sure these client nodes can access ceph cluster after ceph is configured to work with openstack. The first step is creating a pool for openstack. Create pools for glance and cinder.


# ceph osd pool create volumes 128

# ceph osd pool create images 128

Next step is to create users for these pools.  After creating users, create keyring files for users. Add the keyrings for client.cinder and client.glance to the appropriate nodes. The libvirt process needs to access  Ceph while attaching and detaching a block device to cinder. To achieve this we add a secret key of client.cinder to libvirt temporarily. Create a temporary copy of the secret key on the nodes running nova-compute and on the compute nodes, add the secret key to libvirt and remove the temporary copy of the key.The following steps shows this.

Generate a UUID using the command uuidgen. Then create a secret file with the following information.

cat > secret.xml <<EOF
<secret ephemeral=’no’ private=’no’>
<usage type=’ceph’>
<name>client.cinder secret</name>

Define the secret and keep the generated secret value safe using the following command.

sudo virsh secret-define –file secret.xml


Now we must configure Openstack to use Ceph.

Configuring Glance

Openstack glance can support multiple storage back ends.  Edit the /etc/glance/glance-api.conf file and add the following variables within the [DEFAULT] section.





Then restart the glance api service.

Coniguring OpenStack Cinder

To create,attach and delete volumes we need to add the following lines in the cinder configuration file.






Coniguring OpenStack Nova

In order to boot Openstack instances from volumes, we configure an ephemeral backend for nova.  Edit nova configuration file nova.conf and add the following lines.Add the following lines in nova.virt.libvirt.imagebackend section and add:




Add the following lines in nova.virt.libvirt.volume section


rbd_secret_uuid= UUID

After this reload all openstack services that we configured for ceph.

Integrating rados gateway with openstack

Integrating rados gateway with openstack identity service keystone sets up a gateway to accept and authorize keystone users automatically by the ceph cluster. On the ceph admin node, edit ceph.conf file to include the following.

rgw keystone url = {keystone server url:keystone server admin port}
rgw keystone admin token = {keystone admin token}
rgw keystone accepted roles = {accepted user roles}
rgw keystone token cache size = {number of tokens to cache}
rgw keystone revocation interval = {number of seconds before checking revoked tickets}
rgw s3 auth use keystone = true
nss db path = {path to nss db}

This tells the ragosgw to accept keystone as user authority.  If a user that keystone authorizes doesn’t exist on the ceph object gateway, it automatically creates the user. A ceph object gateway user is mapped to a tenant in keystone. A user may have roles on more than one tenant. Ceph gateway accepts or rejects user request according to the configured rgw keystone accepted roles.

By the above mentioned methods we can integrate Openstack with Ceph. Every cloud platform requires a robust, reliable, scalable, and all-in-one storage solution that suffices all their workload requirements. In this blog we discussed Ceph and its components and how seamlessly they integrate with Openstack.



Posted January 8, 2016 by John Mathew

Leave a Reply