Anatomy of Ceph Storage — Solution that fits all pockets

6 min readJan 5, 2021

What it is all about?

Do you have these obvious questions in mind, of “What Hardware Should I Select For My CEPH Storage Cluster ?” or “Why should Ceph be the Persistent Storage backbone to my kubernetes cluster ?”… and so on. You are looking in the right place. Lets together explore the anatomy of Ceph.

What is Ceph?

From commodity to enterprise grade H/W, a storage solution that fits all pockets. Ceph runs on Commodity hardware , Ohh Yeah !! everyone now knows it. It is designed to build a multi-petabyte storage cluster while providing enterprise ready features. No single point of failure , scaling to exabytes , self managing and self healing ( saves operational cost ) , runs on commodity hardware ( no vendor locking , saves capital investment )

Components involved

Lets drive into the broader view of Ceph-Cluster.

Ceph Monitors (MON) are responsible for forming cluster quorums. All the cluster nodes report to monitor nodes and share information about every change in their state.
Ceph Object Store Devices (OSD) daemons store data and handle data replication, recovery, backfilling, and rebalancing. They also provide some cluster state information to Ceph monitors by checking other Ceph OSD daemons with a heartbeat mechanism. A Ceph storage cluster configured to keep three replicas of every object requires a minimum of three Ceph OSD daemons, two of which need to be operational to successfully process write requests. Ceph OSD daemons roughly correspond to a file system on a physical hard disk drive. Ceph clients can interact with OSDs directly.
Ceph Manager (MGR) keeps track of runtime metrics and the current state of the Ceph cluster, including storage utilization, current performance metrics, and system load.
MDSs: A Ceph Metadata Server stores metadata on behalf of the Ceph File System (i.e., Ceph Block Devices and Ceph Object Storage do not use MDS). Ceph Metadata Servers allow POSIX file system users to execute basic commands (like ls, find, etc.) without placing an enormous burden on the Ceph Storage Cluster.

Ceph: Who would help me with FS storage.?

Might be thinking where does Ceph Metadata Server(CMDs) goes..! its a topic mostly related to the filesytem thing in general including journals and stuffs as such.

BlueStore interacts with a block device. Data is directly written to the raw block device and all metadata operations are managed by RocksDB.

RocksDB is an embedded high-performance key-value store that excels with flash-storage, RocksDB can’t directly write to the raw disk device, it needs and underlying filesystem to store it’s persistent data, this is where BlueFS comes in. RocksDB uses WAL as a transaction log on persistent storage, unlike Filestore where all the writes went first to the journal, in bluestore we have two different datapaths for writes, one were data is written directly to the block device and the other were we use deferred writes, with deferred writes data gets written to the WAL device and later asynchronously flushed to disk.

BlueFS is a Filesystem developed with the minimal feature set needed by RocksDB to store its sst files.

10k view: Overall architecture

Reliable Autonomic Distributed Object Stores (RADOS) are at the core of Ceph storage clusters. This layer makes sure that stored data always remains consistent and performs data replication, failure detection, and recovery among others.
CephFS ( File System ) distributed POSIX NAS storage
LIBRADOS provides direct access to RADOS with libraries for most programming languages, including C, C++, Java, Python, Ruby, and PHP.
RBD offers a Ceph block storage device that mounts like a physical storage drive for use by both physical and virtual systems (with a Linux kernel driver, KVM/QEMU storage backend, or user-space libraries).
RADOSGW is a bucket-based object storage gateway service with S3 compatible and OpenStack Swift compatible RESTful interfaces.
Placement groups. Ceph maps objects to placement groups (PGs). PGs are shards or fragments of a logical object pool that are composed of a group of Ceph OSD daemons that are in a peering relationship. Placement groups provide a means of creating replication or erasure coding groups of coarser granularity than on a per object basis. A larger number of placement groups (e.g., 200 per OSD or more) leads to better balancing.
CRUSH ruleset. The CRUSH algorithm provides controlled, scalable, and declustered placement of replicated or erasure-coded data within Ceph and determines how to store and retrieve data by computing data storage locations. CRUSH empowers Ceph clients to communicate with OSDs directly, rather than through a centralized server or broker. By determining a method of storing and retrieving data by algorithm, Ceph avoids a single point of failure, a performance bottleneck, and a physical limit to scalability.
Pools. A Ceph storage cluster stores data objects in logical dynamic partitions called pools. Pools can be created for particular data types, such as for block devices, object gateways, or simply to separate user groups. The Ceph pool configuration dictates the number of object replicas and the number of placement groups (PGs) in the pool. Ceph storage pools can be either replicated or erasure coded, as appropriate for the application and cost model. Additionally, pools can “take root” at any position in the CRUSH hierarchy, allowing placement on groups of servers with differing performance characteristics — allowing storage to be optimized for different workloads.

Networking Involved

For a highly performant and fault tolerant storage cluster, the network architecture is as important as the nodes running the Monitors and OSD Daemons. So the deployed network architecture must have the capacity to handle the expected number of clients bandwidth.

The public network is used by Ceph clients to read and write on to Ceph OSD nodes.
The cluster network enables each Ceph OSD Daemon to check the heartbeat of other Ceph OSD Daemons, send status reports to monitors, replicate objects, rebalance the cluster and backfill and recover when system components fail.
Ceph provisioning network the OAM link for managing the ceph provisioned VM / Physical machines.

Ooof..!!

Ooof..!! So much to read loosing interest, haaaa But believe me in this world of Kubernetes-Operators like ROOK which can provision everything for you in one click, reading a blog is worth more than reading a huge documentation for just understanding the architecture of it.

Rook ..

Rook is an open source orchestrator, easy to deploy distributed storage systems on top of Kubernetes.

Rook turns distributed storage software into a self-scaling, self-managing and self-healing storage services. It does this by automating deployment, bootstrapping, provisioning, scaling, upgrading, configuration, migration, disaster recovery, monitoring, and resource management.

Validation to statement “Solution that fits all pockets.”

Three widely used storage types — Block storage, Object storage and Filesystem storage which in ceph is equivalent to RBD, RADOSGW and CephFS.. plus Kuberntes support… bingooo we now have a solution that fits all pockets.