分布式存储系统
Introduction
Before Learning
The following lists some knowledge that you should know before in my viewpoint.
- Preliminaries
- Principles of Computer Organization: Memory and computer architecture(e.g. von neumann)
- Principles of Database
- Computer Networks: We may learn network-attached storage
- Operating Systems: especially
Linux
file system and memory scheduling
Brief History of DSS
- DAS (Direct-Attached Storage): using cables (e.g. SCSI) to connect to server internal bus (one-storage-to-one-host) >> non-shared, poor scalability
- FAS (Fabric-Attached Storage):
- NAS (Network-Attached Storage): shared network storage providing data storage service >> I/O bottleneck
- SAN (Storage Area Network): access data through dedicated Fibre Channel(光纤) switches, protocol like ISCSI, FC
Issues: 1. Performance grows by controller, scalability drop down at
PB
level data; 2. time-consuming in transferring data from old device.
Distributed Storage
Def. Distributed Storage is to separate all the data into multiple storage servers, which form one virtual storage device.
Benefits
- improve system reliability
- accelerate I/O efficiency
- better scalability
- support hierarchical storage
Metrics to measure storage performance
- Bandwidth: theoretic throughput (data transferred per second)
- Delay: time for a specific operation
- Throughput: data transferred per second
- IOPS (Input/Output Operations Per Second)
Measurement Tools
- FIO(flexible I/O tester): simulate I/O overhead in different cases, supporting random read/write, asynchronous I/O, etc.
Software Defined Systems
- Stem from Google File System (Google, 2003)
- Core: Using software only to manage x86 machines to form a distributed system
- Separate Meta and Data, where Meta is the core component in the system to manage plenty of chunk server(节点)
Two SDS Architectures: Centralized vs. Distributed
Similarity
- Both aim at allocating data equally into all disks
- of course, distributed systems have all disks in multiple nodes(chunk servers), but the centralized one have them in one place.
Difference
Centralized Storage | Distributed Storage | |
---|---|---|
Protocol | block, file | block, file, object |
Scene | Database, Virtualization | Database, Virtualization, Big Data, plenty of unstructured data |
Scalability | Controller: 2-16, SSD: 1-12 | 1024 nodes |
Performance | low delay | high concurrency |
Reliability | RAID + BBU bateries | multi-copy |
Locate Algorithm | Strip algorithm / Pseudo random | Pseudo random CRUSH algorithm |
Application(Instances) | Google GFS, HDFS | Ceph |
Some of the concept mentioned above:
Block storage, file storage and object storage
Block storage | File storage | |
---|---|---|
Principle | divide disk into blocks, then format blocks into an LUN(logic unit number) and mounted to server OS as a disk | store data by hierarchical directory,and accessed by file path |
Protocols | iSCSI, FC(fabric channel) | NFS(network file system) |
Pros | low delay, high IOPS | multi-user access at the same time |
- Object storage: combine advantages of block and file
- using key-value for management, easy for locating and operating data
- have no directory, resulting in high retrieving performance
Key difference: they serve different client
Block storage: systems who read/write at block device
File storage: human
Object storage: others such as software
Recall: DAS, NAS and SAN. DAS, SAN use block storage, while NAS use file storage.
Strip algorithm & CRUSH algorithm
Strip algorithm: divide data into strips, and store them in different disks. They seems continuous logically, but phisically they are not.
- Concept
- Strip length: describes how many blocks are stored in one strip
- Strip depth: describes how many strips are stored in one disk
- Advantage:
- high performance: parallel read/write
- high capacity: data can be stored in multiple disks
- high reliability: applying
RAID
to protect data
CRUSH(Controlled Replication Under Scalable Hashing): a pseudo-random algorithm to locate data in distributed storage system.
Since it’s commonly used in
Ceph
, we will learn it in detail in the next section
Network Protocols of Distributed Block Storage
Protocol | Description |
---|---|
iSCSI(Internet Small Computer System Interface) | SCSI is a protocol connecting host and external device(but also an interface). ISCSI can be served as a protocol in application layer(apply Ethernet to encapsulate SCSI commands and transfer by TCP) ISCSI is under CS architecture |
NVMe-oF(NVMe over Fabrics) | Extension of NVMe protocol(an intra-server protocol of SSD) over network |
RDMA(Remote Direct Memory Access) | A technology that allows data to be transferred from the memory of one computer to the memory of another without involving the CPU, cache, or operating system of either computer |
RDMA+NVMe-oF | a popular usage of RDMA to accelerate NVMe over Fabrics |
Storage Cluster
- Cluster: a group of servers that work together to provide a service, handling the large visiting, high parallel and sea of data.
- 2 sub-class: load balance & distributed
- difference on how the multiple servers cooperate
Load Balance
- DNS Load Balance: configure multiple IPs for one domain name, and the DNS server will return one of them randomly(properly)
- HTTP Load Balance: Calculate a true address of web server by http request, write it into the response header and return to browser(host)
- IP Load Balance: modify target IP address in network layer
- Link Load Balance: modify target MAC address in data link layer
- Hybrid Load Balance
Load Balance Algorithms
Algorithm | Description |
---|---|
Round Robin | distribute requests evenly to all servers |
Least Connections | distribute requests to the server with the least connections |
IP Hash | use the hash value of client IP address to determine which server to send the request |
Random | randomly select a server to send the request |
Weighted | algorithm above with weighed score |
Distributed Cluster
- Def. Applying multiple servers, each of which has different roles, to provide a service.
- Applications:
- Distributed applications & services
- Deployment of distributed static resources
- Distributed data & storage
- Distributed computing
High Availability
- Def. A system that can continue to provide services even when some of its components fail.
Distributed Systems
Ceph
- Published in 2004
RADOS
: Reliable Autonomic Distributed Object StoreLibRados
: a library to access RADOS(support for C/C++/Python)- three API based on
LibRados
:RadosGW
: object storage API, compatible with S3 and OpenStack SwiftLibRBD
: block storage API, compatible with iSCSIMDS
: file system API, compatible with NFS and CIFS
Core Components(Services)
Monitor
: monitor the health of the cluster, store cluster map and configuration(e.g. OSD maps, CRUSH maps)Client
: provide APIs to access RADOS, load balance for nodesMDS(Metadata Server)
: manage metadata/directories of filesOSD(Object Storage Daemon)
: store data, handle read/write requests, replicate data, recover data
Allocation of Resources: CRUSH
- data divided into objects
- each object is hash-projected into one Placement Group(PG), i.e., one PG contains multiple objects
- CRUSH algorithm used to map PGs to OSDs(if multiple copies, see following assumption of three replicas)
- Client’s PG is mapped to the primary OSD
- the primary OSD finds the other two OSDs by CRUSH algorithm
- the primary OSD sends the data to the other two OSDs
- the other two OSDs store the data, and return
ACK
to the primary OSD - the primary OSD writes the data to its local disk, and returns
ACK
to the client
Pros and Cons
- Pros:
- High performance: parallel read/write, low latency
- Decentralized: no single point of failure, no bottleneck
- High consistency:
ACK
returned only after all replicas are stored, which suits occasions that read operations are much more than write operations
- Cons
- Poor expansion: adding new OSDs requires rebalancing data, which is time-consuming
GFS(Google File System)
- Components
- Master Server: manages metadata, namespace, and access control
- Chunk Server: stores data chunks(64M by default), handles read/write requests
- Client: provides APIs to access GFS, caches metadata

- In data read/write, GFS split data flow and control flow by lease mechanism
- Data flow: client push data to each chunk server
- Control flow: client contact primary replica, who decides which chunk server to write data. Unless the primary replica is down, the client will not contact master.
Pros and Cons
- Pros:
- suitable for large files, high throughput
- high performance: cached metadata in client, parallel read/write(by pre-read)
- high reliability: master backup at checkpoint, chunk replication
- Cons:
- single point: master server