Highly Available, distributed, eventually consistent object store
Symmetric structure,no SPoF,scaling-out without downtime
Unreliable hardware,different hardware with different conf para
Flat address space,no directory
Configurable number of replicas as far away as possible
RESTful http API to store and retrieve data, every object has a URL
Objects can have extensive metadata, which can be indexed and searched
Why Object Storage
File:
basic metadata - hard to manage
large tree dir structure
Object:
rich metadata - apply eDiscovery and business intelligence
container/object
scale out when needed - rapidly growing amount of unstructured data
Why Object Storage
Why Object Storage
Differences to Ceph Object Store
Ceph:
also provides block and file storage
chooses consistency over availability
Swift:
supports Multi Datacenter installation
more flexible middleware that can plug into
Process
Entry point for new requests is the `Application.__call__()` in `swift/proxy/server.py`.
Different controller is chosen dependent on the request
Look up to the specific ring, to find the location of data
Partition
A mapping of devices,1 or more partitions for 1 device
Number of partitions = 2^part_power
Add or remove devices will not change the number of partitions,just change the mapping between them
The Ring Data
The Ring Data structure consists of three top level fields:
List of Devices,info of devices in the cluster
Partition Shift Value,when hashing
Partition Assignment List,partitions assigned to devices, a list of lists
Consistent Hashing
hash(path) = md5(path + per-cluster suffix)
top 4 bytes -> top part_power bits -> the partition
part = struct.unpack_from('>I', md5('/account/container/object'))[0] \
>> self._part_shift
Storing
Building the Ring
Calculates the number of partitions to assign to each device based on the weight of the device.
For example, for a partition at the power of 20, the ring has 1,048,576 partitions. One thousand devices of equal weight will each want 1,048.576 partitions to wait for assigning.
Assign each partition to the devices sorted by "most wanted", while keeping replicas as far away as possible
Recalculates the desired partitions that each device wants
Gather partitions for reassignment, include
from removed devices
any that can be spread out for better durability
random partitions from devices that have more than they need
Reassign the gathered partitions to devices by the similar method
No partition will be moved twice in a configurable amount of time
Replicator
To keep the system in a consistent state in the face of temporary error conditions like network outages or drive failures
Compare local data with remote copies to ensure they all contain an up-to-date version
If the state is not consistent:
For object replication, updating is just rsyncing files to the peer
Account and container replication push missing records over HTTP or rsync whole database files
Also ensures that data is removed from the system
Updater
Occasionally, container or account data can not be immediately updated during failures or high load
If an update fails, it is queued locally, the updater will process failed updates.
For example:
Suppose a new object is put in to the system when the container server is under load
The object will be immediately available for reads as soon as the proxy server responds to the client with success
The container server did not update the object listing, and so the update would be queued for a later update.
Auditor
Walk through the local server checking the integrity of the objects, containers, and accounts
If corruption is found, the file will be isolated
Replicator will replace the bad file from another replica
New Features
Multiple object rings for different storage polices, such as different number of replicas, already implemented in v2.0
Erasure Codes, to save storage resource by CPU resource, in process
Directory Structure
Swift Start
/bin/swift-init -> Manager -> Server -> wsgi/daemon
#Instantiate Manager, Manager is the class manages each server class,
#will initialize every service
manager = Manager(servers, run_dir=options.run_dir)
try:
status = manager.run_command(command, **options.__dict__)
On process startup, each server calls its __init__() method to set up internal state from config file and get the server ready to handle requests
Swift Start
Processe
Client sends a request to Swift via the bind_port in proxy-server.conf file
When a request comes to the server, it flows through the WSGI “pipeline” through any middleware to the server’s __call__ method