OpenStack Swift

Intro

by Cong Peng / ddxgz.github.io/slides/swift.html /July, 2014

Swift


Highly Available, distributed, eventually consistent object store

  • Symmetric structure,no SPoF,scaling-out without downtime
  • Unreliable hardware,different hardware with different conf para
  • Flat address space,no directory
  • Configurable number of replicas as far away as possible
  • RESTful http API to store and retrieve data, every object has a URL
  • Objects can have extensive metadata, which can be indexed and searched

Why Object Storage


  • File:
    • basic metadata - hard to manage
    • large tree dir structure
  • Object:
    • rich metadata - apply eDiscovery and business intelligence
    • container/object
    • scale out when needed - rapidly growing amount of unstructured data

Why Object Storage

Why Object Storage

Differences to
Ceph Object Store


  • Ceph:
    • also provides block and file storage
    • chooses consistency over availability
  • Swift:
    • supports Multi Datacenter installation
    • more flexible middleware that can plug into

Process

  • Entry point for new requests is the `Application.__call__()` in `swift/proxy/server.py`.
  • Different controller is chosen dependent on the request
  • Look up to the specific ring, to find the location of data

Partition

  • A mapping of devices,1 or more partitions for 1 device
  • Number of partitions = 2^part_power
  • Add or remove devices will not change the number of partitions,just change the mapping between them

The Ring Data

The Ring Data structure consists of three top level fields:

  • List of Devices,info of devices in the cluster
  • Partition Shift Value,when hashing
  • Partition Assignment List,partitions assigned to devices, a list of lists

Consistent Hashing

hash(path) = md5(path + per-cluster suffix)

top 4 bytes -> top part_power bits -> the partition


part = struct.unpack_from('>I', md5('/account/container/object'))[0] \
    >> self._part_shift
					    

Storing

Building the Ring

  1. Calculates the number of partitions to assign to each device based on the weight of the device. For example, for a partition at the power of 20, the ring has 1,048,576 partitions. One thousand devices of equal weight will each want 1,048.576 partitions to wait for assigning.
  2. Assign each partition to the devices sorted by "most wanted", while keeping replicas as far away as possible
  3. 
    swift-ring-builder object.builder create 20 3 1
    swift-ring-builder object.builder add r1z1-127.0.0.1:6010/sdb1 1
    swift-ring-builder object.builder add r1z2-127.0.0.1:6020/sdb2 1
    swift-ring-builder object.builder add r1z3-127.0.0.1:6030/sdb3 1
    swift-ring-builder object.builder add r1z4-127.0.0.1:6040/sdb4 1
    

Rebuilding the Ring

  1. Recalculates the desired partitions that each device wants
  2. Gather partitions for reassignment, include
    • from removed devices
    • any that can be spread out for better durability
    • random partitions from devices that have more than they need
  3. Reassign the gathered partitions to devices by the similar method
  4. No partition will be moved twice in a configurable amount of time

Replicator

To keep the system in a consistent state in the face of temporary error conditions like network outages or drive failures

  1. Compare local data with remote copies to ensure they all contain an up-to-date version
  2. If the state is not consistent:
    • For object replication, updating is just rsyncing files to the peer
    • Account and container replication push missing records over HTTP or rsync whole database files

Also ensures that data is removed from the system

Updater

Occasionally, container or account data can not be immediately updated during failures or high load

  • If an update fails, it is queued locally, the updater will process failed updates.
  • For example:
    • Suppose a new object is put in to the system when the container server is under load
    • The object will be immediately available for reads as soon as the proxy server responds to the client with success
    • The container server did not update the object listing, and so the update would be queued for a later update.

Auditor


  • Walk through the local server checking the integrity of the objects, containers, and accounts

  • If corruption is found, the file will be isolated

  • Replicator will replace the bad file from another replica

New Features


  • Multiple object rings for different storage polices, such as different number of replicas, already implemented in v2.0

  • Erasure Codes, to save storage resource by CPU resource, in process

Directory Structure

Swift Start


/bin/swift-init -> Manager -> Server -> wsgi/daemon

#Instantiate Manager, Manager is the class manages each server class, 
#will initialize every service
						manager = Manager(servers, run_dir=options.run_dir)
try:
    status = manager.run_command(command, **options.__dict__)
					    
  • On process startup, each server calls its __init__() method to set up internal state from config file and get the server ready to handle requests

Swift Start


Processe


  • Client sends a request to Swift via the bind_port in proxy-server.conf file
  • When a request comes to the server, it flows through the WSGI “pipeline” through any middleware to the server’s __call__ method

Filter


  • When a request comes to the server, it flows through the WSGI “pipeline” through any middleware to the server’s __call__ method
  • Configure in proxy-server.conf


Processe


API

  • RESTful:
    present credential to auth service -> auth service returns a token and url for the account
    -> use token to access data in the account

    get token via v3 API using Curl in json format:
    curl -i -H "Content-type:application/json" -d '
    {"auth":{
      "identity":{
       "methods":["password"],
       "password":{
        "user":{
         "name":"demo",
         "domain":{"id":"default"},
         "password":"thisIsParssword"}}},
      "scope": {
        "project": {
          "name": "demo",
          "domain": { "id": "default" }
        }}}}' \
      http://127.0.0.1:5000/v3/auth/tokens
  • get object in container:
    curl -i $publicURL/container/object -X GET -H "X-Auth-Token: $token"

API

  • Python-Swiftclient
    upload object to container via auth v2:swift -V 2.0 -A http://****:5000/v2.0 -U demo:demo -K 888888 upload container object
  • jclouds: multi-cloud open source library for Java and Clojure
    • support 30 cloud stacks including Amazon, Azure, OpenStack
  • ...

Extending

  • Middleware - Python WSGI Middleware:
    • can be used to “wrap” the request and response of a WSGI application (i.e. proxy-server, account-server, container-server, object-server)
    • Swift uses middleware to add (sometimes optional) behaviors to its WSGI servers
  • ZeroVM - Computing on-disk:
    • have implemented the integration with Swift
    • push application to their data instead of pulling data to application
  • ...

Installation


  • As a component installed with OpenStack via packstack or devstack

  • Swift All In One (SAIO) for development:
    official instruction for for setting up a VM that emulate running a four node Swift cluster

  • Standalone Swift Cluster: flexible ways of deployment
    official instruction for a Multiple Server Swift Installation in Ubuntu