Top Banner
Ceph Internals & Data Processing Capabilities Joao Eduardo Luis Senior Software Engineer [email protected] / [email protected]
26

Ceph Internals & Data Processing Capabilities - SUSECON · Ceph Internals & Data Processing Capabilities Joao Eduardo Luis Senior Software Engineer [email protected] / [email protected]

Apr 15, 2018

Download

Documents

doanh
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Ceph Internals & Data Processing Capabilities - SUSECON · Ceph Internals & Data Processing Capabilities Joao Eduardo Luis Senior Software Engineer jluis@suse.com / joao@suse.de

Ceph Internals & Data Processing Capabilities

Joao Eduardo Luis Senior Software Engineer [email protected] / [email protected]

Page 2: Ceph Internals & Data Processing Capabilities - SUSECON · Ceph Internals & Data Processing Capabilities Joao Eduardo Luis Senior Software Engineer jluis@suse.com / joao@suse.de

2

OVERALL ARCHITECTURE

RGWweb services gateway for

object storage, compatible with S3 and Swift

LIBRADOSclient library allowing apps to access RADOS (C, C++, Java, Python, Ruby, PHP)

RADOSsoftware-based, reliable, autonomous, distributed object store comprised of

self-healing, self-managing, intelligent storage nodes and lightweight monitors

RBDreliable, fully-distributed block device with cloud

platform integration

CEPHFSdistributed file system

with POSIX semantics and scale-out metadata

management

APP HOST/VM CLIENT

Page 3: Ceph Internals & Data Processing Capabilities - SUSECON · Ceph Internals & Data Processing Capabilities Joao Eduardo Luis Senior Software Engineer jluis@suse.com / joao@suse.de

3

OVERALL ARCHITECTURE

RGWweb services gateway for

object storage, compatible with S3 and Swift

LIBRADOSclient library allowing apps to access RADOS (C, C++, Java, Python, Ruby, PHP)

RADOSsoftware-based, reliable, autonomous, distributed object store comprised of

self-healing, self-managing, intelligent storage nodes and lightweight monitors

RBDreliable, fully-distributed block device with cloud

platform integration

CEPHFSdistributed file system

with POSIX semantics and scale-out metadata

management

APP HOST/VM CLIENT

Page 4: Ceph Internals & Data Processing Capabilities - SUSECON · Ceph Internals & Data Processing Capabilities Joao Eduardo Luis Senior Software Engineer jluis@suse.com / joao@suse.de

4

RADOS CLUSTER

APPLICATION

M M

M M

M

RADOS CLUSTER

Page 5: Ceph Internals & Data Processing Capabilities - SUSECON · Ceph Internals & Data Processing Capabilities Joao Eduardo Luis Senior Software Engineer jluis@suse.com / joao@suse.de

5

RADOS COMPONENTS

OSDs‒ Smart storage

‒ Resilient, Distributed, Self-healing, etc

‒ 100's to thousands

Monitors‒ Keep track of cluster state

‒ Always consistent, or otherwise...

‒ 3, 5, 7, ...M

Page 6: Ceph Internals & Data Processing Capabilities - SUSECON · Ceph Internals & Data Processing Capabilities Joao Eduardo Luis Senior Software Engineer jluis@suse.com / joao@suse.de

6

OBJECT STORAGE DAEMONS

FS

DISK

OSD

DISK

OSD

FS

DISK

OSD

FS

DISK

OSD

FS

M

M

M

Page 7: Ceph Internals & Data Processing Capabilities - SUSECON · Ceph Internals & Data Processing Capabilities Joao Eduardo Luis Senior Software Engineer jluis@suse.com / joao@suse.de

7

WHERE IS MY OBJECT?

??APPLICATION

M

M

M

OBJECT

Page 8: Ceph Internals & Data Processing Capabilities - SUSECON · Ceph Internals & Data Processing Capabilities Joao Eduardo Luis Senior Software Engineer jluis@suse.com / joao@suse.de

8

METADATA SERVER?

1

APPLICATION

M

M

M

2

Page 9: Ceph Internals & Data Processing Capabilities - SUSECON · Ceph Internals & Data Processing Capabilities Joao Eduardo Luis Senior Software Engineer jluis@suse.com / joao@suse.de

9

CALCULATED PLACEMENT?

APPLICATION

M

M

MA-G

H-N

O-T

U-Z

??

F

Page 10: Ceph Internals & Data Processing Capabilities - SUSECON · Ceph Internals & Data Processing Capabilities Joao Eduardo Luis Senior Software Engineer jluis@suse.com / joao@suse.de

10

CRUSH

RADOS CLUSTER

OBJECTS

10

01

01

10

10

01

11

01

10

01

01

10

10

01 11

01

1001

0110 10 01

11

01

PLACEMENT GROUPS(PGs)

Page 11: Ceph Internals & Data Processing Capabilities - SUSECON · Ceph Internals & Data Processing Capabilities Joao Eduardo Luis Senior Software Engineer jluis@suse.com / joao@suse.de

11

CRUSH

RADOS CLUSTER

OBJECT

10

01

01

10

10

01 11

01

1001

0110 10 01

11

01

10

10

Page 12: Ceph Internals & Data Processing Capabilities - SUSECON · Ceph Internals & Data Processing Capabilities Joao Eduardo Luis Senior Software Engineer jluis@suse.com / joao@suse.de

12

CRUSH – Failure?

RADOS CLUSTER

OBJECT

10

01

01

10

10

01 11

01

1001

0110 10 01

11

01

10

Page 13: Ceph Internals & Data Processing Capabilities - SUSECON · Ceph Internals & Data Processing Capabilities Joao Eduardo Luis Senior Software Engineer jluis@suse.com / joao@suse.de

13

DATA ORGANIZED INTO POOLS

CLUSTER

OBJECTS

10

01

01

10

10

01 11

01

1001

0110 10 01

11

01

POOLS(CONTAINING PGs)

10

01

11

01

10

01

01

10

01

10

10

01

11

01

10

01

10 01 10 11

01

11

01

10

10

01

01

01

10

10

01

01

POOLA

POOLB

POOL C

POOLDOBJECTS

OBJECTS

OBJECTS

Page 14: Ceph Internals & Data Processing Capabilities - SUSECON · Ceph Internals & Data Processing Capabilities Joao Eduardo Luis Senior Software Engineer jluis@suse.com / joao@suse.de

14

POOLS

OBJECT

REPLICATED POOL

CEPH STORAGE CLUSTER

ERASURE CODED POOL

CEPH STORAGE CLUSTER

COPY COPY

OBJECT

31 2 X Y

COPY4

Full copies of stored objects

Very high durability

3x (200% overhead)

Quicker recovery

One copy plus parity

Cost-effective durability

1.5x (50% overhead)

Expensive recovery

Page 15: Ceph Internals & Data Processing Capabilities - SUSECON · Ceph Internals & Data Processing Capabilities Joao Eduardo Luis Senior Software Engineer jluis@suse.com / joao@suse.de

LIBRADOS

Page 16: Ceph Internals & Data Processing Capabilities - SUSECON · Ceph Internals & Data Processing Capabilities Joao Eduardo Luis Senior Software Engineer jluis@suse.com / joao@suse.de

16

OVERALL ARCHITECTURE

RGWweb services gateway for

object storage, compatible with S3 and Swift

LIBRADOSclient library allowing apps to access RADOS (C, C++, Java, Python, Ruby, PHP)

RADOSsoftware-based, reliable, autonomous, distributed object store comprised of

self-healing, self-managing, intelligent storage nodes and lightweight monitors

RBDreliable, fully-distributed block device with cloud

platform integration

CEPHFSdistributed file system

with POSIX semantics and scale-out metadata

management

APP HOST/VM CLIENT

Page 17: Ceph Internals & Data Processing Capabilities - SUSECON · Ceph Internals & Data Processing Capabilities Joao Eduardo Luis Senior Software Engineer jluis@suse.com / joao@suse.de

17

EXAMPLE `HELLO WORLD!`

Connect to cluster

Create pool

Atomic (re)write

#include <rados/librados.hpp>

librados::Rados rados;rados.init(“admin”);rados.connect();

rados.pool_create(“hello_pool”);

librados::IoCtx ctx;rados.ioctx_create(“hello_pool”, ctx);

bufferlist data;data.append(“hello world!”);ctx.write_full(“hello_object”, data);

bufferlist attr;attr.append(“1”);ctx.setxattr(“hello_object”, “version”, attr);

rados.shutdown();

Page 18: Ceph Internals & Data Processing Capabilities - SUSECON · Ceph Internals & Data Processing Capabilities Joao Eduardo Luis Senior Software Engineer jluis@suse.com / joao@suse.de

18

LIBRADOS

APPLIBRADOS

MMM

object “foo”

pool “bar”

0x2d87c31

pg 2.c31id 2

mod pg_num

clusterstateCRUSH

hierarchy

osdmap

Page 19: Ceph Internals & Data Processing Capabilities - SUSECON · Ceph Internals & Data Processing Capabilities Joao Eduardo Luis Senior Software Engineer jluis@suse.com / joao@suse.de

19

COMPOUND OBJECT OPERATIONS

Connect to cluster

Create pool

#include <rados/librados.hpp>

librados::Rados rados;rados.init(“admin”);rados.connect();

rados.pool_create(“hello_pool”);

librados::IoCtx ctx;rados.ioctx_create(“hello_pool”, ctx);

ObjectWriteOperation op;

bufferlist data;data.append(“hello world!”);op.write_full(data);

bufferlist attr;attr.append(“1”);op.setxattr(“version”, attr);

ctx.operate(“hello_object”, &op);

rados.shutdown();

Page 20: Ceph Internals & Data Processing Capabilities - SUSECON · Ceph Internals & Data Processing Capabilities Joao Eduardo Luis Senior Software Engineer jluis@suse.com / joao@suse.de

20

AP

PLI

BR

AD

OS

RADOS OBJECT CLASSES

put_foo()

calc_bar()

my_foo.soput_foo(data)

read(“bar”, data)

Page 21: Ceph Internals & Data Processing Capabilities - SUSECON · Ceph Internals & Data Processing Capabilities Joao Eduardo Luis Senior Software Engineer jluis@suse.com / joao@suse.de

21

EXAMPLE

int compute_md5(cls_method_context_t hctx, bufferlist *in, bufferlist *out){ size_t size; int ret = cls_cxx_stat(hctx, &size, NULL); if (ret < 0) return ret;

bufferlist data; ret = cls_cxx_read(hctx, 0, size, data); if (ret < 0) return ret;

byte digest[AES::BLOCKSIZE]; MD5().CalculateDigest(digest, (byte*)data.c_str(), data.length());

out->append(digest, sizeof(digest));

return 0;}

bufferlist input, output;

ioctx.exec(“hello_object”, “hello_hash_class”, “compute_md5”, input, output);

ObjectReadOperation op;

uint64_t size;time_t m_time;op.stat(&size, &m_time, NULL);

bufferlist in, out;op.exec(“hello_hash_class”, “compute_md5”, in, &out);

int r = op.operate(“hello_object”, &op);

Server-side class (hello_hash_class) librados client

Compound operations

Page 22: Ceph Internals & Data Processing Capabilities - SUSECON · Ceph Internals & Data Processing Capabilities Joao Eduardo Luis Senior Software Engineer jluis@suse.com / joao@suse.de

22

REAL APPLICATIONS

• Cooperative Locking

• Simple Object Reference Counting

• Image manipulation

• RADOS Block Device (RBD) & Gateway (RGW)

sources in src/cls/*

Page 23: Ceph Internals & Data Processing Capabilities - SUSECON · Ceph Internals & Data Processing Capabilities Joao Eduardo Luis Senior Software Engineer jluis@suse.com / joao@suse.de

23

DYNAMIC OBJECT CLASSES IN LUA

• Noah Watkins (UCSC / Red Hat)‒ http://ceph.com/rados/dynamic-object-interfaces-with-lua/

local script = [[function say_hello(input, output) output:append("Hello, ") if #input == 0 then output:append("world") else output:append(input:str()) end output:append("!")endcls.register(say_hello)]]

local ret, outdata = clslua.exec(ioctx, "oid", script, "say_hello", "")print(outdata)

local ret, outdata = clslua.exec(ioctx, "oid", script, "say_hello", "John")print(outdata)

Page 24: Ceph Internals & Data Processing Capabilities - SUSECON · Ceph Internals & Data Processing Capabilities Joao Eduardo Luis Senior Software Engineer jluis@suse.com / joao@suse.de

Thank you.

24

[email protected]@lists.ceph.com#ceph / #ceph-devel @ OFTCwww.ceph.com

Page 25: Ceph Internals & Data Processing Capabilities - SUSECON · Ceph Internals & Data Processing Capabilities Joao Eduardo Luis Senior Software Engineer jluis@suse.com / joao@suse.de

25

Page 26: Ceph Internals & Data Processing Capabilities - SUSECON · Ceph Internals & Data Processing Capabilities Joao Eduardo Luis Senior Software Engineer jluis@suse.com / joao@suse.de

Unpublished Work of SUSE LLC. All Rights Reserved.This work is an unpublished work and contains confidential, proprietary and trade secret information of SUSE LLC. Access to this work is restricted to SUSE employees who have a need to know to perform tasks within the scope of their assignments. No part of this work may be practiced, performed, copied, distributed, revised, modified, translated, abridged, condensed, expanded, collected, or adapted without the prior written consent of SUSE. Any use or exploitation of this work without authorization could subject the perpetrator to criminal and civil liability.

General DisclaimerThis document is not to be construed as a promise by any participating company to develop, deliver, or market a product. It is not a commitment to deliver any material, code, or functionality, and should not be relied upon in making purchasing decisions. SUSE makes no representations or warranties with respect to the contents of this document, and specifically disclaims any express or implied warranties of merchantability or fitness for any particular purpose. The development, release, and timing of features or functionality described for SUSE products remains at the sole discretion of SUSE. Further, SUSE reserves the right to revise this document and to make changes to its content, at any time, without obligation to notify any person or entity of such revisions or changes. All SUSE marks referenced in this presentation are trademarks or registered trademarks of Novell, Inc. in the United States and other countries. All third-party trademarks are the property of their respective owners.