Top Banner
CephFS: Today and Tomorrow Greg Farnum [email protected] SC '15
29

CephFS: Today and Tomorrow - University of Minnesota · 3 Ceph Tech Talks: CephFS Ceph architecture RGW A web services gateway for object storage, compatible with S3 and Swift LIBRADOS

Jun 12, 2018

Download

Documents

lammien
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: CephFS: Today and Tomorrow - University of Minnesota · 3 Ceph Tech Talks: CephFS Ceph architecture RGW A web services gateway for object storage, compatible with S3 and Swift LIBRADOS

CephFS: Today and

TomorrowGreg [email protected]

SC '15

Page 2: CephFS: Today and Tomorrow - University of Minnesota · 3 Ceph Tech Talks: CephFS Ceph architecture RGW A web services gateway for object storage, compatible with S3 and Swift LIBRADOS

2Ceph Tech Talks: CephFS

Architectural overview

Page 3: CephFS: Today and Tomorrow - University of Minnesota · 3 Ceph Tech Talks: CephFS Ceph architecture RGW A web services gateway for object storage, compatible with S3 and Swift LIBRADOS

3 Ceph Tech Talks: CephFS

Ceph architecture

RGWA web services gateway

for object storage,compatible with S3 and

Swift

LIBRADOSA library allowing apps to directly access RADOS (C, C++, Java, Python, Ruby, PHP)

RADOSA software-based, reliable, autonomous, distributed object store comprised ofself-healing, self-managing, intelligent storage nodes and lightweight monitors

RBDA reliable, fully-distributed

block device with cloudplatform integration

CEPHFSA distributed file systemwith POSIX semantics

and scale-out metadatamanagement

APP HOST/VM CLIENT

Page 4: CephFS: Today and Tomorrow - University of Minnesota · 3 Ceph Tech Talks: CephFS Ceph architecture RGW A web services gateway for object storage, compatible with S3 and Swift LIBRADOS

4Ceph Tech Talks: CephFS

Components

Linux host

M M

M

Ceph server daemons

CephFS client

datametadata 0110

M

OSD

Monitor

MDS

Page 5: CephFS: Today and Tomorrow - University of Minnesota · 3 Ceph Tech Talks: CephFS Ceph architecture RGW A web services gateway for object storage, compatible with S3 and Swift LIBRADOS

5Ceph Tech Talks: CephFS

Dynamic subtree placement

Page 6: CephFS: Today and Tomorrow - University of Minnesota · 3 Ceph Tech Talks: CephFS Ceph architecture RGW A web services gateway for object storage, compatible with S3 and Swift LIBRADOS

6Ceph Tech Talks: CephFS

Using CephFS Today

Page 7: CephFS: Today and Tomorrow - University of Minnesota · 3 Ceph Tech Talks: CephFS Ceph architecture RGW A web services gateway for object storage, compatible with S3 and Swift LIBRADOS

7 Ceph Tech Talks: CephFS

rstats are cool

# ext4 reports dirs as 4K

ls -lhd /ext4/data

drwxrwxr-x. 2 john john 4.0K Jun 2514:58 /home/john/data

# cephfs reports dir size from contents

$ ls -lhd /cephfs/mydata

drwxrwxr-x. 1 john john 16M Jun 25 14:57./mydata

Page 8: CephFS: Today and Tomorrow - University of Minnesota · 3 Ceph Tech Talks: CephFS Ceph architecture RGW A web services gateway for object storage, compatible with S3 and Swift LIBRADOS

8 Ceph Tech Talks: CephFS

Monitoring the MDS

$ ceph daemonperf mds.a

Page 9: CephFS: Today and Tomorrow - University of Minnesota · 3 Ceph Tech Talks: CephFS Ceph architecture RGW A web services gateway for object storage, compatible with S3 and Swift LIBRADOS

9 Ceph Tech Talks: CephFS

MDS admin socket commands

● session ls: list client sessions

● session evict: forcibly tear down client session

● scrub_path: invoke scrub on particular tree

● flush_path: flush a tree from journal to backing store

● flush journal: flush everything from the journal

● force_readonly: put MDS into readonly mode

● osdmap barrier: block caps until this OSD map

Page 10: CephFS: Today and Tomorrow - University of Minnesota · 3 Ceph Tech Talks: CephFS Ceph architecture RGW A web services gateway for object storage, compatible with S3 and Swift LIBRADOS

10 Ceph Tech Talks: CephFS

MDS health checks

● Detected on MDS, reported via mon● Client failing to respond to cache pressure● Client failing to release caps● Journal trim held up● ...more in future

● Mainly providing faster resolution of client-relatedissues that can otherwise stall metadata progress

● Aggregate alerts for many clients

● Future: aggregate alerts for one client across manyMDSs

Page 11: CephFS: Today and Tomorrow - University of Minnesota · 3 Ceph Tech Talks: CephFS Ceph architecture RGW A web services gateway for object storage, compatible with S3 and Swift LIBRADOS

11 Ceph Tech Talks: CephFS

OpTracker in MDS

● Provide visibility of ongoing requests, as OSD does

ceph daemon mds.a dump_ops_in_flight{ "ops": [ { "description": "client_request(client. "initiated_at": "2015-03-10 22:26:17.4 "age": 0.052026, "duration": 0.001098, "type_data": [ "submit entry: journal_and_reply", "client.4119:21120",...

Page 12: CephFS: Today and Tomorrow - University of Minnesota · 3 Ceph Tech Talks: CephFS Ceph architecture RGW A web services gateway for object storage, compatible with S3 and Swift LIBRADOS

12 Ceph Tech Talks: CephFS

cephfs-journal-tool

● Disaster recovery for damaged journals:● inspect/import/export/reset● header get/set● event recover_dentries

● Works in parallel with new journal format, to make ajournal glitch non-fatal (able to skip damaged regions)

● Allows rebuild of metadata that exists in journal but islost on disk

● Companion cephfs-table-tool exists for resettingsession/inode/snap tables as needed afterwards.

Page 13: CephFS: Today and Tomorrow - University of Minnesota · 3 Ceph Tech Talks: CephFS Ceph architecture RGW A web services gateway for object storage, compatible with S3 and Swift LIBRADOS

13 Ceph Tech Talks: CephFS

Full space handling

● Previously: a full (95%) RADOS cluster stalled clientswriting, but allowed MDS (metadata) writes:

● Lots of metadata writes could continue to 100% fillcluster

● Deletions could deadlock if clients had dirty data flushesthat stalled on deleting files

● Now: generate ENOSPC errors in the client, propagateinto fclose/fsync as necessary. Filter ops on MDS toallow deletions but not other modifications.

● Bonus: I/O errors seen by client also propagated tofclose/fsync where previously weren't.

Page 14: CephFS: Today and Tomorrow - University of Minnesota · 3 Ceph Tech Talks: CephFS Ceph architecture RGW A web services gateway for object storage, compatible with S3 and Swift LIBRADOS

14 Ceph Tech Talks: CephFS

Client management

● Client metadata● Reported at startup to MDS● Human or machine readable

● Stricter client eviction● For misbehaving, not just dead clients● Use OSD blacklisting

Page 15: CephFS: Today and Tomorrow - University of Minnesota · 3 Ceph Tech Talks: CephFS Ceph architecture RGW A web services gateway for object storage, compatible with S3 and Swift LIBRADOS

15 Ceph Tech Talks: CephFS

Client management: metadata

● Metadata used to refer to clients by hostname in healthmessages

● Future: extend to environment specific identifiers likeHPC jobs, VMs, containers...

# ceph daemon mds.a session ls... "client_metadata": { "ceph_sha1": "a19f92cf...", "ceph_version": "ceph version 0.93...", "entity_id": "admin", "hostname": "claystone", "mount_point": "\/home\/john\/mnt" }

Page 16: CephFS: Today and Tomorrow - University of Minnesota · 3 Ceph Tech Talks: CephFS Ceph architecture RGW A web services gateway for object storage, compatible with S3 and Swift LIBRADOS

16 Ceph Tech Talks: CephFS

Client management: strict eviction

ceph osd blacklist add <client addr>

ceph daemon mds.<id> session evict

ceph daemon mds.<id> osdmap barrier● Blacklisting clients from OSDs may be overkill in some

cases if we know they are already really dead, or theyheld no dangerous caps.

● This is fiddly when multiple MDSs in use: should wrapinto a single global evict operation in future

Page 17: CephFS: Today and Tomorrow - University of Minnesota · 3 Ceph Tech Talks: CephFS Ceph architecture RGW A web services gateway for object storage, compatible with S3 and Swift LIBRADOS

17 Ceph Tech Talks: CephFS

FUSE client improvements

● Various fixes to cache trimming

● FUSE issues since linux 3.18: lack of explicit means todirty cached dentries en masse (we need a better waythan remounting!)

● flock is now implemented (require fuse >= 2.9because of interruptible operations)

● Soft client-side quotas (stricter quota enforcementneeds more infrastructure)

Page 18: CephFS: Today and Tomorrow - University of Minnesota · 3 Ceph Tech Talks: CephFS Ceph architecture RGW A web services gateway for object storage, compatible with S3 and Swift LIBRADOS

18Ceph Tech Talks: CephFS

Using CephFS Tomorrow

Page 19: CephFS: Today and Tomorrow - University of Minnesota · 3 Ceph Tech Talks: CephFS Ceph architecture RGW A web services gateway for object storage, compatible with S3 and Swift LIBRADOS

19 Ceph Tech Talks: CephFS

Access control improvements (Merged)

● GSoC and Outreachy students● NFS-esque root_squash● Limit access by path prefix

● Combine path-limited access control with subtreemounts, and you have a good fit for container volumes.

Page 20: CephFS: Today and Tomorrow - University of Minnesota · 3 Ceph Tech Talks: CephFS Ceph architecture RGW A web services gateway for object storage, compatible with S3 and Swift LIBRADOS

20 Ceph Tech Talks: CephFS

Backward scrub & recovery (ongoing)

● New tool: cephfs-data-scan (basics exist)● Extract files from a CephFS data pool, and either hook

them back into a damaged metadata pool (repair) ordump them out to a local filesystem.

● Best-effort approach, fault tolerant● In unlikely event of loss of CephFS availability, you can

still extract essential data.● Execute many workers in parallel for scanning large

pools

Page 21: CephFS: Today and Tomorrow - University of Minnesota · 3 Ceph Tech Talks: CephFS Ceph architecture RGW A web services gateway for object storage, compatible with S3 and Swift LIBRADOS

21 Ceph Tech Talks: CephFS

Forward Scrub (partly exists)

● Continuously scrub through metadata tree and validate● forward and backward links (dirs→files, file

“backtraces”)● files exist, are right size● rstats match reality

Page 22: CephFS: Today and Tomorrow - University of Minnesota · 3 Ceph Tech Talks: CephFS Ceph architecture RGW A web services gateway for object storage, compatible with S3 and Swift LIBRADOS

22 Ceph Tech Talks: CephFS

Jewel “stable CephFS”

● The Ceph community is declaring CephFS stable inJewel

● That's limited:● No snapshots● Single active MDS● We have no idea what workloads it will do well under

● But we will have working recovery tools!

Page 23: CephFS: Today and Tomorrow - University of Minnesota · 3 Ceph Tech Talks: CephFS Ceph architecture RGW A web services gateway for object storage, compatible with S3 and Swift LIBRADOS

23 Ceph Tech Talks: CephFS

Test & QA

● teuthology test framework:● Long running/thrashing test● Third party FS correctness tests● Python functional tests

● We dogfood CephFS within the Ceph team● Various kclient fixes discovered● Motivation for new health monitoring metrics

● Third party testing is extremely valuable

Page 24: CephFS: Today and Tomorrow - University of Minnesota · 3 Ceph Tech Talks: CephFS Ceph architecture RGW A web services gateway for object storage, compatible with S3 and Swift LIBRADOS

24Ceph Tech Talks: CephFS

CephFS Future

Page 25: CephFS: Today and Tomorrow - University of Minnesota · 3 Ceph Tech Talks: CephFS Ceph architecture RGW A web services gateway for object storage, compatible with S3 and Swift LIBRADOS

25 Ceph Tech Talks: CephFS

Snapshots in practice

[john@schist backups]$ touch history

[john@schist backups]$ cd .snap

[john@schist .snap]$ mkdir snap1

[john@schist .snap]$ cd ..

[john@schist backups]$ rm -f history

[john@schist backups]$ ls

[john@schist backups]$ ls .snap/snap1

history

# Deleted file still there in the snapshot!

Page 26: CephFS: Today and Tomorrow - University of Minnesota · 3 Ceph Tech Talks: CephFS Ceph architecture RGW A web services gateway for object storage, compatible with S3 and Swift LIBRADOS

26Ceph Tech Talks: CephFS

Dynamic subtree placement

Page 27: CephFS: Today and Tomorrow - University of Minnesota · 3 Ceph Tech Talks: CephFS Ceph architecture RGW A web services gateway for object storage, compatible with S3 and Swift LIBRADOS

27 Ceph Tech Talks: CephFS

Functional testing

● Historic tests are “black box” client workloads: novalidation of internal state.

● More invasive tests for exact behaviour, e.g.:● Were RADOS objects really deleted after a rm?● Does MDS wait for client reconnect after restart?● Is a hardlinked inode relocated after an unlink?● Are stats properly auto-repaired on errors?● Rebuilding FS offline after disaster scenarios

● Fairly easy to write using the classes provided:

ceph-qa-suite/tasks/cephfs

Page 28: CephFS: Today and Tomorrow - University of Minnesota · 3 Ceph Tech Talks: CephFS Ceph architecture RGW A web services gateway for object storage, compatible with S3 and Swift LIBRADOS

28 Ceph Tech Talks: CephFS

Tips for early adopters

http://ceph.com/resources/mailing-list-irc/

http://tracker.ceph.com/projects/ceph/issues

http://ceph.com/docs/master/rados/troubleshooting/log-and-debug/

● Does the most recent development release or kernelfix your issue?

● What is your configuration? MDS config, Cephversion, client version, kclient or fuse

● What is your workload?

● Can you reproduce with debug logging enabled?

Page 29: CephFS: Today and Tomorrow - University of Minnesota · 3 Ceph Tech Talks: CephFS Ceph architecture RGW A web services gateway for object storage, compatible with S3 and Swift LIBRADOS

29Ceph Tech Talks: CephFS

Questions?