Top Banner
Peter Monks Director of Technology, Strategic Alliances
40

BP-3 Taking Your Bulk Content Ingestions to the Next Level

May 22, 2015

Download

Technology

Learn about the Alfresco Bulk Filesystem Import Tool, a community developed extension to Alfresco that provides a high performance bulk import feature. Discover how different tuning parameters affect import performance, and learn how to determine the optimum configuration for your Alfresco environment.
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: BP-3 Taking Your Bulk Content Ingestions to the Next Level

Peter Monks!Director of Technology, Strategic Alliances!

Page 2: BP-3 Taking Your Bulk Content Ingestions to the Next Level

Agenda!1.  Introduction to the Bulk Filesystem Import Tool!2.  Demo!3.  Performance analysis!

1.  Methodology!2.  Results!3.  Conclusions!

4.  Roadmap for the Bulk Filesystem Import Tool!5.  Q&A!6.  Appendices!

Page 3: BP-3 Taking Your Bulk Content Ingestions to the Next Level

Introduction to the ulk ile ystem mport ool!

( for short)!

Page 4: BP-3 Taking Your Bulk Content Ingestions to the Next Level

Introduction to the BFSIT! = ulk ile ystem mport ool

•  Primary use case: one-off content migration / ingestion!•  Provides high-performance import of content!

•  A community maintained extension to Alfresco!•  Hosted on Google Code [1]!•  LGPL licensed!•  Widely adopted!

Page 5: BP-3 Taking Your Bulk Content Ingestions to the Next Level

Introduction to the BFSIT!Why not use…

•  Web UIs?!•  ACP Files?!•  CIFS, FTP, NFS, WebDAV, IMAP?!•  CMIS?!

All of the above suffer from one or more of: •  Content sent over network!•  External (out of process) orchestration!•  Content requires pre-/post-processing (e.g. ACP)!•  Chatty (e.g. CIFS, NFS)!•  Overly general (e.g. CMIS)!

Page 6: BP-3 Taking Your Bulk Content Ingestions to the Next Level

Introduction to the BFSIT!Solution •  Import content from the Alfresco server!•  Load folders & content as they appear on disk!•  Content is imported in batches!•  The “unit of work” is the directory!•  Each directory is imported in at least one batch!•  More if lots of content!

•  Batches within a directory are processed serially!

Page 7: BP-3 Taking Your Bulk Content Ingestions to the Next Level

Introduction to the BFSIT!Usage:

1.  Initiated via a simple repo Web Script:!

(can also be initiated via wget, curl, et al)!

2.  Import runs in background!3.  Detailed status is displayed while in progress!•  Weʼll see that during the demo!

Page 8: BP-3 Taking Your Bulk Content Ingestions to the Next Level

Introduction to the BFSIT!Process details: •  Place source directory on job queue & immediately return!

•  Worker thread pulls a single directory off job queue, and:!1.  Lists the contents of the directory!2.  Groups entries into “importable items”!3.  Filters importable items, based on admin-defined filtering rules!4.  Subdivides list of importable items into batches!5.  Imports batches, one at a time (serially)!6.  Places all subdirectories onto the job queue!

Page 9: BP-3 Taking Your Bulk Content Ingestions to the Next Level

Introduction to the BFSIT!Process details: •  Place source directory on job queue & immediately return!

•  Worker thread pulls a single directory off job queue, and:!1.  Lists the contents of the directory!2.  Groups entries into “importable items”!3.  Filters importable items, based on admin-defined filtering rules!4.  Subdivides list of importable items into batches!5.  Imports batches, one at a time (serially)!6.  Places all subdirectories onto the job queue!

I/O  Bound  Phases  

CPU  Bound  Phases  

Page 10: BP-3 Taking Your Bulk Content Ingestions to the Next Level

Demo!

Page 11: BP-3 Taking Your Bulk Content Ingestions to the Next Level
Page 12: BP-3 Taking Your Bulk Content Ingestions to the Next Level

Performance Analysis:Methodology!

Page 13: BP-3 Taking Your Bulk Content Ingestions to the Next Level

Goals and Test Plan!Goals: •  Benchmark total time taken for bulk imports, using combination of:!•  Machine environments!•  Source content sets!•  Alfresco repository configurations!•  Bulk import tool configurations!

Test Plan: •  Parallel testing in 2 environments!•  Two runs per test per machine:!

1.  Import into fresh (empty) repository!2.  Delete target folder then re-import (without restarting Alfresco)!•  Record average of duration of each run!

•  Modify only one configuration parameter at a time, resetting earlier modifications in between!

Page 14: BP-3 Taking Your Bulk Content Ingestions to the Next Level

Environments!Environment 1

•  2009 model MacBook Pro!•  2.8Ghz dual-core CPU!•  4GB RAM!•  Solid State Drive (Toshiba OEM)!

•  64bit Mac OSX Lion 10.7.1!•  MySQL 5.1!•  Apple JDK 1.6.0_26!

Environment 2 •  2006 model Thinkpad T60!

•  2.33Ghz dual-core CPU!•  3GB RAM!•  Dual hard drives (Seagate,

Hitachi)!•  First used for source directory!•  Second used for Alfresco repository!

•  64bit Ubuntu Natty Narwhal 11.04!

•  MySQL 5.1!•  OpenJDK 1.6.0_22!

NOTE 1: Neither of these environments are “production grade”! NOTE 2: These environments are not directly comparable!

Page 15: BP-3 Taking Your Bulk Content Ingestions to the Next Level

Content Sets!

Name   #  Folders   #  Files   Total  Size   Notes  

Typical   38   4,640   1.44GB  

Extreme  File  Size   1   9   4.41GB  

Extreme  File  Volume   4   11,100   521.7KB  

Extreme  Directory  Structure  

1,021   0   0B   100  levels  of  nes8ng  

Page 16: BP-3 Taking Your Bulk Content Ingestions to the Next Level

Performance Analysis:Repository Tuning Results!

Page 17: BP-3 Taking Your Bulk Content Ingestions to the Next Level

Baseline!Notes: •  Repository tuned as

per Day Zero Config Guide!

•  BFSIT has default configuration!

Observations: •  Environment 2 is

significantly slower at creating cm:folder nodes!

•  Theory: creating cm:folder nodes is “seeky” (more on this later)!

Page 18: BP-3 Taking Your Bulk Content Ingestions to the Next Level

Disable User Quotas!Observations: •  Quota calculation

performance proportional to number of cm:content nodes!

•  Quota calculation performance not affected by content size!

Page 19: BP-3 Taking Your Bulk Content Ingestions to the Next Level

Disable In-txn Indexing!Notes: •  This configuration is

not compatible with Share 3.x!!

Observations: •  Transactional indexing

slows Alfresco down a lot, particularly in environment 2!

•  Theory: indexing is highly “seeky”!

Page 20: BP-3 Taking Your Bulk Content Ingestions to the Next Level

Disable Indexing Entirely!Notes: •  This configuration is

not compatible with Share 3.x!!

•  This configuration functionally cripples Alfresco!!

Observations: •  Some contention

between ingestions & indexing (even async)!

•  Theory: SOLR integration in 4.x should provide similar performance!

Page 21: BP-3 Taking Your Bulk Content Ingestions to the Next Level

Optimal Repository Configuration!Optimal repository configuration, without functionally crippling Alfresco, is:

•  Disable user quotas:!

•  Disable in-transaction indexing:!

•  Indexing still occurs, just not synchronously in-transaction!•  Incompatible with Share 3.x, but can be disabled temporarily during import,

then re-enabled post-import!

system.usages.enabled=false

index.tracking.disableInTransactionIndexing=true alfresco.cluster.name=dummyCluster

Page 22: BP-3 Taking Your Bulk Content Ingestions to the Next Level

Optimal Repository Configuration – Results!Notes: •  This configuration is

not compatible with Share 3.x!!

Observations: •  Slower environment (2)

benefits more than the faster environment (1)!

•  Configuration canʼt speed up import of large files!•  Requires faster storage

devices (e.g. RAID 10)!

Page 23: BP-3 Taking Your Bulk Content Ingestions to the Next Level

Average speedup of ~40%?!

Page 24: BP-3 Taking Your Bulk Content Ingestions to the Next Level

Performance Analysis:BFSIT Tuning Results!

Page 25: BP-3 Taking Your Bulk Content Ingestions to the Next Level

Worker Thread Pool Sizes!Notes: •  Baseline is optimal

repository configuration!•  Only the “Typical”

content set was used for testing!

Observations: •  Multi-threading is

mostly irrelevant!•  Not surprising, given

ingestion is I/O bound!•  Steady improvement in

environment 1!•  Theory: concurrent I/O

support in SSD!

Page 26: BP-3 Taking Your Bulk Content Ingestions to the Next Level

Batch Weights!Observations: •  Larger batches = better

performance!

…HOWEVER…!

•  UI responsiveness got worse!•  A classic trade-off!

•  Ultimately, performance similar to baseline (batch weight = 100)!

Page 27: BP-3 Taking Your Bulk Content Ingestions to the Next Level

Optimal BFSIT Configuration!Optimal BFSIT configuration:

•  High thread count (mostly irrelevant):!

•  More importantly, high batch weight:!

•  Impacts UI responsiveness!•  Could reduce if needed, at little cost!

alfresco-bulk-filesystem-import.threadpool.size.core=48 alfresco-bulk-filesystem-import.threadpool.size.max=48

alfresco-bulk-filesystem-import.batch.weight=1000

Page 28: BP-3 Taking Your Bulk Content Ingestions to the Next Level

Optimal BFSIT Configuration - Results!Observations: •  Modest improvement

over baseline!•  Implies default BFSIT

configuration is close to optimal!

Page 29: BP-3 Taking Your Bulk Content Ingestions to the Next Level

Average speedup of ~6.5%?!

Page 30: BP-3 Taking Your Bulk Content Ingestions to the Next Level

Rethinking the Problem!

What if the BFSIT didn’t have to stream content into the repository at all?

What if the source content was already in the contentstore and only had to be “linked” into the

repository?

Page 31: BP-3 Taking Your Bulk Content Ingestions to the Next Level

In-place Import!Notes: •  Baseline is optimal

repository configuration!•  Optimal repository &

BFSIT configuration!

Observations: •  Improvement across

the board!•  Best improvement is

extreme file size case –375X faster!!

Page 32: BP-3 Taking Your Bulk Content Ingestions to the Next Level

Average speedup of ~60%?!

Page 33: BP-3 Taking Your Bulk Content Ingestions to the Next Level

Performance Analysis:Conclusions!

Page 34: BP-3 Taking Your Bulk Content Ingestions to the Next Level

Conclusions!Results:

•  Minimum improvement of 6%!•  Average improvement of 60%!•  Maximum improvement of 99.7%!•  In absolute terms, saw performance of up to:!

•  16GB / sec!•  120 nodes / sec!

Recall this wasn’t on production hardware!!

Page 35: BP-3 Taking Your Bulk Content Ingestions to the Next Level

Conclusions!Developers: •  Macro-optimization will always outperform micro-optimization!!•  Multi-threading is not a magic bullet! Itʼs only helpful if a given

operation is CPU bound and can be parallelised.!

Administrators: •  Use the Day Zero Configuration Guide for every install you do!!•  Donʼt assume superficially similar environments will perform

similarly!•  For bulk ingestions Alfresco is (mostly) I/O bound!

Page 36: BP-3 Taking Your Bulk Content Ingestions to the Next Level

BFSIT Roadmap!

Page 37: BP-3 Taking Your Bulk Content Ingestions to the Next Level

BFSIT Roadmap!Official roadmap is on the Google Code project’s wiki [2].

BFSIT v1.1 – Performance: •  Issue #91: Optimization of directory analysis phase [complete].!•  Issue #8: Multi-threaded imports [complete].!•  Issue #86: In-place imports [complete].!•  Issue #77: graphical display of throughput.!•  Issue #17: Test various different dimensions to see how they affect performance [complete – this talk!]!

BFSIT v1.2 – Alfresco 4.0, Usability & Performance: •  Issue #92: Test on Alfresco 4.0!•  Issue #26: Integrate into Share's administration console!•  Issue #94: Investigate use of Alfresco's BatchProcessor framework for the multi-threaded importer!•  Issue #96: Measure performance of alternative batching strategies!•  Issue #79: Reimplement the bulk filesystem import as a subsystem!•  Issue #62: Add support for cm:content properties!

BFSIT v1.3+: •  You tell me – Iʼm always keen to hear feedback!!•  The issues list [3] and mailing list [4] are great ways to start getting involved in the project!

Page 38: BP-3 Taking Your Bulk Content Ingestions to the Next Level

References![1] http://code.google.com/p/alfresco-bulk-filesystem-import/ [2] http://code.google.com/p/alfresco-bulk-filesystem-import/wiki/Roadmap [3] http://code.google.com/p/alfresco-bulk-filesystem-import/issues/list [4] http://groups.google.com/group/alfresco-bulk-filesystem-import

Also: •  http://blogs.alfresco.com/wp/pmonks/2009/10/22/bulk-import-from-a-filesystem/!•  Sessions:!

•  BP-1 – Performance Tuning!•  BP-6 – Repository Customization Best Practices!•  BP-9 – Share Customization Best Practices!

Page 39: BP-3 Taking Your Bulk Content Ingestions to the Next Level

Questions?!

Page 40: BP-3 Taking Your Bulk Content Ingestions to the Next Level

Appendix A – “Typical” Content Set Distributions!