Transcript

GAIA Tech1 Data Repositories MeetingGAIA Tech1 Data Repositories Meeting

Ingrid Bàrcena, HPC and Storage services manager

Ricard de la Vega, Portals and Repositories manager

GAIA Tech1 meeting

Madrid May 24 2011

OutlineOutline

1. ¿What is CESCA?

2. CESCA services

� HPC ans Storage

� Network

� University e-Administration

� Portals and Repositories

3. Digital Repositories

� Overview

� Two examples: DSpace and web archiving

� Long term preservation

4. CESCA and GAIA

� What is done

� What could be done

Centre de Centre de SupercomputaciSupercomputacióó de Catalunyade Catalunya

� Patrons:

• Generalitat de Catalunya

• Fundació Catalana per a la Recerca i la Innovació

• Universitat de Barcelona

• Universitat Autònomade Barcelona

• Universitat Politècnicade Catalunya

• Universitat Pompeu Fabra

• Universitat de Girona

• Universitat Rovira i Virgili

• Universitat de Lleida

• Universitat Obertade Catalunya

• Universitat Ramon Llull

• Consell Superiord’Investigacions Científiques

� Public Consortium created in 1991

� ICTS since 2000

OurOur ServicesServices

HPCHPC and and StorageStorage

19,48 Tflop/s Peack performance

50 research projects ( 203 users)

Main areas:

• Materials Science (31%)

• Life Science (32%)

• Environmental Science (28%)

• Astronomy and Astrophysics (5%)

+ 3.5 HC used during 2010

+ 50 scientific applications available

Disk Library

NetApp FAS3170

150 TB

21 TB FC drives

126 TB SATA drives

6 Pharma Labs10 Academic research groups

HPC Service Storage Service

Tape Library

ADIC i2000

156 TB

6 LTO-4 drives

300 slots

NetBackup 6.5

2 Software Packages

Drug Design Service

Network servicesNetwork services

+80 connected institutions

2 core nodes at 10 Gbps

Flexible bandwidth

Services: IPv6, multimedia, Remot

Access Service,Voice over IP,

Eduroam, Security...

21 institutions in Catalonia

40 countries

24 ISP and operators

Services: Multicast, IPv6, NTP Server, F root server (A and J,

.com and .net coming soon)...

University eUniversity e--Administration ProjectsAdministration Projects

e-Register

• URV: production 02-01-11

• UdL: production 03-14-11

• Sadiel: 32.692 €

e-Vote

• Bid price : 405.000 €

• Awarded (03-18-10): Scytl, 345.000 €

• Production: 02-01-11

SCD (e-Identitat i e-Signatura)

• Available:

EC-UR i EC-URV

ER-CESCA, -URV, -UPC

-UdL, -UPF

• In development: ER-UdG,

ER-UB, ER-UAB i ER-UVic

GPI

Improvements (02-03-11)

• Inteum Sentinel i Technology Publisher

• Office 2007; separació MVs per

universitat; enviament correus

Licence renewal. UB i UPC

Investment: 1.046,97 €

e-Archive

• Transfer agreement: 12-7-10

• Inst. ATLAS: 17.800 €

• Integr. Doc. Mgt:Award: IECI 51.920 € (02-12-11)

• Production: 06-01-11

Cluster: 15 BL460c G6 (2 x Intel Xeon E5530 QC); 480 GB; 4,3 TB;

XenServer Citrix; 2 load balancer F5 BIG-IP 1600; 110.487 €

Capa de dades

Balancejadors F5 BIG-IP

31-03-11

Portals and RepositoriesPortals and Repositories

Since 2001

18 universities

10,577 doctoral thesis

www.tdx.cat

Since 2005

22 institutions

24,564 research

papers, eprints…

www.recercat.cat

Since 2006

328 journals

129,235 articles

www.raco.cat

Since 2009

10 universities

1,814 learning objects

www.mdx.cat

Since 2006

39,587 websites crawled

118,039 versions crawled

249M files in 7.5 TB

www.padicat.cat

Since 2010

22 institutions

24,564 research

papers, eprints…

www.recercat.cat

Since 2006

Turnkey development

Evolutionary maintenance

http://recyt.fecyt.es

Pilot 2009-10

420 websites crawled

790 versions crawled

http://recyt.fecyt.es

(restricted IP address)

OutlineOutline

1. ¿What is CESCA?

2. CESCA services

� HPC ans Storage

� Network

� University e-Administration

� Portals and Repositories

3. Digital Repositories

� Overview

� Two examples: DSpace and web archiving

� Long term preservation

4. CESCA and GAIA

� What is done

� What could be done

Digital RepositoriesDigital Repositories

� A repository capture, store, index, preserve and distribute digital content.

� Data + Metadata• Dublin Core (DC)

• Mets, Mods, marc21…

• VO?

• Astronomical?

� Main issues• Access (search / browse)

• Preservation

• Interoperability

– Open Archive Initiative for metadada harvest (OAI-PMH)

(based on Dublin Core metadata)

Repositories taxonomyRepositories taxonomy

Towards a European e-Infrastructure for e-Science Digital Repositories. 7th e-Concentration Meeting, Brussels, 12-14th October, 2009

Repositories HardwareRepositories Hardware

� High availability

� Load balancing

� Easy scalability

� 24x7 monitoringBalancers

Services

Data

Storage Area Network

Disc Tape

Repositories SoftwareRepositories Software

� For general purpose

• DSpace, EPrints, Fedora, Islandora…• Implemented in

� For journal management

• Open Journal Systems (OJS)• Implemented in

� For web archives preservation

• Heritrix, NutchWAX, WERA, Wayback, Webcurator…• Implemented in

ExampleExample onon general general purposepurpose repositoryrepository ((DSpaceDSpace))

� For digital objects, like PDF, images, videos, data…

� Index metadata and PDF for searching

ExampleExample onon webweb archive (PADICAT)archive (PADICAT)

� PADICAT consists of collecting, processing and providing

permanent access to the entire cultural, scientific and general output of Catalonia in digital format. It is the

Catalan web sites archive.

PANDORA UK ARCHIVE IA VEFSAFN BNF Kulturarw3 Netarchive Scope Australia UK World Islandia France Sueden Denmark

Begin 1996 2004 1996 2004 2002 1996 2005

Open access � � � �since 2009 � � �

Search by URL � � � � � � �

S. by keyword � � � � � � �

Directori � � � � � � �

N. websites 26.630 8.308 - - - - > 1,1 milions N. crawls 60.276 32.618 150 billion - - - 4,5 bilions

Space 4,63 TB 7,59 TB - - 180 TB - 155TB

Data 16-12-2010 12-01-2011 13-12-2011 13-01-2011 13-01-2011 26-11-2010 08-2010

- Open Access

- Search by URL and keyword

- Catalogue and thematic directory

www.padicat.cat

Since 2006- 39,587 websites crawled

- 118,039 versions crawled

- 249M files in 7.5 TB

Web archive software architectureWeb archive software architecture

INDEX FOR KEYWORD SEARCHING

INDEX FOR URL SEARCHINGARXIUS

ARC

HADOOP +

NUTCHWAX

ARCINDEXER

HERITRIX

WAYBACK

WERA

CATALOG DATABASE

(Crawl Metadata)

WEB CURATOR TOOL

1. Harvest

2. Index and search

3. Catalogue and browse

PADICATPADICAT’’ss indexesindexes

� Until now (< 100.000 website version crawled)

• For search by URL (like Internet Archive)

– Index with ArcIndexer (~100 GB) + visualize with Wayback √

• For search by keyword

– Index with Hadoop+NutchWAX + visualize with WERA √

� Now (120.000 website version crawls)

• Performance problems for keyword indexing

• Two solutions under evaluation:

– Index with a new version of NutchWAX + visualize with TNH (the new

hotness, from IA)

– Index with JB (James Brown, from IA) + visualize with TNH

Long term preservationLong term preservation

� The e-infrastructure must ensure the long term data

access, without failure.

� To succeed, it must be taken into account:

• Replication (more than one copy)

• Media refresh

• Format migration

• Data integrity (checksums)

• Contingency and recovery plan

• Preservation plan

• ...

An example of long term preservationAn example of long term preservation

The “preservation history” of TDX (doctoral theses)…

� 2001 – 80 GB, 8.000 access hits

• SW: ETDdb (+ MySQL, Glimpse…) from Virginia Tech

• HW: HP V2500 with 16 processors, 4 GB memory, 227 GB disk

• HW: StorageTek TimberWolf 9740 with 2,7 TB of 9840 tapes

Born in a supercomputer!

An example of long term preservationAn example of long term preservation

The “preservation history” of TDX (doctoral theses)…

� Hardware migrations

• 2003 (cpu + disk)

– HP rp5430 with 2 processors, 704 GB memory

– HP EVA V.2 with 2,8 TB disk

• 2006 (cpu + tape)

– High availability HP cluster with 32 Proliant DL360 nodes

– Adic Scalar i2000 (from 9840 tapes to LTO3 tapes)

• 2009 (disk)

– NetApp FAS3170 with 60 TB disk

� Software migrations

• 2010 – DSpace (+ PostgreSQL, Java, solr, …) from MIT & HP labs

An example of long term preservationAn example of long term preservation

The “preservation history” of TDX (doctoral theses)…

� Replication

• On disk - Online version (1)

• One backup on the tape library (2)

• Other backup on a fireproof cabinet (3)

• Other backup on a 50 Km remote Centre (4)

• A dark copy on the MetaArchive Cooperative

– Private LOCKSS (Lots of Copies Keep Stuff Safe) Network

– 10 more copies around the world (14)

� Data Integrity

• Checksums on DSpace (online version)

• Checksums on LOCKSS (dark copies)

An example of long term preservationAn example of long term preservation

The “preservation history” of TDX (doctoral theses)…

� 2011 – 300 GB, + of 3,5 million access hits

• SW: DSpace (+ PostgreSQL, Java, solr, …) from MIT & HP labs

• HW: High availability HP cluster with 32 Prolian DL360 nodes

• HW: NetApp FAS3170 with 60 TB disk

• HW: Adic Scalar i2000

• SW: LOCKSS (+ Conspectus...)

• HW: HP DL380 (LOCKSS cache)

� xxxx – …

www.tdx.cat

OutlineOutline

1. ¿What is CESCA?

2. CESCA services

� HPC ans Storage

� Network

� University e-Administration

� Portals and Repositories

3. Digital Repositories

� Overview

� Two examples: DSpace and web archiving

� Long term preservation

4. CESCA and GAIA

� What is done

� What could be done

GAIA at CESCA: GAIA at CESCA: whatwhat isis donedone

2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011

Data

Processing

IDT/IDU

Storage

DatabaseGDASS/COG

Backup

Data processing

Database

GAiAGAiA andand CESCA: CESCA: whatwhat couldcould be donebe done

Preservation:

Dark copy, …

Data Repository

Large data

transfer

Powerful

Searches and

interoperability

Storage and Backup

¡¡Thank you!Thank you!

¿Questions?

ibarcena@cesca.cat

rdelavega@cesca.cat

top related