Top Banner
24-28 May 2004 HEPiX Spring Meeting E dinburgh 1 Overview of Solaris issues at CERN By Ignacio Reguero. Presented by Manuel Guijarro CERN IT-PS-UI
29

24-28 May 2004HEPiX Spring Meeting Edinburgh1 Overview of Solaris issues at CERN By Ignacio Reguero. Presented by Manuel Guijarro CERN IT-PS-UI.

Mar 28, 2015

Download

Documents

Vanessa Hoffman
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: 24-28 May 2004HEPiX Spring Meeting Edinburgh1 Overview of Solaris issues at CERN By Ignacio Reguero. Presented by Manuel Guijarro CERN IT-PS-UI.

24-28 May 2004 HEPiX Spring Meeting Edinburgh

1

Overview of Solaris issues at CERN

By Ignacio Reguero. Presented by Manuel Guijarro

CERN IT-PS-UI

Page 2: 24-28 May 2004HEPiX Spring Meeting Edinburgh1 Overview of Solaris issues at CERN By Ignacio Reguero. Presented by Manuel Guijarro CERN IT-PS-UI.

24-28 May 2004 HEPiX Spring Meeting Edinburgh 2

Agenda

• Solaris 9 Certification– Compilers– Open Software

• Quattor (EDG WP4) deployment• Sun Blade Server tests

– N1 Management

• LEMON Monitoring on Solaris• Long term plans for Solaris Support

Page 3: 24-28 May 2004HEPiX Spring Meeting Edinburgh1 Overview of Solaris issues at CERN By Ignacio Reguero. Presented by Manuel Guijarro CERN IT-PS-UI.

24-28 May 2004 HEPiX Spring Meeting Edinburgh 3

Solaris 9 Certification (1)

• Certification: Formal test of all software used at CERN on the new system– In cooperation with software “owners”– Using Refsol9 reference machine

• Not much visible OS change – We have the OS running since about one year

• However replacement of ASIS and SUE legacy environment with Quattor– Completely replacing system management framework– Large undertaking

• The certification was foreseen for beginning 2004– Delay due to underestimation of the amount of work required

to implement Solaris specific packages and components

• It will be launched on 1st June

Page 4: 24-28 May 2004HEPiX Spring Meeting Edinburgh1 Overview of Solaris issues at CERN By Ignacio Reguero. Presented by Manuel Guijarro CERN IT-PS-UI.

24-28 May 2004 HEPiX Spring Meeting Edinburgh 4

Solaris 9 Certification (2): Compilers

• Proposed to forum-solaris-certification list• GCC

– default • 3.2.3

– As gcc-alt• 2.95.2• 3.3.3

– There is some question on whether we would like to have 3.3.3 as default rather than 3.2.3.

• Sun Compilers– Default

• Sun ONE Studio 8 (Sun C++ 5.5)– Other alternative Sun compilers available in AFS

• Sun ONE Studio 7 (Sun C++ 5.4)• Sun WorkShop 6 update 2 (Sun C++ 5.3)• Sun WorkShop 6 update 1 (Sun C++ 5.2) (default in Solaris 8)

Page 5: 24-28 May 2004HEPiX Spring Meeting Edinburgh1 Overview of Solaris issues at CERN By Ignacio Reguero. Presented by Manuel Guijarro CERN IT-PS-UI.

24-28 May 2004 HEPiX Spring Meeting Edinburgh 5

Solaris 9 Certification (3): Open Software

• As more Open Source software distributed and supported by Sun: Perl, Bash,…

• Proposing change of Open Software policy– If possible, use software from Sun

• Setting compatibility links in /usr/local– Only if special requirements exist make software

ourselves. For instance• DBI and other Perl modules• Version requirements, such as GCC• Mozilla 1.6 browser (instead of Netscape recomm. by

Sun)– We will take into account compatibility with Linux when

possible• Though this is a moving target

Page 6: 24-28 May 2004HEPiX Spring Meeting Edinburgh1 Overview of Solaris issues at CERN By Ignacio Reguero. Presented by Manuel Guijarro CERN IT-PS-UI.

24-28 May 2004 HEPiX Spring Meeting Edinburgh 6

Quattor deployment on Solaris (1)

• Quattor is the Fabric Management toolkit– Already manages over 2000 nodes of a Linux farm in the

CERN Computing Centre

• Sun funded a visitor at CERN to implement Quattor on Solaris– Using Solaris packages rather than RPMs– This work has been presented to Sun, HPC Consortium,

SC2003

• We plan to use Quattor to manage all Solaris systems from Solaris 9 onwards– Including desktop systems

• Behaviour with “unmanaged” software

Page 7: 24-28 May 2004HEPiX Spring Meeting Edinburgh1 Overview of Solaris issues at CERN By Ignacio Reguero. Presented by Manuel Guijarro CERN IT-PS-UI.

24-28 May 2004 HEPiX Spring Meeting Edinburgh 7

Quattor deployment on Solaris (2): Summary for System administrators

– Central Configuration DataBase (CDB)• Stores all configuration information as well as Software

packages to be installed– Both applications and system

• A cache manager provided for the client accessing the DB– Allows disconnected operation– To avoid dependency on the DB server or on the network

• The configuration database is linked to the network installation server

– The Jumpstart profile is to be generated from the database

Page 8: 24-28 May 2004HEPiX Spring Meeting Edinburgh1 Overview of Solaris issues at CERN By Ignacio Reguero. Presented by Manuel Guijarro CERN IT-PS-UI.

24-28 May 2004 HEPiX Spring Meeting Edinburgh 8

Quattor deployment on Solaris (3): Summary for System administrators

– Node Configuration Manager (NCM)• For configuration “components”• They have single action: “configure” and “unconfigure”• They access Configuration DB through the cache manager

– SPMA software distributor (package level)• Replaces ASIS software distribution (file level)• For Linux it uses RPMs, for Solaris implemented with

Solaris PKG• Allows to install packages from various SW repositories• Several protocols supported: HTTP, file system (AFS), FTP,

etc.

Page 9: 24-28 May 2004HEPiX Spring Meeting Edinburgh1 Overview of Solaris issues at CERN By Ignacio Reguero. Presented by Manuel Guijarro CERN IT-PS-UI.

24-28 May 2004 HEPiX Spring Meeting Edinburgh 10

Quattor deployment on Solaris (5): Ongoing work

• Implementation of Solaris NCM Components from existing SUE features– First priority for server configurations in the computing centre

• Mostly done• Except LSF

– Validating the whole lot

• Graphical User interface– For delegation – In machines outside of the computing centre– We had a “proof of concept” prototype– Now working on a more general interface

• In close cooperation with Quattor project as it touches CDB Access Control

Page 10: 24-28 May 2004HEPiX Spring Meeting Edinburgh1 Overview of Solaris issues at CERN By Ignacio Reguero. Presented by Manuel Guijarro CERN IT-PS-UI.

24-28 May 2004 HEPiX Spring Meeting Edinburgh 11

Sun Blade Server tests (1)

• Sun Blade server 1600– Packaged farm– Fits in 3 units of a 19” rack– SSC Controller with gigabit switch that manages up to 16 CPUs

• Several Gigabit Ethernet external connections• VLAN with 16 Gigabit Ethernet Interface• Protection attack by Packet Filter configuration• Console through Serial Port for each Blade

• 12 X 650MHz UltraSPARC-IIe• 4 “Intel Compatible” CPUs

– AMD Athlon XP-M 1.2GHz• Other Specialized Blades supported on hardware level

– SSL Encryptor– Load Balancer

Page 11: 24-28 May 2004HEPiX Spring Meeting Edinburgh1 Overview of Solaris issues at CERN By Ignacio Reguero. Presented by Manuel Guijarro CERN IT-PS-UI.

24-28 May 2004 HEPiX Spring Meeting Edinburgh 12

Sun Blade Server 1600

Page 12: 24-28 May 2004HEPiX Spring Meeting Edinburgh1 Overview of Solaris issues at CERN By Ignacio Reguero. Presented by Manuel Guijarro CERN IT-PS-UI.

24-28 May 2004 HEPiX Spring Meeting Edinburgh 13

Sun Blade Server 1600 system chassis

SSC0

(active)

SSC1

(standby)

Switch Fabric

Switch Fabric

External Switch

137.138.x.x (ce0) 137.138.x.x (ce1) 137.138.x.x (ce0) 137.138.x.x (ce1)

Slot 0……s15 Slot 0……s15

Blades 0…….15

Page 13: 24-28 May 2004HEPiX Spring Meeting Edinburgh1 Overview of Solaris issues at CERN By Ignacio Reguero. Presented by Manuel Guijarro CERN IT-PS-UI.

24-28 May 2004 HEPiX Spring Meeting Edinburgh 14

Sun Blade Server tests (2)

• Implemented fully automated network installation (DHCP) using Jumpstart from SUNINST0

• Nodes have been used for development of Quattor on Solaris 9

• But main interest is to test N1 management

Page 14: 24-28 May 2004HEPiX Spring Meeting Edinburgh1 Overview of Solaris issues at CERN By Ignacio Reguero. Presented by Manuel Guijarro CERN IT-PS-UI.

24-28 May 2004 HEPiX Spring Meeting Edinburgh 15

N1 Management (1)

• Sun N1 Provisioning server 3.0 Blade Edition– Automates configuration and deployment of different

kinds of blades using system images• Assignment may vary according to a schedule or other

input – dynamic management of clusters

– Images can also be used to deliver applications and data

• Our interest to compare N1 with EDG WP4 Quattor functionality

• Question: could N1 manage heterogeneous farms outside the Blade server scope?

Page 15: 24-28 May 2004HEPiX Spring Meeting Edinburgh1 Overview of Solaris issues at CERN By Ignacio Reguero. Presented by Manuel Guijarro CERN IT-PS-UI.

24-28 May 2004 HEPiX Spring Meeting Edinburgh 16

N1 Management (2)

• Also case study proposed by SAS(DB) group

• Test case proposed: To use BS1600 machines for Oracle Application Server– To use the new version of OAS that will facilitate

dynamic allocation of nodes

Page 16: 24-28 May 2004HEPiX Spring Meeting Edinburgh1 Overview of Solaris issues at CERN By Ignacio Reguero. Presented by Manuel Guijarro CERN IT-PS-UI.

24-28 May 2004 HEPiX Spring Meeting Edinburgh 17

N1 Management (3)

• Long list of technical problems found– Had to install Upgrade 1 of the Sun N1 Provisioning server 3.0

Blade Edition to avoid bugs of the initial section– Requires at least one dedicated server not foreseen

• “Control Plane Server” + “Image Server”• External to the blade server

– We have to go back to Solaris 8– Local Oracle 8 installation required

• When public Oracle DB service exists• Running Oracle 9

– Precise model of Gigabit Ethernet card required• For VLAN support• Gigabit interface of V210 and V240 not supported• Had to acquire Syskonnect Gigabit card

– Precise network Switch models required• Fortunately not required for single SB1600 configuration

Page 17: 24-28 May 2004HEPiX Spring Meeting Edinburgh1 Overview of Solaris issues at CERN By Ignacio Reguero. Presented by Manuel Guijarro CERN IT-PS-UI.

24-28 May 2004 HEPiX Spring Meeting Edinburgh 18

N1 Management (4)

• More technical problems– The N1 installation hang after updated the Database

parameters…• Resource layers subnets and VLANs, Control Center

Application server, Blade system chassis• Normally, this install should take around 1 hour and half

• Sun later told us that N1 installation has to be done by Sun Professional Services– You are supposed to pay– A large part of the documentation seems to be internal

to Sun only

Page 18: 24-28 May 2004HEPiX Spring Meeting Edinburgh1 Overview of Solaris issues at CERN By Ignacio Reguero. Presented by Manuel Guijarro CERN IT-PS-UI.

24-28 May 2004 HEPiX Spring Meeting Edinburgh 19

N1 Management (5)

• Conclusion for Sun N1 Provisioning server 3.0 Blade Edition– The product is complex and not well finished– You are only supposed to use it through the Sun

Professional Services organisation– System images for system management are only

interesting if you have a large number of identical nodes (we do not)• The Jumpstart model fits better our needs as HW/SW

differences are solved by the Sun Installer– Confirmed that no support for nodes outside of Blade

Server is foreseen

• Agreed with our DB colleagues that we are not interested in this product

Page 19: 24-28 May 2004HEPiX Spring Meeting Edinburgh1 Overview of Solaris issues at CERN By Ignacio Reguero. Presented by Manuel Guijarro CERN IT-PS-UI.

24-28 May 2004 HEPiX Spring Meeting Edinburgh 20

N1 Management (6)

• On the other hand attended demo ofSun N1 Service Provisioning System 4.1

• Totally different product– Looks good technically– Has higher level with a scope similar to Quattor– Nice GUI– Supports several types of package objects including user

defined– Supports RH Linux and Windows as well as Solaris

• DB and us agreed on the interest of having a look, at least to compare to Quattor

• But “Sociological” Problems– Sun tries to sell it with per node fees with a Professional

Services model– Sun has not been able/willing to give us access to the product

for over two months

Page 20: 24-28 May 2004HEPiX Spring Meeting Edinburgh1 Overview of Solaris issues at CERN By Ignacio Reguero. Presented by Manuel Guijarro CERN IT-PS-UI.

24-28 May 2004 HEPiX Spring Meeting Edinburgh 21

N1 Management (7)

Currently• Studying the implementation of the Oracle

Application Server with Quattor packages and components– If we get the Sun product we will compare it with

Quattor– Otherwise we will go ahead with Quattor

Page 21: 24-28 May 2004HEPiX Spring Meeting Edinburgh1 Overview of Solaris issues at CERN By Ignacio Reguero. Presented by Manuel Guijarro CERN IT-PS-UI.

24-28 May 2004 HEPiX Spring Meeting Edinburgh 22

LEMON Monitoring on Solaris (1)

• Migrating from UIMON to Lemon MSA– Work done by Piotr Kolet (Fellow in IT/PS/UI)

• To align with IT/FIO developments– Use the Computing Centre Infrastructure– Achieve Linux and Solaris data integration

• Have to implement missing parts– Recovery Action– Solaris Specific metrics

• Targeting production by this summer

Page 22: 24-28 May 2004HEPiX Spring Meeting Edinburgh1 Overview of Solaris issues at CERN By Ignacio Reguero. Presented by Manuel Guijarro CERN IT-PS-UI.

24-28 May 2004 HEPiX Spring Meeting Edinburgh 23

LEMON Monitoring on Solaris (2)

• Porting MSA to Solaris – already done• Porting internal sensor – several bugs fixed• Porting Linux metrics to Solaris – routines with

strong OS dependencies– Several metrics have to be still rewritten or fixed

• Already sending data to central Oracle repository(metrics numbers and names have to be the same for all

platforms)

• Results can be viewed on Lemon Status Page (http://lemon-status.web.cern.ch)

Page 23: 24-28 May 2004HEPiX Spring Meeting Edinburgh1 Overview of Solaris issues at CERN By Ignacio Reguero. Presented by Manuel Guijarro CERN IT-PS-UI.

24-28 May 2004 HEPiX Spring Meeting Edinburgh 24

LEMON Monitoring on Solaris (3)

• Recovery actions framework (to be done)– Part of CmDaemon framework

• Subset of UIMON features need to be implemented– Notification granularity– Active and monitoring time customizing– Smart Recovery Action launch (specific number of

times, execution timeout, avoiding concurrently running)

• Recovery decision made based on CMDaemon correlation unit

Page 24: 24-28 May 2004HEPiX Spring Meeting Edinburgh1 Overview of Solaris issues at CERN By Ignacio Reguero. Presented by Manuel Guijarro CERN IT-PS-UI.

24-28 May 2004 HEPiX Spring Meeting Edinburgh 25

Long term plans for Solaris Support (1)

• Up to now second platform for LHC physics– For validation purposes only– SUNDEV facility for physics development

• Total population of 663 Active nodes– Data from network database

Page 25: 24-28 May 2004HEPiX Spring Meeting Edinburgh1 Overview of Solaris issues at CERN By Ignacio Reguero. Presented by Manuel Guijarro CERN IT-PS-UI.

24-28 May 2004 HEPiX Spring Meeting Edinburgh 26

Long term plans for Solaris Support (2)

• Current Main Users– Accelerator Sector (including LHC construction) (60+ nodes)– AS (AIS) + Oracle DB servers in general (70+ nodes)– CMS (150+ nodes)– AFS (60+ nodes)– CAE + PH/MIC (Electronics development) (130+nodes)– Network monitoring (Spectrum SW (Nick Trikoupis)) (4 nodes)– Remedy (2 nodes)– EST Survey Group (8 nodes)– (Old) Mail Servers + Listbox (4 nodes)– SUNDEV (Physics) (10 nodes)– SUNPARC (Engineering)(8 nodes)– LICMAN (License Servers) (8 nodes)

Page 26: 24-28 May 2004HEPiX Spring Meeting Edinburgh1 Overview of Solaris issues at CERN By Ignacio Reguero. Presented by Manuel Guijarro CERN IT-PS-UI.

24-28 May 2004 HEPiX Spring Meeting Edinburgh 27

Long term plans for Solaris Support (3)

• Critical services, including most DB servers and electronics design being run on Solaris

• However, most physics done on Linux PCs– And it seems that interest of the physics community in

Solaris is diminishing• Problems with C++ support• No interesting Sun desktops• Uncertain future of the company

– The fashionable platform is Apple MAC• Nice Laptops

• In the IT POW action item on “SUNDEV Reduction”– Ongoing discussion with Les Robertson (LCG)

Page 27: 24-28 May 2004HEPiX Spring Meeting Edinburgh1 Overview of Solaris issues at CERN By Ignacio Reguero. Presented by Manuel Guijarro CERN IT-PS-UI.

24-28 May 2004 HEPiX Spring Meeting Edinburgh 28

Long term plans for Solaris Support (4)

• A likely scenario coming out of the discussion would be

• A downsizing of SUNDEV– Using a reduced number of nodes or smaller nodes– Recycling the current nodes for DB serving

• A downsizing of Solaris Support– By defining a Service Level Agreement with a more

precise scope– For instance

• Only support installation server and Quattor automated management

• Regular calls to be handled exclusively by the desktop contract and/or directly by Sun

– In order to free one FTE

Page 28: 24-28 May 2004HEPiX Spring Meeting Edinburgh1 Overview of Solaris issues at CERN By Ignacio Reguero. Presented by Manuel Guijarro CERN IT-PS-UI.

24-28 May 2004 HEPiX Spring Meeting Edinburgh 29

SUNDEV

Page 29: 24-28 May 2004HEPiX Spring Meeting Edinburgh1 Overview of Solaris issues at CERN By Ignacio Reguero. Presented by Manuel Guijarro CERN IT-PS-UI.

24-28 May 2004 HEPiX Spring Meeting Edinburgh 30

Questions?

Unix Infrastructure section:http://cern.ch/product-support/UI

[email protected]@cern.ch