Security aspects of the WLCG infrastructure: clients and services Maarten Litmaath CERN
Jan 18, 2018
Security aspects of the WLCG infrastructure: clients and services
Maarten LitmaathCERN
Outline
• How it all should work• Proxies• Incoherence• Security model examples• Banning• Argus• Site authorization• Pilot jobs• Virtual machines and clouds• Data security• Other services• SSO, identity providers• Vulnerability aspects
HEPiX 2009-10-29, LBNL 2
This list probably is incomplete…
How it all should work (1)• Users and services have digital certificates signed by
trusted certificate authorities (CAs)– Certificate lifetime usually is 1 year
• Users are members of virtual organizations (VOs)– WLCG: alice, atlas, cms, lhcb, dteam, ops, …– Users need to re-sign AUP every year– Sites decide which VOs to support at which QoS
• Services are rarely made members of a VO– It would be desirable to some extent
• A service could prove that it is trusted by the VO• Now: rely on information system + filtering
HEPiX 2009-10-29, LBNL 3
How it all should work (2)
• Users create short-lived proxies for grid access
• Long-lived proxies are only found on MyProxy servers
• Proxies are delegated to services as needed– Some services can retrieve or renew proxies via
MyProxy
• Services interpret proxies consistently– The same criteria are used by different services– User jobs and data are protected as needed
• Services log security-related information consistently
• Users can easily be banned as neededHEPiX 2009-10-29, LBNL 4
Where we want to be
HEPiX 2009-10-29, LBNL 5
Where we are
HEPiX 2009-10-29, LBNL 6
Proxies (1)• Plain grid proxy
– Usage: grid-proxy-init– Mapping can only be based on the DN– DNs in grid-mapfile harvested from VOMS servers
• Different subsets can be mapped differently• VOMS proxy
– Usage:• voms-proxy-init –voms vo• voms-proxy-init –voms vo:/vo/group• voms-proxy-init –voms vo:/vo/group/Role=role
– Plain grid proxy + set of attributes signed by VOMS server
– Attributes: groups and/or roles– Mapping can be based on attributes and/or the DN
• Attributes usually preferredHEPiX 2009-10-29, LBNL 7
Proxies (2)• Proxy lifetime should be “short”
– Cf. AFS/Kerberos token lifetime– Default 12 hours, 24 hours probably OK– Current practice: LHC experiments use multi-day proxies
to avoid potential problems with proxy renewal• CMS use 8-day proxies!
• Long job needs proxy to be renewed before it expires
• Long-lived proxies can be stored on a MyProxy server– Trusted services can retrieve or renew short-lived proxies
• MyProxy server currently is a single point of failure– RFE: upload proxies to multiple servers, try all of them
for downloading proxies as needed
HEPiX 2009-10-29, LBNL 8
Incoherence • Different services treat proxies differently
– Libraries – Mapping
• Plain proxies• VOMS proxies
– Logging– Banning
• Not possible on certain services!– Testing/debugging/forensics tools
• Available for some scenarios on some services
• Try finding two gLite services with the same security model !– OSG, ARC?
HEPiX 2009-10-29, LBNL 9
Security model examples• LCG Computing Element
– VOMS mapping with fallback on plain proxy mapping• CREAM Computing Element
– VOMS only• OSG Computing Element
– GUMS: VOMS, DN• Disk Pool Manager
– Virtual IDs– VOMS mapping and plain proxy mapping
• dCache– gPlazma: GUMS, vo-role-map, …
• Workload Management System– VOMS authZ by 2 different libraries: GridSite, LCMAPS
• But Condor-G engine only looks at the DN!HEPiX 2009-10-29, LBNL 10
Banning• OSG have SAZ and GUMS, ARC have Charon
• EGEE/gLite: LCAS library and SCAS/Argus services have banning plugins– Easy to ban a DN– LCG-CE, CREAM-CE, WMS
• DPM/LFC virtual ID table will get banning flags– Currently only plain proxies can be fully banned
• By mapping them to non-existent accounts/VOs – VOMS proxies can be banned only from creating new
files
• Argus should make this consistent and easy– Also can import a grid-wide ban list
HEPiX 2009-10-29, LBNL 11
Argus
• Argus is the long-term gLite authorization framework• It should give all gLite services a consistent authZ model• It allows for authZ decisions to be taken centrally per site
– A single place to pull the plug• It can import remote policies
– Regional, national, project-based, …– Give priority to local/national/… users– Banning of DNs, e.g. grid-wide
• Policies can affect QoS for DNs or VOMS attributes– Preferences– Banning
• Argus will be introduced gradually– It can coexist with legacy services
HEPiX 2009-10-29, LBNL 12
Site authorization
• EGEE– SCAS
• Released to production early July for glexec on the WN• Only deployed on the few sites that helped debugging
glexec and its use by ATLAS and LHCb– Argus
• In certification• OSG
– GUMS– SAZ
• ARC– Charon– Argus support foreseen
HEPiX 2009-10-29, LBNL 13
Pilot jobs (1)
• A pilot job checks and prepares the worker node environment for a real job, i.e. a task that it downloads from a central task queue– Late binding leads to good efficiency
• A multi-user pilot job can pick up a task from any user in the VO
• The task should run with its own associated proxy– Access services, store data etc. with the correct identity
• It should run under an account corresponding to that proxy– Separate users as the CE head node would have done– Protect the pilot proxy against malicious payloads
• A setuid root utility is needed to switch to the correct identity– Like “sudo” or Apache “suexec” gLExec
HEPiX 2009-10-29, LBNL 14
Pilot jobs (2)
• Each experiment has a pilot job framework– ALICE: AliEn– ATLAS: PanDA– CMS: glideinWMS, only used on OSG– LHCb: DIRAC
• All examined by GDB Pilot Job Frameworks Review group• Current usage
– Production managers run VO workload for many/all users
– Individual users may be able to run their own jobs• Foreseen usage
– Pilot jobs use glexec to run payload under user account• Problem: we have no production experience with glexec
and there is little time left before the LHC starts HEPiX 2009-10-29, LBNL 15
Virtual machines and clouds
• Running each job in its own VM is desirable– Reduce security interference between jobs
• Shared software area and shared services remain– Local files left behind can be cleaned up completely– Implemented at some sites and becoming more popular
• Shared SW area not needed when SW included in the image– Avoids Trojan horses and bottleneck
• Complete images also are a natural fit for clouds
• Some sites are experimenting with clouds
HEPiX 2009-10-29, LBNL 16
Data security (1)
• Fine-grained security policies for data access are possible in principle
• In practice there are only 2 levels of security today– Production managers are responsible for the vast
majority of a VO’s data volume (99%)– Only they have write access to specific resources
used in managing production data• Reserved sub-trees in the catalog name space• Reserved disk pools and tape access
– All the remaining resources are group-writable• By default writable for the whole VO!• Different groups in a VO can be shielded from each
other– If they are mapped differently– This may require site admin intervention
HEPiX 2009-10-29, LBNL 17
Data security (2)• BeStMan
– Classic grid-mapfile, GUMS• CASTOR
– Classic grid-mapfile, insecure RFIO !!• dCache
– gPlazma supports GUMS, vo-role-mapfile, …• DPM, LFC
– Maps to virtual UIDs and GIDs (defined in DB)– Native VOMS support, fallback on classic grid-mapfile
• Lcgdm-mapfile to determine the VO for a plain grid proxy• Grid-mapfile is needed by DPM GridFTP server
• StoRM– Native VOMS support– Uses just-in-time ACLs to give access to data on cluster
FSHEPiX 2009-10-29, LBNL 18
Other services
• Information system– Insecure LDAP
• Anyone can search for vulnerable hosts• Information can be corrupted (DNS spoofing, MITM attack)
– Any site can claim it supports any VO• The VO can configure a filter to get rid of unwanted sites
or run a private, static information system– Filters currently work only for Computing and Storage Elements
• Monitoring– When secure, often viewable for any DN from a trusted
CA• Accounting
– Secure– Privacy
HEPiX 2009-10-29, LBNL 19
SSO, identity providers
• SSO for services is popular• Identity providers
– Kerberos– Shibboleth– …
• Why should grid usage be excluded?• SSO identity can be translated into grid identity
– FNAL Kerberos CA, SLCS– SWITCH SLCS– …
HEPiX 2009-10-29, LBNL 20
Vulnerability aspects
• EGEE Grid Security Vulnerability Group has >70 open issues– The vast majority of them are deemed low risk …
for now
• A complete list of domains involved in WLCG could be used to configure service firewalls accordingly– Outbound client connections might also be
constrained
• Jobs/payloads should be signed by the user proxy– Close the door to “easy” injection of rogue jobs
HEPiX 2009-10-29, LBNL 21
Conclusions
• Security aspects of WLCG clients and services show a forest of libraries, configurations and features– A lot of legacy
• More consistency and simplicity are highly desirable
• Some important functionalities only implemented partially– Banning– Site-wide policies– Data protection
• There are steady improvements and road maps– To get us out of the woods…
HEPiX 2009-10-29, LBNL 22
HEPiX 2009-10-29, LBNL 23