1 The Condor DB Group Report Jiansheng Huang, Ameet Kini, Shrinivas Lakshmikant, Erik Paulson, Christine Reilly, Eric Robinson, Srinath Shankar, David DeWitt, Jeff Naughton
Jan 07, 2016
1
The Condor DB Group Report
Jiansheng Huang, Ameet Kini, ShrinivasLakshmikant, Erik Paulson, Christine Reilly, Eric Robinson, Srinath Shankar, David DeWitt, Jeff Naughton
2
Overview
General overview of group projects (Naughton).
Quill (Paulson).
3
Condor DB Group
Overall task: Focus on data management aspects of
Condor Deliver prototypes of useful technology Explore, develop and evaluate technology
that may be useful to Condor down the road.
4
Projects other than Quill
Provenance in a Condor System. Statistical mining of log data to evaluate
system health. Interaction of user data placement, caching,
and workflow job scheduling. Job-machine matching in DB context. Condor functionality based on App-Server
technology. Recency and consistency in captured data.
5
Provenance and Condor
Christine Reilly ([email protected]). Provenance: information on how data was
produced. Observation: for each user job, Condor can record:
Which version of program(s) was used; Which version of data was used; When it was produced; What system it ran on (hardware, software.)
Questions: How much information should we gather? How much burden should we place on the system designer,
application programmer, or both?
6
Debugging through log mining
Srinivas Lakshmikant ([email protected]) Idea:
Record “events,” logically associated with entities. E.g., job entities start, get scheduled, run, terminate.
Find which entities have infrequent events. Find which entities lack frequent events.
Can you use this to detect problems? Early results suggest yes: finds and pinpoints
problems that might not be found otherwise. How can you increase the accuracy and
efficiency over naïve approaches?
7
Caching,Scheduling,Workflow
Srinath Shankar ([email protected]) Idea:
Cache input files and intermediate files on disks of pool machines;
Record where these files are cached; Schedule tasks in a workflow to minimize data
fetches/moves Result: potentially much greater throughput.
8
Job Matching in a DBMS
Ameet Kini ([email protected]) Idea: matching looks a lot like a DBMS
join. If machine and job data are already
stored in a DBMS, can we or should we use the DBMS to do the matching?
Answer: early results are promising but this is a non-trivial problem.
9
Recency of Quill Data
Jiansheng Huang ([email protected].) Problem: daemons report in at uncontrollable
and unpredictable times. Result: out of date and inconsistent data set. Can we provide the user with a concise
characterization of the recency of the sources relevant to a user query?
Note: surprisingly non-trivial to define what we mean by “relevant” in this setting.
10
App. Servers and Condor
Eric Robinson ([email protected]) Idea: applications servers provide a lot
of technology that appears useful in a Condor setting.
Approach: build prototype of some Condor functionality using these tools, evaluate the approach.
11
Moving on…
Further questions on these projects? Best bet is to contact student listed on each slide.
On to Quill portion of talk.
12
The Condor Quill
“Give me a condor's quill! Give me Vesuvius' crater for an ink stand. Friends, hold my arms! For in the mere act of penning my thoughts of this Leviathan, they weary me. . . To produce a mighty book you must choose a mighty theme.”
-Melville, Moby Dick
The Quill Developers
13
What is Quill?
A non-invasive method of storing a read-only version of the Condor operational data in a relational
database.
14
Job queue transaction
log
(job_queue.log)
Quill: In pictures
SchedD
QuillD
DBMSSchedD
Without QuillWith Quill
Job queue transaction
log
(job_queue.log)
Disk
15
Quill: Where we’ve been
First shipped in 6.7.11 (Sept 05) Now “over the fence” – Condor Team is
driving the 6.8 version Response from users very helpful! Lessons learned
Passive collection good DBMSes are full of surprises
16
Quill: Where we’d like to be
Shared databases Better job data Data from non-job sources More than just PostgreSQL DBMS Examples of usage
17
Quill in Condor 6.9.3
Development effort mostly complete Previous bullet points addressed Migration path for historical job data Out of the box changes for Quill users:
Horizontal and vertical schema for active jobs Jobs from multiple schedds in one database By default, no new historical data stored
18
Example tables
ScheddName Cluster Proc Owner JobStatus JobPrio Universe
north.cs.wisc.edu 23 2 epaulson IDLE 10 Vanilla
north.cs.wisc.edu 23 3 epaulson IDLE 10 Vanilla
south.cs.wisc.edu 13 2 jhuang RUN 5 Grid
north.cs.wisc.edu 13 2 miron HELD 30 Standard
ScheddName Cluster Proc Attr Value
north.cs.wisc.edu 23 2 WantIO TRUE
north.cs.wisc.edu 23 2 Group Database
north.cs.wisc.edu 23 3 Group Condor
south.cs.wisc.edu 13 2 Group Condor
Vertical Job Table
Horizontal Job Table
19
More job information
The lifecycle of the job would be nice to have Events like those in the “user log”
But, need more info than what’s in the job queue
Passive data collection works
20
Job queue.
log
Quill 6.9.3 diagram
SchedD
QuillD
DBMS Disk
event log
(new)
Schedd writes events to the new “Event” log, Quill daemon passively picks up the events and inserts them into the database.
For the schedd, event log contains userlog events and job history events
21
Examples
“Show me all the jobs that exited with a segfault that at some point ran on this machine”
“When my jobs get preempted, how long until they get matched again?”
“What is the average runtime for jobs for each different type of input file” SQL “GROUP by”
22
Collecting non-job information
SchedD
QuillD
DBMS Disk
event log
(new)
StartD
Negotiator
23
New information stored
StartD: Machine status Negotiator: Matches made Starter/Shadow: Files transferred Collector: “Submitter” ads All daemons: Generic Events, daemon
ads
24
The DBMSD
New daemon responsible for database housekeeping Only one needed per DBMS
Purges old data Three classes, independent thresholds
Resource: Machine classads Run: matches, job log events Job: condor_history information
Estimates size of database “Soft quota”, warn when exceeded
25
Multiple DBMS systems
Oracle supported Appears to need less maintenance
A nearly unified schema Main difference is large text fields Same binaries, DBMS type selectable via
configuration file
26
Example Usage
PHP web front end Good enough for some people Or, use as the basis for your own system
BoF on Thursday at 11:00am We’ll use the web front end to explain the
information Quill now stores