This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
1Gray & Reuter: Resource Manager
Resource ManagersResource Managers
9:00
11:00
1:30
3:30
7:00
Overview
Faults
Tolerance
T Models
Party
TP mons
Lock Theory
Lock Techniq
Queues
Workflow
Log
ResMgr
CICS & Inet
Adv TM
Cyberbrick
Files &Buffers
COM+
Corba
Replication
Party
B-tree
Access Paths
Groupware
Benchmark
Mon Tue Wed Thur Fri
Jim Gray Jim Gray Microsoft, Gray @ Microsoft.comMicrosoft, Gray @ Microsoft.com
Resource managers – provide ACID objects (transactional objects)provide ACID objects (transactional objects)– Use log manager to record changesUse log manager to record changes– Use transaction manager to coordinate multi-RM changesUse transaction manager to coordinate multi-RM changes– Use communication manager to make transactional RPCsUse communication manager to make transactional RPCs
Transaction Manager
Log Manager
Log
Objects
Resource Managers
Objects
Resource Managers
Volatile Storage
Durable Storage
Volatile Storage
Durable Storage
Communication Manager
Transaction Manager
Log Manager
Communication Manager
Log
3Gray & Reuter: Resource Manager
Whirlwind Tour: the Application VerbsTRID Begin_Work(context *); /* begin a transaction */Boolean Commit_Work(context *); /* commit the transaction */void Abort_Work(void); /* rollback to savepoint zero */
savepoint Save_Work(context *); /* establish a savepoint */savepoint Rollback_Work(savepoint); /*return to savept (savept 0 = abort)*/Boolean Prepare_Work(context *); /* put transaction in prepared state */context Read_Context(void); /* return current savepoint context */TRID Chain_Work(context *); /* end current and start next trans */
TRID My_Trid(void); /* return current transaction identifier*/TRID Leave_Transaction(void); /*set process trid null, return current id*/Boolean Resume_Transaction(TRID); /* set process trid to desired trid */
A Partial RollbackBeginActionActionSave ActionSaveActionActionActionSaveActionRollback
A Persistent Transaction Surviving A System Restart
BeginActionAction
ActionSave Action
RestartActionSave ActionCommit
5Gray & Reuter: Resource Manager
Whirlwind Tour: the TRID FlowCall graph: who calls whom.TRIDs flow on all such calls.Application is typically root.RM can be an application (use a transactional RM to store state)
Application
Application Servers
Resource Managers
Resource Managers
Transaction Application Servers
6Gray & Reuter: Resource Manager
Whirlwind tour Normal (no failure) Transaction Execution
TP monitoradministrative functions and callbacks to install, start, and schedule a resource manager
response
invocation
callbacks(depends on application)
Save
Prepare Commit UNDO REDO
Checkpoint
Transaction Manager
functions
callbacks
Identify SaveWork
RollbackWork Join
StatusTransaction Leave
Resume
8Gray & Reuter: Resource Manager
WW tour: The Resource manager view
Boolean Savepoint(LSN *); /* invoked at tran Save_Work(). Returns RM vote */Boolean Prepare(LSN *); /* invoked at phase_1. Return vote on commit */void Commit(); /* called at commit ¯2 */void Abort(); /* called at failed commit ¯2 or abort */
void UNDO(LSN); /* Undo the log record with this LSN */void REDO(LSN); /* Redo the log record with this LSN */Boolean UNDO_Savepoint(LSN);/* Vote TRUE if can return to savepoint */void REDO_Savepoint(LSN);/* Redo a savepoint. */
declare cursor for transaction_log select rmid, lsn /* a cursor on the transaction's log */from log /* it returns the resource manager name */where trid = :trid /* and record id (log sequence number) */descending lsn; /* and returns records in LIFO order */
void transaction_undo(TRID trid) /* Undo the specified transaction. */ { int sqlcode; /* event variables set by sql */
open cursor transaction_log; /* open an sql cursor on the trans log */while (TRUE) /* scan trans log backwards & undo each*/
{ /* fetch the next most recent log rec */fetch transaction_log into :rmid, :lsn; /* */if (sqlcode != 0) break; /* if no more, trans is undone, end loop*/
rmid.undo(lsn); /* tell RM to undo that record */ } /* tell RM to undo that record */ close cursor transaction_log; /* Undo scan is complete, close cursor */ }; /* return to caller */
• If UNDO to savepoint , the UNDO stops at desired savepoint
15Gray & Reuter: Resource Manager
Resource Manager Concepts: Restart REDO Protocol
Note: REDO forwards, UNDO backwards
void log_redo(void) /* */{declare cursor for the_log /* declare cursor from log start forward */
select rmid, lsn /* gets RM id and log record id (lsn) */from log /* of all log records. */ascending lsn; /* in FIFO order */
open cursor the_log; /* open an sql cursor on the log table */while (TRUE) /* Scan log forward& redo each record. */
{ fetch the_log into :rmid, :lsn; /* fetch the next log record */if (sqlcode != 0) break; /* if no more, then all redone, end loop */
rmid.redo(lsn);} /* tell RM to redo that record */ close cursor the_log; /* Redo scan complete, close cursor */ }; /* return to caller */
16Gray & Reuter: Resource Manager
Idempotence
F(F(X)) == F(X): Needed in case restart fails (and restarts)
Keep old and new value of container (page, file,...)
Pro: Simple
Allows recovery of physical object (e.g. broken page)
Con: Generates LOTS of log data
Logical:
Keep call params such that you can compute F(x), F-1
(x)
Pro: Sounds simple
Compact log.
Con: Doesn't work (wrong failure model).
Operations do not fail cleanly.
21Gray & Reuter: Resource Manager
Sample Physical LOG RECORD
Ordinary sequential insert is OK.Update of sorted (B-tree) page:
update LSN
update page space map
update pointer to record
insert record at correct spot (move 1/2 the others)
Essentially writes whole page (old and new).
16KB log records for 100-byte updates.
struct compressed_log_record_for_page_update /* */{ int opcode; /* opcode will say compressed page update*/filename fname; /* name of file that was updated */long pageno; /* page that was updated */long offset; /* offset within page that was updated */long length; /* length of field that was updated */char old_value[length]; /* old value of field */char new_value[length]; /* new value of field */}; /* */
22Gray & Reuter: Resource Manager
Sample Physical LOG RECORD
Very compact.
Implies page update(s) for record (may be many pages long).
Implies index updates (many be many indices on base table)
struct logical_log_record_for_insert /* */{ int opcode; /* opcode will says insert */filename fname; /* name of file that was updated */long length; /* length of record that was updated */char record[length]; /* value record */}; /* */
23Gray & Reuter: Resource Manager
The trouble with Logical Logging Logical logging needs to start UNDO/REDO with an action-consistent state.
No half completed operations.
for example: insert (table, record)ALL or NONE of the indices should be updated
when logical UNDO/REDO is invoked.
Problem:
Failure model is Page & Message action consistency
(Lampson /Sturgis model of Chapter 3).
Actions can fail due to:
Logic: e.g. duplicate key.
Limit: ran out of space
Contention: deadlock
Media: broken page or session
System: computer failure/restart
24Gray & Reuter: Resource Manager
Making Logical Logging Work: Shadows
Keep old copy of each page
Reset page to old copy at abort (no undo log)
Discard old copy at commit.
Handles all online failures due to:
Logic: e.g. duplicate key.
Limit: ran out of space
Contention: deadlock
Problem: forces page locking, only one updater per page.
What about restart?
Need to atomically write out all changed pages.
25Gray & Reuter: Resource Manager
Making Logical Logging Work: Shadows
Perform same shadow trick at disc level.
Keep shadow copy of old pages.
Write out new pages.
In one careful write, write out new page root.
Makes update atomic
Free Space Bit MapDirectory
Free Space Bit MapDirectory
Data
Old New
A Shadow Update
A B C A BC
26Gray & Reuter: Resource Manager
ShadowsPro: Simple
Not such a bad deal with non-volatile ram
Con: page locking
extra space
extra overhead (for page maps)
extra IO
declusters sequential data
27Gray & Reuter: Resource Manager
Compromise Physio-Logical Logging
Physio-Logical LoggingPhysical to a "page" (physical container)Logical within a "page".
Keep old and new value of container (page, file,...)Pro: Simple
Allows recovery of physical object (e.g. broken page)Con: Generates LOTS of log data
28Gray & Reuter: Resource Manager
Logical vs Physio-logical Logging
Insert record r into table A
Table A
Index B
Index C
insert, A, rLogical log record
Table A
Index B
Index C
insert, A, page 508, r
Physiological log records
insert, B, page 72, s
insert, C, page 94, t
Note: physical log records would be bigger for sorted pages.
29Gray & Reuter: Resource Manager
Physiological Logging RulesComplex operations are a sequence of simple operations on pages and
messages.
Each operation is constructed as a mini-transaction:lock the object in exclusive modetransform the objectgenerate an UNDO-REDO log recordrecord log LSN in objectunlock the object.
Action Consistent Object:When object semaphore free, no ops in progress.
Log-Consistency: contains log records of all complete page/msg actions.
30Gray & Reuter: Resource Manager
Physiological Logging RulesOnline Operation - Only Need the Fix Rule
Each operation is structured as a mini-transaction.
Each operation generates an UNDO record.
No page operation fails with the semaphore set.(exception handler must clean up state and UNFIX any pages).
Then Rollback can be physical to a page/session/container and logical within page/session/container.
31Gray & Reuter: Resource Manager
Physiological Logging RulesRestart Operation - Need WAL and F@C
Need Page-Action consistent disc state.Pages are action consistent.Committed actions can be redone from log.Uncommitted actions can be undone from log.
WAL: Write Ahead Log Write undo/redo log records before overwriting disc pageOnly write action-consistent pages
Force-Log-At-CommitMake transaction log records durable at commit.
32Gray & Reuter: Resource Manager
Physiological Logging RulesWAL and F@C
WAL: Write Ahead Log write page: get page semaphore copy page give page semaphore /* avoids holding semaphore during IO */ Force_log(Page(LSN)) /*WAL logic, probably already flushed*/ Write copy to disc.
WAL gives idempotence and testability.
Force-Log-At-CommitAt commit phase 1:
Force_log(transaction.max_lsn)
33Gray & Reuter: Resource Manager
WAL & F@C in PicturesWAL & F@C in Pictures
VVlsn
Volatile Page Versions
Volatile Log Records
VLlsn
PVlsn
Persistent Page Versions
Durable Log Records
DLlsnTim
e
online: VVlsn = VLlsn restart: DLlsn <= VVlsn
PVlsn <= DLlsnCommit:
commit_lsn <= DLlsn
At restart all volatile memory is reset and must be reconstructed from persistent memory.
restart: PVlsn <= DLlsn commit_lsn <= DLlsn
PVlsn
DLlsn
FIX, WAL and F@C assure these assertions
34Gray & Reuter: Resource Manager
The One Bit Resource Manager
Manages an array of transactional bits (the free space bit map).
i = get_bit(); /* gets a free bit and sets it */
give_bit(i); /* returns a free bit (when transaction commits) */
35Gray & Reuter: Resource Manager
The Bitmap and Its Log Records
The Data Structure
struct { /* layout of the one-bit RM data structure */LSN lsn; /* page LSN for WAL protocol */xsemaphore sem; /* semaphore regulates access to the page */Boolean bit[BITS]; /* page.bit[i] = TRUE => bit[i] is free */} page; /* allocates the page structure */
The Log Recordsstruct /* log record format for the one-bit RM */
{ int index; /* index of bit that was updated */Boolean value; /* new value of bit[index] */} log_rec; /* log record used by the one-bit RM */
const int rec_size = sizeof(log_rec); /*size of the log record body. */
36Gray & Reuter: Resource Manager
Page and Log Consistency for 1-Bit RM
Data dirty if reflects an uncommitted transaction update Otherwise, data is clean.
Page Consistency:• No clean free bit has been given to any transaction.• Every clean busy bit was given to exactly one transaction.• Dirty bits locked in X mode by updating transactions .• The page.lsn reflects most recent log record for page.Log Consistency:• Log contains a record for every completed
mini-transaction update to the page.
37Gray & Reuter: Resource Manager
give_bit()get_bit() & give_bit(i) temporarily violate page consistency. Mini-transaction holds semaphore while violating consistency.Makes page & log mutually consistent before releasing sem.=> each mini-transaction observes a consistent page state.
void give_bit(int i) /* free a bit */{ if (LOCK_GRANTED==lock(i,LOCK_X,LOCK_LONG,0)) /* Lock bit */
{ Xsem_get(&page.sem); /* get page sem */page.bit[i] = TRUE; /* free the bit */log_rec.index = i; /* generate log rec */log_rec.value = TRUE; /*saying bit is free */page.lsn = log_insert(log_rec,rec_size); /*write log rec&update lsn */Xsem_give(&page.sem);} /* page consistent */
else /* if lock failed, caller doesn't own bit, */ Abort_Work(); /* in that case abort caller's trans */
return; }; /* */
38Gray & Reuter: Resource Manager
get_bit()
int get_bit(void) /* allocate a bit to and returns bit index */{ int i; /* loop variable */Xsem_get(&page.sem); /* get the page semaphore */for ( i = 0; i<BITS; i++); /* loop looking for a free bit */
{if (page.bit[i]) /* if bit is free, may be dirty (so locked) */ {if (LOCK_GRANTED =lock(i,LOCK_X,LOCK_LONG,0));/* lock bit */
{ page.bit[i] =FALSE; /* got lock on it, so it was free */log_rec.value = FALSE; /* generate log rec describing update */log_rec.index = i; /* */page.lsn = log_insert(log_rec,rec_size); /* write log rec&update lsn */Xsem_give(&page.sem); /* page now consistent, give up sem */return i; } /* return to caller */
}; /* else lock bounce so bit dirty */}; /* try next free bit, */
Xsem_give(&page.sem); /* if no free bits, give up semaphore */Abort_Work(); /* abort transaction */return -1;}; /* returns -1 if no bits are available. */
39Gray & Reuter: Resource Manager
Compensation Logging
Undo may generate a log record recording undo stepMakes Page LSN monotonicSimilar technique was used for Communication Manager
(session sequence number was monotonic)
New State Logical Old State
UNDO
log record compensation log record
40Gray & Reuter: Resource Manager
1-bit RM UNDO Callback
void undo(LSN lsn) /* undo a one-bit RM operation */{ int i; /* bit index */Boolean value; /* old bit value from log rec to be undone*/log_rec_header header; /* buffer to hold log record header */rec_size = log_read_lsn(lsn,header,0,log_rec,big); /* read log rec */Xsem_get(&page.sem); /* get the page semaphore */i = log_rec.index; /* get bit index from log record */value = ! log_rec.value; /* get complement of new bit value */page.bit[i] = value; /* update bit to old value */log_rec.value= value; /* make a compensation log record */page.lsn = log_insert(log_rec,rec_size); /* log it and bump page lsn */Xsem_give(&page.sem); /* free the page semaphore */return; } /* */
41Gray & Reuter: Resource Manager
1-bit RM Checkpoint Callback
LSN checkpoint(LSN * low_water) /* copy 1-page RM state to persistent store*/{ Xsem_get(&page.sem); /* get the page semaphore */*low_water = log_flush(page.lsn); /* WAL force up to page lsn, and */
/* set low water mark */write(file,page,0,sizeof(page)); /* write page to persistent memory */Xsem_give(&page.sem); /* give page semaphore */return NULLlsn; } /* return checkpoint lsn (none needed) */
42Gray & Reuter: Resource Manager
1-bit RM REDO Callbackvoid redo( LSN lsn) /* redo an free space operation */
{ int i; /* bit index */Boolean value; /* new bit value from log rec to be redone*/log_rec_header header; /* buffer to hold log record header */rec_size = log_read_lsn(lsn,header,0,log_rec,big); /* read log record */i = log_rec.index; /* Get bit index */lock(i,LOCK_X,LOCK_LONG,0); /* get lock on the bit (often not needed) */Xsem_get(&page.sem); /* get the page semaphore */if (page.lsn < lsn) /* if bit version older than log record */
{ value= log_rec.value; /* then redo the op. get new bit value */page.bit[i] = value; /* apply new bit value to bit */page.lsn = lsn; } /* advance the page lsn */
Xsem_give(&page.sem); /* free the page semaphore */return; }; /* */
43Gray & Reuter: Resource Manager
1-BIT Rm Noise Callbacks
Boolean prepare(LSN * lsn) /* 1-bit RM has no phase 1 work */{*lsn = NULLlsn; return TRUE ;}; /* */
Boolean savepoint((LSN * lsn) /* no work to do at savepoint */{*lsn = NULLlsn; return TRUE ;}; /* */
void UNDO_savepoint(LSN lsn) /* rollback work or abort transaction */{if (savepoint == 0) /* if at savepoint zero (abort) */
unlock_class(LOCK_LONG, TRUE, MyRMID()); /* release all locks */}; /* */
44Gray & Reuter: Resource Manager
Summary
Model: Complex actions are a page/message action sequence.LSN: Each page carries an LSN and a semaphore.ReadFix: Read acts semaphore in shared mode.WriteFix: Update actions get semaphore in exclusive mode,
generate one or more log records covering the page, advance the page LSN to match highest LSN
give semaphoreWAL: log_flush(page.LSN) before overwriting persistent page F@C: force all log records up to the commit LSN at commitCompensation Logging: Invalidate undone log record with a
compensating log record.Idempotence via LSN: page LSN makes REDO idempotent
45Gray & Reuter: Resource Manager
Two Phase Commit
Getting two or more logs to agreeGetting two or more RMs to agreeAtomically and DurablyEven in case one of them fails and restarts.The TM phasesPrepare. Invoke each joined RM asking for its vote.Decide. If all vote yes, durably write commit log record.Commit. Invoke each joined RM, telling it commit
decision.Complete. Write commit completion when all RM ACK.
46Gray & Reuter: Resource Manager
Centralized Case of Two Phase Commit
Each participant: (TM &RM) goes through a sequence of states
These generate log records
Null ActiveAborting Aborted
Prepared Committing Committed
47Gray & Reuter: Resource Manager
ExamplesExamples
Committed Abortedbegin beginDO rm1 DO rm1DO rm2 DO rm2DO rm2 DO rm2prepare rm2 {locks} UNDO rm2commit { rm1, rm2} UNDO rm2complete UNDO rm1
UNDO begin { rm1, rm2}
complete
48Gray & Reuter: Resource Manager
Transitions in Case of Restart
Null ActiveAborting Aborted
Prepared Committing Committed
Active state not persistent, others are persistent
For both TM and RM.
Log records make them persistent (redo)
TM tries to drive states to the right. (to committed, aborted)
49Gray & Reuter: Resource Manager
Successful two phase commit
Message/Call flow from TM to each RM joined to transaction
If TM and RM share the same log, the RM FORCE can piggyback on the TM FORCE
One IO to commit a transaction (less if commit is grouped)
Prepare
Local PrepareWrite Prepare RecordIn Log (force)
yes
Local Prepare(lazy)
Write CommitRecord In Log
(force)
Commit
Ack
Local Commit WorkWrite Completion RecordIn Log (lazy)Ack when durable.
Coordinator Participant
Write CompletionRecord In Log
(lazy)
State
Active
Prepared
Committing
Local CommitWork(lazy)
Committed
State
Active
Prepared
Committing
Committed
50Gray & Reuter: Resource Manager
Abort Two Phase Commit
If RM sends "NO" or no response (timeout), TM starts abort.
Calls UNDO of each trans log record
May stop at a savepoint.
At begin_trans it calls ABORT() callback of each joined RM
51Gray & Reuter: Resource Manager
Distributed two phase commit
Tracking joined TMs -- the communications manager helpsMuch as TRPC helps in the local case.
Root TM owes a Prepare/Commit/Abort message to each joined TM.Joined TM does "local" commit.
call
first time?
Transaction Manager A
trid is outgoing to
B
Communications Manager
first time?
Transaction Manager
trid is incom
ing from
ACommunications ManagerSession calleetrid, data
trid, data
52Gray & Reuter: Resource Manager
Full Transaction State DiagramNext section explains how these states are implemented.
null
persistent save point n
= save point 0
Begun= save point 1
save point n active
prepared
committing
committed
aborting
abortedDurable States
Persistent States
Volatile States
live states
complete states
53Gray & Reuter: Resource Manager
Summary of Resource Manager Concepts
DO/UNDO/REDOIdempotent, Testable, Real operationsLogical vs Physical loggingShadows to make logical logging workPhysiological logging