The Enron and W3C CollectionsThe Enron and W3C Collections
Tamer Elsayed and Douglas W. Oard
ICAIL 2007, DESI Workshop, June 4ICAIL 2007, DESI Workshop, June 4thth, 2007, 2007
University of Maryland
The Enron and W3C Collections
ParticipantParticipant Non-participantNon-participant
PersonalPersonal My own emailsShneiderman’s
Postel’s
OrganizationOrganization Help desksWhite House
Enron
PublicPublicOnline
communitiesUsenet news
W3C
Variants of Email Search
SearcherSearcher
Co
llec
tio
nC
oll
ecti
on
The Enron and W3C Collections
Rich multimodal data Emails Phone calls Databases
The (Extended) Enron Collection
The Enron and W3C Collections
“Public” version of Enron collection (CMU) 150 sets of rescued Outlook email folders 517,431 emails, 52% duplicates, 133,581 unique addresses Subset annotated w/genre, speech act, mentioned calls, …
Extended Enron email collection (Aspen Systems) Attachments, additional email (later release, redaction)
Phone calls from/to Enron traders (Shohomish PUD) Transcribed subset from 52 DVDs of recorded audio Recovered from scanned transcripts using OCR 93 annotated with date, time, participants, mentioned names,
mentioned emails, mentioned meetings, ...
Relational databases (Aspen Systems)
The (Extended) Enron Collection
The Enron and W3C Collections
Cross-References
EMAILPhone
Calls
The Enron and W3C Collections
Phone Call Transcripts
Message-ID: <24-20010126-19435570-20020114-R>
Message-Type: PhoneCall
Date: Fri, 26 Jan 2001 19:43:55 -0600 (CST)
From: [email protected]
Parties: [email protected], [email protected]
Subject: Snohornish deal, Houston Chronicle Article, Bonuses e-mail, Houston Chronicle Article, Deal, email to Jane King
Subject-TimePos: 145, 313, 713, 775, 920, 1018
InCallNames: Christian, Ken Lay, Greg, Chris Foster, Stewie, Stewie, Mike, Mike, Laverado, Mike, Kim, Shari, Greg, Forney, Stewie, Jane King, Shari
InCallNames-TimePos: 42, 81, 90, 95, 96, 143, 146, 190, 262, 266, 522, 580, 780, 1007, 1018, 1038, 1067
Keywords: CDWR, email, email
Keywords-TimePos: 55, 689, 1038
X-From: Stack, Shari <>
X-To: Wolfe, Greg <>
X-Parties: Stack, Shari <>, Wolfe, Greg <>
X-AudioFile: 24-20010126-19435570-20020114-R.wav
X-TranscriptFile: 24-20010126-19435570-20020114-R.txt
SHARI STACK: Hey.
GREG WOLFE: All right, let me get my fax machine workin'. Uh - [laughs]
SHARI: [laughs] She's like, it was so easy, I could make you a lot of money [laughs]. She's like, he said it so desperate. She goes I hate to laugh at people, but - [laughs]
GREG: Did you, um, did you, ah, ah tell her about the, ah, that voice mail?
SHARI: Yeah, I said - I said Greg [inaudible] he's got the - they got a mob connection [langhs] - his friend threw away the business card after the meeting.[both laughing]
SHARI: But, my God - my God, and so anyway, have you talked to Chnstian about this 'cause Christian apparently talked to him twice today.
GREG: Oh, he sent a - Christian sent an e-mail shortly after, you know, that, and said we're not doin' business with this guy.
SHARI: [laughs]
GREG: Ah, so I still don't understand why this guy's trying to get in the middle of us and CDWR and I guess -
SHARI: [laughs]
The Enron and W3C Collections
Message Header
Main BodySalutationSalutation
Signature BlockSignature Block
Quoted Header QuotedText
Message Body
Quoted SignatureQuoted Signature
Quoted Main Body
Typical Enron Email
-----Original Message-----From: [email protected]@ENRONSent: Monday, July 30, 2001 2:24 PMTo: Sager, Elizabeth; Murphy, Harlan; [email protected]; [email protected]: [email protected]: Shhhh.... it's a SURPRISE !
Message-ID: <1494.1584620.JavaMail.evans@thyme>Date: Mon, 30 Jul 2001 12:40:48 -0700 (PDT)From: [email protected]: [email protected]: RE: Shhhh.... it's a SURPRISE !X-From: Sager, Elizabeth </O=ENRON/OU=NA/CN=RECIPIENTS/CN=ESAGER>X-To: '[email protected]@ENRON'
Hope all is well.Count me in for the group present.See ya next week if not earlier
Please call me (713) 207-5233
Liza
Elizabeth Sager713-853-6349
Hi Shari
Thanks!
Shari
The Enron and W3C Collections
Research Problems (Enron)
Threading Email Classification Social Network Analysis Mention Resolution
The Enron and W3C Collections
Date: Wed Dec 20 08:57:00 EST 2000From: Kay Mann <[email protected]>To: Suzanne Adams <[email protected]>Subject: Re: GE Conference Call has be rescheduled
Did Sheila want Scott to participate? Looks like the call
will be too late for him.
Who is that “Sheila”?
Sheila??
The Enron and W3C Collections
Rich Evidence about Identity
[email protected] m scott
suebobsusan scott
sue
susan
m scott
scott susan
susan m scott
susan scott
[email protected] scott
friday
sscott5
susan
sscott
susan m scott
com members
66,715 models
82,084addr-name
3,151 addr-nickname
19,708 addr-addr
The Enron and W3C Collections
Test Collection of Mention Resolution
Candidates
Collection Emails Identities Queries Min. Avg. Max.
SagerSager 1,628 627 51 1 4 11
ShapiroShapiro 974 855 49 1 8 21
Enron-subsetEnron-subset 54,018 27,340 78 1 152 489
Enron-allEnron-all 248,451 123,783 78 3 518 1785
Sager
Shapiro
Enron-subsetEnron-all
Test CollectionsTest Collections
The Enron and W3C Collections
Evaluation
Task named-mention ranked list of people
Measures Mean Reciprocal Rank Success @ K
Success @ 1
Confidence-based scoring
The Enron and W3C Collections
Limitations (Mention Resolution)
Small number of queries Only resolved by Enron employees
Much easier Most of participants are outsides
Measures focus only on accuracy
The Enron and W3C Collections
Identity-Content Interplay
Search for People
Search for Content
SocialSocialContextContext
TopicalTopicalContextContext
The Enron and W3C Collections
W3C Collection
Set of mailing lists public not private Topically-oriented
~175,000 emails Introduced at TREC 2005 50 topics (x 2 years) relevance judgments available for ad-hoc
retrieval
The Enron and W3C Collections
Research Problems (W3C)
Expert Finding Topic ranked list of experts
Know-item Retrieval Query ranked list of emails
Discussion Search (i.e., ad-hoc retrieval) Pro/con retrieval Query ranked list of emails
The Enron and W3C Collections
Topic Type AnalysisFind categories amenable to pro/con classification (TREC 2005-Enterprise Track)
Number of Topics in Categories
0 5 10 15 20 25 30
F: Reasons, design rationales
E: Definitions, functionality
D: Problems, impacts
C: Discuss an issue
B: Methods, tips, solutions
A: Comparions, usefulness, relationships
Category
The Enron and W3C Collections
Limitations (Pro/Con Retrieval)
Not private/personal communication Mailing lists receivers are hidden Topical categories are unbalanced Developed by researchers NOT users
The Enron and W3C Collections
Related Projects Others working with CMU’s Enron emails
Berkeley, CMU, U Mass, SIAM Workshop
University of Southern California ISI/ICT eArchivarius, Postel collection (Anton Leuski)
Georgia Tech Research Institute PERPOS Presidential records (Bill Underwood)
The Enron and W3C Collections
Conclusion
Two email test collections Public Hundreds of thousands of emails Annotated emails and transcripts Tasks and ground truth
Need for “real” user needs Development of evaluation measures for utility
The Enron and W3C Collections
For More Information
Joint Institute for Knowledge Discovery http://www.umiacs.umd.edu/jikd
The Enron and W3C Collections
Running System