How to conduct high quality research and write good papers Haixun Wang Microsoft Research Asia
May 27, 2015
How to conduct high quality research and write good papers
Haixun Wang
Microsoft Research Asia
2
What is research?
1. Solve a problem using existing methods. Write a README.txt. (low innovation, little impact)
2. Improve existing solutions to an existing problem. Write a tech report. (low innovation, little impact)
3. Create a new solution to an existing problem. Write a paper. (high innovation, low impact)
4. Identify a new problem. Generalize the solution. Write a paper. (high innovation, high impact)
Research and Engineering
• New Solutions Useful Solutions
3
How innovative are you?
4
5
• Why, if the Chinese had come to know so much about earthquakes so early on in their immensely long history, were they never able to minimize the effects of the world’s contortions — to at least the degree that America has?
• Why did they leave the West to become leaders in the field, and leave themselves to become mired, time and again, in the kind of tragic events that we are witnessing this week?
• It is a cruel that the children who died during the earthquake in Dujiangyan (都江堰), China, knew all too well that their country once led the world in the knowledge of the planet’s seismicity.
6
• There had been any number of Chinese Euclids and Archimedes but there was never to be a Chinese Newton or Galileo.
• Until this week Dujiangyan was a place of which China could be proud; today its wreckage stands as a tragic monument to a culture that turned its back on its remarkable and glittering history (of innovation).
• In almost every area of technology the Chinese were once supreme, without competition. And yet, in the 16th century China’s innovative energies inexplicably withered away, and modern science became the virtual monopoly of the West.
How to train your innovation?
7
Read, Read, Read
8
9
Malcolm GladwellEditor, New Yorker
10
10,000 hours of success
Excellence requires a minimum level of practice.
10,000 hours is the magic number
(3 hours per day for 10 years)
11
By the time Bill Gates dropped out of Harvard, he had been
programming nonstop for seven years, which was way past
10,000 hours.
12
In the last 10 years, I spent more than 3 hours watching TV
everyday, how come I didn’t achieve anything?
13
Nicholas Carr, Atlantic MonthlyJuly 2008
14
Independent thinking
• the downfall of deep reading/thinking
• Internet is rewiring our brains, forcibly adapting us to tolerate only bite-sized summations and simplified blips at the expense of deeper thought
• we risk turning into ‘pancake people’—spread wide and thin as we connect with that vast network of information accessed by the mere touch of a button.
15
How to train your creativity?
Write, Write, Write!
16
Research = Writing + Rewriting
• Turn your idea into writing before implementing it.
• Hard to write it down? Because you don’t understand the problem (or your idea).
– Writing forces us to be clear, focused
– Writing crystallises what we don’t understand
• Writing opens the way to dialogue with others: reality check, critique, and collaboration.
17
Research = Writing + Rewriting
• The process of writing and rewriting is the process of– developing your idea
– generalizing your problem/solution
• After many times of rewriting, your problem (idea) maybe totally different from the problem (idea) you start with– more interesting and challenging
• It’s not a waste of time. It’s how you should spend your time when you do research.
How to find a topic?
The Theory of Flying Pigs
18
In Reality
– Pigs do not have to fly.
[ABSTRACT] In this paper, we identify theimportance for pigs to fly. We show thatmany challenging tasks can be modeled byflying pigs. Thus, solving the flying pigproblem benefits a large variety ofapplications.
20
[ABSTRACT] In this paper, we extend thepioneering work of flying pigs [1]. Ourimprovement enables pigs to fly higher.
[ABSTRACT] Recently, the flying pig problemhas attracted significant attention [1, 2].However, pigs in previous works are all flyingvery slow. In this paper, we introduce atechnique so that pigs can fly an order-of-magnitude faster.
22
and soon we have many papers …
24
What topic to work on?
• The choices you make will define your career
• No real problems at hand
– Get a proceeding. Read from the 1st page.
– Ask senior people what they are working on.
– Make it go faster/higher
• Find real problems, use real data
25
Is this topic meaningful?
• Convince yourself
– an issue of research ethics
• Talk to your colleagues
– Hey! I have a crazy idea
– Convince them
• Talk to/Read from people not in your field
– mathematicians, physicists, biologists, …
26
Database research as an example
• Database has been one of the most successful fields in CS in terms of applications and industrial value!
• However, is there any leftover for substantial database research?
– Relational database theory, a closing world?
– Too many index structures already?
27
Example: Data Model
• From : RDBMS
– Normalization is one of the cornerstones of RDBMS
– Theoretical results and practical applications
• To: XML
– Storage model: still an open problem
– hybrid database, Native XML support
28
Example: Logic Databases
• Logic database was a hot topic in the 80’s and early 90’s– models, semantics, magic sets, …
– many results have since been incorporated into RDBMS
– is Logic Database dead?
• Rejuvenated by semantic query processing– ontology, description logics
29
Broadening the Scope
• Concern (VLDB endowment meeting, 98’):
– The area of database research may lose the pivotal role it now plays among information system technologies
• Keep DB research current and relevant
– We should maintain a watch on trends and future directions in the general area of information management
• Can a traditionally non-DB/KDD research problem be treated using DB/KDD methods?
31
Writing techniques
• Overcome language barrier
• Paper structure and content
32
The Language Barrier
• One must first know the
rules to break them
33
Some General Tips
• Choose the right word/phrase
• Use the active voice
• A picture is worth 10,000 words
• Use a fair amount of formalization
• The divide-and-conquer approach
• Keep it simple and stupid
34
Choose the right word/phrase
• Chicken without sexual life
• Husband and wife’s lung slice
• Bean curd made by a pockmarked woman
35
Use the active voice
• Ten Yuan will be paid for every
one-time towel you use.
36
Use the active voice
NO YESIt can be seen that... We can see that...
34 tests were run We ran 34 tests
These properties were
thought desirable
We wanted to retain these
properties
It might be thought that this
would be a type error
You might think this would be
a type error
The passive voice is “respectable” but it DEADENS your paper. Avoid it at all costs.
“We” = you and the reader
“We” = the authors
“You” = the reader
Slide borrowed from Simon Peyton Jones
37
Some General Tips
• Choose the right word/phrase
• Use the active voice
• A picture is worth 10,000 words
• Use a fair amount of formalization
• The divide-and-conquer approach
• Keep it simple and stupid
38
Be Specific
NO! YES!
We describe the WizWoz
system. It is really cool.
We give the syntax and semantics of a
language that supports concurrent
processes (Section 3). Its innovative
features are...
We study its properties We prove that the type system is sound,
and that type checking is decidable
(Section 4)
We have used WizWoz in
practice
We have built a GUI toolkit in WizWoz,
and used it to implement a text editor
(Section 5). The result is half the length of
the Java version.
From Simon Peyton Jones
39
Structure (conference paper)
• Title (1000 readers)
• Abstract (4 sentences, 100 readers)
• Introduction (1 page, 100 readers)
• The problem (1 page, 10 readers)
• My idea (2 pages, 10 readers)
• The details (5 pages, 3 readers)
• Related work (1-2 pages, 10 readers)
• Conclusions and further work (0.5 pages)
Slide borrowed from Simon Peyton Jones
40
An Attractive Abstract Counts
• Abstract is for people to skim through in one minute
– No technical details
– Plain English, easy to understand
– No assumption of DB/KDD background
– As short as possible
• What to write
– The problem, and why it is important and challenging
– Your technical thrust, progress and contributions
– Broader impact
• Write it last!
41
What Is a Good Introduction
• Starting from good stories– Motivation – what is the problem and why is the
problem important?
– 1-2 typical real-life applications
• Intuition and general ideas– Intuition is most important!
– No technical details
– Understandable for a CS undergraduate
– Use clear, small examples
42
What Is a Good Introduction (2)
• Highlight major contributions
– Typical examples: identifying a new problem, novel solutions, a systematic performance study, …
– Only list the major ones, don’t over claim
– Again, no technical details
– A road map of the rest of the paper
What’s the difference?
43
Hardcover: 1312 pages
Publisher: Wiley; 7th edition (June 20, 2001)
Language: English
ISBN-10: 0471381578
ISBN-13: 978-0471381570
Product Dimensions: 10.1 x 9.1 x 1.9 inches
Shipping Weight: 6.1 pounds
页码:378 页出版日期:2004年01月ISBN:7040137860
条形码:9787040137866
44
Writing paper is like telling a story
• The goal of the title is to get the reader to read
the abstract …
• The goal of the abstract is to get the reader to
read the introduction …
• …
• You need a good set up … a suspense … then
you unfold your story slowly …
45
Goal: creating a suspense
• Reader thinks “gosh, if they can really deliver this, that’d be exciting. I’d better read on”
46
Create Suspense
Many years later, as he faced the firing squad, Colonel Aureliano Buendia was to remember that distant afternoon when his father took him to discover ice.
One hundred years of solitudeby Gabriel García Márquez
47
Keep it Simple and Stupid
一夜北风紧
红楼梦/曹雪芹
这句虽粗,不见底下的,这正是
会作诗的起法。不但好,而且留了写不尽的多少地步与后人。
48
An Example (SIGMOD’02)
49
Motivation Found!
Shifting Pattern {b,c,h,j,e}
Scaling Pattern{f,d,a,g,i}
50
Is It Meaningful?
CH1I CH1B CH1D CH2I CH2B …
VPS8 401 281 120 275 298
SSA1 401 292 109 580 238
SP07 228 290 48 285 224
EFB1 318 280 37 277 215 …
MDM10 538 272 266 277 236
CYS3 322 288 41 278 219
DEP1 317 272 40 273 232 …
NTG1 329 296 33 274 228
… … …
51
Intuition Is the Most Important
• Example– ensemble classifier for streams
• Why ensemble?– Rigorous mathematical proof which shows ensemble
reduces classification variance
• Many benefits– High accuracy, ease of use, best approach in many
aspects
• Result: – paper rejected
52
Optimal decision boundary
t0 t1 t2t1 & t2 errorst0 & t2 no errors!t0 & t1 & t2 errors
53
How to Present Technical Details?
• The top-down approach
– First give an overview of the algorithm
– Present details of the major steps
• The bottom-up approach
– Start from the critical details
– Summarize the discussion and present the algorithm
• The hybrid approach
– Top-down to partition the global problem
– Bottom-up to present solutions to sub-problems
54
How to Present Examples?
• Occam’s razor (the principle of parsimony)
– “One should not increase, beyond what is
necessary, the number of entities required to
explain anything”
• Find the simplest example that can show
all the points you want to show
– Some data in running examples can be highly
skewed
– Only select data that can show critical ideas
55
Worksheet of Running Example
• Work out the complete running example
• Select the interesting and critical segments
• Present multiple small examples in the paper
– Only one running example if possible
– Preferably several paragraphs in one example
– Don’t give a long, exhaustive example
– Each example should focus on one point
56
How to Present Algorithms?
• Choose the appropriate abstract level
– Operations obvious – omit them
• Readers have general CS background
– Complicated operations – function description
• The WWH sequence
– Why do we need such an operation?
– What is the operation?
– How can the operation done efficiently?
57
Keep Your Algorithm Short
• Long algorithms are hard to understand
• Multi-level expansion of algorithms
– Use functions or procedures
• Ideally, each algorithm is less than 20 lines
• Control the complexity
– Don’t use too many variables
– Use meaningful variable names
– Use plain text to explain
58
Performance Study Goals
• “Wisconsin wallpaper”
• Clearly say why you design and conduct
the experiments
– Effectiveness measures
– Efficiency measures
– Other considerations
59
How to Present Experimental
Results?
• Experiment settings
• Performance study goals
• Selected experimental results
– Explanation
• Summary of performance study
60
How to Handle Related Work?
• If possible, talk about related work at the end of the paper.– Do not interrupt the flow of your story
• Extensive collection of related work– Don’t forget to look at the latest results– Go beyond your field, if possible
• Give sufficient credits to others– We are standing on the shoulders of giants– Avoid emotional words– Be precise in comparison
• Point out critical points– Use examples if necessary
61
What Should Be in Discussion?
• Related issues
– Constraints in your method
– Drawbacks
• Possible extensions
– Point out the other problems that can be solved straightforwardly using the proposed method
– Broader impact
• Future work if you have a detailed plan
62
Writing Strong Conclusions
• Summarize the paper briefly.
– What is the problem solved
– Major technical contributions
– Major findings and results
• Future work if possible
63
Aiming high!Major DB/KDD Conferences
• DB (in my opinion)
– 1st tier: SIGMOD, VLDB, ICDE
– 2nd tier: EDBT, ICDT, CIKM, ER, SSDBM
– Regional: DASFAA, WAIM, British DB Conf,
Australian DB Conf, Brazilian DB Conf, DEXA, …
• KDD (in my opinion)
– Top: KDD
– 2nd tier: SIAM DB, ICDM,
– Regional: PAKDD, PKDD, …
– KDD papers can be sent to DB & ML conferences
64
Reviewers’ Comments
65
Reviewers Comments
• The conference review process is necessarily imperfect
• The reviewers operate under strict time constraints, and the committee must make quick decisions.
• Some good papers will be rejected and some embarrassing papers will be accepted.
66
Thank you!
67
My Paper Got Accepted!
• Congratulations!
• Address reviewers’ comments in the final version– Adopt good points
– Clarify and remove confusions
• Prepare a nice talk and/or poster– Pass the general idea
– Use examples wherever possible
– Use as few symbolic text as possible
68
Recycle a Paper
• Before publication, a paper is likely to go
through several rejections
– SIGMOD,VLDB,ICDE acceptance is around
10%-15%
– A conference with 25+% acceptance ratio
may not be good
• Aim at the next chance
69
Learn from the Reviews
• Do we aim at the right target?
– If 2/3 of reviewers are laymen of your subject,
consider the forum seriously
• Address technical issues
– Response to reviewers’ comments by
revising/enhancing technical description and
experiments
• Improve writing
– Confused reviewers? Clarify the issues
– Correct any linguistic problems pointed out
70
Why Journal Papers?
• Records archived
• Important for degree, promotion,
election, …
71
Conference vs. Journal Papers
• Length
– Journal papers are often longer
• Objectives
– Conference papers mainly pass the ideas and
results
– Journal papers systematically report and
justify the research, more formal
72
From Conference Papers to
Journal Papers
• A critical requirement: “major value added”
– 30% in some journals, e.g., TODS, TKDE
– But, how to count?
• Some “major values”
– More detailed/complete examples
– Complete formal results and proofs
– Further variations and extensions of the method
– Triviality should be avoided
73
Steps Towards Good Research
• Motivations and problems
– More important than the solutions
• Re-search
– Systematic development of solutions
• Writing a good paper
– Careful design
• Submissions
– Good luck!