Transcript

How to conduct high quality research and write good papers

Haixun Wang

Microsoft Research Asia

2

What is research?

1. Solve a problem using existing methods. Write a README.txt. (low innovation, little impact)

2. Improve existing solutions to an existing problem. Write a tech report. (low innovation, little impact)

3. Create a new solution to an existing problem. Write a paper. (high innovation, low impact)

4. Identify a new problem. Generalize the solution. Write a paper. (high innovation, high impact)

Research and Engineering

• New Solutions Useful Solutions

3

How innovative are you?

4

5

• Why, if the Chinese had come to know so much about earthquakes so early on in their immensely long history, were they never able to minimize the effects of the world’s contortions — to at least the degree that America has?

• Why did they leave the West to become leaders in the field, and leave themselves to become mired, time and again, in the kind of tragic events that we are witnessing this week?

• It is a cruel that the children who died during the earthquake in Dujiangyan (都江堰), China, knew all too well that their country once led the world in the knowledge of the planet’s seismicity.

6

• There had been any number of Chinese Euclids and Archimedes but there was never to be a Chinese Newton or Galileo.

• Until this week Dujiangyan was a place of which China could be proud; today its wreckage stands as a tragic monument to a culture that turned its back on its remarkable and glittering history (of innovation).

• In almost every area of technology the Chinese were once supreme, without competition. And yet, in the 16th century China’s innovative energies inexplicably withered away, and modern science became the virtual monopoly of the West.

How to train your innovation?

7

Read, Read, Read

8

9

Malcolm GladwellEditor, New Yorker

10

10,000 hours of success

Excellence requires a minimum level of practice.

10,000 hours is the magic number

(3 hours per day for 10 years)

11

By the time Bill Gates dropped out of Harvard, he had been

programming nonstop for seven years, which was way past

10,000 hours.

12

In the last 10 years, I spent more than 3 hours watching TV

everyday, how come I didn’t achieve anything?

13

Nicholas Carr, Atlantic MonthlyJuly 2008

14

Independent thinking

• the downfall of deep reading/thinking

• Internet is rewiring our brains, forcibly adapting us to tolerate only bite-sized summations and simplified blips at the expense of deeper thought

• we risk turning into ‘pancake people’—spread wide and thin as we connect with that vast network of information accessed by the mere touch of a button.

15

How to train your creativity?

Write, Write, Write!

16

Research = Writing + Rewriting

• Turn your idea into writing before implementing it.

• Hard to write it down? Because you don’t understand the problem (or your idea).

– Writing forces us to be clear, focused

– Writing crystallises what we don’t understand

• Writing opens the way to dialogue with others: reality check, critique, and collaboration.

17

Research = Writing + Rewriting

• The process of writing and rewriting is the process of– developing your idea

– generalizing your problem/solution

• After many times of rewriting, your problem (idea) maybe totally different from the problem (idea) you start with– more interesting and challenging

• It’s not a waste of time. It’s how you should spend your time when you do research.

How to find a topic?

The Theory of Flying Pigs

18

In Reality

– Pigs do not have to fly.

[ABSTRACT] In this paper, we identify theimportance for pigs to fly. We show thatmany challenging tasks can be modeled byflying pigs. Thus, solving the flying pigproblem benefits a large variety ofapplications.

20

[ABSTRACT] In this paper, we extend thepioneering work of flying pigs [1]. Ourimprovement enables pigs to fly higher.

[ABSTRACT] Recently, the flying pig problemhas attracted significant attention [1, 2].However, pigs in previous works are all flyingvery slow. In this paper, we introduce atechnique so that pigs can fly an order-of-magnitude faster.

22

and soon we have many papers …

24

What topic to work on?

• The choices you make will define your career

• No real problems at hand

– Get a proceeding. Read from the 1st page.

– Ask senior people what they are working on.

– Make it go faster/higher

• Find real problems, use real data

25

Is this topic meaningful?

• Convince yourself

– an issue of research ethics

• Talk to your colleagues

– Hey! I have a crazy idea

– Convince them

• Talk to/Read from people not in your field

– mathematicians, physicists, biologists, …

26

Database research as an example

• Database has been one of the most successful fields in CS in terms of applications and industrial value!

• However, is there any leftover for substantial database research?

– Relational database theory, a closing world?

– Too many index structures already?

27

Example: Data Model

• From : RDBMS

– Normalization is one of the cornerstones of RDBMS

– Theoretical results and practical applications

• To: XML

– Storage model: still an open problem

– hybrid database, Native XML support

28

Example: Logic Databases

• Logic database was a hot topic in the 80’s and early 90’s– models, semantics, magic sets, …

– many results have since been incorporated into RDBMS

– is Logic Database dead?

• Rejuvenated by semantic query processing– ontology, description logics

29

Broadening the Scope

• Concern (VLDB endowment meeting, 98’):

– The area of database research may lose the pivotal role it now plays among information system technologies

• Keep DB research current and relevant

– We should maintain a watch on trends and future directions in the general area of information management

• Can a traditionally non-DB/KDD research problem be treated using DB/KDD methods?

31

Writing techniques

• Overcome language barrier

• Paper structure and content

32

The Language Barrier

• One must first know the

rules to break them

33

Some General Tips

• Choose the right word/phrase

• Use the active voice

• A picture is worth 10,000 words

• Use a fair amount of formalization

• The divide-and-conquer approach

• Keep it simple and stupid

34

Choose the right word/phrase

• Chicken without sexual life

• Husband and wife’s lung slice

• Bean curd made by a pockmarked woman

35

Use the active voice

• Ten Yuan will be paid for every

one-time towel you use.

36

Use the active voice

NO YESIt can be seen that... We can see that...

34 tests were run We ran 34 tests

These properties were

thought desirable

We wanted to retain these

properties

It might be thought that this

would be a type error

You might think this would be

a type error

The passive voice is “respectable” but it DEADENS your paper. Avoid it at all costs.

“We” = you and the reader

“We” = the authors

“You” = the reader

Slide borrowed from Simon Peyton Jones

37

Some General Tips

• Choose the right word/phrase

• Use the active voice

• A picture is worth 10,000 words

• Use a fair amount of formalization

• The divide-and-conquer approach

• Keep it simple and stupid

38

Be Specific

NO! YES!

We describe the WizWoz

system. It is really cool.

We give the syntax and semantics of a

language that supports concurrent

processes (Section 3). Its innovative

features are...

We study its properties We prove that the type system is sound,

and that type checking is decidable

(Section 4)

We have used WizWoz in

practice

We have built a GUI toolkit in WizWoz,

and used it to implement a text editor

(Section 5). The result is half the length of

the Java version.

From Simon Peyton Jones

39

Structure (conference paper)

• Title (1000 readers)

• Abstract (4 sentences, 100 readers)

• Introduction (1 page, 100 readers)

• The problem (1 page, 10 readers)

• My idea (2 pages, 10 readers)

• The details (5 pages, 3 readers)

• Related work (1-2 pages, 10 readers)

• Conclusions and further work (0.5 pages)

Slide borrowed from Simon Peyton Jones

40

An Attractive Abstract Counts

• Abstract is for people to skim through in one minute

– No technical details

– Plain English, easy to understand

– No assumption of DB/KDD background

– As short as possible

• What to write

– The problem, and why it is important and challenging

– Your technical thrust, progress and contributions

– Broader impact

• Write it last!

41

What Is a Good Introduction

• Starting from good stories– Motivation – what is the problem and why is the

problem important?

– 1-2 typical real-life applications

• Intuition and general ideas– Intuition is most important!

– No technical details

– Understandable for a CS undergraduate

– Use clear, small examples

42

What Is a Good Introduction (2)

• Highlight major contributions

– Typical examples: identifying a new problem, novel solutions, a systematic performance study, …

– Only list the major ones, don’t over claim

– Again, no technical details

– A road map of the rest of the paper

What’s the difference?

43

Hardcover: 1312 pages

Publisher: Wiley; 7th edition (June 20, 2001)

Language: English

ISBN-10: 0471381578

ISBN-13: 978-0471381570

Product Dimensions: 10.1 x 9.1 x 1.9 inches

Shipping Weight: 6.1 pounds

页码:378 页出版日期:2004年01月ISBN:7040137860

条形码:9787040137866

44

Writing paper is like telling a story

• The goal of the title is to get the reader to read

the abstract …

• The goal of the abstract is to get the reader to

read the introduction …

• …

• You need a good set up … a suspense … then

you unfold your story slowly …

45

Goal: creating a suspense

• Reader thinks “gosh, if they can really deliver this, that’d be exciting. I’d better read on”

46

Create Suspense

Many years later, as he faced the firing squad, Colonel Aureliano Buendia was to remember that distant afternoon when his father took him to discover ice.

One hundred years of solitudeby Gabriel García Márquez

47

Keep it Simple and Stupid

一夜北风紧

红楼梦/曹雪芹

这句虽粗,不见底下的,这正是

会作诗的起法。不但好,而且留了写不尽的多少地步与后人。

48

An Example (SIGMOD’02)

49

Motivation Found!

Shifting Pattern {b,c,h,j,e}

Scaling Pattern{f,d,a,g,i}

50

Is It Meaningful?

CH1I CH1B CH1D CH2I CH2B …

VPS8 401 281 120 275 298

SSA1 401 292 109 580 238

SP07 228 290 48 285 224

EFB1 318 280 37 277 215 …

MDM10 538 272 266 277 236

CYS3 322 288 41 278 219

DEP1 317 272 40 273 232 …

NTG1 329 296 33 274 228

… … …

51

Intuition Is the Most Important

• Example– ensemble classifier for streams

• Why ensemble?– Rigorous mathematical proof which shows ensemble

reduces classification variance

• Many benefits– High accuracy, ease of use, best approach in many

aspects

• Result: – paper rejected

52

Optimal decision boundary

t0 t1 t2t1 & t2 errorst0 & t2 no errors!t0 & t1 & t2 errors

53

How to Present Technical Details?

• The top-down approach

– First give an overview of the algorithm

– Present details of the major steps

• The bottom-up approach

– Start from the critical details

– Summarize the discussion and present the algorithm

• The hybrid approach

– Top-down to partition the global problem

– Bottom-up to present solutions to sub-problems

54

How to Present Examples?

• Occam’s razor (the principle of parsimony)

– “One should not increase, beyond what is

necessary, the number of entities required to

explain anything”

• Find the simplest example that can show

all the points you want to show

– Some data in running examples can be highly

skewed

– Only select data that can show critical ideas

55

Worksheet of Running Example

• Work out the complete running example

• Select the interesting and critical segments

• Present multiple small examples in the paper

– Only one running example if possible

– Preferably several paragraphs in one example

– Don’t give a long, exhaustive example

– Each example should focus on one point

56

How to Present Algorithms?

• Choose the appropriate abstract level

– Operations obvious – omit them

• Readers have general CS background

– Complicated operations – function description

• The WWH sequence

– Why do we need such an operation?

– What is the operation?

– How can the operation done efficiently?

57

Keep Your Algorithm Short

• Long algorithms are hard to understand

• Multi-level expansion of algorithms

– Use functions or procedures

• Ideally, each algorithm is less than 20 lines

• Control the complexity

– Don’t use too many variables

– Use meaningful variable names

– Use plain text to explain

58

Performance Study Goals

• “Wisconsin wallpaper”

• Clearly say why you design and conduct

the experiments

– Effectiveness measures

– Efficiency measures

– Other considerations

59

How to Present Experimental

Results?

• Experiment settings

• Performance study goals

• Selected experimental results

– Explanation

• Summary of performance study

60

How to Handle Related Work?

• If possible, talk about related work at the end of the paper.– Do not interrupt the flow of your story

• Extensive collection of related work– Don’t forget to look at the latest results– Go beyond your field, if possible

• Give sufficient credits to others– We are standing on the shoulders of giants– Avoid emotional words– Be precise in comparison

• Point out critical points– Use examples if necessary

61

What Should Be in Discussion?

• Related issues

– Constraints in your method

– Drawbacks

• Possible extensions

– Point out the other problems that can be solved straightforwardly using the proposed method

– Broader impact

• Future work if you have a detailed plan

62

Writing Strong Conclusions

• Summarize the paper briefly.

– What is the problem solved

– Major technical contributions

– Major findings and results

• Future work if possible

63

Aiming high!Major DB/KDD Conferences

• DB (in my opinion)

– 1st tier: SIGMOD, VLDB, ICDE

– 2nd tier: EDBT, ICDT, CIKM, ER, SSDBM

– Regional: DASFAA, WAIM, British DB Conf,

Australian DB Conf, Brazilian DB Conf, DEXA, …

• KDD (in my opinion)

– Top: KDD

– 2nd tier: SIAM DB, ICDM,

– Regional: PAKDD, PKDD, …

– KDD papers can be sent to DB & ML conferences

64

Reviewers’ Comments

65

Reviewers Comments

• The conference review process is necessarily imperfect

• The reviewers operate under strict time constraints, and the committee must make quick decisions.

• Some good papers will be rejected and some embarrassing papers will be accepted.

66

Thank you!

67

My Paper Got Accepted!

• Congratulations!

• Address reviewers’ comments in the final version– Adopt good points

– Clarify and remove confusions

• Prepare a nice talk and/or poster– Pass the general idea

– Use examples wherever possible

– Use as few symbolic text as possible

68

Recycle a Paper

• Before publication, a paper is likely to go

through several rejections

– SIGMOD,VLDB,ICDE acceptance is around

10%-15%

– A conference with 25+% acceptance ratio

may not be good

• Aim at the next chance

69

Learn from the Reviews

• Do we aim at the right target?

– If 2/3 of reviewers are laymen of your subject,

consider the forum seriously

• Address technical issues

– Response to reviewers’ comments by

revising/enhancing technical description and

experiments

• Improve writing

– Confused reviewers? Clarify the issues

– Correct any linguistic problems pointed out

70

Why Journal Papers?

• Records archived

• Important for degree, promotion,

election, …

71

Conference vs. Journal Papers

• Length

– Journal papers are often longer

• Objectives

– Conference papers mainly pass the ideas and

results

– Journal papers systematically report and

justify the research, more formal

72

From Conference Papers to

Journal Papers

• A critical requirement: “major value added”

– 30% in some journals, e.g., TODS, TKDE

– But, how to count?

• Some “major values”

– More detailed/complete examples

– Complete formal results and proofs

– Further variations and extensions of the method

– Triviality should be avoided

73

Steps Towards Good Research

• Motivations and problems

– More important than the solutions

• Re-search

– Systematic development of solutions

• Writing a good paper

– Careful design

• Submissions

– Good luck!