Relevance and Quality of Health Information on the Web Tim Tang DCS Seminar October, 2005.

Relevance and Quality of Health Information on the Web

Tim TangDCS Seminar

October, 2005

2

Outlines

• Motivation - Aims• Experiments & Results

– Domain specific vs. general search– A Quality Focused Crawler

• Conclusion & Future work

3

Why health information on the Web?

• Internet is a free medium• High user demand for health information• Health information of various quality• Incorrect health advice is dangerous

4

Problems

• Normal definition of relevance: Topical relevance• Normal way to search: Word-matching

Q: Are these applicable to health information?

A: Not complete, we also need quality = usefulness of the information

5

Problem: Quality of health info

The various quality of health information in search results

6

Wrong advice

7

Dangerous information

8

Dangerous information

9

Dangerous Information

10

Problem: Commercial sites

Health information for commercial purposes

11

Commercial promotion

12

Problem: Types of search engine

The difference between domain-specific search and general-purpose search.

13

Querying BPS

14

Querying Google: Irrelevant information

15

Problem of domain-specific portals

Domain-specific portals may be good, but …

It often requires intensive effort to build and maintain (will be discussed more in experiment 2)

16

Aims

• To analyse the relative performance of domain specific and general purpose search engines

• To discover how to provide effective domain specific search, particularly in the health domain

• To automate the quality assessment of medical web sites

17

Two experiments

• First: Compare search results for health info between general and domain specific engines

• Second: Build and evaluate a Quality focused crawler for a health topic

18

The First Experiment

A Comparison of the relative performance of general purpose search engines and domain-

specific search engines

In Journal of Information Retrieval ‘05 – Special Issue

with Nick Craswell, Dave Hawking, Kathy Griffiths and Helen Christensen

19

Domain specific vs. General engines

• General search engines: Google, Yahoo, MSN search, …

• Domain specific: Search service for scientific papers, search service for health, or a topic in the health domain.

• A depression portal: BluePages (http://bluepages.anu.edu.au)

20

BluePages Search (BPS)

21

BPS result list

22

Engines

– Google– GoogleD (Google with “depression”)– BPS– 4sites (4 high quality depression sites)– HealthFinder (HF): A health portal search named

Health Finder– HealthFinderD (HFD): HF with depression

23

Queries

• 101 queries about depression:– 50 treatment queries suggested by domain experts– 51 non-treatment queries collected from 2 query logs:

domain-specific query log and general query log.• Examples:

– Treatment queries: acupuncture, antidepressant, chocolate

– Non-treatment queries: depression symptoms, clinical depression

24

Experiment details

• Run the 101 queries on the 6 engines.• For each query, top 10 results from each engine are

collected.• All results were judged by research assistants: degrees

of relevance, recommendation of advice• Relevance and quality for all engines were then

compared

25

Results

Engine Relevance Quality

GoogleD0.407 78

BPS0.319 127

4sites0.225 143

Google0.195 28

HFS0.0756 0

26

Findings

• Google is not good in either relevance or quality • GoogleD can retrieve more relevant pages, but less high

quality pages. • 4sites and BPS provide good quality but have poor

coverage.

It’s important to have a domain-specific portal which provides both high quality and high coverage. How to improve coverage?

27

Experiment 2

Building a high quality domain-specific portal using focused crawling techniques

In CIKM ’05

With Dave Hawking, Nick Craswell, Kathy Griffiths

28

A Quality Focused Crawler

• Why?– The first experiment shows: Quality can be achieved

using domain specific portals– The current method for building such a portal is

expensive.– Focused crawling may be a good way to build a

health portal with high coverage, while reducing human effort.

29

The problems of BPS

• Manual judgments of health sites by domain experts for two weeks to decide what to include.

• 207 Web sites are included, i.e., a lot of useful web pages are left out.

• Tedious maintenance process: Web pages change, cease to exist, new pages, etc.

• Also, the first experiment shows: High quality but quite low coverage.

30

Focused Crawling (FC)

• Designed to selectively fetch content relevant to a specified topic of interest using the Web’s hyperlink structure.

• Examples of topics: sport, health, cancer, or scientific papers, etc.

31

FC Process

URL Frontier

Link extractorDownload

Classifier

{URLs, link info}

dequeue

{URLs, scores}

enqueue

Link info = anchor text, URL, source page’s content, so on.

32

FC: simple example• Crawling pages about psychotherapy

33

Relevance prediction

• anchor text: text appearing in a hyperlink• text around the link: 50 bytes before and after the link• URL words: parse the URL address

34

Relevance Indicators

• URL: http://www.depression.com/psychotherapy.html

=> URL words: depression, com, psychotherapy

• Anchor text: psychotherapy• Text around the link:

– 50 bytes before: section, learn

– 50 bytes after: talk, therapy, standard, treatment

35

Methods

• Machine learning approach: Train and test relevant and irrelevant URLs using the discussed features.

• Evaluated different learning algorithms: k-nearest neighbor, Naïve Bayes, C4.5, Perceptron.

• Result: The C4.5 decision tree was the best to predict relevance.

• The same method applied to predict quality but not successful!!!

36

Quality prediction

• Using evidence-based medicine, and

• Using Relevance Feedback (RF) technique

37

Evidence-based Medicine

• Interventions that are supported by a systematic review of the evidence as effective.

• Examples of effective treatments for depression:– Antidepressants– ECT (electroconvulsive therapy)– Exercise– Cognitive behavioral therapy

• These treatments were divided into single and 2-word terms.

38

Relevance Feedback

• Well-known IR approach of query by examples.• Basic idea: Do an initial query, get feedback from users

about what documents are relevant, then add words from relevant document to the query.

• Goal: Add terms to the query in order to get more relevant results.

39

RF Algorithm

• Identify the N top-ranked documents• Identify all terms from these documents• Select the terms with highest weights• Merge these terms with the original query• Identify the new top-ranked documents for the new query

(Usually, 20 terms are added in total)

40

Our Modified RF approach

• Not for relevance, but Quality• No only single terms, but also phrases• Generate a list of single terms and 2-word phrases and

their associated weights • Select the top weighted terms and phrases• Cut-off points at the lowest-ranked term that appears in

the evidence-based treatment list• 20 phrases and 29 single words form a ‘quality query’

41

Terms represent topic “depression”Term WeightDepression 13.3

Health 6.9

Treatment 5.7

Mental 5.4

patient 3.3

Medication 3

ECT 2.4

antidepressants 1.9

Mental health 1.2

Cognitive therapy 0.84

42

Predicting Quality

• For downloaded pages, quality score (QScore) is computed using a modification of the BM25 formula, taking into account term weights.

• Quality of a page is then predicted based on the quality of all downloaded pages linking to it.

(Assumption: Good pages are usually inter-connected)• Predicted quality score of a page with n downloaded

source pages:

PScore = ΣQScore/n

43

Combining relevance and quality

• Need to have a way of balancing relevance and quality• Quality and relevance score combination is new• Our method uses a product of the two scores• Other ways to combine these scores will be explored in

future work• A quality focused crawler rely on this combined score to

order the crawl queue

44

The Three Crawlers

• A Web crawler (spider): – A program which browses the WWW in a methodical, automated

manner– Usually used by a search engine to index web pages to provide

fast searches.• We built three crawlers:

– The Breadth-first crawler: Traverses the link graph in a FIFO fashion (serves as baseline for comparison)

– The Relevance crawler: For relevance purpose, ordering the crawl queue using the C4.5 decision tree

– The Quality crawler: For both relevance and quality, ordering the crawl queue using the combination of the C4.5 decision tree and RF techniques.

45

Results

46

Relevance

47

Relevance Results

• The relevance and quality crawls each stabilised after 3000 pages, at 80% and 88% relevance respectively.

• The BF crawl continued to degrade over time, and down to 40% at 10,000 pages.

• The quality crawler outperformed the relevance crawler due to the incorporation of the RF quality scores.

48

Quality

49

High quality pages

AAQ = Above Average Quality: top 25%

50

Low quality pages

BAQ = Below Average Quality: bottom 25%

51

Quality Results

• The quality crawler performed significantly better than the relevance crawler. (50% better towards the end of the crawl)

• All the crawls did well in crawling high quality pages. The quality crawler performed very well, with more than 50% of its pages being high quality.

• The quality crawl only has about 5% pages from low quality sites while the BF crawl has about 3 times higher.

52

Findings

• Topical-relevance could be well predicted using link anchor context.

• Link anchor context could not be used to predict quality.

• Relevance feedback technique proved its usefulness in quality prediction.

53

Overall Conclusions

• Domain-specific search engines could offer better quality of results than general search engines.

• The current way to build a domain-specific portal is expensive. We have successfully used focused crawling techniques, relevance decision tree and relevance feedback technique to build high-quality portals cheaply.

54

Future works

• So far we only experimented in one health topic. Our plan is to repeat the same experiments with another topic, and generalise the technique to another domain.

• Other ways of combining relevance and quality should be explored.

• Experiments to compare our quality crawl with other health portals is necessary.

• How to remove spam from the crawl is another important step.

Relevance and Quality of Health Information on the Web Tim Tang DCS Seminar October, 2005.

Documents

health domain

domainspecific search

health topic slide

dangerous slide

dangerous information

irrelevant information

health portal search

domain specific engines