Relevance and Quality of Health Information on the Web Tim Tang DCS Seminar October, 2005
Dec 21, 2015
Relevance and Quality of Health Information on the Web
Tim TangDCS Seminar
October, 2005
2
Outlines
• Motivation - Aims• Experiments & Results
– Domain specific vs. general search– A Quality Focused Crawler
• Conclusion & Future work
3
Why health information on the Web?
• Internet is a free medium• High user demand for health information• Health information of various quality• Incorrect health advice is dangerous
4
Problems
• Normal definition of relevance: Topical relevance• Normal way to search: Word-matching
Q: Are these applicable to health information?
A: Not complete, we also need quality = usefulness of the information
5
Problem: Quality of health info
The various quality of health information in search results
6
Wrong advice
7
Dangerous information
8
Dangerous information
9
Dangerous Information
10
Problem: Commercial sites
Health information for commercial purposes
11
Commercial promotion
12
Problem: Types of search engine
The difference between domain-specific search and general-purpose search.
13
Querying BPS
14
Querying Google: Irrelevant information
15
Problem of domain-specific portals
Domain-specific portals may be good, but …
It often requires intensive effort to build and maintain (will be discussed more in experiment 2)
16
Aims
• To analyse the relative performance of domain specific and general purpose search engines
• To discover how to provide effective domain specific search, particularly in the health domain
• To automate the quality assessment of medical web sites
17
Two experiments
• First: Compare search results for health info between general and domain specific engines
• Second: Build and evaluate a Quality focused crawler for a health topic
18
The First Experiment
A Comparison of the relative performance of general purpose search engines and domain-
specific search engines
In Journal of Information Retrieval ‘05 – Special Issue
with Nick Craswell, Dave Hawking, Kathy Griffiths and Helen Christensen
19
Domain specific vs. General engines
• General search engines: Google, Yahoo, MSN search, …
• Domain specific: Search service for scientific papers, search service for health, or a topic in the health domain.
• A depression portal: BluePages (http://bluepages.anu.edu.au)
20
BluePages Search (BPS)
21
BPS result list
22
Engines
– Google– GoogleD (Google with “depression”)– BPS– 4sites (4 high quality depression sites)– HealthFinder (HF): A health portal search named
Health Finder– HealthFinderD (HFD): HF with depression
23
Queries
• 101 queries about depression:– 50 treatment queries suggested by domain experts– 51 non-treatment queries collected from 2 query logs:
domain-specific query log and general query log.• Examples:
– Treatment queries: acupuncture, antidepressant, chocolate
– Non-treatment queries: depression symptoms, clinical depression
24
Experiment details
• Run the 101 queries on the 6 engines.• For each query, top 10 results from each engine are
collected.• All results were judged by research assistants: degrees
of relevance, recommendation of advice• Relevance and quality for all engines were then
compared
25
Results
Engine Relevance Quality
GoogleD0.407 78
BPS0.319 127
4sites0.225 143
Google0.195 28
HFS0.0756 0
26
Findings
• Google is not good in either relevance or quality • GoogleD can retrieve more relevant pages, but less high
quality pages. • 4sites and BPS provide good quality but have poor
coverage.
It’s important to have a domain-specific portal which provides both high quality and high coverage. How to improve coverage?
27
Experiment 2
Building a high quality domain-specific portal using focused crawling techniques
In CIKM ’05
With Dave Hawking, Nick Craswell, Kathy Griffiths
28
A Quality Focused Crawler
• Why?– The first experiment shows: Quality can be achieved
using domain specific portals– The current method for building such a portal is
expensive.– Focused crawling may be a good way to build a
health portal with high coverage, while reducing human effort.
29
The problems of BPS
• Manual judgments of health sites by domain experts for two weeks to decide what to include.
• 207 Web sites are included, i.e., a lot of useful web pages are left out.
• Tedious maintenance process: Web pages change, cease to exist, new pages, etc.
• Also, the first experiment shows: High quality but quite low coverage.
30
Focused Crawling (FC)
• Designed to selectively fetch content relevant to a specified topic of interest using the Web’s hyperlink structure.
• Examples of topics: sport, health, cancer, or scientific papers, etc.
31
FC Process
URL Frontier
Link extractorDownload
Classifier
{URLs, link info}
dequeue
{URLs, scores}
enqueue
Link info = anchor text, URL, source page’s content, so on.
32
FC: simple example• Crawling pages about psychotherapy
33
Relevance prediction
• anchor text: text appearing in a hyperlink• text around the link: 50 bytes before and after the link• URL words: parse the URL address
34
Relevance Indicators
• URL: http://www.depression.com/psychotherapy.html
=> URL words: depression, com, psychotherapy
• Anchor text: psychotherapy• Text around the link:
– 50 bytes before: section, learn
– 50 bytes after: talk, therapy, standard, treatment
35
Methods
• Machine learning approach: Train and test relevant and irrelevant URLs using the discussed features.
• Evaluated different learning algorithms: k-nearest neighbor, Naïve Bayes, C4.5, Perceptron.
• Result: The C4.5 decision tree was the best to predict relevance.
• The same method applied to predict quality but not successful!!!
36
Quality prediction
• Using evidence-based medicine, and
• Using Relevance Feedback (RF) technique
37
Evidence-based Medicine
• Interventions that are supported by a systematic review of the evidence as effective.
• Examples of effective treatments for depression:– Antidepressants– ECT (electroconvulsive therapy)– Exercise– Cognitive behavioral therapy
• These treatments were divided into single and 2-word terms.
38
Relevance Feedback
• Well-known IR approach of query by examples.• Basic idea: Do an initial query, get feedback from users
about what documents are relevant, then add words from relevant document to the query.
• Goal: Add terms to the query in order to get more relevant results.
39
RF Algorithm
• Identify the N top-ranked documents• Identify all terms from these documents• Select the terms with highest weights• Merge these terms with the original query• Identify the new top-ranked documents for the new query
(Usually, 20 terms are added in total)
40
Our Modified RF approach
• Not for relevance, but Quality• No only single terms, but also phrases• Generate a list of single terms and 2-word phrases and
their associated weights • Select the top weighted terms and phrases• Cut-off points at the lowest-ranked term that appears in
the evidence-based treatment list• 20 phrases and 29 single words form a ‘quality query’
41
Terms represent topic “depression”Term WeightDepression 13.3
Health 6.9
Treatment 5.7
Mental 5.4
patient 3.3
Medication 3
ECT 2.4
antidepressants 1.9
Mental health 1.2
Cognitive therapy 0.84
42
Predicting Quality
• For downloaded pages, quality score (QScore) is computed using a modification of the BM25 formula, taking into account term weights.
• Quality of a page is then predicted based on the quality of all downloaded pages linking to it.
(Assumption: Good pages are usually inter-connected)• Predicted quality score of a page with n downloaded
source pages:
PScore = ΣQScore/n
43
Combining relevance and quality
• Need to have a way of balancing relevance and quality• Quality and relevance score combination is new• Our method uses a product of the two scores• Other ways to combine these scores will be explored in
future work• A quality focused crawler rely on this combined score to
order the crawl queue
44
The Three Crawlers
• A Web crawler (spider): – A program which browses the WWW in a methodical, automated
manner– Usually used by a search engine to index web pages to provide
fast searches.• We built three crawlers:
– The Breadth-first crawler: Traverses the link graph in a FIFO fashion (serves as baseline for comparison)
– The Relevance crawler: For relevance purpose, ordering the crawl queue using the C4.5 decision tree
– The Quality crawler: For both relevance and quality, ordering the crawl queue using the combination of the C4.5 decision tree and RF techniques.
45
Results
46
Relevance
47
Relevance Results
• The relevance and quality crawls each stabilised after 3000 pages, at 80% and 88% relevance respectively.
• The BF crawl continued to degrade over time, and down to 40% at 10,000 pages.
• The quality crawler outperformed the relevance crawler due to the incorporation of the RF quality scores.
48
Quality
49
High quality pages
AAQ = Above Average Quality: top 25%
50
Low quality pages
BAQ = Below Average Quality: bottom 25%
51
Quality Results
• The quality crawler performed significantly better than the relevance crawler. (50% better towards the end of the crawl)
• All the crawls did well in crawling high quality pages. The quality crawler performed very well, with more than 50% of its pages being high quality.
• The quality crawl only has about 5% pages from low quality sites while the BF crawl has about 3 times higher.
52
Findings
• Topical-relevance could be well predicted using link anchor context.
• Link anchor context could not be used to predict quality.
• Relevance feedback technique proved its usefulness in quality prediction.
53
Overall Conclusions
• Domain-specific search engines could offer better quality of results than general search engines.
• The current way to build a domain-specific portal is expensive. We have successfully used focused crawling techniques, relevance decision tree and relevance feedback technique to build high-quality portals cheaply.
54
Future works
• So far we only experimented in one health topic. Our plan is to repeat the same experiments with another topic, and generalise the technique to another domain.
• Other ways of combining relevance and quality should be explored.
• Experiments to compare our quality crawl with other health portals is necessary.
• How to remove spam from the crawl is another important step.