Top Banner
News Topic Discovery Based On ICOWOBAS 2011, Surabaya,September 21 st -23 rd 2011 Key Phrase Identification Algorithm Indra Kharisma Airlangga University Information Systems Study Group Department of Mathematics
24

News Topic Discovery Based On Key Phrase Identification ...indrakharisma.blog.unair.ac.id/files/2012/11/indra-kharisma-icowobas-news-topic...Nov 02, 2012  · topic if it has minimum

Jul 26, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: News Topic Discovery Based On Key Phrase Identification ...indrakharisma.blog.unair.ac.id/files/2012/11/indra-kharisma-icowobas-news-topic...Nov 02, 2012  · topic if it has minimum

News Topic Discovery Based On

ICOWOBAS 2011,Surabaya,September 21st-23rd 2011

News Topic Discovery Based On

Key Phrase Identification Algorithm

Indra Kharisma

Airlangga University

Information Systems Study Group

Department of Mathematics

Page 2: News Topic Discovery Based On Key Phrase Identification ...indrakharisma.blog.unair.ac.id/files/2012/11/indra-kharisma-icowobas-news-topic...Nov 02, 2012  · topic if it has minimum

Why Do People Read Newspapers?

� For Information and Interpretation of Public Affairs� The 'serious' side of newspaper coverage

� To provide a framework for Daily Living� lists of events and advertisements published on a daily basis

� For respite� The relaxation or entertainment value of the daily newspaper� The relaxation or entertainment value of the daily newspaper

� For social prestige� Many readers felt the newspaper was important not just because it gave them information, but because it enabled them to appear more informed at social gatherings.

� For social contact� Human interest stories, personal advice columns, gossip pieces and their variety provided much more than respite from daily routine

Page 3: News Topic Discovery Based On Key Phrase Identification ...indrakharisma.blog.unair.ac.id/files/2012/11/indra-kharisma-icowobas-news-topic...Nov 02, 2012  · topic if it has minimum

How to read news online

� Latest news

� Popular news

� News walking

� By category

� By topic� By topic

Page 4: News Topic Discovery Based On Key Phrase Identification ...indrakharisma.blog.unair.ac.id/files/2012/11/indra-kharisma-icowobas-news-topic...Nov 02, 2012  · topic if it has minimum

Read News By Topic?

Page 5: News Topic Discovery Based On Key Phrase Identification ...indrakharisma.blog.unair.ac.id/files/2012/11/indra-kharisma-icowobas-news-topic...Nov 02, 2012  · topic if it has minimum

News Topic Identification

� Manual

� automatic topic discovery

� Unsupervised Topic Discovery [3] that using TFIDF measure to selecting topic label

� Key Entities and Significant Events Extraction [4] using clustering to extract significant events from key entity.clustering to extract significant events from key entity.

� In this research trying to explore extracting news topic using key phrase identification algorithm.

Page 6: News Topic Discovery Based On Key Phrase Identification ...indrakharisma.blog.unair.ac.id/files/2012/11/indra-kharisma-icowobas-news-topic...Nov 02, 2012  · topic if it has minimum

What is Key Phrase Indentification

� Key phrase identification is techniques for improving the effectiveness of searching the World Wide Web for documents relevant to a given topic of interest, this algoritm are proposed by Shamim Khan and Sebastian Khor [5].

Page 7: News Topic Discovery Based On Key Phrase Identification ...indrakharisma.blog.unair.ac.id/files/2012/11/indra-kharisma-icowobas-news-topic...Nov 02, 2012  · topic if it has minimum

Similar function with Key Phrase

Indentification

Page 8: News Topic Discovery Based On Key Phrase Identification ...indrakharisma.blog.unair.ac.id/files/2012/11/indra-kharisma-icowobas-news-topic...Nov 02, 2012  · topic if it has minimum

How Key Phrase Indentification Work?� These algoritm basicly based on definition, a phrase is a sequence of up to four contiguous words, which does not contain any verbs or verb phrases and does not start or end with a stop-word (e.g. a, the, and) [5].

� Based from that Key Phrase Indentification algorithm identifying words that can form parts of a noun phrase.

� Using five sets words for tagging purpose, include � set of common verbs words, � set of common verbs words, � a set consisting of a mixture of adverbs, prepositions, conjunctions, pronouns, and the two adjectives able and unable.

� set consists of words and parts of words that identify numbers or indicate positional ranking.

� a set of words that indicate that the preceding word is likely to be a noun.

� a set consist words that are used as indi- cators of the likelihood of the following word being a noun.

� Execute Key Phrase Identification rule to acquire the key phrase

Page 9: News Topic Discovery Based On Key Phrase Identification ...indrakharisma.blog.unair.ac.id/files/2012/11/indra-kharisma-icowobas-news-topic...Nov 02, 2012  · topic if it has minimum

Key Phrase Indentification example

� Keyword : News Topic Discovery

� Suggestion key phrase :

� news story

� topic discovery

� english news articles

Page 10: News Topic Discovery Based On Key Phrase Identification ...indrakharisma.blog.unair.ac.id/files/2012/11/indra-kharisma-icowobas-news-topic...Nov 02, 2012  · topic if it has minimum

Key Phrase Identification ���� topic discovery

Adjustment

Page 11: News Topic Discovery Based On Key Phrase Identification ...indrakharisma.blog.unair.ac.id/files/2012/11/indra-kharisma-icowobas-news-topic...Nov 02, 2012  · topic if it has minimum

Example : Topic Discovery using KPI � Judul :

Singapore not serious about an extradition treaty: Marzuki

� berita :

Indonesian House of Representatives speaker Marzuki Alie accused Singapore of downplaying the need for “pure” talks on an extradition treaty between the two neighbors. “Singapore has, at all times, never seriously responded to our need to make a pure extradition treaty. Our two countries are supposed to agree on an to make a pure extradition treaty. Our two countries are supposed to agree on an extradition treaty without relating it to other matters,” Marzuki said Friday in Jakarta, as quoted by tempointeraktif.com. Marzuki was apparently complaining about Singapore’s reported request to be allowed to conduct military training within Indonesian territory in exchange for the extradition treaty.Marzuki said the two issues were “not correlated”. The non-existence of an extradition treaty between Indonesia and Singapore has allowed the escape of many Indonesian fugitives who have fled to Singapore – an issue that has recently drawn the attention of the Indonesian in the aftermath of the escape of two prominent graft fugitives.

� topic :

extradition treaty

Page 12: News Topic Discovery Based On Key Phrase Identification ...indrakharisma.blog.unair.ac.id/files/2012/11/indra-kharisma-icowobas-news-topic...Nov 02, 2012  · topic if it has minimum

Dataset

� Online version of The Jakarta Post� The Jakarta post focus on local Indonesian news � more feasible to analysis of news content

� key phrase identification algorithm work for English news.

� 1 July 2011 to 10 July 2011� 1 July 2011 to 10 July 2011

� 318 news obtained from the dataset

� The average of sentences in the news is 28.7390

� minimum number of sentence is 5

� maximum 221 sentences.

Page 13: News Topic Discovery Based On Key Phrase Identification ...indrakharisma.blog.unair.ac.id/files/2012/11/indra-kharisma-icowobas-news-topic...Nov 02, 2012  · topic if it has minimum

Topic Discovery Strategy

1. Direct implementation of Key Phrase Identification algoritm on news content

2. News Clasification based on similarity with threshold

3. Refinement of News Clasification based on similarity with threshold (produces 3 alternative method)

Page 14: News Topic Discovery Based On Key Phrase Identification ...indrakharisma.blog.unair.ac.id/files/2012/11/indra-kharisma-icowobas-news-topic...Nov 02, 2012  · topic if it has minimum

Direct implementation of Key Phrase

Identification algoritm on news content

1. Find Key Phrase (phrase are considered as key phrase / topic if it has minimum 3 occurance).With this requirement not all the news will result topic.

2. If topic are found in spesific news, assign it as topic of the news.

Find more news (for instance 5 news) that similar to 3. Find more news (for instance 5 news) that similar to the news; assign those topics on selected similar news.

Page 15: News Topic Discovery Based On Key Phrase Identification ...indrakharisma.blog.unair.ac.id/files/2012/11/indra-kharisma-icowobas-news-topic...Nov 02, 2012  · topic if it has minimum

Evaluation

Page 16: News Topic Discovery Based On Key Phrase Identification ...indrakharisma.blog.unair.ac.id/files/2012/11/indra-kharisma-icowobas-news-topic...Nov 02, 2012  · topic if it has minimum

News Clasification based on similarity with

threshold

1. News are paired and calculated to define similarity values.

2. Based on similarity values, news are sorted and divide to 2 path based on limit values. Limit value is assumed based on the similarity algorithm, the estimated boundary values between pairs of closely related and unrelated. The unrelated pairs are subject to be cut off. In this scenario 0.31 are assumed to be the limit values for cut off the unrelated pairs.

3. From existing similarity list (above the cut off limit), news are categorize based on linked that made from similarity pair. For example the list contained the following pairs (2,3), (3,5), dan (1,4) [The brackets indicate contained the following pairs (2,3), (3,5), dan (1,4) [The brackets indicate that the news has similarities according to criteria of step 2.] Then the group formed by linkage is (2,3,5) and (1.4).

4. Since the naming of topics based on series of events, the group that analyzed is a group that has at least 3 news. The groups then analyzed using key phrase identification algorithm, it will get a lot of topics (with the value of its appearance), then the topics they will be combined and summed, the highest topic become a topic for the group name the news.

Page 17: News Topic Discovery Based On Key Phrase Identification ...indrakharisma.blog.unair.ac.id/files/2012/11/indra-kharisma-icowobas-news-topic...Nov 02, 2012  · topic if it has minimum

Evaluation

Page 18: News Topic Discovery Based On Key Phrase Identification ...indrakharisma.blog.unair.ac.id/files/2012/11/indra-kharisma-icowobas-news-topic...Nov 02, 2012  · topic if it has minimum

Refinement Strategy

1. News content is analyzed with a number of titles that exist in the group. The results were summed topics based on its appearance.

2. News content is analyzed with a number of titles that exist in the group. For each analysis, topics treated in the same appearance, the emergence of the topic is determined equal to 1 in each analysis step, so the number of maximal to 1 in each analysis step, so the number of maximal occurrences is the number of news topics that are in the news group.

3. Topic candidate list obtained from topic discovery based refinement strategy from alternative 2, the topic list is then calculated based on term frequency from news content. Most frequented topics used as a topic in the news group.

Page 19: News Topic Discovery Based On Key Phrase Identification ...indrakharisma.blog.unair.ac.id/files/2012/11/indra-kharisma-icowobas-news-topic...Nov 02, 2012  · topic if it has minimum

Topic Discovery Alternative Method 1

� News content is analyzed with a number of titles that exist in the group. The results were summed topics based on its appearance.

Page 20: News Topic Discovery Based On Key Phrase Identification ...indrakharisma.blog.unair.ac.id/files/2012/11/indra-kharisma-icowobas-news-topic...Nov 02, 2012  · topic if it has minimum

Topic Discovery Alternative Method 2

� News content is analyzed with a number of titles that exist in the group. For each analysis, topics treated in the same appearance, the emergence of the topic is determined equal to 1 in each analysis step, so the number of maximal occurrences is the number of news topics that are in the news group.

Page 21: News Topic Discovery Based On Key Phrase Identification ...indrakharisma.blog.unair.ac.id/files/2012/11/indra-kharisma-icowobas-news-topic...Nov 02, 2012  · topic if it has minimum

Topic Discovery Alternative Method 3

� Topic candidate list obtained from topic discovery based refinement strategy from alternative 2, the topic list is then calculated based on term frequency from news content. Most frequented topics used as a topic in the news group.

Page 22: News Topic Discovery Based On Key Phrase Identification ...indrakharisma.blog.unair.ac.id/files/2012/11/indra-kharisma-icowobas-news-topic...Nov 02, 2012  · topic if it has minimum

Evaluation Result� Based on the implementation and evaluation has been done, Direct implementation of Key Phrase Identification algorithm on news content can not be made for the determination of the topic because it will lead to imperfect results.

� While News Classification based on similarity with the threshold also did not produce the expected topic because not all the news groups which have been determined produce a topic.

� The refinement processes to the News Classification based on similarity with the threshold are taken to achieve a good topic candidate. However with the threshold are taken to achieve a good topic candidate. However refinement alternative method 2 did not give good results because it generated too many topics in one news group. On the other hand, alternative method 1 and alternative method 3 provides fairly good results in determining the topics in the news groups.

� All topics are found and assigned to the news groups have close links with the news. Although can not be denied that the topic is not appropriate if compared with topics created manually by humans. The topics that are better suited as a candidate actually found the topic by the algorithm key-phrase identification but have not been able to emerge as the chosen topic of the news group.

Page 23: News Topic Discovery Based On Key Phrase Identification ...indrakharisma.blog.unair.ac.id/files/2012/11/indra-kharisma-icowobas-news-topic...Nov 02, 2012  · topic if it has minimum

Discussion

� News grouping needs to be done first to eliminate the possibility of a mismatch the topics assigned to the news. Noteworthy is that not all the news should have a topic, and the topic are assign to a group of news that have events similarity.

� Key phrase identification algorithm which actually serves as a tool to propose a query on a search engine, can be modified to be used as a method to detect topics. From some usage to be used as a method to detect topics. From some usage scenarios the strategy that have been implemented and evaluated, alternative methods 1 and 3 of the Refinement of News Classification based on similarity with the threshold make available to be used as a method for discovery topic.

� This method is able to identify candidate topics in the news, but unfortunately it has a tendency that most relevant candidates are not chosen as a topic. In spite of the chosen topic is closely related to the news group.

Page 24: News Topic Discovery Based On Key Phrase Identification ...indrakharisma.blog.unair.ac.id/files/2012/11/indra-kharisma-icowobas-news-topic...Nov 02, 2012  · topic if it has minimum

References[1] Hrvoje Bacan, Igor S. Pandzic, and Darko Gulija, "Automated News Item Categorization," in Proceedings of JSAI 2005 Workshop on Conversational Informatics, Kitakyushu, Japan, 2005, pp. 57-62.

[2]José Iria, Fabio Ciravegna, and João Magalhães, "Web News Categorization using a Cross-Media Document Graph," in Proceedings of the 2009 ACM International Conference on Image and Video Retrieval, 2009.

[3]Sreenivasa Sista, Richard Schwartz, Timothy R. Leek, and John Makhoul, "An Algorithm for Unsupervised Topic Discovery from Broadcast News Stories," Proceedings of the second international conference on Human Stories," Proceedings of the second international conference on Human Language Technology Research -, p. 110, 2002.

[4]Mingrong Liu, Yicen Liu, Liang Xiang, Xing Chen, and Qing Yang, "Extracting Key Entities and Significant Events from Online Daily News," in IDEAL '08 Proceedings of the 9th International Conference on Intelligent Data Engineering and Automated Learning, 2008, pp. 201 - 209.

[5]M. Shamim Khan and Sebastian Khor, "Enhanced Web Document Retrieval Using Automatic Query Expansion," Journal Of The American Society For Information Science And Technology, vol. 55, no. 1, pp. 29–40, January 2004.