Kettunen, Honkela, Linden, Kauppinen, Pääkkönen & Kervinen 2014 Kimmo Kettunen 1 , Timo Honkela 1,2 , Krister Lindén 2 , Pekka Kauppinen 2 , Tuula Pääkkönen 1 & Jukka Kervinen 1 Analyzing and Improving the Quality of a Historical News Collection using Language Technology and Statistical Machine Learning Methods IFLA Pre-Conference Geneva, Switzerland, 13th of August, 2014 1 2 Presented by Timo Honkela in
34
Embed
Analyzing and Improving the Quality of a Historical News Collection using Language Technology and Statistical Machine Learning Methods
This presentation was given by Timo Honkela (National Library of Finland and University of Helsinki) in the IFLA 2014 Pre-Conference "Digital Transformation and the Changing Role of News Media in the 21st Century", Geneva, Switzerland, August 13, 2014. The presentation consists of three main parts: (1) Background, (2) OCR result analysis and correction, and (3) Potential directions for future research in socio-cultural text mining of newspaper collections. The published paper covers items 1 and 2. The abstract of the paper is provided below.
Analyzing and Improving the Quality of a Historical News Collection using Language Technology and Statistical Machine Learning Methods
Kimmo Kettunen, Timo Honkela, Krister Lindén, Pekka Kauppinen, Tuula Pääkkönen and Jukka Kervinen
Abstract
In this paper, we study how to analyze and improve the quality of a large historical newspaper collection. The National Library of Finland has digitized millions of newspaper pages. The quality of the outcome of the OCR process is limited especially with regard to the oldest parts of the collection. Approaches such as crowdsourcing has been used in this field to improve the quality of the texts, but in this case the volume of the materials makes it impossible to edit manually any substantial proportion of the texts. Therefore, we experiment with quality evaluation and improvement methods based on corpus statistics, language technology and machine learning in order to find ways to automate analysis and improvement process. The final objective is to reach a clear reduction in the human effort needed in the post-processing of the texts. We present quantitative evaluations of the current quality of the corpus, describe challenges related to texts written in a morphologically complex language, and describe two different approaches to achieve quality improvements.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
● The National Library of Finland has digitized a large proportion of the historical newspapers published in Finland between 1771 and 1910 (Bremer-Laamanen 2001, 2005).
● This collection contains approximately 1.95 million pages in Finnish and Swedish
● According to Legal Deposit law, the National Library of Finland receives a copy of each newspaper and magazine published in Finland.
● Regardless of recent development of the OCR software, there are still challenges with it, as some material is very old, with – varying paper and print quality,
– varying number of columns and layout patterns,
– different languages (mainly Finnish and Swedish but also French, German, etc.), and
Kimmo Koskenniemi (2013):Johdatus kieliteknologiaan,sen merkitykseen ja sovelluksiin(Introduction to language technology, its significance andapplications)
Timo Honkela, Juha Raitio, Krista Lagus, Ilari T. Nieminen, Nina Honkela, and Mika Pantzar:Subjects on objects in contexts: Using GICA method to quantify epistemological subjectivity (IJCNN 2012)