CorpusStudio web application Erwin R. Komen Meertens Instituut // Radboud University Nijmegen // SIL-International [email protected] 1. Background • Existing software: • CorpusStudio – Windows • Cesax – Windows • Successfully used in linguistic research • Web application version? • Central location for corpora (‘last’ version) • Platform independent: MacOS/Linux/Windows • Fast parallel processing 2. Formats • FoLiA xml • Dutch: Nederlab, CGN, Sonar/Lassy • TEI-Psdx xml • English historical + SLA • Caucasian: Chechen, Lak, Lezgi • Old Welsh • Dutch • Additional formats • Convert via ‘Cesax’ (Alpino, Negra, …) • Add handler into CorpusStudio 4. Defining queries • Definition editor • Constants • Functions (Xquery) • Query editor • Subcategorization (Xquery) • Constructor editor • Execution order • Options (examples, output, complement) • Result database Feature editor • Xquery user-functions calculate them 6. Availability • CorpusStudio sources (build your own version) • https://github.com/ErwinKomen • CLARIN-NL access • http://www.clarin.nl/node/2095 7. References Boag, Scott, Don Chamberlin, Mary F. Fernández, Daniela Florescu, Jonathan Robie, and Jérôme Siméon. 2010. XQuery 1.0: An XML Query Language (Second Edition): W3C Recommendation, <http://www.w3.org/XML/Query>. van Gompel, Maarten & Martin Reynaert (2014). FoLiA: A practical XML format for linguistic annotation - a descriptive and comparative study. Computational Linguistics in the Netherlands Journal; 3:63-81; 2013. Komen, Erwin R. 2013. Corpus databases with feature pre-calculation. In Proceedings of the twelfth workshop on treebanks and linguistic theories (TLT12). Sandra Kübler, Petya Osenova & Martin Volk (eds), 85-96. Sofia, Bulgaria: The institute of information and communication technologies, Bulgarian AS. User information Project information Definition Editor Query Editor Constructor Editor Result viewer Meta Data Editor Definitions Queries Corpus Research Project (.crpx) Search service: crpp Query Executor Database Creator Output Monitor Results (.xml) Corpus Research Database (.xml) Table Viewer Result Viewer Documents (.xml) xml xml xml xml xml Input Selector json Status xml json Database feature editor Result Grouping Standard grouping (.json) Grouping Viewer Corpus Viewer Result database Result dbase Viewer Result dbase Editor 3. Corpus Research Projects • All information for one research project • Meta information (author , dates, goal) • Input (language, corpus, filter) • All definition and query files used • Execution order • Optional: result database features • Exchange • Upload/download • Compatible with Windows CorpusStudio CorpusStudio components Meta Data Editor Definition Editor Input Selector Query Editor Constructor Editor Output Monitor Query Executor Result Viewer Corpus Viewer Database feature editor 5. Future • Grouping editor • Group output over meta-data categories • User-definable (Xquery) • Query/project wizard • Tabular input of principal components • Relations, names, feature calculations • Result database editor • View and edit result database records