Database Systems Research Group Heidelberg University April 22, 2020 Software Practicals Summer Semester 2020
Database Systems Research GroupHeidelberg University
April 22, 2020
Software PracticalsSummer Semester 2020
Slides Online
The slides are available on our webpagehttps://dbs.ifi.uni-heidelberg.de/teaching/current/
Organization
Outline ● Overview of topics (today)
○ send application for a topic until Monday, April 27, 1pm○ assignment of topics by April 29
● First milestone (mid/end May)○ prototype/part of software○ summary of research (literature and related systems/tools)○ further milestones in agreement with supervisor
● End of practical (mid/end July)○ code in local Gitlab○ report / documentation as local Wiki document ○ presentation / demo of practical and software (10-12 minutes)
Organizational issues● Application
○ by email directly to supervisor○ brief list of relevant courses / prior knowledge / “Anwendungsgebiet”○ schedule and milestones for the practical○ group work is not possible○ application is binding (don’t apply if you don’t want to do the practical)
● Deadlines○ presentation: planned for last week in July 2020 ○ Report & Gitlab upload: end of August 2020○ no extension possible○ not finished = failed (grade 5,0)
Assessment● Credit points (Leistungspunkte)
○ Beginners Practical (IAP, 2+4 ECTS) [Bachelor students]■ workload: 180 h (~1 ½ days/week)
○ Advanced Practical (IFP, 8 ECTS ECTS)■ workload: 240 h (~2 days/week)
● Grading based on○ code (readability, structure, functionality)○ documentation (README, comments)○ commitment and self-reliance○ cool ideas!!
● IMPORTANT○ talk to / communicate with your advisor
Supervisors
● Michael Gertz (MG)
● Satya Almasian (SA)
● Dennis Aumiller (DA)
● Philip Hausner (PH)
Project Topics
Overview of Topics
1. Implement Citation Extraction in spaCy, BP/AP, (Aumiller)
2. Outline Generation for Wikipedia Articles, AP, (Aumiller/Almasian)
3. Analysis of RNV Delays, BP/AP, (Aumiller/Hausner)
4. Time-dependent analysis of COVID-19 case development, BP/AP, (Hausner)
5. Time-dependent Political Twitter Analysis, AP, (Hausner)
6. Annotating Numerical Relations in News Articles , AP, (Almasian)
7. Numerical Word Co-occurrence Networks (extension), BP/AP, (Almasian)
8. YouTube Video Comment Extractor and Exploration, AP, (Gertz)
9. Extraktion und Management von Bundestagsdokumenten, BP/AP, (Gertz)
BP/AP: Implement Citation Extraction in spaCy (DA)
Given: 1. Rule-based extraction algorithm by Openlegaldata.io2. Dataset of ~1,000 manually annotated referencesTasks: • Transfer functionality to spaCy’s rule-based entity extractor• Publish package that makes this easily usable in spaCy
Subtasks:• Create detailed flow-chart of existing RegEx coverage
Languages / Tools:• Python; spaCy; RegEx
AP: Outline Generation for Wikipedia Articles (DA/SA)
Given: 1. Cleaned dataset of articles from Wikipedia2. Paper by Zhang et al. [1]Tasks: • Implement efficient data loader• Try to reproduce training results from the paper• Implement alternative scoring (RAND score, etc.)
Subtasks:• Learn details about implementation and investigate improvements• Investigate evaluation metrics
Languages / Tools:• Python; PyTorch; Neural Networks (!!)
Given: 1. Start.Info API (RNV API) [1]2. Previous outside project: RNV Monitor [2]Tasks: • Crawl all data (not just delays)• Broader analysis of delays (daytime, line, etc.)• Create time dependent geographical heat map
BP/AP: Analysis of RNV Delays (DA/PH)
Subtasks:• Compare results to RNV Monitor dump• Create suitable database scheme
Languages / Tools:• Python; REST API; SQL
Given: 1. Public data set for Germany [1]2. Reference work from RKI [2]Tasks: • Crawl data set• Identify locations with high increase of case numbers• Create time dependent geographical heat map
BP/AP: Time-dependent Analysis of COVID-19 (PH)
Subtasks:• Create suitable database scheme• Structure in time-dependent fashion
Languages / Tools:• Python; Javascript (vis.js); REST API; SQL
Given: 1. Twitter dataTasks: • Structure information around creation dates of Twitter posts• Identify important topics for certain dates• Take into account all terms or only hashtags
BP/AP: Time-dependent Political Twitter Analysis (PH)
Subtasks:• Investigate different weighting schemes
Languages / Tools:• Python; SQL
AP: Annotating Numerical Relations in News Articles (SA)
Given: 1. Corpus of economical news articles 2. Tasks: • Extract high confidence relations that contain numerical information
from news articles• Apply Named Entity Disambiguation to the entities and numbers • Saving the annotated dataset in Mongodb
Subtasks:• Getting familiar with OpenIE for information extraction• Using AIDA for Named Entity Disambiguation • Detecting quantities with Illinois Quantifier
Languages / Tools:• Python, MongoDB, Brief knowledge of JAVA is also recommended
BP/AP: Numerical Word Co-occurrence Networks (SA)
Given: 1. English Wikipedia corpusTasks: • Improve and existing pipeline of word co-occurrence graph from the
sentences containing numerical information • Enhance the NER (using Metamap from UMLS)• Enhancing the numerical extractor (using Illinois Quantifier)
Subtasks:• Explore the distribution of the numerical values with respect to the
surrounding word to extract valid rangesLanguages / Tools:• Python; SciKit-Learn, Brief knowledge of JAVA is also recommended
AP: YouTube Comment Extractor/Exploration (MG)
Given: 1. Existing pipeline to extract comments from YouTube2. Comprehensive documentation of the dataTasks:• Implement Web-based dashboard to view comment statistics• Provide Web-based search interface on comments
Subtasks:• Port pipeline to Elasticsearch• Decide which features to realize in dashboard• Develop search methods for comments
Languages / Tools:• Python; Elasticsearch
AP: Bundestagsdokumente (MG)
Gegeben: 1. Drucksachen und Plenarprotokolle [DIPBT] Tasks:• (Adaptiver) Crawler für Drucksachen • Speicherung der Dokumente in Solr (strukturiert)• Faceted Search auf Dokumente über Web-Frontend
Subtasks:• Datenmodell für Dokumente• Modell für Faceted Search
Languages / Tools:• Python; Solr
Slides Online
The slides are available on our webpagehttps://dbs.ifi.uni-heidelberg.de/teaching/current/