Introduction Problem and Motivation Approach Resource Instance and Type Extraction Resource Sampling Approaches Constructing profiles: Dataset-topic graph Topic Ranking Approaches Experimental Setup Baselines Evaluation Results Efficiency of Dataset Profiling Scalability of Dataset Profiling Conclusions A Scalable Approach for Efficiently Generating Structured Dataset Topic Profiles Besnik Fetahu 1 , Stefan Dietze 1 , Bernardo Pereira Nunes 2 , Marco Antonio Casanova 2 , Davide Taibi 3 , Wolfgang Nejdl 1 1 L3S Research Center, Leibniz Universit¨ at Hannover 2 Department of Informatics - PUC-Rio 3 Institute for Educational Technologies, CNR May 29, 2014
40
Embed
A Scalable Approach for Efficiently Generating Structured Dataset Topic Profiles
The increasing adoption of Linked Data principles has led to an abundance of datasets on the Web. However, take-up and reuse is hindered by the lack of descriptive information about the nature of the data, such as their topic coverage, dynamics or evolution. To address this issue, we propose an approach for creating linked dataset profiles. A profile consists of structured dataset metadata describing topics and their relevance. Profiles are generated through the configuration of techniques for resource sampling from datasets, topic extraction from reference datasets and their ranking based on graphical models. To enable a good trade-off between scalability and accuracy of generated profiles, appropriate parameters are determined experimentally. Our evaluation considers topic profiles for all accessible datasets from the Linked Open Data cloud. The results show that our approach generates accurate profiles even with comparably small sample sizes (10%) and outperforms established topic modelling approaches.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Introduction
Problem andMotivation
Approach
ResourceInstance andType Extraction
ResourceSamplingApproaches
Constructingprofiles:Dataset-topicgraph
Topic RankingApproaches
ExperimentalSetup
Baselines
EvaluationResults
Efficiency ofDataset Profiling
Scalability ofDataset Profiling
Conclusions
A Scalable Approach for Efficiently GeneratingStructured Dataset Topic Profiles
Besnik Fetahu1, Stefan Dietze1, Bernardo Pereira Nunes2, MarcoAntonio Casanova2, Davide Taibi3, Wolfgang Nejdl1
1L3S Research Center, Leibniz Universitat Hannover2Department of Informatics - PUC-Rio
3Institute for Educational Technologies, CNR
May 29, 2014
Introduction
Problem andMotivation
Approach
ResourceInstance andType Extraction
ResourceSamplingApproaches
Constructingprofiles:Dataset-topicgraph
Topic RankingApproaches
ExperimentalSetup
Baselines
EvaluationResults
Efficiency ofDataset Profiling
Scalability ofDataset Profiling
Conclusions
1 Introduction
2 Problem and Motivation
3 ApproachResource Instance and Type ExtractionResource Sampling ApproachesConstructing profiles: Dataset-topic graphTopic Ranking Approaches
4 Experimental SetupBaselines
5 Evaluation ResultsEfficiency of Dataset ProfilingScalability of Dataset Profiling
6 Conclusions
Introduction
Problem andMotivation
Approach
ResourceInstance andType Extraction
ResourceSamplingApproaches
Constructingprofiles:Dataset-topicgraph
Topic RankingApproaches
ExperimentalSetup
Baselines
EvaluationResults
Efficiency ofDataset Profiling
Scalability ofDataset Profiling
Conclusions
1 Introduction
2 Problem and Motivation
3 ApproachResource Instance and Type ExtractionResource Sampling ApproachesConstructing profiles: Dataset-topic graphTopic Ranking Approaches
4 Experimental SetupBaselines
5 Evaluation ResultsEfficiency of Dataset ProfilingScalability of Dataset Profiling
6 Conclusions
Introduction
Problem andMotivation
Approach
ResourceInstance andType Extraction
ResourceSamplingApproaches
Constructingprofiles:Dataset-topicgraph
Topic RankingApproaches
ExperimentalSetup
Baselines
EvaluationResults
Efficiency ofDataset Profiling
Scalability ofDataset Profiling
Conclusions
Introduction
• Increasing amount of Web Data
• Data heterogeneity: representation, language, quality anddomains
• Sparsely connected datasets
• Lack of descriptive metadata about datasets
• Exhaustive techniques for data analysis
• Efficiency heavily dependent on information need
• Ease of access and representation of datasets
Introduction
Problem andMotivation
Approach
ResourceInstance andType Extraction
ResourceSamplingApproaches
Constructingprofiles:Dataset-topicgraph
Topic RankingApproaches
ExperimentalSetup
Baselines
EvaluationResults
Efficiency ofDataset Profiling
Scalability ofDataset Profiling
Conclusions
Introduction
• Increasing amount of Web Data
• Data heterogeneity: representation, language, quality anddomains
• Sparsely connected datasets
• Lack of descriptive metadata about datasets
• Exhaustive techniques for data analysis
• Efficiency heavily dependent on information need
• Ease of access and representation of datasets
Introduction
Problem andMotivation
Approach
ResourceInstance andType Extraction
ResourceSamplingApproaches
Constructingprofiles:Dataset-topicgraph
Topic RankingApproaches
ExperimentalSetup
Baselines
EvaluationResults
Efficiency ofDataset Profiling
Scalability ofDataset Profiling
Conclusions
Introduction
• Increasing amount of Web Data
• Data heterogeneity: representation, language, quality anddomains
• Sparsely connected datasets
• Lack of descriptive metadata about datasets
• Exhaustive techniques for data analysis
• Efficiency heavily dependent on information need
• Ease of access and representation of datasets
Introduction
Problem andMotivation
Approach
ResourceInstance andType Extraction
ResourceSamplingApproaches
Constructingprofiles:Dataset-topicgraph
Topic RankingApproaches
ExperimentalSetup
Baselines
EvaluationResults
Efficiency ofDataset Profiling
Scalability ofDataset Profiling
Conclusions
Introduction
• Increasing amount of Web Data
• Data heterogeneity: representation, language, quality anddomains
• Sparsely connected datasets
• Lack of descriptive metadata about datasets
• Exhaustive techniques for data analysis
• Efficiency heavily dependent on information need
• Ease of access and representation of datasets
Introduction
Problem andMotivation
Approach
ResourceInstance andType Extraction
ResourceSamplingApproaches
Constructingprofiles:Dataset-topicgraph
Topic RankingApproaches
ExperimentalSetup
Baselines
EvaluationResults
Efficiency ofDataset Profiling
Scalability ofDataset Profiling
Conclusions
Introduction
• Increasing amount of Web Data
• Data heterogeneity: representation, language, quality anddomains
• Sparsely connected datasets
• Lack of descriptive metadata about datasets
• Exhaustive techniques for data analysis
• Efficiency heavily dependent on information need
• Ease of access and representation of datasets
Introduction
Problem andMotivation
Approach
ResourceInstance andType Extraction
ResourceSamplingApproaches
Constructingprofiles:Dataset-topicgraph
Topic RankingApproaches
ExperimentalSetup
Baselines
EvaluationResults
Efficiency ofDataset Profiling
Scalability ofDataset Profiling
Conclusions
Introduction
• Increasing amount of Web Data
• Data heterogeneity: representation, language, quality anddomains
• Sparsely connected datasets
• Lack of descriptive metadata about datasets
• Exhaustive techniques for data analysis
• Efficiency heavily dependent on information need
• Ease of access and representation of datasets
Introduction
Problem andMotivation
Approach
ResourceInstance andType Extraction
ResourceSamplingApproaches
Constructingprofiles:Dataset-topicgraph
Topic RankingApproaches
ExperimentalSetup
Baselines
EvaluationResults
Efficiency ofDataset Profiling
Scalability ofDataset Profiling
Conclusions
Introduction
• Increasing amount of Web Data
• Data heterogeneity: representation, language, quality anddomains
• Sparsely connected datasets
• Lack of descriptive metadata about datasets
• Exhaustive techniques for data analysis
• Efficiency heavily dependent on information need
• Ease of access and representation of datasets
Introduction
Problem andMotivation
Approach
ResourceInstance andType Extraction
ResourceSamplingApproaches
Constructingprofiles:Dataset-topicgraph
Topic RankingApproaches
ExperimentalSetup
Baselines
EvaluationResults
Efficiency ofDataset Profiling
Scalability ofDataset Profiling
Conclusions
1 Introduction
2 Problem and Motivation
3 ApproachResource Instance and Type ExtractionResource Sampling ApproachesConstructing profiles: Dataset-topic graphTopic Ranking Approaches
4 Experimental SetupBaselines
5 Evaluation ResultsEfficiency of Dataset ProfilingScalability of Dataset Profiling
• tf-idf: Consider resources as documents. Extract foreach dataset the top {50, 100, 150, 200} terms.
• LDA: Consider dataset as documents4. Extract topweighted topic terms. For every dataset extract top {50,100, 150, 200} with a number of topics {10, 20, 30, 40,50}.
4In this case it does not matter if datasets are considered at theresource level or aggregated.
Introduction
Problem andMotivation
Approach
ResourceInstance andType Extraction
ResourceSamplingApproaches
Constructingprofiles:Dataset-topicgraph
Topic RankingApproaches
ExperimentalSetup
Baselines
EvaluationResults
Efficiency ofDataset Profiling
Scalability ofDataset Profiling
Conclusions
1 Introduction
2 Problem and Motivation
3 ApproachResource Instance and Type ExtractionResource Sampling ApproachesConstructing profiles: Dataset-topic graphTopic Ranking Approaches
4 Experimental SetupBaselines
5 Evaluation ResultsEfficiency of Dataset ProfilingScalability of Dataset Profiling
6 Conclusions
Introduction
Problem andMotivation
Approach
ResourceInstance andType Extraction
ResourceSamplingApproaches
Constructingprofiles:Dataset-topicgraph
Topic RankingApproaches
ExperimentalSetup
Baselines
EvaluationResults
Efficiency ofDataset Profiling
Scalability ofDataset Profiling
Conclusions
Efficiency of Dataset Profiling
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
1 100 200 300 400 500 600 700 800 900 1000
ND
CG r
anki
ng s
core
NDCG rank
Profiling accuracy for all topic ranking approaches