Parameter und Bereiche Gewichtung Vektorraum-Modell VSM vs. Boole Literatur Information-Retrieval: Vektorraum-Modell Claes Neuefeind Fabian Steeg 03. Dezember 2009 Text-Engineering I - Information-Retrieval - Wintersemester 2009/2010 - Informationsverarbeitung - Universit¨ at zu K¨ oln
64
Embed
Information-Retrieval: Vektorraum-Modell · 03. Dezember 2009 Text-Engineering I - Information-Retrieval - Wintersemester 2009/2010 - Informationsverarbeitung - Universit at zu K
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Parameter und Bereiche Gewichtung Vektorraum-Modell VSM vs. Boole Literatur
Information-Retrieval:Vektorraum-Modell
Claes NeuefeindFabian Steeg
03. Dezember 2009
Text-Engineering I - Information-Retrieval - Wintersemester 2009/2010 - Informationsverarbeitung - Universitat zu Koln
Parameter und Bereiche Gewichtung Vektorraum-Modell VSM vs. Boole Literatur
Themen des Seminars
I Boolesches Retrieval-Modell (IIR 1)
I Datenstrukturen (IIR 2)
I Tolerantes Retrieval (IIR 3)
I Vektorraum-Modell (IIR 6)
I Evaluation (IIR 8)
I Web-Retrieval (IIR 19-21)
Text-Engineering I - Information-Retrieval - Wintersemester 2009/2010 - Informationsverarbeitung - Universitat zu Koln
Parameter und Bereiche Gewichtung Vektorraum-Modell VSM vs. Boole Literatur
Wiederholung: Boolesches Retrieval
I Suche alle Dokumente, die Term(e) der Anfrage enthaltenI ’Ganz oder gar nicht’I Gut fur Experten und Anwendungen, weniger gut fur Nutzer
I Erweiterungen:I Positional Index (Phrasen, Nahe)I Permuterm- oder k-gram-Index
(Unscharfes Matchen, Korrekturen)
I Ranking?
Text-Engineering I - Information-Retrieval - Wintersemester 2009/2010 - Informationsverarbeitung - Universitat zu Koln
Parameter und Bereiche Gewichtung Vektorraum-Modell VSM vs. Boole Literatur
Ranking
I Grundgedanke:I Bewertung von Term/Dokument-Paaren durch einen ’Score’,
der die Relevanz des Terms fur das Dokument wiedergibt
I Ansatze:I Parameter und BereicheI Termgewichtung
Text-Engineering I - Information-Retrieval - Wintersemester 2009/2010 - Informationsverarbeitung - Universitat zu Koln
Parameter und Bereiche Gewichtung Vektorraum-Modell VSM vs. Boole Literatur
Parameter und Bereiche
Gewichtung
Vektorraum-Modell
VSM vs. Boole
Literatur
Text-Engineering I - Information-Retrieval - Wintersemester 2009/2010 - Informationsverarbeitung - Universitat zu Koln
Parameter und Bereiche Gewichtung Vektorraum-Modell VSM vs. Boole Literatur
Parameter
I Nutzung von Metadaten:I Strukturierte Informationen uber das DokumentI Kontrolliertes Vokabular
I Invertierter Index unzureichendI Erweiterung:
I Parameter in Index aufnehmen→ Zuordnung Dokument - Felder
Text-Engineering I - Information-Retrieval - Wintersemester 2009/2010 - Informationsverarbeitung - Universitat zu Koln
Parameter und Bereiche Gewichtung Vektorraum-Modell VSM vs. Boole Literatur
Dokumentbereiche
I Dokumentbereiche mit Freitext
Abbildung: www.informationretrieval.org
I Erweiterter Index:Bereiche als Attribute von Termen
Text-Engineering I - Information-Retrieval - Wintersemester 2009/2010 - Informationsverarbeitung - Universitat zu Koln
Key to columns: tf-raw: raw (unweighted) term frequency, tf-wght: logarithmically weightedterm frequency, df: document frequency, idf: inverse document frequency, weight: the finalweight of the term in the query or document, n’lized: document weights after cosinenormalization, product: the product of final query weight and final document weight
Key to columns: tf-raw: raw (unweighted) term frequency, tf-wght: logarithmically weightedterm frequency, df: document frequency, idf: inverse document frequency, weight: the finalweight of the term in the query or document, n’lized: document weights after cosinenormalization, product: the product of final query weight and final document weight
Key to columns: tf-raw: raw (unweighted) term frequency, tf-wght: logarithmically weightedterm frequency, df: document frequency, idf: inverse document frequency, weight: the finalweight of the term in the query or document, n’lized: document weights after cosinenormalization, product: the product of final query weight and final document weight
Key to columns: tf-raw: raw (unweighted) term frequency, tf-wght: logarithmically weightedterm frequency, df: document frequency, idf: inverse document frequency, weight: the finalweight of the term in the query or document, n’lized: document weights after cosinenormalization, product: the product of final query weight and final document weight
Key to columns: tf-raw: raw (unweighted) term frequency, tf-wght: logarithmically weightedterm frequency, df: document frequency, idf: inverse document frequency, weight: the finalweight of the term in the query or document, n’lized: document weights after cosinenormalization, product: the product of final query weight and final document weight
Key to columns: tf-raw: raw (unweighted) term frequency, tf-wght: logarithmically weightedterm frequency, df: document frequency, idf: inverse document frequency, weight: the finalweight of the term in the query or document, n’lized: document weights after cosinenormalization, product: the product of final query weight and final document weight
Key to columns: tf-raw: raw (unweighted) term frequency, tf-wght: logarithmically weightedterm frequency, df: document frequency, idf: inverse document frequency, weight: the finalweight of the term in the query or document, n’lized: document weights after cosinenormalization, product: the product of final query weight and final document weight
Key to columns: tf-raw: raw (unweighted) term frequency, tf-wght: logarithmically weightedterm frequency, df: document frequency, idf: inverse document frequency, weight: the finalweight of the term in the query or document, n’lized: document weights after cosinenormalization, product: the product of final query weight and final document weight
Key to columns: tf-raw: raw (unweighted) term frequency, tf-wght: logarithmically weightedterm frequency, df: document frequency, idf: inverse document frequency, weight: the finalweight of the term in the query or document, n’lized: document weights after cosinenormalization, product: the product of final query weight and final document weight
Key to columns: tf-raw: raw (unweighted) term frequency, tf-wght: logarithmically weightedterm frequency, df: document frequency, idf: inverse document frequency, weight: the finalweight of the term in the query or document, n’lized: document weights after cosinenormalization, product: the product of final query weight and final document weight
Key to columns: tf-raw: raw (unweighted) term frequency, tf-wght: logarithmically weightedterm frequency, df: document frequency, idf: inverse document frequency, weight: the finalweight of the term in the query or document, n’lized: document weights after cosinenormalization, product: the product of final query weight and final document weight√
Key to columns: tf-raw: raw (unweighted) term frequency, tf-wght: logarithmically weightedterm frequency, df: document frequency, idf: inverse document frequency, weight: the finalweight of the term in the query or document, n’lized: document weights after cosinenormalization, product: the product of final query weight and final document weight
Key to columns: tf-raw: raw (unweighted) term frequency, tf-wght: logarithmically weightedterm frequency, df: document frequency, idf: inverse document frequency, weight: the finalweight of the term in the query or document, n’lized: document weights after cosinenormalization, product: the product of final query weight and final document weight
Text-Engineering I - Information-Retrieval - Wintersemester 2009/2010 - Informationsverarbeitung - Universitat zu Koln
Parameter und Bereiche Gewichtung Vektorraum-Modell VSM vs. Boole Literatur
Zusammenfassung: Das Vector Space Model
I Vorteile:I Kompakte Darstellung der Eigenschaften von DokumentenI Numerische ReprasentationI Vergleichsmetriken liefern graduelle Ahnlichkeiten→ Ranking der Dokumente relativ zur Anfrage
I Probleme:I ’Bag of words’I Wildcards / unscharfes MatchenI Dimensionalitat / SparsenessI Polysemie / Homonymie
Text-Engineering I - Information-Retrieval - Wintersemester 2009/2010 - Informationsverarbeitung - Universitat zu Koln
Parameter und Bereiche Gewichtung Vektorraum-Modell VSM vs. Boole Literatur
VSM vs. Boolesches Modell
I VSM:I Akkumulierte Evidenz: Termfrequenz erhoht BewertungI Nur fur Freitext-Anfragen geeignet
I Boolesches Modell:I Selektive EvidenzI Wahr, wenn Gewicht ≥ 0
I Kombination:I implizites UNDI Weitere Operatoren fur verfeinerte Anfragen
Text-Engineering I - Information-Retrieval - Wintersemester 2009/2010 - Informationsverarbeitung - Universitat zu Koln
Parameter und Bereiche Gewichtung Vektorraum-Modell VSM vs. Boole Literatur
VSM und Wildcards
I Keine direkte Abfrage moglich
I Indexstrukturen nicht kompatibel (Matrix/Baum)I Kombinierbar mittels k-gram-Index und ’Query expansion’:
I Aus k-gram-Index passende Terme holenI Daraus Anfragen-Vektor konstruieren
Text-Engineering I - Information-Retrieval - Wintersemester 2009/2010 - Informationsverarbeitung - Universitat zu Koln
Parameter und Bereiche Gewichtung Vektorraum-Modell VSM vs. Boole Literatur
VSM und Phrase Queries
I VSM nicht fur Positionsabhangige Suche geeignetI Bei Mehrwort-Anfragen werden immer auch die Achsen der
einzelnen Terme aktiviertI Kombinierbar mittels ’Query Parsing’
Text-Engineering I - Information-Retrieval - Wintersemester 2009/2010 - Informationsverarbeitung - Universitat zu Koln
Parameter und Bereiche Gewichtung Vektorraum-Modell VSM vs. Boole Literatur
Wie geht es weiter?
I Evaluation (IIR 8)
I Web-Retrieval (IIR 19-21)
Text-Engineering I - Information-Retrieval - Wintersemester 2009/2010 - Informationsverarbeitung - Universitat zu Koln
Parameter und Bereiche Gewichtung Vektorraum-Modell VSM vs. Boole Literatur
Luhn, H. P. (1957).A statistical approach to mechanized encoding and searchingof literary information.IBM Journal of Research and Development, 1(4):309–317.
Manning, C. D., Raghavan, P., and Schutze, H. (2008).Introduction to Information Retrieval.Cambridge University Press.
Zum Nachlesen: [Manning et al., 2008], Kapitel 6(siehe www.informationretrieval.org)
Text-Engineering I - Information-Retrieval - Wintersemester 2009/2010 - Informationsverarbeitung - Universitat zu Koln