IBE110: HTML document processing concepts and searching on the Web 2015 Judith A. Molka-Danielsen.

IBE110: HTML document processing concepts and searching on the Web

2015Judith A. Molka-Danielsen

Document Processing

Hypertext Processing: In the 1990s we saw the development of internetworks, and ubiquitous interfaces (windows).

Tim Berners-Lee at the National Radiation Lab at CERN created HTML and URL (Uniform Resource Locator) protocols so that a simple standardized form of markup, based on Scribe, could be used to describe documents and naming scheme would allow for the universal identification of documents.

So documents could be and viewed in graphical format and large collections linked across multiple internets. This is hypertext processing.

Properties of DocumentsSyntax - can express structure, presentation style, semantics, and external actions. It can be implicit in the contents of a document or expressed in a language.

Structure - a structural element like a section can have can have a Formating Style associated with it that tells how the elements relate to each other within the document.

Presentation Style - is how the document is displayed or printed. It can be embedded in the documents such as in TeX, and use macros LaTeX. Or can be defined separately as CSS for HTML documents. Presentation style can be determined by the author (in applications or languages) or the reader (Web browser).

Semantics - the meaning within a language, can be associated with use.

Characteristics continued...

Metadata - information about the organization of the data. Data about the data. Such as, author, publication date, subject codes, etc.

What is Markup?

•Markup is everything in a document that is not content. Typesetters used procedural markup to lay out instructions of how a document should look. (16 pt bold Helvetica)

•Word Processing software like Microsoft Word uses Procedural markup. They have a specific set of markup codes. The codes apply to a single physical way of presenting information, such as on a printed page. It doesn't define the appearance on other media like CD-ROM or Internet.

•Descriptive markup, or generic markup, describes the structure of the document rather than the appearance. Content is separate from style. You can publish on all media using the same structure instruction set.

SGMLSGML (Standard Generalized Markup Language, ISO 8879, 1986), specifies a standard method for describing the structure of the document. Structural elements are for example: title, chapter, paragraph. It is an extensible Meta Language. It can supports an infinite variety of document structures like: information bulletins, technical manuals, parts catalogs, design specifications, reports, letters, memos. The Document Type Definition (DTD) describes the structure of the document. (like a database schema in a database). The DTD provides a framework of elements (chapters, headers). The DTD specifies rules for the relationship between elements, ie. a chapter header must come after the start of a chapter. A document intance is a document whose contents is tagged in conformance with a DTD. A DTD can be applied throughout the whole organization.

SGML continued

SGML uses tagging to identify the contents position within a DTD structure. So we insert tags around the content. You can nest elements. A parser program verifies that a document follows the rules of a DTD. The parser checks if the document is structurally correct.

Documents can be ported to different formats for different output medium (printer, screen, CD Rom, speaker, TV)

Style is usally handled separately by style sheets, like Cascading Style Sheets (CSS).

HTML (first version in 1992) a tagging language that could be used on the World Wide Web for text formatting and linking documents. It adopts the syntax of SGML and is an application of SGML described by a particular DTD. HTML is not an extensible language. Authors cannot add their own tags. HTML supports style sheets written in CSS language (color, font, layout for web pages.) to define the look and layout of text and other materials.

HTML can embed scripts written in languages such as JavaScript which affect the behavior of HTML web pages.

The World Wide Web Consortium (W3C), maintainer of both the HTML and the CSS standards, has encouraged the use of CSS over explicit presentational HTML since 1997.

HTML5 – cross platform for mobile applications and implementation with more file types. Started in 2008, in 2014 is a proposed recommendation by W3C. http://www.w3schools.com/html/html5_intro.asp

Potential of HTML5: http://learningcircuits.blogspot.no/2011/12/what-do-we-mean-when-we-say-html5.html

Element reference list: http://www.w3schools.com/tags/

Positive comments on HTML

HTML uses tags to separate content (text) from format (structure, appearance).

It lets amateurs control markup (good and bad)

HTML tags were used for appearance formatting, but little attention was used toward content structuring.

Negative comments on HTML

HTML did not offer enough custom control over the WYSIWYG environment.

Things looked different in different browsers (reader interpreted, not author interpreted).

Navigating through hypertext requires user memory.

Designing hypertext (document collections) for easy searching is hard to do.

Comments on CSS

Cascading Style Sheets helped HTML by freeing tags like <font> and <b> from carrying format information. Puts them in the style sheet.

It lets tags like <header> carry structure information.

CSS is a styling tool that can work with other markup languages like XML.

Current version is CSS3

Comments on CSS

Formating• Structure• Appearance

Content•Information•Data

The Document

Structure – HTML does this a little bit.Appearance – or presentation, before HTML did this

with tags like <b> but now all structurecontrol should be taken out of HTMLdocuments and put in CSS or XSL files.

XMLXML (XML 1.0, 1998, Extensible Markup Language) is also a meta language in that it describes other languages. There is not pre-defined list of elements.

Elements are specified using a DTD or Schema. Also style sheets can be used to specify the output format of each element (XSL).

XML is based on SGML but it is a subset and is considered easier to program. XML is also supported to be viewed in most current versions of browsers.

More on XML in a later lecture..

How do search engines work?

They create directories in different ways: Human powered directories. In 2001, Yahoo, depended on humans for listings. You submitted a description to them. The search looked for matches in the descriptions submitted. Changing your web page had no effect on your listing. You could get reviewed by others if you were a good site.

Crawler based search engines: Most use these. Create listings automatically. Indexes change periodically when the crawler is reissued. The crawlers must (1) crawl through web pages (2) make an index, and (3) rank the results.

Hybrid search engines: use both humans and crawlers to produce directories or listings.

Search engine featuresCrawling features: deep, frame support, image maps, robots, meta tags, link popularity, paid inclusion

Indexing features: full body text, stop words, meta descriptions, meta keywords, ALT text, comments, stemming

Ranking features: meta tags boost ranking, link popularity boost ranking, direct hit boost ranking.

Spam features: meta refresh (target pages take

visitors automatically to other pages in a web site),

invisible text (text is same color as background),

tiny text.

Meaning of full text search typesKeyword search - Accepts a list of words as criteria and matches a document that contains any of the words. E.g. a keyword search for smart data matches a document that contains either smart or data.

Boolean search - Accepts a Boolean expression that states rules for the presence or absence of words in a document. Matches a document in which the required words are present and the forbidden words are absent. For example, a Boolean search for smart & data matches only documents that contain both smart and data.

Phrase search - Accepts a list of words as criteria and matches a document that contains the words in the stated order as a complete phrase. For example, a phrase search for smart data matches only documents that contain the complete phrase smart data.

Proximity search - Accepts a list of words as criteria and matches a document that contains the words in any order in close proximity. For example, a proximity search for smart data would match a document that contains the phrase the data is smart.

Fuzzy search - Also known as pattern search. Tunes one of the previous search strategies by matching slight variations on the words in the criteria list. For example, a fuzzy phrase search for smart data could match a document that contains the variant phrase a smart datum.

Ranking - Also known as weighting. In a fuzzy search, determines the relevance of the document based on the similarity of the match to the criteria. Documents with a higher ranking appear earlier in the result list. For example, a fuzzy phrase search for smart data would rank a document containing the exact phrase smart data higher than a document containing the variant phrase a smart datum.

Stop Words - Also known as noise words. Words that should be ignored in matches, such as a, the, some, and other articles and prepositions. For example, if in and the are stop words, a phrase search for smart data would match a document that contains the phrase smart in the data.

Synonyms - Also known as a thesaurus. Words that are equivalent for the documents in the repository. For example, if smart and intelligent are synonyms, a phrase search for smart data would match a document that contains the phrase intelligent data.

Other search engines besides Google

Examples of search engines:

AltaVista (Now Yahoo!)

HotBot

NorthernLight

Excite

Search Engine User Interface

Many search engines have advanced features that the general searcher does not know how to use. The most commonly used features are quotation marks and capitalization. (Show example case study in class.) Important issues:

Query Interface: different by engine. In AltaVista (Yahoo!) a sequence of words is a logical union. In HotBot it is an intersection.

Interface for complex queries: Boolean, phrase, proximity, wild cards, filtering, special qualifications via date, language, url, title, internet domain, file types.

Response Interface: 10 entries per page. Entry contains information on: url, size, date indexed, some text.

Return options: the number of pages returned, maybe sorting by url or date.

Crawling the Web

A ranking algorithm like PageRank can be used to rank the relevancy of documents in a hit set. This algorithm can be used to decide which page to visit next by web crawler programs. Crawlers can traverse up to 10 million Web pages per day.

Traversal approaches:

Breadth first is to look at all pages linked by the current pages, and so on.

Depth first is to follow the first link on the page and successive pages and return up recursively. This is a narrow but deep search.

Crawlers can use much bandwidth. Priorities and restrictions might be set on their use.

Crawlers are also referred to as Spiders.

Ranking: how is it used by search engines

Criteria: location, frequency , metatags, number of web pages indexed, spamming controls , "off the page" -link analysis, click through ratings

The difference between the Web and DBMS ranking is that the Web ranking can use hyperlink information. It can use the number of links coming into a site, or the number of outward pointing links to other sites.

Authorities are pages that have many links pointing to them. They are likely to be good sources of information on the searched topic. The number of inward reference links can indicate the popularity of

the site, and perhaps this is reflective of the quality of the information there.

Hubs are pages that have many links outgoing to other servers. They point to pages with similar or related information. Better authority pages come from incoming edges from good hubs. Better hub pages come from outgoing edges to good authorities.

How users can improve searching on the webGiven the User Problems

user does not understand the meaning of searching

user does not know the rules (case, stemming) used by the search engine and gets unexpected answers.

users have problems with Boolean logic

users find the engines slow, answers sets too large, not very relevant, not up to date.

Techniques for Users to improve information retrievalStart with a relevant page, use the keywords from that page

Use authors personal Web pages

Pages on the topic already contain relevant references and links

use web directories to select a category for a starting point.

use search engines to improve the query formulation on a relevant set of answers.

On Web Query Languages: Structured searches (using sql type queries) only work on domains where the data is structured.

Size of the Web

http://www.netcraft.com/Survey/ (Netcraft survey)

IBE110: HTML document processing concepts and searching on the Web 2015 Judith A. Molka-Danielsen.

Documents

document intance

dtd structure

procedural markup

document type definition

generic markup

descriptive markup

structure instruction

html documents