From Memex to Google in 120 minutes Rivka Taub Amit Levin
Dec 19, 2015
From Memex to Google in 120 minutes
Rivka TaubAmit Levin
“As We May think”
By Vannevar Bush
A Paper that talks about the Future
Vannevar-
Bush:
Biography
Vannevar-Bush (1890-1974)
Vannevar-Bush (1890-1974)
* Was Born in Massachusetts
* Studied engineering in Tuft college
* Earned his bachelor and master degree in 1913
* Earned his doctorate of engineering at 1917
Vannevar-
Bush:
Biography
Vannevar-Bush (1890-1974)
* In 1919, Bush joined MIT’s electrical engineering department,
and had stayed there for 25 years.
* Completed the differential analyzer in 1931
* During the 1930s, worked on technology for document retrieval
and information organization (used microfilm)
* In 1938, designed and built the microfilm rapid selector,
rumored to have been used for cryptanalysis during WWII
Vannevar-
Bush:
Biography
Vannevar-Bush (1890-1974)Vannevar-
Bush:
Biography
* Was the planner and chairman of a committee that brought
together government, military, business and scientists (NDRC)
* Supervised the Manhattan project which developed the first
atomic bomb
* In reply to President Roosevelt’s request for post-war direction,
published the articles “As We May Think” (1945) and ”Science
the Endless Frontier” (1945)
* Served as the chairman of the MIT Corporation
* Continued pushing for analog computers, as digital computers
rose to prominence
Bush’s Vision:
By Science
For Science
Bush’s Vision
Organizing the information:
by science, for science
The Record-Technological Predictions
Improved microfilm
Storage
Acquisition
DryPhotography
Dictation Technology
Head-mountedcamera
By Science
For Science
•Tech
Predictions
Technological Predictions-The Record
RetrievalCalculation
And Automation
Machines will manipulate and analyze data
Calculuation of “advanced math”and logical thought
Microfilm rapid selector
By Science
For Science
•Tech
Predictions
Microfilm Rapid Selector
* Microfilm storage was popular
during the 1920s and 1930s
* The problem: Selecting documents
* Option: Punched-cards. BUT they are too
slow, and retrieve only the address of the
document, not the document itself
* Goal: A system that will combine
documents and index
By Science
For Science
•Tech
predictions
•Microfilm
Rapid
Selector
Microfilm Rapid Selector By Science
For Science
•Tech
predictions
•Microfilm
Rapid
Selector
The Memex
“A memex is a device in which an individual stores all
his books, records, and communications, and which is
mechanized so that it may be consulted with
exceeding speed and flexibility. It is an enlarged
supplement to his memory” (As We May Think,1945)
By Science
For science
•Tech
predictions
•Microfilm
Rapid
Selector
•The Memex
The MemexBy Science
For science
•Tech
predictions
•Microfilm
Rapid
Selector
•The Memex
The Memex - Features
* Storage on microfilm
* Workstation for stored documents and for projection
* An option of adding new images
* An option of adding personal comments to a document
* Retrieval by document and code
By Science
For science
•Tech
predictions
•Microfilm
Rapid
Selector
•The Memex
So, What’s new? By Science
For science
•Tech
predictions
•Microfilm
Rapid
Selector
•The Memex
Associative annotation and
selection: “trails” .
Imitation of the human brain
From
Memex to
Hypertext
From Memex to Hypertext
“The 1987 Hypertext conference: The influence of
Bush’s essay “As We May Think” on the emerging field
of hypertext was widely acknowledged” (“From Memex
to Hypertext”,Nyce & Kahn, 1991)
“To a large part we have MEMEXes on our desks today…a web browser with an editor gives quite a good substitute for a MEMEX.” (Berners-Lee, talk at Bush symposium MIT,
1995)
BUT…
* Emanuel Goldberg’s statistical machine- a microfilm
selector. A US patent was issued in 1931.
* Paul Otlet, 1934: “The Trait de Documentation”.
Described a workstation for scholars, enables to read,
write, and select documents. Scholars can connect
documents. Coined the term ‘link’.
•From
Memex to
Hypertext
•Previous
Ideas
The Memex - Critic
* Trails are artificial. Not an objective measure
* Every user has his own Memex, no networking
* Bush predicted the affect of the record in
laboratory research, law, and business accounting
and not on the “ordinary person”
The Memex
Critic
Internet and
WWW
The Birth of the Internet and the WWW
* 1969: The Advanced Research Projects Agency
(ARPA) prepared a plan for the United States to
maintain control over its missiles and bombers after a
nuclear attack. Through this work the Internet was
born.
* Almost 20 years after the birth of the Internet, the
World Wide Web was born to allow the public
exchange of information on a global basis. It was built
on the backbone of the Internet
A Brief History of Search Engines
WWWW(1993):Indexed titles and URLs. Listed
results in the order it found them
Excite (1993) :Used statistical analysis of word
relationships to make searching more
efficient.
Yahoo (1994) :A collection of favorite websites, that
became a searchable directory. It
provided a description with each URL
Internet and
WWW
•Search
Engines
A Brief History of Search Engines
WebCrawler (1994): Indexed entire web pages. Was
bought in 1997 by Excite
Lycos (1994): Provided ranked relevance
retrieval and prefix matching
Alta Vista (1995): Had nearly unlimited bandwidth
(for that time), allowed natural
language queries, advanced
searching techniques, and
allowed users to add or delete
their own URL within 24 hours.
Internet and
WWW
•Search
Engines
“The Anatomy of a Large-
Scale Hypertextual Web
Search Engine”
By S. Brin and L. Page
* Google was born in Stanford university
* Was launched in 1998
* Main goal: High Quality Search
Quality = Relevance
GoogleInternet and
WWW
•Search
Engines
Obstacles
Web:
* Scalability of the web and a growing number of
queries
* There is no control on what comes in the web-
heterogeneous collection
Search Engines:
* Textual search provides many ‘junk results’ (A
search engine that does not return itself to the top
of 10 results)
* Commercial SE, loss of relevance
* Spam
Internet and
WWW
•Search
Engines
•Obstacles
How Google Achieves Quality search
It Makes use of the hypertextual information. In
particular it utilizes:
1. The link structure of the web to calculate a quality ranking for each web page (PageRank)
2. Anchor text . Associated to the page in points to: Improves search results and causes for results that are not text-based
3. Other features such as proximity and visual presentation details (e.g. font size)
Internet and
WWW
•Search
Engines
•Obstacles
•Quality search
Google’s Architecture
Major functions:
1. Crawling
2. Indexing
3. Ranking
4. Searching
Internet and
WWW
•Search
Engines
•Obstacles
•Quality search
•Architecture
Internet and
WWW
•Search
Engines
•Obstacles
•Quality search
•Architecture
Google’s Architecture
URL Server - sends lists of URLs to crawlers Crawler - downloads web pages Store Server - compresses & stores web pages into the repository Indexer - reads the repository & uncompresses the documents - parses the documents - creates forward index - parses out the link
Internet and
WWW
•Search
Engines
•Obstacles
•Quality search
•Architecture
Google’s Architecture
URL Revolver - converts relative URLs from the anchors file, to absolute URLs and then to docIDs - generates a database of links - puts the anchor text into the f. index Sorter - generates the inverted index Searcher - answers queries
Internet and
WWW
•Search
Engines
•Obstacles
•Quality search
•Architecture
Crawling The Web Crawling
The Web
Internet and
WWW
•Search
Engines
•Obstacles
•Quality search
Architecture
Searching the Web
1. Parse the query.
2. Convert words into wordIDs.
3. Seek to the start of the doclist in the short barrel for
every word.
4. Scan through the doclists until there is a document that matches all the search terms.
Internet and
WWW
•Search
Engines
•Obstacles
•Quality search
•Architecture
Searching the Web
5. Compute the rank of that document for the query.
6. If we are in the short barrels and at the end of any doclist, seek to the start of the doclist in the full barrel for every word and go to step 4.
7. If we are not at the end of any doclist go to step 4. 8. Sort the documents that have matched by rank and return the top k.
Internet and
WWW
•Search
Engines
•Obstacles
•Quality search
•Architecture
The Ranker
* Uses hit lists, anchor text hits and PageRank
* Types of hits: title, anchor, URL, plain text small font…
Internet and
WWW
•Search
Engines
•Obstacles
•Quality search
•Architecture
The Ranker
Vectors:
* Type- weight vector, sorted by types for one word query
* type-prox weight vector, for multiple words query
* Count-weight vector
* IR Score is a the dot product of the count weight and the
types-weight vectors
Internet and
WWW
•Search
Engines
•Obstacles
•Quality search
•Architecture
What we saw so far:
Bush : Memex, Hypertext, Goldberg, Otlet
Google: Goal, Obstacles, How to achieve
quality, architecture