In Search Of... Ian Barber @ianbarber http://phpir.com [email protected] integrating site search Friday, 29 October 2010
May 17, 2015
In Search Of...
Ian Barber@ianbarber
http://[email protected]
integrating site search
Friday, 29 October 2010
2
How Search WorksIntegrating SearchImproving Results
Using SearchSearch Performance
Questions
Friday, 29 October 2010
3Friday, 29 October 2010
4
Index
DocumentDocumentDocumentDocumentAnalyser
Query Parser
QueryQueryQueryQuery
ResultResultResultResult
Friday, 29 October 2010
5
With AT&T’s help, the F.B.I Miami-Dade office had recovered $1.1 million from O’Healy’s Ponzi scheme, 10-15% more than expected.
Tokenisation
“”
Friday, 29 October 2010
6
PHP Tokenisation
function tokenise($string) { $string = strtolower($string); preg_match_all('/\w+/', $string, $matches, PREG_OFFSET_CAPTURE); return $matches[0];}
Friday, 29 October 2010
7
Document Term PairsDocument ID Term
1 the 1 best1 of1 the ... ...
204 and 204 what204 would
Friday, 29 October 2010
8
Inverted IndexTerm Documents
best 1 (4, 16), 4 (422), 129 (344) ...
what 24 (50, 98), 75 (33, 208) ...
would 99 (32, 599), 201 (344) ..
... ...
Friday, 29 October 2010
9
Boolean Query MergeQuery: Best Western Hotel
Result: Document 298
best 1 4 129 298 305 338western 4 95 194 204 298 305
hotel 2 40 200 298 355 402working 4 298 305
Friday, 29 October 2010
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed sit amet ante vitae enim elementum semper sodales quis ipsum. Aliquam vel condimentum neque. Curabitur ornare feugiat ornare. Donec consectetur elit metus. Nulla eleifend tincidunt massa et euismod. Vestibulum vestibulum, justo vel egestas elementum, purus enim ornare quam, vel gravida est enim vel nibh.
Nam non eros nisi, eget fringilla justo. Fusce vel risus vitae mauris vehicula facilisis sit amet in mi. Nulla ut turpis id felis sollicitudin dictum sed non ipsum. Praesent ut risus nulla, sed blandit leo. Curabitur volutpat laoreet lacus, ut consectetur arcu vestibulum vel. Donec dapibus fringilla arcu, et semper lacus egestas non. Quisque eu purus ut lacus egestas dapibus. Integer in velit id est dictum bibendum in id mi.
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed sit amet ante vitae enim elementum semper sodales quis ipsum. Aliquam vel condimentum neque. Curabitur ornare feugiat ornare. Donec consectetur elit metus. Nulla eleifend tincidunt massa et euismod. Vestibulum vestibulum, justo vel egestas elementum, purus enim ornare quam, vel gravida est enim vel nibh.
Nam non eros nisi, eget fringilla justo. Fusce vel risus vitae mauris vehicula facilisis sit amet in mi. Nulla ut turpis id felis sollicitudin dictum sed non ipsum. Praesent ut risus nulla, sed blandit leo. Curabitur volutpat laoreet lacus, ut consectetur arcu vestibulum vel. Donec dapibus fringilla arcu, et semper lacus
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed sit amet ante vitae enim elementum semper sodales quis ipsum. Aliquam vel condimentum neque. Curabitur ornare feugiat ornare. Donec consectetur elit metus. Nulla eleifend tincidunt massa et euismod. Vestibulum vestibulum, justo vel egestas elementum, purus enim ornare quam, vel gravida est enim vel nibh.
Nam non eros nisi, eget fringilla justo. Fusce vel risus vitae mauris vehicula facilisis sit amet in mi. Nulla ut turpis id felis sollicitudin dictum sed non ipsum. Praesent ut risus nulla, sed blandit leo. Curabitur volutpat laoreet lacus, ut consectetur arcu vestibulum vel. Donec dapibus fringilla arcu, et semper lacus
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed sit amet ante vitae enim elementum semper sodales quis ipsum. Aliquam vel condimentum neque. Curabitur ornare feugiat ornare. Donec consectetur elit metus. Nulla eleifend tincidunt massa et euismod. Vestibulum vestibulum, justo vel egestas elementum, purus enim ornare quam, vel gravida est enim vel nibh.
Nam non eros nisi, eget fringilla justo. Fusce vel risus vitae mauris vehicula facilisis sit amet in mi. Nulla ut turpis id felis sollicitudin dictum sed non ipsum. Praesent ut risus nulla, sed blandit leo. Curabitur volutpat laoreet lacus, ut consectetur arcu vestibulum vel. Donec dapibus fringilla arcu, et semper lacus
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed sit amet ante vitae enim elementum semper sodales quis ipsum. Aliquam vel condimentum neque. Curabitur ornare feugiat ornare. Donec consectetur elit metus. Nulla eleifend tincidunt massa et euismod. Vestibulum vestibulum, justo vel egestas elementum, purus enim ornare quam, vel gravida est enim vel nibh.
Nam non eros nisi, eget fringilla justo. Fusce vel risus vitae mauris vehicula facilisis sit amet in mi. Nulla ut turpis id felis sollicitudin dictum sed non ipsum. Praesent ut risus nulla, sed blandit leo. Curabitur volutpat laoreet lacus, ut consectetur arcu vestibulum vel. Donec dapibus fringilla arcu, et semper lacus Lorem ipsum dolor sit amet,
consectetur adipiscing elit. Sed sit amet ante vitae enim elementum semper sodales quis ipsum. Aliquam vel condimentum neque. Curabitur ornare feugiat ornare. Donec consectetur elit metus. Nulla eleifend tincidunt massa et euismod. Vestibulum vestibulum, justo vel egestas elementum, purus enim ornare quam, vel gravida est enim vel nibh.
Nam non eros nisi, eget fringilla justo. Fusce vel risus vitae mauris vehicula facilisis sit amet in mi. Nulla ut turpis id felis sollicitudin dictum sed non ipsum. Praesent ut risus nulla, sed blandit leo. Curabitur volutpat laoreet lacus, ut consectetur arcu vestibulum vel. Donec dapibus fringilla arcu, et semper lacus
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed sit amet ante vitae enim elementum semper sodales quis ipsum. Aliquam vel condimentum neque. Curabitur ornare feugiat ornare. Donec consectetur elit metus. Nulla eleifend tincidunt massa et euismod. Vestibulum vestibulum, justo vel egestas elementum, purus enim ornare quam, vel gravida est enim vel nibh.
Nam non eros nisi, eget fringilla justo. Fusce vel risus vitae mauris vehicula facilisis sit amet in mi. Nulla ut turpis id felis sollicitudin dictum sed non ipsum. Praesent ut risus nulla, sed blandit leo. Curabitur volutpat laoreet lacus, ut consectetur arcu vestibulum vel. Donec dapibus fringilla arcu, et semper lacus
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed sit amet ante vitae enim elementum semper sodales quis ipsum. Aliquam vel condimentum neque. Curabitur ornare feugiat ornare. Donec consectetur elit metus. Nulla eleifend tincidunt massa et euismod. Vestibulum vestibulum, justo vel egestas elementum, purus enim ornare quam, vel gravida est enim vel nibh.
Nam non eros nisi, eget fringilla justo. Fusce vel risus vitae mauris vehicula facilisis sit amet in mi. Nulla ut turpis id felis sollicitudin dictum sed non ipsum. Praesent ut risus nulla, sed blandit leo. Curabitur volutpat laoreet lacus, ut consectetur arcu vestibulum vel. Donec dapibus fringilla arcu, et semper lacus
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed sit amet ante vitae enim elementum semper sodales quis ipsum. Aliquam vel condimentum neque. Curabitur ornare feugiat ornare. Donec consectetur elit metus. Nulla eleifend tincidunt massa et euismod. Vestibulum vestibulum, justo vel egestas elementum, purus enim ornare quam, vel gravida est enim vel nibh.
Nam non eros nisi, eget fringilla justo. Fusce vel risus vitae mauris vehicula facilisis sit amet in mi. Nulla ut turpis id felis sollicitudin dictum sed non ipsum. Praesent ut risus nulla, sed blandit leo. Curabitur volutpat laoreet lacus, ut consectetur arcu vestibulum vel. Donec dapibus fringilla arcu, et semper lacus
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed sit amet ante vitae enim elementum semper sodales quis ipsum. Aliquam vel condimentum neque. Curabitur ornare feugiat ornare. Donec consectetur elit metus. Nulla eleifend tincidunt massa et euismod. Vestibulum vestibulum, justo vel egestas elementum, purus enim ornare quam, vel gravida est enim vel nibh.
Nam non eros nisi, eget fringilla justo. Fusce vel risus vitae mauris vehicula facilisis sit amet in mi. Nulla ut turpis id felis sollicitudin dictum sed non ipsum. Praesent ut risus nulla, sed blandit leo. Curabitur volutpat laoreet lacus, ut consectetur arcu vestibulum vel. Donec dapibus fringilla arcu, et semper lacus
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed sit amet ante vitae enim elementum semper sodales quis ipsum. Aliquam vel condimentum neque. Curabitur ornare feugiat ornare. Donec consectetur elit metus. Nulla eleifend tincidunt massa et euismod. Vestibulum vestibulum, justo vel egestas elementum, purus enim ornare quam, vel gravida est enim vel nibh.
Nam non eros nisi, eget fringilla justo. Fusce vel risus vitae mauris vehicula facilisis sit amet in mi. Nulla ut turpis id felis sollicitudin dictum sed non ipsum. Praesent ut risus nulla, sed blandit leo. Curabitur volutpat laoreet lacus, ut consectetur arcu vestibulum vel. Donec dapibus fringilla arcu, et semper lacus
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed sit amet ante vitae enim elementum semper sodales quis ipsum. Aliquam vel condimentum neque. Curabitur ornare feugiat ornare. Donec consectetur elit metus. Nulla eleifend tincidunt massa et euismod. Vestibulum vestibulum, justo vel egestas elementum, purus enim ornare quam, vel gravida est enim vel nibh.
Nam non eros nisi, eget fringilla justo. Fusce vel risus vitae mauris vehicula facilisis sit amet in mi. Nulla ut turpis id felis sollicitudin dictum sed non ipsum. Praesent ut risus nulla, sed blandit leo. Curabitur volutpat laoreet lacus, ut consectetur arcu vestibulum vel. Donec dapibus fringilla arcu, et semper lacus
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed sit amet ante vitae enim elementum semper sodales quis ipsum. Aliquam vel condimentum neque. Curabitur ornare feugiat ornare. Donec consectetur elit metus. Nulla eleifend tincidunt massa et euismod. Vestibulum vestibulum, justo vel egestas elementum, purus enim ornare quam, vel gravida est enim vel nibh.
Nam non eros nisi, eget fringilla justo. Fusce vel risus vitae mauris vehicula facilisis sit amet in mi. Nulla ut turpis id felis sollicitudin dictum sed non ipsum. Praesent ut risus nulla, sed blandit leo. Curabitur volutpat laoreet lacus, ut consectetur arcu vestibulum vel. Donec dapibus fringilla arcu, et semper lacus
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed sit amet ante vitae enim elementum semper sodales quis ipsum. Aliquam vel condimentum neque. Curabitur ornare feugiat ornare. Donec consectetur elit metus. Nulla eleifend tincidunt massa et euismod. Vestibulum vestibulum, justo vel egestas elementum, purus enim ornare quam, vel gravida est enim vel nibh.
Nam non eros nisi, eget fringilla justo. Fusce vel risus vitae mauris vehicula facilisis sit amet in mi. Nulla ut turpis id felis sollicitudin dictum sed non ipsum. Praesent ut risus nulla, sed blandit leo. Curabitur volutpat laoreet lacus, ut consectetur arcu vestibulum vel. Donec dapibus fringilla arcu, et semper lacus
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed sit amet ante vitae enim elementum semper sodales quis ipsum. Aliquam vel condimentum neque. Curabitur ornare feugiat ornare. Donec consectetur elit metus. Nulla eleifend tincidunt massa et euismod. Vestibulum vestibulum, justo vel egestas elementum, purus enim ornare quam, vel gravida est enim vel nibh.
Nam non eros nisi, eget fringilla justo. Fusce vel risus vitae mauris vehicula facilisis sit amet in mi. Nulla ut turpis id felis sollicitudin dictum sed non ipsum. Praesent ut risus nulla, sed blandit leo. Curabitur volutpat laoreet lacus, ut consectetur arcu vestibulum vel. Donec dapibus fringilla arcu, et semper lacus
Friday, 29 October 2010
11
TF-IDF
function getWeight($docID, $term, $total) { $tf = count($term[$docID]); $idf = log($total / count($term), 2); return $tf * $idf;}
Friday, 29 October 2010
12
Document Vector
socket what heavy steel ...
Doc 1 0.02 0.3 0.001 0 ...
Doc 2 0 0 0 0 ...
Doc 3 0.001 0.2 0 0 ...
Doc 4 0 0 0.002 0.003 ...
Friday, 29 October 2010
best 23 42 179 246 333 703
weight 0.008 0.002 0.023 0.039 0.014 0.001
western 42 88 120 179 246 798
weight 0.003 0.004 0.023 0.001 0.034 0.004
1 - 246: 0.0732 - 179: 0.0243 - 120: 0.023
Ranked Query Merge
13Friday, 29 October 2010
14
PHP Similarityfunction score($queryString, $index) { $query = tokenize($queryString); $matches = array(); foreach($query as $qterm) { $postings = $index[$qterm]; foreach($postings as $id => $posting) { $matches[$id] += $posting['score']; } } return arsort($matches);}
Friday, 29 October 2010
15
Integrating SearchFriday, 29 October 2010
16
CREATE TABLE example ( id INT(11) NOT NULL auto_increment, title VARCHAR(255), content TEXT, PRIMARY KEY(id), FULLTEXT(title,content)) Engine=MyISAM;
INSERT INTO example (title, content) VALUES ('Mikko & Bacon','Mikko loves bacon'),('Marcello & Bacon','Marcello hates bacon'),('Jo & Sausages','Johanna loves sausages'),('Hollywood & Garlic','Lorenzo hates garlic'),('James & Cheddar','James is keen on cheeses');
MySQL Full Text Search
Friday, 29 October 2010
17
MySQL FTI QuerySELECT * FROM example WHERE MATCH(title,content) AGAINST('loves bacon');
+----+------------------+------------------------+| id | title | content |+----+------------------+------------------------+| 1 | Mikko & Bacon | Mikko loves bacon | | 2 | Marcello & Bacon | Marcello hates bacon | | 3 | Jo & Sausages | Johanna loves sausages | +----+------------------+------------------------+3 rows in set (0.00 sec)
Friday, 29 October 2010
18
Sphinx http://www.sphinxsearch.com
Friday, 29 October 2010
19
Sphinx Configurationsource posts{ type = mysql sql_host = localhost sql_user = user sql_pass = password sql_db = search
sql_query = \ SELECT id, title, content FROM example; sql_attr_multi = uint tag from query; \ SELECT example_id, tag_id FROM tags;}
Friday, 29 October 2010
20
index posts{ source = posts path = /var/data/sphinx/example morphology = stem_en
min_word_len = 3 min_prefix_len = 3 min_infix_len = 0 enable_star = 1}
Friday, 29 October 2010
21
Stemming
happeninghappenedhappens
http://tartarus.org/~martin/PorterStemmer
- happen- happen- happen
Friday, 29 October 2010
22
Command Line Searchingindexer --config /etc/sphinx.conf --allsearch --config /etc/sphinx.conf love bacon
displaying matches:1. document=1, weight=3, tag=(1,2)! id=1! title=Mikko & Bacon! content=Mikko loves baconwords:1. 'love': 2 documents, 2 hits2. 'bacon': 2 documents, 4 hits
searchd --config /etc/sphinx.conf
Friday, 29 October 2010
23
Sphinx From PHP
$cl = new SphinxClient();$cl->SetServer('localhost', 3312);$cl->SetMatchMode(SPH_MATCH_ANY);
$result = $cl->Query('bac*');$docIDs = array_keys($result["matches"]);
$cl->SetFilter('tag', array(1));$result = $cl->Query('bac*');$docIDs = array_keys($result["matches"]);
Friday, 29 October 2010
24
Swish-E . http://swish-e.org
pecl install swish-beta
Friday, 29 October 2010
25
Filesystem Index With Swish-E
IndexDir /var/data/documentsIndexFile fs-swish-e.indexIndexOnly .doc .docx .pdfFuzzyIndexingMode Stemming_en1
FileFilter .pdf /usr/local/bin/swish_filter.plFileFilter .doc /usr/local/bin/swish_filter.pl
fs-swish-e.conf
/usr/local/bin/swish-e -S fs -c fs-swish-e.conf
Friday, 29 October 2010
26
Crawling Content
IndexDir /usr/local/lib/swish-e/spider.plIndexFile www-swish-e.indexSwishProgParameters default http://phpir.com/
FuzzyIndexingMode Stemming_en1DefaultContents HTML
www-swish-e.conf
/usr/local/bin/swish-e -S prog -c www-swish-e.conf
Friday, 29 October 2010
27
Swish-E With Multiple Indices$swish = new Swish( 'www-swish-e.index fs-swish-e.index');$search = $swish->prepare();
$queryStr = 'search string goes here';$result = $search->execute($queryStr);$total = $result->hits;
while($r = $result->nextResult()) { echo $r->swishdocpath; // url}
Friday, 29 October 2010
28
Lucene
Friday, 29 October 2010
29
$index = Zend_Search_Lucene::create('idx');foreach($documents as $title => $content) { $doc = new Zend_Search_Lucene_Document(); $doc->addField( Zend_Search_Lucene_Field::Text( 'title', $title)); $doc->addField( Zend_Search_Lucene_Field::UnStored( 'content', $content)); $index->addDocument($doc);}
Build Index
Friday, 29 October 2010
30
$results = $index->find('loves bacon');foreach($results as $result) { echo $result->score, " "; echo $result->title, "\n";} Output: 0.81656279309067 Mikko and Bacon0.24800278854758 Marcello & Bacon
Query Zend Search Lucene
Friday, 29 October 2010
31
$file = file_get_contents($url);
$doc = Zend_Search_Lucene_Document_Html:: loadHTML($file);
$doc->addField( Zend_Search_Lucene_Field::Text( 'url', $url);$index->addDocument($doc)
Index HTML
Friday, 29 October 2010
32
Solr http://lucene.apache.org/solr/
Friday, 29 October 2010
33
Solr Search Index$options = array( 'hostname' => 'localhost', 'port' => 8983 );
$client = new SolrClient($options);$doc = new SolrInputDocument();$doc->addField('id', $id);$doc->addField('cat', $category);$doc->addField('title', $title);$doc->addField('text', $text);$response = $client->addDocument($doc);$client->commit();
Friday, 29 October 2010
34
Solr Search Client$client = new SolrClient($options);
$query = new SolrQuery('bacon');$response = $client->query($query);$r = $response->getResponse();
foreach($r['response']['docs'] as $d) { echo $d->title[0] . "\n";}
Friday, 29 October 2010
35
Xapian .
http://xapian.org
Friday, 29 October 2010
36
Xapian In PHP$db = new XapianWritableDatabase( 'idx', Xapian::DB_CREATE_OR_OPEN);$i = new XapianTermGenerator();$i->set_stemmer(new XapianStem("english"));
$doc = new XapianDocument();$doc->set_data($content);$doc->add_value(1, $title);
$i->set_document($doc);$i->index_text($content);$db->add_document($doc);
Friday, 29 October 2010
37
Xapian Search In PHP
$database = new XapianDatabase('idx');$enquire = new XapianEnquire($database);$qp = new XapianQueryParser();$qp->set_stemmer(new XapianStem("english"));$qp->set_database($database);$qp->set_stemming_strategy( XapianQueryParser::STEM_SOME);$query = $qp->parse_query($queryString);
$enquire->set_query($query);
Friday, 29 October 2010
38
$matches = $enquire->get_mset(0, 10);
$i = $matches->begin();while(!$i->equals($matches->end())) { $n = $i->get_rank() + 1; $data = $i->get_document()->get_data(); $title = $i->get_document()->get_value(1); $score = $i->get_percent(); $i->next();}
Friday, 29 October 2010
39
Improving Results
Friday, 29 October 2010
40
Anchor Text
Friday, 29 October 2010
41
$p = file_get_contents('http://phpir.com');
libxml_use_internal_errors(true);$dom = DomDocument::loadHTML($p);$links = $dom->getElementsByTagName('a');
foreach($links as $link) { $href = $link->getAttribute('href'); $text = $link->nodeValue;}
Parse Anchor Text
Friday, 29 October 2010
42
1
2
3
Zone WeightingFriday, 29 October 2010
43
ZSL Zone Weighting
$doc = new Zend_Search_Lucene_Document();
$tfield = Zend_Search_Lucene_Field::Text ('title', $title);$tfield->boost = 1.3;$doc->addField($tfield);
$doc->addField( Zend_Search_Lucene_Field::UnStored ('content', $content));
$index->addDocument($doc);
Friday, 29 October 2010
44
Document Authority
Friday, 29 October 2010
45
Document Weights in ZSL$doc = new Zend_Search_Lucene_Document();$doc->addField( Zend_Search_Lucene_Field::Text ('title', $title));$doc->addField( Zend_Search_Lucene_Field::UnStored ('content', $content));
$doc->boost = 1 + ($numComments / 100);
$index->addDocument($doc);
Friday, 29 October 2010
46
Using Search
Friday, 29 October 2010
47
Summaries & Highlighting
Friday, 29 October 2010
48
Sphinx Extract & Highlight$cl = new SphinxClient();$cl->SetServer( "localhost", 3312 );$q = 'bacon';$r = $cl->Query($q);foreach ($r["matches"] as $doc => $info) { $text[$doc] = getTextFromDB($doc);}
$e = $cl->BuildExcerpts($text, 'posts', $q);foreach($extracts as $extract) { echo $extract;}
Friday, 29 October 2010
Friday, 29 October 2010
50
Xapian Spelling Correction$indexer = new XapianTermGenerator();$indexer->set_database($database);$indexer->set_flags( XapianTermGenerator::FLAG_SPELLING);
Indexer
$queryString = "strreplace or str_cmp";$q = new XapianQueryParser();$q->set_database($database);$query = $q->parse_query($queryString, XapianQueryParser::FLAG_SPELLING_CORRECTION);echo "Did you mean: " . $q->get_corrected_query_string() . "\n";
Searcher
Friday, 29 October 2010
51
Spelling Correction Output php xapsearch.php
Did you mean: str_replace or strcmp
4644 results found for “strreplace or str_cmp”:1: 2% docid=572 [phpdocs/html/cc.license.html]2: 2% docid=7169 [phpdocs/html/imagick.constants.html]3: 2% docid=10086 [phpdocs/html/sqlite3result.fetcharray.html]4: 2% docid=6132 [phpdocs/html/function.swf-posround.html]
Friday, 29 October 2010
52
Results Sorting
Friday, 29 October 2010
53
Sorting in ZSL
$q = Zend_Search_Lucene_Search_QueryParser:: parse('search string');
$results = $index->find($q, 'title');foreach($results as $result) { echo '<h3>', $result->title, "</h3>\n"; $doc = getDocumentFromDB($result->did); echo $q->htmlFragmentHighlightMatches($doc);}
Friday, 29 October 2010
54
Faceted Search
Friday, 29 October 2010
55
Faceted Search In Solr$client = new SolrClient($options);$query = new SolrQuery('bacon');$response = $client->query($query);$query->setFacet(true);$query->addFacetField('cat');$r = $response->getResponse();$f = $r['facet_counts']['facet_fields'];foreach($f['cat'] as $facet => $count) { echo $facet . " " . $count . "\n";}
Friday, 29 October 2010
56
More Like This
Friday, 29 October 2010
57
More Like This$rset = new XapianRset();$rset->add_document(5959); // str_replace$e = $enquire->get_eset(40, $rset);
$t = $e->begin();for($t; !$t->equals($e->end()); $t->next()){ $qs[] = new XapianQuery($t->get_term(), intval($t->get_weight()));}
$query = new XapianQuery( XapianQuery::OP_OR, $qs);
Friday, 29 October 2010
58
More Like This Example php xapsim.php
1656 results found:1: 100% docid=5959 [phpdocs/html/function.str-replace.html]2: 47% docid=5956 [phpdocs/html/function.str-ireplace.html]3: 24% docid=5328 [phpdocs/html/function.preg-replace.html]4: 18% docid=5958 [phpdocs/html/function.str-repeat.html]
Friday, 29 October 2010
59
Search Performance
Friday, 29 October 2010
60
Index Updates
Docs
Main
New
Delta Delta Main
Query
Delta Main
Main
DocsDocsDocs
Friday, 29 October 2010
61
Search Speed$index = Zend_Search_Lucene::open('index');$index->optimize();
indexer --merge main delta --rotate
Zend Search Lucene
Sphinx
$client = new SolrClient($options);$client->optimize();
Solr
xapian-compact xapindex xapindex2Xapian
Friday, 29 October 2010
62
Distributing Search
Index
Application
Index Index
DocumentDocumentDocumentDocument
Friday, 29 October 2010
63
Large Scale Search
http://www.nutch.org
http://hadoop.apache.org
Friday, 29 October 2010
64
Image CreditsTitle http://www.flickr.com/photos/generated/2084287794/What Do You Want http://www.flickr.com/photos/the_justified_sinner/
2498066986/You Are Here http://www.flickr.com/photos/alecvuijlsteke/2692475420/Integrating Search http://www.flickr.com/photos/squeaks2569/3700355684/Sphinx http://www.flickr.com/photos/generated/2084287794/Lucene http://www.flickr.com/photos/mypanda/7731447/Swish-e http://www.flickr.com/photos/ryan_fung/2239687100/Solr http://www.flickr.com/photos/m-j-s/2724756177/Xapian http://www.flickr.com/photos/olibac/3522056495/Using Search http://www.flickr.com/photos/eneas/175027945/Improving Search http://www.flickr.com/photos/x-ray_delta_one/3928200642/Search Performance http://www.flickr.com/photos/maisonbisson/1634408/Large Scale Search http://www.flickr.com/photos/zedzap/3663508847/
Friday, 29 October 2010
Questions?
65Friday, 29 October 2010