Jun 26, 2015
Data Analysis @ DaumPaul Kim ([email protected])DevOn 2012
Search Quality
Search Quality=
Satisfaction / Cost
How?
HTTP://WWW.GOOGLE.COM/ONCEUPONATIME/TECHNOLOGY/PIGEONRANK.HTML
Understanding Userswith Logs
BIG DATA!
Data Analysis Processwith Hadoop
?HADOOP FEATURES TOOLS
!
SAS
RWEKA
ETC
2 QUAD-CORES8GB RAM4TB HDD
4 QUAD-CORES16GB RAM4TB HDD
X 60 NODES
X 30 NODES
For example,
라면 맛있게 끓이는 비법
많이 본 글
Mission만족스러운 검색 경험들을 랭킹에 반영
Target DataHalf Year Search Logs (about 40TB)
FeaturesQuery - Collection RelationshipQuery - Document - Session RelationshipSession - Query RelationshipSession - Document Relationship
GROUP-BY JOB
GROUP-BY JOB
GROUP-BY JOB
GROUP-BY JOB
많이 본 글
ModelingLinear Regression with SAS
Batch Process
HADOOP FEATURES MODEL ENGINE
LESS THAN 2 HOURS
바다 이야기
SEARCH SPAM INDEXMission
Spam이 검색 사용자에게 미치는 영향 파악Data
Search Log : Text with DelimiterPost Filtered Documents : Json FormatOperation Deleted Documents : Xml Format
TaskQuery - Session - Doc. 1 - Doc. 2 - Doc. 3 - Doc. 4
Click?Type? (Ham, Spam, OP Del.)
OUTER JOIN
SEARCH SPAM INDEXResult Sample
BLOG CLASSIFICATION
BLOG CLASSIFICATIONMission
Unsupervised Learning을 통한 나쁜 Blog ClusteringData
30 Days Blog DocumentsTask
Blog - Document’s Feature Analysis with Fixed Interval
BLOG CLASSIFICATIONModeling
Kohonen’s SOM(Self Organizing Map) with R
WHAT ELSE?Topic Analysis with PLSA
Query Chain Filtering
Reprocessing with Hadoop
In Conclusion,
ADVANTAGE OF HADOOP
ADVANTAGELow analyze cost!No more sampling!Low operation cost!Programming Language IndependentVarious support tools
DISADVANTAGEConceptual Change is Needed.Project under active development.Version upgrade is not supported.
THANK YOU!