Topic Modeling of Freelance Job Postings to Web Service Abuse
Post on 03-Feb-2022
2 Views
Preview:
Transcript
Topic Modeling of Freelance Job Postingsp g gto Monitor Web Service Abuse
D k Ki M ti M tDo‐kyum Kim, Marti Motoyama,
Geoffrey M. Voelker and Lawrence Saul
@UCS CS
1/ 31
@UCSD CSE
Web Service AbuseWeb Service Abuse
b f /• Many Web services are free/open access• To attract large numbers of usersg
• To attract user‐generated content
• But openness invites abuse:• Exploitation of free resources• Exploitation of free resources
E.g. Sending spam from Web‐based email accounts
U ti d d ti i h l• Unsanctioned advertising channelsE.g. Spamming links on blog comments
2/ 31
Crowdsourcing for Web Service AbuseCrowdsourcing for Web Service Abuse
• Widely used:30% jobs on Freelancer.com [Motoyama et al., 2011]
40% jobs on Mechanical Turk [Ipeirotis, 2010]
• Example posting on Freelancer com• Example posting on Freelancer.comTitle Open 10 blog accounts and write/publish 10 posts
D I d t f il t (i hDesc. I need someone to open a free email account (i.e yahoo, hotmail, gmail) Then use that email to open 20 free blog accounts (excluding blogger, word press, blog) This project is to open 20 free blog accounts (all different sites) than post a single blog post (20 in total) one each blog account …
3/ 31
Keywords Blog
Why Crowdsource Abuse Jobs?Why Crowdsource Abuse Jobs?
• Cost Effective
Workers come from low wage regions
• Agile
Buyers can find technically skilled workers
• Scalable• Scalable
Freelancer.com has over one million workers
4/ 31
Example: Account Creation Form FillingExample: Account Creation – Form Filling
• Scenario: Abuser wants to send spam via Web email
• Prerequisite: Bulk accounts on Gmail
“i need gmail captcha entry agent immediately. 1000's new captcha entrys per week
5/ 31
captcha entrys per week
Example: Account Creation Varying IPExample: Account Creation – Varying IP
• Problem: Google detects mass account creation
• Solution: Purchase IP proxy services
6/ 31
Example: Account Creation Being VerifiedExample: Account Creation – Being Verified
• Problem: Google implements phone verification
• Solution: Buy telephone numbers
7/ 31
How to Identify Abuse Jobs?How to Identify Abuse Jobs?
P i k• Previous work [Motoyama et al., 2011] :• Manually inspect 2k postings on Freelancer.com• Identify 22 job categoriesIdentify 22 job categories• Label 10k+ jobs for training on SVM classifier
• Can we use this approach in operation?No – too much manual labor.
• Challenges:• How to scale?• How to discover new job categories?
8/ 31
Our Approach: Topic ModelingOur Approach: Topic Modeling
• Unsupervised vs. supervisedJob categories are discovered automatically from raw posts
No need for manual labeling
• Large‐scale vs. small‐scaleData‐driven from 7 years of posts on Freelancer.com
Categories identified from 355K (versus 2K) posts
• Principled vs. heuristic
Topics are collections of co‐occurring words
9/ 31
Topics are collections of co occurring words
Postings and users have distribution over topics
Background: Freelancer comBackground: Freelancer.com
l• Freelancer.com• One of largest and oldest freelancing sites• Over 2 million users from 200+ countries• Queryable by APIQ y y
• How it works:• How it works:1. Buyers/employers post jobs2 W k bid j b2. Workers bid on jobs3. Buyers select workers
10/ 31
Background: Data setBackground: Data set
• Job/user data from 2004 to 2011:• 840 k job descriptions
• 815 k user profiles
• 12 million bids
Open 10 blog account
ProjectOpen 10 blog account
ProjectWorker 1Buyer 1 Open 10 blog account
ProjectPost Bid
ProjectProject
Open 10 blog accountOpen 10 blog account
Worker 2Buyer 2 Get 1k likes on
Project
P j t
Generate 1k likes on Facebook
Generate 1k likes on Facebook
Buyer 2 Get 1k likes on Facebook
Project
11/ 31Write articles on car
ProjectWrite articles on car
ProjectWorker SBuyer B Write articles on car
Project
Topic ModelingTopic Modeling
• Automatic, data‐driven approach for analyzing large corpora of text1. Discovers hidden topics in corpus
2. Represents each document as collection of topics
3. Models each topic as a distribution over words
Example:Example:
Topic Top Frequent Words
Articles from sports magazine
Labelp p q
1 nfl, quarterback, touchdown, …
2 driver, club, birdie, …Topic Model Football
Golf
13/ 31
3 mlb, hit, sox, …
…
Baseball
Latent Dirichlet Allocation (LDA)Latent Dirichlet Allocation (LDA)
• First properly Bayesianmodel for topic modeling
• Assumptiond l h d i f i• Model each document as a mixture of topics
• Model each topic as a distribution over words
• Model each word as drawn from a particular topic
14/ 31
Main Parameters of LDAMain Parameters of LDA
• # Topics:
• # Words in vocabulary:# Words in vocabulary:
• Word distributions as topics: matrix
Topic: Football Topic: GolfWord Probability
nfl 0.05
quarterback 0 02
Word Probability
driver 0.021
club 0 017…
quarterback 0.02
touchdown 0.018
club 0.017
birdie 0.014… …
15/ 31
Generative Process for LDAGenerative Process for LDA
• For each document in the corpus:1. Pick the topic proportions from a Dirichlet distribution.
2. For each word in the document
a) Pick a topic from the proportions in (1).
b) Pick a word based on the topic in (2a).
• In our problem:Document = Job postingp g
Topic = Job category
16/ 31
Fitting the Model ParametersFitting the Model Parameters
• Given observed words, discover the topics that best explain the documents in the corpusp p
B l l i i bl• But some calculations are intractable
• We use variational methods for approximationpp
• For details, see Blei et al., 2003.
17/ 31
LDA WorkflowLDA WorkflowOpen 10 blog accountsOpen 10 blog accounts and write/publish 10 posts …
Input Unlabeled Docs
LDAParams # topics: K
blog 0.03post 0.02forum 0.01
account 0.02gmail 0.01hotmail 0.01
write 0.03word 0.01English 0.01
K Topics …… …
g…Output
0 40.3D d / iOpen 10 blog accounts and write/publish 10
18/ 31
0.4
0.3
0.3Docs annotated w/ topics and write/publish 10 posts …
PreprocessingPreprocessing
• Constructs document from project:
title, description, keywordstitle, description, keywords
• Lower‐cases, splits at punctuation, removes d li istopwords, applies stemming
• Filters infrequent termsq
• Filters buyers and workers in less than 20 j tprojects
20/ 31
Term document Matrix for LDATerm‐document Matrix for LDA
Wij: # word i in Terms27k+
ijdocument j
27k+
Documents
21/ 31
355k+
How many topics?How many topics?
• LDA discovers a prespecified # of topics.
• How to choose this number?1. Train models with varying # of topics
2. Measure likelihood of held‐out data
22/ 31
Categories of AbuseCategories of Abuse
Top Frequent Words Ratio
articl writer Articles write copyscap ‘Article Rewriting’ 5%
ti l k d it d itt h 3%
Label
SEO Content Generation1
SEO C t t G ti 2articl keyword rewrit copyscap word rewritten paragraph 3%
link pr site page anchor websit nofollow farm googl 3%
‘Data Entry’ data entri team captcha Excel fast worker hr 3%
SEO Content Generation2
SEO Whitehat
CAPTCHA Solvingy p
market sale traffic promot affili lead Marketing commiss 2%
ad account craigslist post pva poster gmail cl ip proxi 2%
g
Click/CPA/Leads/Signups1
Ad Posts/Accounts
seo keyword googl search rank engin SEO optim adword 2%
email list address excel mail newslett e‐mail spreadsheet 2%
fan facebook member profil friend Facebook myspac 2%
SEO Unknown
Bulk Emailing
OSN Linkingfan facebook member profil friend Facebook myspac 2%
blog post forum comment ‘Link Building’ thread phpbb 2%
submiss directori review social bookmark submit copi 2%
OSN Linking
SEO Greyhat1
SEO Greyhat2
23/ 31
p
sign signup citi countri uk up usa canada travel adult 1%
Companies or countries Worker methodologies
y
Click/CPA/Leads/Signups2
Annotated Job PostingsAnnotated Job Postings
SEO G h t (0 276) Ad P ti / A t C ti (0 211)SEO Greyhat (0.276), Ad Posting / Account Creation (0.211),
SEO Content Generation (0.169), Bulk Emailing (0.082)
Titl O 10 bl t d it / bli h 10 t
False positive: reveals limitations
fTitle Open 10 blog accounts and write/publish 10 posts
Desc. I need someone to open a free email account (i.e yahoo,hotmail gmail) Then use that email to open 20 free blog
of LDA
hotmail, gmail) Then use that email to open 20 free blogaccounts (excluding blogger, word press, blog) This project is to open 20 free blog accounts (all different sites) than post a single blog post (20 in total) one each blog account.Each free blog account to be on a separate free blogservice.service. …
Keyword Blog
24/ 31
Word Trends Reveal Target TrendsWord Trends Reveal Target Trends
Years
2005 2006 2007 2008 2009 2010 20111 member myspac profil myspac facebook fan fan2 group friend myspac profil friend facebook facebook3 friend profil friend friend twitter account page4 profil member member member profil page account5 event account bot account myspac Facebook Facebook
Top 10 terms in “OSN y p
6 myspac peopl group facebook account friend real7 invit group account bot follow follow follow8 bot invit facebook event group real twitter
OSN Linking” topic
8 bot invit facebook event group real twitter9 meet bot invit group fan twitter usa10 account paid event invit member Social Networking like
25/ 31
Manual vs Automatic Discovery of TopicsManual vs. Automatic Discovery of Topics
Automatically Discovered Topicsy p
How much of postings in eachpostings in each category are assigned to topics?
ManualCategories
26/ 31
Manual vs Automatic Tracking of TrendManual vs. Automatic Tracking of Trend
0.8
0.9
1
s
Class SEO Content GenerationTopic SEO Content Generation 1, 2
0.8
0.9
1
s
Class Verified Accounts, Account Registration and Ad PostingTopic AdPosts/Accounts
0 5
0.6
0.7
ume
of P
roje
cts
0 5
0.6
0.7
ume
of P
roje
cts Topic AdPosts/Accounts
0.3
0.4
0.5
Nor
mal
ized
Vol
u
0.3
0.4
0.5
Nor
mal
ized
Vol
uJan05 Jan06 Jan07 Jan08 Jan09 Jan10 Jan11
0
0.1
0.2N
Jan05 Jan06 Jan07 Jan08 Jan09 Jan10 Jan11
0
0.1
0.2N
Very close agreement
Jan05 Jan06 Jan07 Jan08 Jan09 Jan10 Jan11Time Time
27/ 31
Very close agreement
Correlation of Worker Topic ProportionCorrelation of Worker Topic ProportionPearson correlation matrix
Indicates mergeablemergeabletopics
28/ 31
SummarySummary
• Explored LDA to identify and monitor abuse job postings
• What LDA automates:What LDA automates:• Discovery of topics as frequently co‐occurring words
• Labeling of individual postings by topicsLabeling of individual postings by topics
• Word‐by‐word annotations
• What remains:• Interpretation of topics
• Merging/splitting of discovered categories
29/ 31
Future DirectionsFuture Directions
• Other applications of LDA to unstructured text• IRC channels
• Underground Internet forums
M hi ti t d t i d l• More sophisticated topic models• Author‐reader topic models: incorporating buyers and workers
D i t i d l t ki h t i l ti• Dynamic topic models: tracking how topics evolve over time
• Online LDA: continuous modeling of streaming projects
30/ 31
top related