Top Banner
Identification of the authors of short messages portals on the Internet using the methods of mathematical linguistics. Postgraduate: Sukhoparov M.E. Supervisor: doctor of engineering science, Lebedev I.S. "St. Petersburg National Research University of Information Technologies, Mechanics and Optics" Department of "Secure Information Technology" Specialty 05.13.19 "Methods and systems of information protection, information security"
11

Identification of the authors of short messages portals on the Internet using the methods of mathematical linguistics. Postgraduate:Sukhoparov M.E. Supervisor:doctor.

Dec 31, 2015

Download

Documents

Dennis Carson
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Identification of the authors of short messages portals on the Internet using the methods of mathematical linguistics. Postgraduate:Sukhoparov M.E. Supervisor:doctor.

Identification of the authors of short messages portals on the Internet using

the methods of mathematical linguistics.

Postgraduate: Sukhoparov M.E.

Supervisor: doctor of engineering science,Lebedev I.S.

"St. Petersburg National Research University of Information Technologies, Mechanics and Optics"

Department of "Secure Information Technology"

Specialty 05.13.19"Methods and systems of information protection, information security"

Page 2: Identification of the authors of short messages portals on the Internet using the methods of mathematical linguistics. Postgraduate:Sukhoparov M.E. Supervisor:doctor.

Purpose and objectivesThe goal - a study of methods to identificate users.

Objectives:study and development of scientific-methodical system

of identification of authorship of textual informationcreation of the program layout, based on the proposed

approachassessment of the performance and efficiency of the

developed prototyping implementation

Page 3: Identification of the authors of short messages portals on the Internet using the methods of mathematical linguistics. Postgraduate:Sukhoparov M.E. Supervisor:doctor.

Prospective directions of research

The use of naive Bayes classifierAnalysis based on the N - gramsAnalysis based on latent Dirichlet allocation

Page 4: Identification of the authors of short messages portals on the Internet using the methods of mathematical linguistics. Postgraduate:Sukhoparov M.E. Supervisor:doctor.

Architecture of the proposed software

Posts

1

* Words

*

*

Words in

Posts

UsersTopic

1

Vocabulary

Filters

*

Page 5: Identification of the authors of short messages portals on the Internet using the methods of mathematical linguistics. Postgraduate:Sukhoparov M.E. Supervisor:doctor.

Naive Bayes classifier

Bayes theorem:

- probability that document belongs to the class ;

- probability of finding document of any documents class ;

- unconditional probability of finding a document of class in the case of documents;

- unconditional probability of a document in the case of documents.

Page 6: Identification of the authors of short messages portals on the Internet using the methods of mathematical linguistics. Postgraduate:Sukhoparov M.E. Supervisor:doctor.

Naive Bayes classifier

Maximum a posteriori estimation:

Page 7: Identification of the authors of short messages portals on the Internet using the methods of mathematical linguistics. Postgraduate:Sukhoparov M.E. Supervisor:doctor.

Naive Bayes classifier

The problem of arithmetic overflow:

Estimation of parameters of the Bayes model:• , where - number of documents belong to class , - total number of

documents in the training set;• , where - number of times as the i-th word appears in the documents

of class , - dictionary of a set of documents (a list of all unique words).

Page 8: Identification of the authors of short messages portals on the Internet using the methods of mathematical linguistics. Postgraduate:Sukhoparov M.E. Supervisor:doctor.

Naive Bayes classifier

The problem of unknown words:

The final view of the formula:

Page 9: Identification of the authors of short messages portals on the Internet using the methods of mathematical linguistics. Postgraduate:Sukhoparov M.E. Supervisor:doctor.

Naive Bayes classifierStatistics used in the classification stage:

relative frequencies of the classes in the case of documents;total number of words in each document class;the relative frequencies of words within each class;dictionary size (amount of unique words in training set).

- number of documents belong to class - total number of documents in the training set; - dictionary of a set of documents (a list of all unique words);- the total number of words in documents of class c in the training set; - number of times as the i-th word appears in the documents of class ; - set of words of classified document (including repeats).

Page 10: Identification of the authors of short messages portals on the Internet using the methods of mathematical linguistics. Postgraduate:Sukhoparov M.E. Supervisor:doctor.

Results

75 100 125 150 175 2000.00

0.20

0.40

0.60

0.80

1.00

0.54

0.64

0.720.76

0.790.81

Amount of training set

𝑃 (𝑐|𝑑 )

Page 11: Identification of the authors of short messages portals on the Internet using the methods of mathematical linguistics. Postgraduate:Sukhoparov M.E. Supervisor:doctor.

Conclusions

The implementation of the proposed solutions will identify the authors of short message forums and blogs on the Internet at various PR - actions to combat and control the formation and manipulation of public opinion and other manifestations of astroterfing.