Machine learning techniques applied to Twitter spammers detection

Tags: Electrical and Electronic Engineering, features, feature space, Twitter, ML, Machine Learning, Social Network, Learning Machine, Spam Detection, PAOLO GASTALDO, Support Vector Machine, spammers, experimental session, Experimental performance, Experimental results, experimental result, input features, Twitter Company, kernel methods, Random Forests
Content: Recent Advances in Electrical and Electronic Engineering
machine learning Techniques applied to Twitter Spammers Detection CLAUDIA MEDA, FEDERICA BISIO, PAOLO GASTALDO, RODOLFO ZUNINO DITEN University of Genoa Via all'Opera Pia 11a, 16145, Genova ITALY {federica.bisio, claudia.meda}@edu.unige.it {paolo.gastaldo, rodolfo.zunino}@unige.it http://www.sealab.diten.unige.it
Abstract: - Every minute more than 320 new accounts are created on Twitter and more than 98,000 tweets are posted. Among the multitude of Twitter users, spammers and cybercriminals aim to pervade and strike legitimate users' accounts with a large amount of troublesome messages. Hence, the social network propagation opens new modalities for cyber-crime perpetration, while the spamming phenomenon exploits specific mechanism of messaging process. This research shows that Machine Learning (ML) may provide a powerful tool to support spammer detection in Twitter. The present paper compares the performance of three different ML algorithm in tackling this task. The experimental session involves a publicly available dataset.
Key-Words: - Social Network Security, Social Network Analysis, Twitter, Spam Detection, Machine Learning
1 Introduction The total number of active registered Twitter users exceeds the 650 million of people: there are 255 million monthly active users which tweet 500 million messages every day. Thus, Twitter is more than a social network: it is an ecosystem, where people can broadcast their activity, sharing their life and opinions. In fact, spammers and cybercriminals aim to pervade and strike legitimate users' accounts with a large amount of troublesome messages. For example, the contents of tweets can be forged to redirect users to malicious websites or to advertising spam campaigns. This paper addresses the development of methodologies that are able to analyse user's behaviour and the contents of tweets; the ultimate goal is to automated detection of twitter spammers. The literature has already proved that MachineLearning (ML) can provide effective tools to tackle the spam recognition task [1, 2, 3]. In practice, ML techniques can support predictive systems that make decisions on unknown input samples. When dealing with complex and non-linear mechanisms characterizing the phenomenon to be modeled, an explicit formalization of the input-output relationship is in fact difficult to attain. This is the reason why ML technologies model the input-output function by a "learning from examples" approach. The aim of the present research is to evaluate the performance of different ML paradigms in the specific task of Twitter spam detection. Three
paradigms are analysed: Support Vector Machine (SVM) [4], which is an instance of kernel machines; Extreme Learning Machine (ELM) [5], which is an instance of feed-forward neural networks; Random Forests (RFs) [6], which is an instance of classification trees. In the proposed framework, the ML-based predictor receives as input a feature vector, which characterizes the incoming tweet according to a feature space. The predictor is then entitled to yield as output the label to be assigned to the tweet (spammer / non-spammer). The rest of the paper is organized as follows: Section 2 presents Twitter environment and related works, Section 3 explains the general model for each proposed Twitter spam detector and Section 4 lists the experiments and evaluates results while Section 5 offers conclusion and ideas for future works. 2 Background and related works A Twitter user communicates through 140 characters message, called tweet. Different types of actions can be performed: The use of "@" character allows the user to mention other people or to reply to other users. The use of "#" character allows the user to add special words, called hashtags.
ISBN: 978-960-474-399-5
177
Recent Advances in Electrical and Electronic Engineering
The use of retweet option allows the user to re-post a tweet. Moreover there are two types of friends on Twitter: followers and following. Let Bob be a user; then, a follower of Bob is any user that agrees to receive Bob's tweets. A following of Bob is a user that is followed by Bob. In principle, Twitter is based on a "network of trust," in which the user profiles are public and the right of view an user's page is granted even if you do not follow the user. Since the number of spammers on this social network is quickly increasing, the Twitter Company has established seven behaviors [7] that define a spam user's account: Posting harmful links (including links to phishing or malware sites) Aggressive following behavior (mass following and mass un-following for attention) Abusing the "@" reply or "@" mention function to post unwanted messages to users Creating multiple accounts (either manually or using automated tools) Posting repeatedly about trending topics to try to grab attention Repeatedly posting duplicate updates Posting links with unrelated tweets In the recent years, a few researches applied machine learning techniques to the analysis of Twitter data. In [2], a SVM classifier and a 2 features selection algorithm have been exploited to detect spammers. In [8, 9], random forests have been used to detect spammers in real time. In [9], an unsupervised method for automatic identification of spammers in a social network has been proposed; the framework exploited an integrated classification likelihood Bayesian information criterion and the expectation-maximization algorithm. The detection of malicious tweets in trending topics using a Statistical Analysis of language is addressed in [10]; four different classification algorithms are compared: Decision Tree, Naпve Bayes, logistic regression and support vector machines. 3 A ML-Based approach to Twitter spam detection A ML-based approach to Twitter detection requires one to set up a framework in which tweets are represented according to a feature space.
Accordingly, each tweet is eventually represented as a pattern xZ. Actually, an unknown function y = f(x) models the relationship between the input space and the category labels, i.e., {spammer, nonspammer}. Eventually, the empirical learning of the function f(x) stems from a training procedure that uses a dataset, D, holding N patterns (samples); each pattern includes a data vector, xZ, and its category label y. After training, the system processes data that do not belong to the training set and ascribes each test sample to a predicted category, y^ (Fig. 1). In the following, Section 3.1 will provide details about the adopted feature space. Sections 3.2, 3.3, and 3.4 will present the three ML algorithms that have been exploited to implement the predictor. Fig. 1 ­ ML based system 3.1 Feature Space In general, a machine-learning predictor should be fed with patterns represented as feature vectors. This in turn requires one to pre-process digital text documents and to organize the information according to a given structure that can be directly interpreted by a machine learning system. Indeed, the set up of an effective feature space represents a major challenge in the case of tweet, as the original text document cannot include more than 140characters and the syntax rules are very peculiar. The present research adopted the feature space defined in [2], which includes 62 features. The attributes traced by the feature space are the following:
ISBN: 978-960-474-399-5
178
Recent Advances in Electrical and Electronic Engineering
Number of followers Number of following Number of replies Number of mentions Number of URLs Number of hashtags Number of spam words Number of a tweet characters Number of words Number of numeric characters Number of tweets in a user account Number of tweets per day or week Time between post Age of the user account A standard PCA algorithm was indeed used to reduce the dimensionality of the eventual feature space at the predictor input. In particular, three dimensionalities were targeted: 20 features, 10 features, and 5 features.
3.2 Support Vector Machines
SVMs [4] belong to the family of kernel methods,
i.e. those methods that exploit positive definite
kernels to project data in a high-dimensional Hilbert
space. These methods take advantage of the so-
called "kernel trick": a kernel function K(xi,xj) allows to handle only inner products between
pattern pairs xi and xj, disregarding the specific mappings of individual patterns. In this way
complex and non-linear relation between data can
be simplified if the kernel provides a suitable
mapping.
The SVM training process requires one to solve
the following optimization problem:
min f
N i 1
(1
yi
fi )

2
f
2 H
(1)
where N is the number of training patterns, "+"
indicates the hinge loss function [4], H indicates the
Hilbert Space, and C is a hyperparameter that
regulates the trade-off between accuracy and
complexity in the training process. The problem can
be efficiently solved by using quadratic
programming techniques.
The final SVM decision function fSVM(x) is expressed as a weighted sum of some non-linear
kernel basis functions:
Nsv
f SVM (x) i yi K (xi , x) b
(2)
i
where Nsv is the number of the support vectors, b is
the bias term, and i are the coefficients computed by the training algorithm.
3.3 Extreme Learning Machines
The ELM model [5] implements a single-hidden layer feed-forward neural network (SLFN) with Nh mapping hidden neurons. The neuron's response to the input vector, x, is implemented by any nonlinear piecewise continuous function a(x,), where denotes the set of parameters of the function. The overall output function is then expressed as
Nh
f (x) w j h j (x)
(3)
j 1
where wj denotes the weight that connects the jth neuron with the output, and hj(x) = a(x, j). In ELM the parameters j are set randomly; thus the training process reduces to the adjustment of the output layer. As a major result, training ELMs is
equivalent to solving a regularized least squares
problem in a linear space.
The vector of weights w is then obtained as:
w (Ht H I)1 Ht y
(4)
Here, H is a Nh Ч N matrix with hij = hj(xi).
3.4 Random Forests The Random Forest algorithm [6] is a learning method that operates by constructing a fixed number of decision trees at training time. Random forests are a particular implementation of the bagging technique: some samples are repeatedly selected from the training set to fit the chosen models to these samples; then the classification is made by using a majority voting scheme between all the models. In Random forests, each model is a random tree. Besides, for each sample involved in the construction of the tree, only a small subset of randomly selected input features is considered. Let N be the number of training samples, and Z the dimensionality of the feature space. Then, the number of input variables to be randomly selected, d, should satisfy the following constraint: d << Z. If T is the number of trees to be embedded in the RF, the training procedure can be outlined as follows:
ISBN: 978-960-474-399-5
179
Recent Advances in Electrical and Electronic Engineering
for t =1 to T I. sample n cases at random with replacement from the training set to create a subset of the data II. at each node of the tree a. d features are selected at random from all the feature space. b. the feature that provides the best split, according to some objective function, is used to do a binary split on that node Two common choices for the setup of d are the following: d = D ; d = log2 (D+1). 4 Experimental results 4.1 Experimental setup The experimental session involved the publicly available database of tweet adopted in [2]. The dataset included 1065 users: 355 spammers and 710 legitimate users. A 10-fold approach has been exploited to robustly estimate the generalization performance of the ML-based predictor. As a result, the experimental session involved 10 different runs: in each run, 9 folds composed the training set, while the remaining fold was used as test set. As anticipated above, three different settings were adopted for the feature space: 20 features, 10 features, and 5 features. 4.2 Experimental performance Each subsection of the experimental performance shows the experimental results obtained by using one of the three different machine learning algorithm and 20, 10 and 5 features, with the respective parameters. Tables 1, 2 and 3 show the experimental result of RF, SVM and ELM respectively. Each table shows the confusion matrix obtained as the result of the experiments. The diagonal in bold indicates the percentage of correctly classified spammers (S) and legitimate users (NS). The other three column represent the performances obtained by the ML algorithms in terms of confusion matrix and precision (P), recall (R) and F-measure (F), where: P tp tp fp
R tp tp fn
F 2PR PR tp (true positives) defines the number of correctly classified instances of a class; fp (false positives) defines the number of misclassified instances of the same class; fn (false negatives) defines the number of misclassified instances of the other class.
TABLE 1
EXPERIMENTAL RESULTS BY USING RANDOM FORESTS
A ­ PARAMETERS
T
250
NS Test
241
S Test
121
B - 20 FEATURES
TRUE
PREDICTED S NS P R F S 76.9 23.1 0.77 0.93 0.84 NS 2.90 97.1 0.97 0.89 0.93 C - 10 FEATURES
TRUE
PREDICTED S NS P R F S 75.2 24.8 0.75 0.92 0.82 NS 3.30 96.7 0.97 0.89 0.92 D - 5 FEATURES
PREDICTED S NS P R F S 77.7 22.3 0.78 0.88 0.83 NS 5.40 94.6 0.95 0.89 0.92
TRUE
ISBN: 978-960-474-399-5
180
TRUE
Recent Advances in Electrical and Electronic Engineering
TABLE 2
EXPERIMENTAL RESULTS BY USING SUPPORT VECTOR MACHINES
A ­ PARAMETERS
Sigma
10
C
10
NS Test
241
S Test
121
B - 20 FEATURES PREDICTED S NS P R F S 68.6 31.4 0.68 0.95 0.80 NS 1.66 98.34 0.98 0.86 0.92 C - 10 FEATURES PREDICTED S NS P R F S 69.4 30.6 0.69 0.94 0.80 NS 2.07 97.9 0.98 0.86 0.92 D - 5 FEATURES PREDICTED S NS P R F S 67.7 32.2 0.68 0.93 0.78 NS 2.49 97.5 0.97 0.86 0.91
TABLE 3
EXPERIMENTAL RESULTS BY USING EXTREME LEARNING MACHINES
A ­ PARAMETERS
Neurons Lambda NS Test S Test
2000 0.1 241 121
TRUE
TRUE
B - 20 FEATURES PREDICTED S NS P R F S 76.8 23.1 0.77 0.80 0.78 NS 9.54 90.5 0.90 0.88 0.89 C - 10 FEATURES PREDICTED S NS P R F S 70.2 29.7 0.70 0.68 0.69 NS 16.1 83.8 0.84 0.85 0.84 D - 5 FEATURES PREDICTED S NS P R F S 71.9 28.0 0.72 0.71 0.71 NS 15.0 85.0 0.85 0.86 0.85 The experimental results show the different detection performances of the algorithms. Tables 1 underline an increase in detection with the reduction of the number of features: it is possible to obtain satisfying results with Random Forest. Conversely, Support Vector Machine and Extreme Learning Machine have a decrease performance both in detection and computational cost according to a less number of features, as shown in Tables 2 and 3. Besides, the Random Forest algorithm obtains better
TRUE
TRUE
TRUE
ISBN: 978-960-474-399-5
181
Recent Advances in Electrical and Electronic Engineering
detection results than Extreme Learning Machine using 20 features. 6 Conclusion The paper points out the application of three machine learning algorithms, studying the different performance of these techniques, in order to identify the best algorithm and the best parameters that combine both satisfactory detection results and considerable performance capabilities. Experimental results confirm the effectiveness of Random Forest algorithm compared to the Support Vector Machine and the Extreme Learning Machines: the Random Forest performances increase with the decreasing number of features opposed to the other two technique. This behavior underlines the advantage to choose few features on behalf of detection and computational cost.
violation/articles/64986-reporting-spam-ontwitter. [8] G. Stringhini, C. Kruegel, G. Vigna, Detecting Spammers on Social Network, ACSAC 2010, Austin, Texas USA. [9] M. Bouguessa, An Unsupervised Approach for Identifying Spammers in social networks, 23rd IEEE international conference on Tools with artificial intelligence, 2011. [10] J. Martinez-Romo, L. Araujo, Detecting malicious tweets in treding topics using a statistical analysis of language, Expert Systems with Applications, 2013
References:
[1] F. Bisio, C. Meda, P. Gastaldo, R. Zunino, A
Machine Learning approach for Twitter Spammers Detection, ICCST 2014 ­ 48th
Annual IEEE International Carnahan
Conference on Security Technlogy, Rome,
Italy.
[2] F. Benevenuto, G. Magno, T. Rodrigues,
and V. Almeida, Detecting Spammers on
Twitter, CEAS 2010 - Seventh annual
Collaboration, Electronic messaging, Anti-
Abuse and Spam Conference, Redmond,
Washington, US.
[3] F. Bisio, S. Decherchi, P. Gastaldo, R. Zunino,
Semi-supervised machine learning approach
for unknown malicious software detection,
IEEE International Symposium on Innovation
in Intelligence Systems and Applications
Proceedings, 2014.
[4] V. Vapnik, Statistical Learning theory, New
York: John Wiley, 1998
[5] G.B. Huang, H. Zhou, X. Ding, and R. Zhang,
Extreme Learning Machine for Regression and
Multiclass Classification, IEEE Transactions
on Systems, Man, and Cybernetics - Part B:
Cybernetics, vol. 42, no. 2, pp. 513-529, 2012.
[6] L. Breiman, Bagging predictors, Machine
Learning 24(2), 123­140, 1996
[7] Reporting
Spam
on
Twitter,
https://support.twitter.com/groups/56-policies-
violations/topics/238-report-a-
ISBN: 978-960-474-399-5
182

File: machine-learning-techniques-applied-to-twitter-spammers-detection.pdf
Published: Fri Nov 14 10:42:18 2014
Pages: 6
File size: 0.58 Mb


All sleepless and light, 6 pages, 0.34 Mb

, pages, 0 Mb

, pages, 0 Mb

Escaped Alone, 11 pages, 0.31 Mb

, pages, 0 Mb
Copyright © 2018 doc.uments.com