Detecting spam users automatically with a neural network
Circuit Training
The training is conducted iteratively. During a pass (an epoch), the developer passes the complete data set through the neural network one time and works out the loss. You do not use all the data at once in this process but, rather, divide it into smaller portions (batches) for performance reasons.
In the course of a later epoch, the training script sets new parameters via the optimization process, incurring a smaller loss. Ideally, the loss will converge toward a specific minimum value. The network can then be regarded as trained, and the training script stores the calculated weightings and threshold values that produced the minimum loss.
Figure 3 shows a simplified view of how the gradient descent method determines the minimum for an individual weighting w
. The xaxis is the weighting, and the yaxis is the loss function's value with this weighting. The training script determines the loss graph's differentiation – and thus its slope – with each iteration of the gradient method at the point of the current weight, then moves a step in the learning rate (eta) direction.
Interesting Properties
The fields of the data sets, which are used as inputs for the network, are known as properties. The neural network works with real numbers, which means names and IP addresses cannot be added directly as properties in the form of strings.
Our experience shows that spammers often use very cryptic usernames. We were able to derive the following properties to help identify spammers: the length, the number of hyphens, the number of numerals, the differentiation of the characters, the number of vowels, the number of nonletters, and the occurrence of certain keywords (e.g., credits, 100mg, taler).
A geolocation database breaks down the country, matching a particular IP address and the ISP. The onhand data reveals how often an ISP operates as a spammer, how frequently a combination of a particular country of origin and chosen language appears for the website builder, and which countries transmit an especially large amount of spam.
The next step is to sort out properties that do not correlate strongly with the class and thus contribute little to the outcome. The reason for sorting out the data that doesn't correlate strongly is that a smaller network can be trained more quickly and needs fewer resources. I can use a correlation matrix to discern how wellsuited properties are for spam detection.
Listing 1 shows a Python script for setting up the correlation matrix. The script reads a CSV file with the data, computes the correlation matrix using the np.corrcoef()
function, and finally generates a PNG file with the density plot of the matrix. The script ignores the first column (in the username sample data) during this process. If the CSV file contains other values that are not real numbers, you will have to modify the read_file()
function accordingly. The class, which distinguishes spammers from legitimate website builders, is intended to be in the last column.
Listing 1
correlation.py
The density plot (Figure 4) shows an overview of which properties are particularly suitable. Each row and column cover a property. The lighter the field, the higher the correlation between the row property and the column property. The last row and column reveals the correlation with the class. For this reason, the lighter colored the field in the last row and column, the better suited the properties to the classification.
The correlation matrix also reveals whether two characteristics are excessively similar and whether it would be sufficient to classify them as one where possible. The properties 6 (the number count in the username) and 10 (the number of nonletters) would be an example of this. The white field indicates a strong relationship between both these variables. It is therefore sufficient to take property 6 into account, because 10 provides no additional information.
Hyperparameters
The lion's share of work with neural networks is in determining the structure or configuration of the network with the aid of hyperparameters. Developers usually perfect this process manually by individually training each network configuration and comparing the results until ending up at a good configuration. Hyperparameters include the number and size of the layers, the activation functions for the layers, the number of epochs, the size of the data batches, the optimization process, and the learning rate.
TFLearn offers a variety of activation functions in the tflearn.optimizations package. Figure 5 depicts the most important of these functions; the easiest is the identity or the linear activation function, which returns the input value unaltered. The sigmoid function is nonlinear and so is more interesting as an activation function than the linear equivalent. The function is restricted and only produces positive values between 0 and 1.
Tanh can be compared with sigmoid, except that it returns values between 1 and 1. A further activation function is known as a rectified linear unit, or Relu. You can think of this as a linear function with a threshold value. Relu converges very quickly and is the recommended function at this time. We also use it for our network.
Another important activation function goes by the name softmax. The softmax function creates a relationship between the value of the neuron and the values of other neurons in the layer. Its special characteristic is that all output values in this layer add up to 1. Users often use this function for the output layer in networks whose purpose is to classify. The network's output can then be interpreted as probabilities for the individual classes.
You also pick an optimization process along with the layers and their activation functions. Adam, an algorithm with which you can generally achieve good results, is often used by developers to train neural networks as an alternative to the classic gradient method. Adam also needs a learning rate. The preset value of 0.001 is a suitable learning rate to start with, although it can be reduced to achieve an even better outcome where possible. All the optimization techniques supplied with TFLearn are included in the tflearn.optimizers package.
You can train your network and measure its accuracy by using the hyperparameters you have found, before proceeding to vary the parameters. You continue to repeat this process until you can no longer significantly increase the accuracy.
Figure 6 shows the loss during training in graph form. The graph will indicate whether the learning rate is too high or low. The loss graph is intended as far as possible to resemble a falling exponential curve (the blue graph). If the learning rate is too high, the loss initially drops quickly, although it is possible that it may converge prematurely (red). This indicates that you have not yet found the optimum. The loss only drops very slowly when the learning rate is too low, and it is more likely that you actually find the global optimum (yellow). You can increase the batch size if the loss graph is too noisy.
« Previous 1 2 3 4 Next »
Buy this article as PDF
(incl. VAT)
Buy Linux Magazine
Subscribe to our Linux Newsletters
Find Linux and Open Source Jobs
Subscribe to our ADMIN Newsletters
Support Our Work
Linux Magazine content is made possible with support from readers like you. Please consider contributing when you've found an article to be beneficial.
News

The GNU Project Celebrates Its 40th Birthday
September 27 marks the 40th anniversary of the GNU Project, and it was celebrated with a hacker meeting in Biel/Bienne, Switzerland.

Linux Kernel Reducing LongTerm Support
LTS support for the Linux kernel is about to undergo some serious changes that will have a considerable impact on the future.

Fedora 39 Beta Now Available for Testing
For fans and users of Fedora Linux, the first beta of release 39 is now available, which is a minor upgrade but does include GNOME 45.

Fedora Linux 40 to Drop X11 for KDE Plasma
When Fedora 40 arrives in 2024, there will be a few big changes coming, especially for the KDE Plasma option.

RealTime Ubuntu Available in AWS Marketplace
Anyone looking for a Linux distribution for realtime processing could do a whole lot worse than RealTime Ubuntu.

KSMBD Finally Reaches a Stable State
For those who've been looking forward to the first release of KSMBD, after two years it's no longer considered experimental.

Nitrux 3.0.0 Has Been Released
The latest version of Nitrux brings plenty of innovation and fresh apps to the table.

Linux From Scratch 12.0 Now Available
If you're looking to roll your own Linux distribution, the latest version of Linux From Scratch is now available with plenty of updates.

Linux Kernel 6.5 Has Been Released
The newest Linux kernel, version 6.5, now includes initial support for two very exciting features.

UbuntuDDE 23.04 Now Available
A new version of the UbuntuDDE remix has finally arrived with all the updates from the Deepin desktop and everything that comes with the Ubuntu 23.04 base.