Detecting spam users automatically with a neural network

Circuit Training

The training is conducted iteratively. During a pass (an epoch), the developer passes the complete data set through the neural network one time and works out the loss. You do not use all the data at once in this process but, rather, divide it into smaller portions (batches) for performance reasons.

In the course of a later epoch, the training script sets new parameters via the optimization process, incurring a smaller loss. Ideally, the loss will converge toward a specific minimum value. The network can then be regarded as trained, and the training script stores the calculated weightings and threshold values that produced the minimum loss.

Figure 3 shows a simplified view of how the gradient descent method determines the minimum for an individual weighting w. The x-axis is the weighting, and the y-axis is the loss function's value with this weighting. The training script determines the loss graph's differentiation – and thus its slope – with each iteration of the gradient method at the point of the current weight, then moves a step in the learning rate (eta) direction.

Figure 3: The gradient descent method determines a function's minimum in this simplified depiction.

Interesting Properties

The fields of the data sets, which are used as inputs for the network, are known as properties. The neural network works with real numbers, which means names and IP addresses cannot be added directly as properties in the form of strings.

Our experience shows that spammers often use very cryptic usernames. We were able to derive the following properties to help identify spammers: the length, the number of hyphens, the number of numerals, the differentiation of the characters, the number of vowels, the number of non-letters, and the occurrence of certain keywords (e.g., credits, 100mg, taler).

A geolocation database breaks down the country, matching a particular IP address and the ISP. The on-hand data reveals how often an ISP operates as a spammer, how frequently a combination of a particular country of origin and chosen language appears for the website builder, and which countries transmit an especially large amount of spam.

The next step is to sort out properties that do not correlate strongly with the class and thus contribute little to the outcome. The reason for sorting out the data that doesn't correlate strongly is that a smaller network can be trained more quickly and needs fewer resources. I can use a correlation matrix to discern how well-suited properties are for spam detection.

Listing 1 shows a Python script for setting up the correlation matrix. The script reads a CSV file with the data, computes the correlation matrix using the np.corrcoef() function, and finally generates a PNG file with the density plot of the matrix. The script ignores the first column (in the username sample data) during this process. If the CSV file contains other values that are not real numbers, you will have to modify the read_file() function accordingly. The class, which distinguishes spammers from legitimate website builders, is intended to be in the last column.

Listing 1


The density plot (Figure 4) shows an overview of which properties are particularly suitable. Each row and column cover a property. The lighter the field, the higher the correlation between the row property and the column property. The last row and column reveals the correlation with the class. For this reason, the lighter colored the field in the last row and column, the better suited the properties to the classification.

Figure 4: The correlation matrix density plot reveals the interdependencies among properties.

The correlation matrix also reveals whether two characteristics are excessively similar and whether it would be sufficient to classify them as one where possible. The properties 6 (the number count in the username) and 10 (the number of non-letters) would be an example of this. The white field indicates a strong relationship between both these variables. It is therefore sufficient to take property 6 into account, because 10 provides no additional information.


The lion's share of work with neural networks is in determining the structure or configuration of the network with the aid of hyperparameters. Developers usually perfect this process manually by individually training each network configuration and comparing the results until ending up at a good configuration. Hyperparameters include the number and size of the layers, the activation functions for the layers, the number of epochs, the size of the data batches, the optimization process, and the learning rate.

TFLearn offers a variety of activation functions in the tflearn.optimizations package. Figure 5 depicts the most important of these functions; the easiest is the identity or the linear activation function, which returns the input value unaltered. The sigmoid function is non-linear and so is more interesting as an activation function than the linear equivalent. The function is restricted and only produces positive values between 0 and 1.

Figure 5: The most important activation functions at a glance.

Tanh can be compared with sigmoid, except that it returns values between -1 and 1. A further activation function is known as a rectified linear unit, or Relu. You can think of this as a linear function with a threshold value. Relu converges very quickly and is the recommended function at this time. We also use it for our network.

Another important activation function goes by the name softmax. The softmax function creates a relationship between the value of the neuron and the values of other neurons in the layer. Its special characteristic is that all output values in this layer add up to 1. Users often use this function for the output layer in networks whose purpose is to classify. The network's output can then be interpreted as probabilities for the individual classes.

You also pick an optimization process along with the layers and their activation functions. Adam, an algorithm with which you can generally achieve good results, is often used by developers to train neural networks as an alternative to the classic gradient method. Adam also needs a learning rate. The preset value of 0.001 is a suitable learning rate to start with, although it can be reduced to achieve an even better outcome where possible. All the optimization techniques supplied with TFLearn are included in the tflearn.optimizers package.

You can train your network and measure its accuracy by using the hyperparameters you have found, before proceeding to vary the parameters. You continue to repeat this process until you can no longer significantly increase the accuracy.

Figure 6 shows the loss during training in graph form. The graph will indicate whether the learning rate is too high or low. The loss graph is intended as far as possible to resemble a falling exponential curve (the blue graph). If the learning rate is too high, the loss initially drops quickly, although it is possible that it may converge prematurely (red). This indicates that you have not yet found the optimum. The loss only drops very slowly when the learning rate is too low, and it is more likely that you actually find the global optimum (yellow). You can increase the batch size if the loss graph is too noisy.

Figure 6: Loss during training at different learning rates, with the blue curve representing the ideal.

Buy this article as PDF

Express-Checkout as PDF
Price $2.95
(incl. VAT)

Buy Linux Magazine

Get it on Google Play

US / Canada

Get it on Google Play

UK / Australia

Related content

  • FAQ

    Welcome our new artificial intelligence overlords by tinkering with their gray matter.

  • Neural Networks

    3, 4, 8, 11… ? A neural network can complete this series without knowledge of the underlying algorithm – by a kind of virtual gut feeling. We’ll show you how neural networks solve problems by simulating the behavior of a human brain.

  • Programming Snapshot – Mileage AI

    On the basis of training data in the form of daily car mileage, Mike Schilli's AI program tries to identify patterns in driving behavior and make forecasts.

  • TensorFlow AI on the Pi

    You don't need a powerful computer system to use AI. We show what it takes to benefit from AI on the Raspberry Pi and what tasks the small computer can handle.

  • Neural networks learn from mistakes and remember successes

    The well-known Monty Hall game show problem can be a rewarding maiden voyage for prospective statisticians. But is it possible to teach a neural network to choose between goats and cars with a few practice sessions?

comments powered by Disqus
Subscribe to our Linux Newsletters
Find Linux and Open Source Jobs
Subscribe to our ADMIN Newsletters

Support Our Work

Linux Magazine content is made possible with support from readers like you. Please consider contributing when you’ve found an article to be beneficial.

Learn More