Artificial intelligence detects mileage patterns

Staying Normal

The regression only works if the training data was previously normalized for a constrained value range. If the script feeds the optimizer with the unmodified Unix seconds as the mileage date, the algorithm goes haywire and produces increasingly nonsensical values, until it finally breaks the boundaries of the hardware's floating-point math and sets all parameters to nan (Not a Number).

Lines 31 to 37 in Listing 1 therefore normalize the training data by using pandas' min() and max() methods to find the minimum and the maximum timestamps, then subtract the minimum from all training values as an offset, and finally subdivide by the min-max difference.

This process normally results in training values between 0 and 1 (but caution if min = max), which the optimizer can process more efficiently.

With the learned parameters, it is now possible to reproduce historical values within the model's framework or predict the future. What mileage will the car have on June 2, 2019? The date has an epoch value of 1559516400, which the model has to normalize just as in the training case. The offset of 1486972800, found as norm_off in Figure 3, gets subtracted, and the input date is also divided by the scaling factor norm_mult of 7686000.

This results in an X value of 9.43, which is substituted into the formula

Y = X * W + b

to predict a mileage of 94,115 for June 2, 2019 – all assuming, of course, that the model is accurate (i.e., that the increase is indeed linear) and that the three months of training data are sufficient to determine the slope of the curve more or less accurately.

Keeping Back Data

To ensure that the model not only simulates the training data but also predicts the real future, AI specialists often break down the available data into a training and a test set. They train the model only with data from the training set; otherwise, the risk is that it will mimic the training data perfectly, including replicating any temporary outliers that do not occur later in production, causing the system to predict artifacts that are out of touch with reality.

If the test set remains untouched up to the end of the training runs and the model later also correctly predicts the test data, the AI system will most likely behave as expected later in a production environment.

Now, my 30-year-old HP-41CV pocket calculator was already able to determine the parameters W and b from a collection of X/Y values by assuming a linear relationship with a linear regression. However, TensorFlow can now do much more, because it also understands neural networks and decision trees, as well as more complex regression techniques.

No Simple Pattern

If you look at the daily mileage numbers closely, you will note that the increase is by no means precisely linear over time. Figure 4 shows the higher resolution mileage growth per day and illustrates that the rise is subject to huge fluctuations. For example, the car travels between 16 and 50 miles on most days, interrupted by a pause of two consecutive days every so often, with no increase in mileage at all.

Figure 4: My car's daily mileage over the last three months.

A person simply looking at the graph in Figure 4 will immediately see that the car is driven less on weekends than on workdays. For an AI system to offer the same kind of intuitive performance, the programmer needs to take it by the hand and guide it in the right direction.

If the dates are, for example, stated in epoch seconds, as is common on Unix, the AI system will never in its lifetime find out that the weekend happens every seven days, with less driving as a result. A linear regression would only stretch the last few data points into the future; a polynomial regression would produce completely insane patterns in a mad bout of overfitting.

The learning algorithms are also bad at handling incomplete data. If there are no measured values for certain X values, for example, on days when the car was only parked in the garage, the conscientious teacher needs to fill them with meaningful values (e.g., with zeros). Also, you need to add what is known as "expert knowledge" in the discipline of machine learning: Because the weekday of the date values is known and will hopefully help the algorithm, a new CSV file (miles-per-day-wday.csv) simply provides the sequence number of the weekday (neural networks do not like strings, only numbers) for the daily mileage reading (Figure 5).

Figure 5: Dates expressed as weekdays, as a crutch for the neural network.

Listing 2 then uses the sklearn framework to construct a neural network that it teaches to guess the associated day of the week based on the mileage. To do so, it first reads the CSV file and forms the data frame X with the mileage numbers from it, and with y as a vector containing the associated weekday numbers.

Listing 2

neuro.py

 

The train_test_split() function splits the existing data into a training set and a test set, which the standard scaler normalizes in lines 19 to 22 because neural networks are extremely meticulous as far as the value range of the input values is concerned.

The multilayer perceptron of type MLPClassifier generated in lines 24 and 25 creates a neural network with two layers and stipulates that the training phase will be running for 1,000 steps at the most. Calling the fit() method then triggers the teach-in, during which the optimizer tries to adjust the internal receptor weights in a bout of supervised learning, to evaluate the input until the error is minimized between the predicted value calculated from the training parameters and the anticipated value in y_train.

The results were not all that exciting in the experiment, in part because the predicted values varied greatly from call to call, and the precision left something to be desired; yet, the neural network predicted the weekday from a given mileage in most cases. A variety of different input parameters would lead to better results.

With TensorFlow and SciKits, curious users have two sophisticated frameworks for experimentation with AI applications at their disposal. Getting started is anything but child's play because the literature [3]  [4] on the latest features is still fairly recent and not very mature; also, a number of works are still in the development stage. However, it is worth exploring the matter, because this area of computer science undoubtedly has a bright future ahead of it.

Infos

  1. "Programming Snapshot – Driving Data" by Mike Schilli, Linux Pro Magazine, issue 202, September 2017, p. 50, http://www.linuxpromagazine.com/Issues/2017/202/Programming-Snapshot-Driving-Data
  2. Listings for this article: ftp://ftp.linux-magazine.com/pub/listings/linux-magazine.com/<issue no.>/
  3. Guido, Sarah, and Andreas C. Müller. Introduction to Machine Learning with Python. O'Reilly Media, 2016
  4. Géron, Aurélien. Hands-On Machine Learning with Scikit-Learn and TensorFlow. O'Reilly Media, 2017

The Author

Mike Schilli works as a software engineer in the San Francisco Bay area of California. In his column, launched back in 1997, he focuses on short projects in Perl and various other languages. You can contact Mike at mailto:mschilli@perlmeister.com.

Buy this article as PDF

Express-Checkout as PDF
Price $2.95
(incl. VAT)

Buy Linux Magazine

SINGLE ISSUES
 
SUBSCRIPTIONS
 
TABLET & SMARTPHONE APPS
Get it on Google Play

US / Canada

Get it on Google Play

UK / Australia

Related content

  • Spam-Detecting Neural Network

    Build a neural network that uncovers spam websites.

  • FAQ

    Welcome our new artificial intelligence overlords by tinkering with their gray matter.

  • Neural Networks

    3, 4, 8, 11… ? A neural network can complete this series without knowledge of the underlying algorithm – by a kind of virtual gut feeling. We’ll show you how neural networks solve problems by simulating the behavior of a human brain.

  • USENIX LISA '10 Conference

    “It’s not the years, honey, it’s the mileage.” – Indiana Jones to Marion Ravenwood in Raiders of the Lost Ark.

  • Neural networks learn from mistakes and remember successes

    The well-known Monty Hall game show problem can be a rewarding maiden voyage for prospective statisticians. But is it possible to teach a neural network to choose between goats and cars with a few practice sessions?

comments powered by Disqus

Direct Download

Read full article as PDF:

Price $2.95

News