Number Game » Linux Magazine

Getting started with the R data analysis language

Data Cleanup

Data cleanup examples are difficult to generalize, because the actions you need to take heavily depend on the individual dataset. But there are a number of fairly common actions. For example, you might need to rename cryptically labeled columns. The recommended approach is to first standardize the designations. Then change the column names with the colnames() command. Then pass in the index of the column whose name you want to change in square brackets. The index of a particular column can also be found automatically (Listing 5, first line). If you do not want to overwrite the column caption of the original mtcars dataset, first copy the data to a new data frame with df <- mtcars.

Listing 5

Data Cleanup

> colnames(mtcars)[colnames(mtcars) == 'cyl'] <- 'Zylinder'
> without.zeros <- na.omit(mtcars)
> without.duplicates <- unique( mtcars )

If the records have empty fields, this can lead to errors. That's why it is a good idea to resolve this potential worry at the start of the cleanup. Depending on how often empty fields occur, you can either fill them with estimated values (imputation) or delete them. The command from the second line of Listing 5 removes all lines that contain at least one zero (also NaN or NA).

Records also often contain duplicates. If the duplicate is the result of a technical error in data retrieval or in the source system, you should first try to correct this error. R provides an easy way to clean up the dataset and assign the results to a new, clean data frame with the unique() command (Listing 5, last line).

Predictive Modeling

In reality, there are a variety of prediction models with a wide range of parameters that provide better or worse results depending on the requirements and data. For an example, I'll use a dataset for irises (the flowers) – one of the best-known datasets for machine learning examples.

As an algorithm, I use a decision tree to predict the iris species – given certain properties, for example, the length (Petal.Length) and width (Petal.Width) of the calyx. To do this, I first need to load the data, which already exists in an R library (Listing 6, line 1).

Listing 6

Prediction with Iris Data

01 > data(iris)
02 > n <- nrow(iris)
03 > n_train <- round(.70 * n)
04 > set.seed(101)
05 > train_indicise <- sample(1:n, n_train)
06 > iris_train <- iris[train_indicise, ]
07 > iris_test <- iris[-train_indicise, ]
08 > install.packages("rpart ")
09 > install.packages("rpart.plot")
10 > library(rpart)
11 > library(rpart.plot)
12 > iris_model <- rpart(formula = Species ~.,data = iris_train, method = "class")
13 > rpart.plot(iris_model, type=4)

The next thing to do is to split the data into training and test data. The training data is used to train the model, whereas the test data checks the predictions and evaluates how well the model works. You would typically use about 70 percent of the data for training and the remaining 30 percent for testing. To do this, first determine the length of the record using the nrow() function and multiply the number by 0.7 (Listing 6, lines 2 and 3). Then randomly select an appropriate amount of data (line 5).

I have set a seed of 101 for the random value selection in the example (line 4). If you set the same value for the seed, you will see identical random values. Following this, split the data into iris_train for training and iris_test for validation (lines 6 and 7).

After splitting the data, you can train and evaluate the decision tree model. To do this, you need the rpart library. rpart.plot visualizes the decision tree (lines 8 to 11). Next, generate the decision tree based on the training data. When doing so, pass in the Species column in order to predict which iris species you are looking at (line 12).

One advantage of the decision tree is that it is relatively easy to see which parameters the model refers to. rpart.plot lets you visualize and read the parameters (line 13). Figure 5 shows that the iris species is setosa if the Petal.Length is greater than 2.5. If the Petal.Length exceeds 2.5 and the Petal.Width is less than 1.7, then the species is probably versicolor. Otherwise, the virginica species is the most likely.

Figure 5: Visualizing the decision tree model with the iris data.

The next step in the analysis process is to find out how accurate the results are. To do this, you need to feed the model data that it hasn't seen before. The previously created test data is used for this purpose. Then use predict() to generate predictions based on the test data using the iris_model model (Listing 7, line 1).

Listing 7

Accuracy Estimation

01 > iris_pred <- predict(object = iris_model, newdata = iris_test, type = "class")
02 > install.packages("caret")
03 > library(caret)
04 > confusionMatrix(data = iris_pred, reference = iris_test$Species)

There are a variety of metrics for determining the quality of the model. The best known of these metrics is the confusion matrix. To compute a confusion matrix, first install the caret library (lines 2 and 3), which will give you enough time for an extensive coffee break even on a fast computer. Then evaluate the iris_pred data (line 4).

The statistics show that the model operates with an accuracy of 93 percent. The next step would probably be to optimize the algorithm or find a different algorithm that offers greater accuracy.

You can now also imagine how this algorithm could be applied to other areas. For example, you could use environmental climate data (humidity, temperature, etc.) as the input, combine it with information on the type and number of defects in a machine, and use the decision tree to determine the conditions under which the machine is likely to fail.

Importing Data

If you want to analyze your own data now, you just need to import the data into R to get started. R lets you import data from different sources.

To import data from a CSV file, first pass the file name (including the path if needed) to the read.table() function and optionally specify whether the file contains column names. You can also specify the separator character for the fields in the lines (Listing 8, first line).

Listing 8

Data Import

> df <- read.table("meine_datei.csv", header = FALSE, sep = ",")
> my_daten <- read_excel("my_excel-file.xlsx")

If the data takes the form of an Excel spreadsheet, you can also import it directly. To do this, install the readxl library and use read_excel() (second line) to import the data.

« Previous 1 2 3 4 Next »

Buy this article as PDF

Express-Checkout as PDF

Price $2.95
(incl. VAT)

Buy Linux Magazine

SINGLE ISSUES

Print Issues

Digital Issues

SUBSCRIPTIONS

Print Subs

Digisubs

TABLET & SMARTPHONE APPS

US / Canada

UK / Australia

Getting started with the R data analysis language

Data Cleanup

Predictive Modeling

Importing Data

Buy this article as PDF

Buy Linux Magazine

Related content

Subscribe to our Linux Newsletters
Find Linux and Open Source Jobs
Subscribe to our ADMIN Newsletters

Support Our Work

News

Endless OS 6 has Arrived

Fedora Asahi 40 Remix Available for Macs with Apple Silicon

Red Hat Adds New Deployment Option for Enterprise Linux Platforms

OSJH and LPI Release 2024 Open Source Pros Job Survey Results

Proton 9.0-1 Released to Improve Gaming with Steam

So Long Neofetch and Thanks for the Info

Ubuntu 24.04 Comes with a “Flaw"

Canonical Releases Ubuntu 24.04

Linux Servers Targeted by Akira Ransomware

TUXEDO Computers Unveils Linux Laptop Featuring AMD Ryzen CPU

Getting started with the R data analysis language

Data Cleanup

Predictive Modeling

Importing Data

Buy this article as PDF

Buy Linux Magazine

Related content

Subscribe to our Linux Newsletters Find Linux and Open Source Jobs Subscribe to our ADMIN Newsletters

Support Our Work

News

Tag Cloud

Subscribe to our Linux Newsletters
Find Linux and Open Source Jobs
Subscribe to our ADMIN Newsletters