Getting started with the R data analysis language
RStudio Scripts
A script is a plain text file in which you store the R code. You can open a script file in RStudio via the File menu.
RStudio has many built-in features that make working with scripts easier. First, you can run a line of code automatically in a script by clicking the Run button or pressing Ctrl+Enter. R then executes the line of code in which the cursor is located. If you highlight a complete section, R will execute all the highlighted code. Alternatively, you run the entire script by clicking the Source button.
Data Analysis
A typical process in data analysis involves a series of phases. The primary step in any data science project is to gather the right data from various internal and external sources. In practice, this step is often underestimated – in which case problems arise with data protection, security, or technical access to interfaces.
Data cleaning or data preparation is a critical step in data analysis. The data collected from various sources might be disorganized, incomplete, or incorrectly formatted. If the quality of the data is not good, the findings will not be of much use to you later on. Data preparation usually takes the most time in the data analysis process.
After cleaning up the data, you need to visualize the data for a better understanding. Visualization is usually followed by hypothesis testing. The objective is to identify patterns in the dataset and find important potential features through statistical analysis.
After you draw insights from the data, a further step typically follows: You will want to predict how the data will evolve in the future. Prediction models are used for this purpose. Historical data is divided into training and validation sets, and the model is trained with the training dataset. You then verify the trained model using the validation dataset and evaluate its accuracy and efficiency.
Data Visualization
R has powerful graphics packages that help with data visualization. These tools produce graphics in a variety of formats, which can also be inserted into documents of popular office suites. The formats include bar charts, pie charts, histograms, kernel density charts, line charts, box plots, heat maps, and word clouds.
To quickly generate a couple of plots using the previously installed ggplot2 package, first create two vectors of equal length. The first is a set of x-values; the second is a set of y-values. Next, square the values of the x vector to generate the values for the y vector, and finally output the graph (Listing 2).
Listing 2
Sample Graph
> x <- c(-1, -0.8, -0.6, -0.4, -0.2, 0, 0.2, 0.4, 0.6, 0.8, 1) > y <- x^2 > qplot(x, y)
The scatter plot is one of the chart types commonly used in data analysis; you can create a scatter plot using the plot(x, y)
function. You can pass in other parameters, such as main
for the header input, xlab
for the x-axis labels, and ylab
for the y-axis labels. Listing 3 uses a dataset supplied by R from the US magazine Motor Trend in 1974, covering 10 aspects of 32 vehicle models, including number of cylinders, vehicle weight, and gasoline consumption. Load the dataset by typing:
data(mtcars
Listing 3
Vehicle Data Example
> plot(mtcars$wt, mtcars$mpg, main = "Scatter chart", xlab = "Weight (wt)", ylab = "Miles per gallon (mpg)", pch = 20, frame = FALSE) > fit <- lm(mpg ~ wt, data=mtcars) > abline(fit, col="red")
The command head(mtcars)
then displays the first six lines.
Use the abline()
function to add a regression line to the graph (Figure 3). To do this, lm()
first calculates the linear regression between the range and the weight, which shows that there is a relationship. This is a negative correlation: The lighter a vehicle is, the farther it can travel on the same amount of gasoline. The graph says nothing about the strength of the relationship, but summary(fit)
provides a variety of characteristic values of the calculation. This includes a fairly high R-squared value, a statistical measure of how close the data points are to the regression line.
Histograms visualize the distribution of a single variable. A histogram shows how often a certain measured value occurs or how many measured values fall within a certain interval. The qplot
command automatically creates a histogram if you only pass in one vector to plot. qplot(x)
creates a simple histogram from x <- c(1, 2, 2, 3, 3, 4, 4, 4)
.
The box plot, also known as a whisker diagram, is another type of chart. A box plot is a standardized method of displaying the distribution of data based on a five-value summary: minimum, first quartile (Q1), median, third quartile (Q3), and maximum. In addition, a box plot highlights outliers and reveals whether the data points are symmetrical and how closely they cluster.
In R you can generate a box plot, for example, with qplot()
. The best way to generate a box plot is with the sample data from mtcars
. To use the cyl
column as a category, factor()
first needs to convert the values from numeric variables to categorical variables. This is done with the factor()
command (Listing 4).
Listing 4
Box plots
> qplot(factor(cyl), mpg, data = mtcars, geom = "violin", color = factor(cyl), fill = factor(cyl))
Thanks to the special display form that the geom="violin"
parameter sets here, you can see at first glance that, for example, the vast majority of eight-cylinder engines can travel around 15 miles on a gallon of fuel, whereas the more frugal four-cylinder engines manage between 20 and 35 miles with the same amount (Figure 4).
« Previous 1 2 3 4 Next »
Buy this article as PDF
(incl. VAT)
Buy Linux Magazine
Subscribe to our Linux Newsletters
Find Linux and Open Source Jobs
Subscribe to our ADMIN Newsletters
Support Our Work
Linux Magazine content is made possible with support from readers like you. Please consider contributing when you’ve found an article to be beneficial.
News
-
Linux Kernel 6.13 Offers Improvements for AMD/Apple Users
The latest Linux kernel is now available, and it includes plenty of improvements, especially for those who use AMD or Apple-based systems.
-
Gnome 48 Debuts New Audio Player
To date, the audio player found within the Gnome desktop has been meh at best, but with the upcoming release that all changes.
-
Plasma 6.3 Ready for Public Beta Testing
Plasma 6.3 will ship with KDE Gear 24.12.1 and KDE Frameworks 6.10, along with some new and exciting features.
-
Budgie 10.10 Scheduled for Q1 2025 with a Surprising Desktop Update
If Budgie is your desktop environment of choice, 2025 is going to be a great year for you.
-
Firefox 134 Offers Improvements for Linux Version
Fans of Linux and Firefox rejoice, as there's a new version available that includes some handy updates.
-
Serpent OS Arrives with a New Alpha Release
After months of silence, Ikey Doherty has released a new alpha for his Serpent OS.
-
HashiCorp Cofounder Unveils Ghostty, a Linux Terminal App
Ghostty is a new Linux terminal app that's fast, feature-rich, and offers a platform-native GUI while remaining cross-platform.
-
Fedora Asahi Remix 41 Available for Apple Silicon
If you have an Apple Silicon Mac and you're hoping to install Fedora, you're in luck because the latest release supports the M1 and M2 chips.
-
Systemd Fixes Bug While Facing New Challenger in GNU Shepherd
The systemd developers have fixed a really nasty bug amid the release of the new GNU Shepherd init system.
-
AlmaLinux 10.0 Beta Released
The AlmaLinux OS Foundation has announced the availability of AlmaLinux 10.0 Beta ("Purple Lion") for all supported devices with significant changes.