Workflow-based data analysis with KNIME
Analyze This!
They say data is "the new oil," but all that data you collect is only valuable if it leads to new insights. An open source analysis tool called KNIME lets you analyze data through graphical workflows – without the need for programming or complex spreadsheet manipulation.
Data analysts like to use flexible scripting languages such as R or Python that come with large ecosystems of libraries and extensions. But many users don't want to have to write and debug their own custom programs just to analyze data.
Visual workflows offer a different approach. You can use visual workflows to break down the analysis processes into modular, sequential steps. Each step is symbolized by a graphic element called a node. Each node performs an action, which might be a calculation, a formatting function, or another step related to data analysis and manipulation. By linking the nodes on the screen, users can create workflows for complex investigations of the data – without producing any code.
Visual workflows are the central element of the KNIME Analytics Platform. In the KNIME environment, a workflow is a graph with nodes showing a series of sequential steps for processing and analyzing the data. The user defines a pathway for the data by connecting the output of one node to the input of another node. A type system ensures that you can only connect compatible output and input. Real programming code is only necessary if you want to integrate KNIME with R or Python – or if you want to develop your own modules.
KNIME, which is pronounced "nime," was originally known as the Konstanz Information Miner; it began in 2006 at the Department of Bioinformatics and Data Mining at the University of Konstanz under the direction of Prof. Michael Berthold. The basic idea was to make data analysis easily accessible and affordable for users from different disciplines. The developers, therefore, turned their creation into an open source project, paying particular attention to making the tool extensible and user-friendly.
KNIME Analytics Platform is written in Java and is based on Eclipse and Open Services Gateway Initiative (OSGI) technology. The latest version (version 3.5.1 at the time this article was written) is available at the project website [1], where you will also find other introductory materials, including blogs, videos, and sample workflows.
The KNIME Analytics Platform is open source; anyone can download and use it free of charge. Start the installer or unpack the archive, and KNIME is ready to go. However, you might wish to add some KNIME extensions, which are the source for many additional nodes. To add an extension, go to File | Install KNIME Extensions, check the list for the desired extension, and follow the instructions.
Looking Around
Figure 1 shows the KNIME user interface. The KNIME Explorer in the top-left corner of the workspace provides an overview of the workflows. Click the View menu and select Workflow Coach to access the Workflow Coach, which offers suggestions on building a workflow. The Node Repository (bottom left) lists the nodes of all installed extensions.
The nodes are the building blocks of any KNIME workflow, and the vast library of available nodes gives KNIME its versatility and power. KNIME nodes perform tasks such as:
- data access
- data manipulation
- visualization
- analytics
- reporting
- flow control
- scripting
- big data
The node description in the window on the right of the user interface gives the user the necessary documentation on the function and use of a node. In the middle is the workflow editor, where users string the nodes together to develop the actual workflow. During the course of developing a workflow, the user connects, configures, and executes the KNIME nodes individually and collectively (Figure 2).
Once a node has been executed, which is indicated by a green traffic light symbol below the node, you can display the resulting data as a table, bar chart, or other format. If the traffic light symbol is red, the node is not yet configured. Yellow means the node is ready for execution.
Nodes can have between zero and an arbitrary number of inputs and outputs (ports). The ports' shape and color indicate what kind of data the node needs or outputs at a particular input or output (a black triangle indicates a table).
Sample Scenario
The best way to get to know KNIME is to work through an example workflow (Figure 3). The editors of a fictitious online magazine would like to know more about their readers' preferences. They have gathered some data from a rating system that gives readers the opportunity to rate magazine articles with up to five stars. Each article is also classified in at least one of the five categories: Hardware, Software, Development, Security, and Internet.
The editors hope to draw conclusions from the ratings on reader preferences and to identify groups of readers with common interests. They then want to suggest further articles to each reader that match their interests.
The starting points for the analysis are a SQLite database with the article ratings (Table 1) and a CSV file with the classification of the categories for the articles. Both data sources are therefore available in a table format.
Table 1
Evaluation from a Sqlite Database
Reader ID | Article ID | Evaluation |
---|---|---|
Reader 1 |
Article 11 |
1 |
Reader 93 |
Article 31 |
3 |
Reader 45 |
Article 3 |
4 |
… |
Loading the Data
The first step of any data analysis is loading the required data (the red nodes in Figure 3). For KNIME, the data can come from a variety of sources, such as text files, documents, databases, or web services.
Once loaded in KNIME, it doesn't matter where the data came from, because the software always converts the data into an internal format. In this case, I need to load data from an SQLite database and a CSV file. To help you access databases, KNIME has several nodes to assist with data input, including the Database Reader
node, which delivers the results of an SQL query as a KNIME table.
The node works with different database types and receives the connection information via its inbox. You can prefix an SQLite connector
, which is configured with the connection information and forwards it to other nodes.
The easiest way for KNIME to load text files is to use the File Reader
. This node automatically attempts to guess the file format, including column separators and row length, and displays a preview of the table in its configuration dialog.
Buy Linux Magazine
Subscribe to our Linux Newsletters
Find Linux and Open Source Jobs
Subscribe to our ADMIN Newsletters
Support Our Work
Linux Magazine content is made possible with support from readers like you. Please consider contributing when you’ve found an article to be beneficial.
News
-
Juno Computers Launches Another Linux Laptop
If you're looking for a powerhouse laptop that runs Ubuntu, the Juno Computers Neptune 17 v6 should be on your radar.
-
ZorinOS 17.1 Released, Includes Improved Windows App Support
If you need or desire to run Windows applications on Linux, there's one distribution intent on making that easier for you and its new release further improves that feature.
-
Linux Market Share Surpasses 4% for the First Time
Look out Windows and macOS, Linux is on the rise and has even topped ChromeOS to become the fourth most widely used OS around the globe.
-
KDE’s Plasma 6 Officially Available
KDE’s Plasma 6.0 "Megarelease" has happened, and it's brimming with new features, polish, and performance.
-
Latest Version of Tails Unleashed
Tails 6.0 is based on Debian 12 and includes GNOME 43.
-
KDE Announces New Slimbook V with Plenty of Power and KDE’s Plasma 6
If you're a fan of KDE Plasma, you'll be thrilled to hear they've announced a new Slimbook with an AMD CPU and the latest version of KDE Plasma desktop.
-
Monthly Sponsorship Includes Early Access to elementary OS 8
If you want to get a glimpse of what's in the pipeline for elementary OS 8, just set up a monthly sponsorship to help fund its continued existence.
-
DebConf24 to be Held in South Korea
Busan will be the location of the latest DebConf running July 28 through August 4
-
Fedora Unleashes Atomic Desktops
Fedora has combined its solid distribution with rpm-ostree system to make it possible to deliver a new family of Fedora spins, called Fedora Atomic Desktops.
-
Bootloader Vulnerability Affects Nearly All Linux Distributions
The developers of shim have released a version to fix numerous security flaws, including one that could enable remote control execution of malicious code under certain circumstances.