Workflow-based data analysis with KNIME

Analyze This!

© Lead Image Photo by mari lezhava on Unsplash

© Lead Image Photo by mari lezhava on Unsplash

Article from Issue 210/2018
Author(s): , Author(s):

They say data is "the new oil," but all that data you collect is only valuable if it leads to new insights. An open source analysis tool called KNIME lets you analyze data through graphical workflows – without the need for programming or complex spreadsheet manipulation.

Data analysts like to use flexible scripting languages such as R or Python that come with large ecosystems of libraries and extensions. But many users don't want to have to write and debug their own custom programs just to analyze data.

Visual workflows offer a different approach. You can use visual workflows to break down the analysis processes into modular, sequential steps. Each step is symbolized by a graphic element called a node. Each node performs an action, which might be a calculation, a formatting function, or another step related to data analysis and manipulation. By linking the nodes on the screen, users can create workflows for complex investigations of the data – without producing any code.

Visual workflows are the central element of the KNIME Analytics Platform. In the KNIME environment, a workflow is a graph with nodes showing a series of sequential steps for processing and analyzing the data. The user defines a pathway for the data by connecting the output of one node to the input of another node. A type system ensures that you can only connect compatible output and input. Real programming code is only necessary if you want to integrate KNIME with R or Python – or if you want to develop your own modules.

KNIME, which is pronounced "nime," was originally known as the Konstanz Information Miner; it began in 2006 at the Department of Bioinformatics and Data Mining at the University of Konstanz under the direction of Prof. Michael Berthold. The basic idea was to make data analysis easily accessible and affordable for users from different disciplines. The developers, therefore, turned their creation into an open source project, paying particular attention to making the tool extensible and user-friendly.

KNIME Analytics Platform is written in Java and is based on Eclipse and Open Services Gateway Initiative (OSGI) technology. The latest version (version 3.5.1 at the time this article was written) is available at the project website [1], where you will also find other introductory materials, including blogs, videos, and sample workflows.

The KNIME Analytics Platform is open source; anyone can download and use it free of charge. Start the installer or unpack the archive, and KNIME is ready to go. However, you might wish to add some KNIME extensions, which are the source for many additional nodes. To add an extension, go to File | Install KNIME Extensions, check the list for the desired extension, and follow the instructions.

Looking Around

Figure 1 shows the KNIME user interface. The KNIME Explorer in the top-left corner of the workspace provides an overview of the workflows. Click the View menu and select Workflow Coach to access the Workflow Coach, which offers suggestions on building a workflow. The Node Repository (bottom left) lists the nodes of all installed extensions.

Figure 1: The KNIME user interface, in which users compose workflows using drag & drop.

The nodes are the building blocks of any KNIME workflow, and the vast library of available nodes gives KNIME its versatility and power. KNIME nodes perform tasks such as:

  • data access
  • data manipulation
  • visualization
  • analytics
  • reporting
  • flow control
  • scripting
  • big data

The node description in the window on the right of the user interface gives the user the necessary documentation on the function and use of a node. In the middle is the workflow editor, where users string the nodes together to develop the actual workflow. During the course of developing a workflow, the user connects, configures, and executes the KNIME nodes individually and collectively (Figure 2).

Figure 2: Anatomy of a KNIME node.

Once a node has been executed, which is indicated by a green traffic light symbol below the node, you can display the resulting data as a table, bar chart, or other format. If the traffic light symbol is red, the node is not yet configured. Yellow means the node is ready for execution.

Nodes can have between zero and an arbitrary number of inputs and outputs (ports). The ports' shape and color indicate what kind of data the node needs or outputs at a particular input or output (a black triangle indicates a table).

Sample Scenario

The best way to get to know KNIME is to work through an example workflow (Figure 3). The editors of a fictitious online magazine would like to know more about their readers' preferences. They have gathered some data from a rating system that gives readers the opportunity to rate magazine articles with up to five stars. Each article is also classified in at least one of the five categories: Hardware, Software, Development, Security, and Internet.

Figure 3: The sample workflow, which analyzes reader preferences for a publisher.

The editors hope to draw conclusions from the ratings on reader preferences and to identify groups of readers with common interests. They then want to suggest further articles to each reader that match their interests.

The starting points for the analysis are a SQLite database with the article ratings (Table 1) and a CSV file with the classification of the categories for the articles. Both data sources are therefore available in a table format.

Table 1

Evaluation from a Sqlite Database

Reader ID

Article ID

Evaluation

Reader 1

Article 11

1

Reader 93

Article 31

3

Reader 45

Article 3

4

Loading the Data

The first step of any data analysis is loading the required data (the red nodes in Figure 3). For KNIME, the data can come from a variety of sources, such as text files, documents, databases, or web services.

Once loaded in KNIME, it doesn't matter where the data came from, because the software always converts the data into an internal format. In this case, I need to load data from an SQLite database and a CSV file. To help you access databases, KNIME has several nodes to assist with data input, including the Database Reader node, which delivers the results of an SQL query as a KNIME table.

The node works with different database types and receives the connection information via its inbox. You can prefix an SQLite connector, which is configured with the connection information and forwards it to other nodes.

The easiest way for KNIME to load text files is to use the File Reader. This node automatically attempts to guess the file format, including column separators and row length, and displays a preview of the table in its configuration dialog.

Buy Linux Magazine

SINGLE ISSUES
 
SUBSCRIPTIONS
 
TABLET & SMARTPHONE APPS
Get it on Google Play

US / Canada

Get it on Google Play

UK / Australia

Related content

  • CLUSTERIP

    Iptables gives admins the ability to set up clusters and distribute the load. But what about failover?

  • Getting Started with HPC Clusters

    Starting out in the HPC world requires learning to write parallel applications and learning to administer and manage clusters. We take a look at some ways to get started.

  • Samba for Clusters

    Samba Version 3.3 and the CTDB lock manager provide full cluster support.

  • Hadoop

    Experience the power of supercomputing and the big data revolution with Apache Hadoop.

  • Tutorials – Minetest

    Minetest is much more than a clone of a certain popular proprietary game. It offers infinite customization that allows you to create blocks, objects, fun educational exercises, and even games within the game, dishing up features well beyond those of any other closed source alternative.

comments powered by Disqus

Direct Download

Read full article as PDF:

Price $0.00

News