Saving and evaluating network paths in Neo4j

A Relationship Thing

Author(s):

The Neo4j graph database is much better suited than relational databases for storing and quickly querying nodes and their mutual relationships. If your circle of friends is not wide enough to warrant a graph-based application, you might just want to inventory your LAN.

Modeling structures like the social graph of Facebook, connections to friends and their acquaintances, or your follower structure on Twitter is surprisingly difficult with traditional databases. Trying to map a network path – easily represented with squiggles and arrows on a whiteboard – with a relational model inevitably leads to performance-hungry join statements, the natural enemy of responsive websites.

The Neo4j [1] graph database natively stores graph models and offers fantastic performance – as long as you don't overcook the complexity of the queries. Its generic storage model consists of nodes and relationships. Both can possess attributes; for example, a node that represents a person could contain a name field for storing the name or carry a relationship is_friends_with and its intensity (best_friend, casual_friend).

Cypher Query Language

The Neo4j query processor takes inquiries in the SQL-style Cypher language, rummages through the data located in the database, and quickly returns results that Cypher also filters and processes in SQL style (i.e., sort, group, etc.).

After you install the GPL-licensed Neo4j Community Server (there's also a commercial enterprise version), it listens on port 7474 for commands either received via REST or using the newer simple JSON processor. The client can be programmed in several dozen languages, including the CPAN REST::Neo4p module for Perl.

The Debian package offered on the Neo4j site [1] also includes a handy command shell: neo4j-sh. You can use it to run commands similar to the interactive MySQL client to insert new data into the model and extract stored information via Cypher queries.

Declaratively Powerful

Cypher is, like SQL, declarative: You can specify the results you are looking for, but you don't need to define procedural statements to describe how exactly to find them. Match statements define which data are of interest (e.g., "Find all data" or "find all relations of type is_friends_with) where clauses then reduce the number of matches; for example, the requesting user may only be interested in people who are 18 years or older.

Subsequent processing steps remodel, sort, or collate the data. Even running further match statements against the results list is permitted, as well as intermediate actions to generate new data on the fly.

The graph of the home network in Figure 1 is intended to illustrate some practical queries. Networks actually represent a popular task for Neo4j with nodes and relations. To determine whether a router can easily reach the open Internet via other nodes, the database often needs to find an open path from A to B via craftily connected nodes. This can cause a performance implosion on relational systems, but can often be tackled with ease using graph databases.

Figure 1: The components of a home network – this Perl column feeds them to a graph database for path analysis.

Hand-Reared

For example, to add the router named internal in Figure 1 to the database and assign it the LAN IP 192.168.2.1, you would just do this in the Neo4j shell:

neo4j-sh (?)$ CREATE (router {name:"internal", lan_ip:"192.168.2.1"});

After creating another new node named merger for the gateway relation between the internal router and its gateway, a Cypher query locates both nodes and defines the connection with Cypher's own ASCII art syntax:

neo4j-sh (?)$ MATCH (a), (b)
> WHERE a.name = "internal" and b.name ="merger"
> CREATE (a)-[r:gateway]->(b);

The match operation finds two nodes, which it assigns the aliases a and b. Because no other search pattern exists in the match clause, this applies to all the nodes in the database. However, the following WHERE clause restricts the results to two precisely named nodes, and the CREATE statement uses the syntax -[...]-> to draw an arrow with a name between the identified nodes, thus creating a relation of the gateway type.

Feeding Machine

Of course, it would be extremely tedious to type in the data manually for a large network. That's why the routers.yml file in Listing 1 [2] defines the key data for all routers in the readable YAML format. The links between the routers as relations in the graph later implicitly result from connecting the gateway addresses of one router with the LAN IP of the next one.

Listing 1

routers.yml

 

The script in Listing 2 grabs the Yaml records of all the routers from the routers.yml file. Starting in line 28, it iterates through the list and dumps it into the database as a collection of nodes  – thanks to the CPAN REST::Neo4p module. While doing so, it saves all the references to the node objects created in the %lans hash using the LAN IPs as the key.

Listing 2

router-setup

 

The script also stores the gateway IPs encountered in the @gateways array along with the routers that use these gateway IPs. Starting in line 46, a for loop iterates through @gateways, uses the %lans hash to discover the target object, and defines one gateway relation from the start to the destination router in the database using the method relate_to().

As soon as the script finishes, a glance at the browser interface provided through port 7474 shows that the data is properly stored in Neo4j (Figure 2).

Figure 2: The Neo4j browser interface on port 7474 renders Cypher queries graphically.

With a query such as MATCH (n) RETURN n, which returns all previously stored nodes, the database browser shows the nodes as numbered squiggles in graph mode and their relationships as labeled arrows that point from one node to the next in the graph.

When you click on a node, its attributes are displayed in a dialog box that pops up. In text mode, the boxes shown in Figure 2 with the attribute values of the nodes edge into the picture.

Gordian Knot

To be able to replace the network structure in the database without leftovers if changes are made to the Yaml data, Listing  2 starts by deleting all the previously defined nodes and relations in the graph. This step is not as easy as you might think, because to keep the data model intact, Neo4j refuses to delete nodes that still have relationships. The Cypher query for purging the database therefore does:

MATCH (n)
OPTIONAL MATCH (n)-[r]-()
DELETE n,r

This matches all nodes n – as well as any relationships to another (anonymous) node emanating from them. The delete statement that follows then deletes the entry for the node, including it's outgoing relations.

To show all the routers stored in the database with the Neo4j shell, you would just run the query shown in Figure 3. If you simply had a RETURN n, instead of the return clause with three attribute values of interest (n.name, n.hardware_vendor, …), the query result would contain all the defined attributes, which could be hard to read. Instead, the query uses RETURN with specific field names to hide any values that are not of interest.

Figure 3: This query searches for and finds all the devices stored in the database.

Walking Paths

Of course, the advantages of a graph database are not found in menial services, such as finding nodes by their attributes, but in finding connection paths between nodes. It's actually quite simple to find out which network connections exist in the graph: The query in Figure 4 uses MATCH(n) to consider all the routers on the network. The relation description -[r:gateway*]-> matches one or more relations of the type gateway; the (m) in the query after the relation description stands for the terminating node in the found path.

Figure 4: This query shows all possible paths in the graph via gateway relations.

Because the Cypher query stores the results in the p variable with a prepended p=, a subsequent RETURN p would output all routing paths with all the routers considered by the query including their attributes. That would be a real mess of data. The return statement therefore restricts the output to the router name at the beginning of the route and uses collect(m.name) to concatenate the names of all subsequent routers on the path covered by the query.

The output in Figure 4 thus has two columns: The first contains the identified start router, and the second contains a list with the names of routers passed through on the way to the open Internet.

Saving Expenses

If the stored network is huge, running a query with an arbitrarily long chain of relations will usually take too long; limiting this to two or three levels saves a huge amount of time. What's more, if the database does not have to look forever for a suitable starting point for a route, your queries will run faster. For example, if the starting node for the search is already defined, a directive such as:

START n=node:router_index(name='guest')

can define the starting point for the match command. The prerequisite for this step is that the start router has been stored previously in a Neo4j index under the specified key (name).

Listing 2 does this in lines 35 and 36 with the add_entry() method. For simpler applications, the Neo4j Server offers an autoindex option; it determines which attributes to index automatically.

In Figure 5, the query first finds all wireless routers that have a wireless attribute set to 1. The example network has only one, although a more complex installation will probably have many more, increasing the number of routes found leading to the Internet. Using this information, you could, for example, periodically run automatic security checks that prevent a network packet from a wireless network reaching a protected internal network because of a configuration error.

Figure 5: The single wireless router is only connected to the modem via a path that does not use the internal network.

Two-Sided Anchors

The search pattern for the match statement can use more complex anchors. The query in Figure 6 searches the entire network for any router, n, that is used as a gateway on both sides. To do this, you need to compose a query with three router variables (here, m, n, and o) with a relation from left to right between the first two routers and another relation from right to left between the router at the end and the router in the middle. ASCII art makes this possible.

Figure 6: This query looks for two nodes in the graph that forward packets to the same gateway.

Neo4j tries to find such a constellation and outputs the results: The router merger matches the queried pattern. The two results have the same paths as equivalents with opposite orientation.

Finding the Edge

In the search to determine which devices on the network forward packets to the Internet, Listing 3 uses Perl to dispatch a Cypher query and anchors the match pattern to end at the router belonging to the DSL modem. The CPAN REST::Neo4p::query module runs the query and uses the fetch() method, which takes the results trickling in from the server in JSON format and returns them as paths. The latter consist of two lists, one with nodes and one with the intermediate relations.

Listing 3

router-search

 

The nodes() and relationships() methods dig the nodes and relations out of the path objects. The only task remaining for the for loop at the end is to extract the names of the devices by means of get_property() and output the sole matching result route by using arrows as symbols:

guest->merger->modem

Queries of this type happen quickly and are obviously easy to formulate.

As Figure 7 demonstrates, you can easily install the Neo4j server on Ubuntu and other Debian derivatives using the Debian repository server, as described on neo4j.org. Following this, the daemon starts automatically, as a call to curl http://localhost:7474 confirms by returning the server status in JSON format.

Figure 7: The apt-get package manager installing the Neo4j server on Ubuntu.

Installation

If you point a browser at the URL, you see the web interface mentioned above and can play with tutorials or feed in test data of your own. At the time of writing, the latest stable Neo4j version on Ubuntu 12.04 was 2.0.1. If you run the Neo4j server in a virtual machine or on a different host and would like to access it from some other location, you need to comment out the following line in the /etc/neo4j/neo4j-server.properties configuration file:

org.neo4j.server.webserver.address=0.0.0.0

Otherwise, the web server will block queries that do not originate from localhost.

Start Reading

The Neo4j website [1] mentions two books as recommended reading in addition to the man pages. The new Kindle book by Neo4j contributor Michael Hunger [3] (in German) offers a whirlwind tour of the Cypher syntax and presents some practical Neo4j examples. However, apart from the attached Cypher cheat sheet, it is not a reference work; it just scratches the surface here and there.

The second recommendation from the Neo4j website is a one-year-old O'Reilly book called Graph Databases [4], which  – despite the comprehensive-sounding title  – is devoted almost entirely to Neo4j and presents a variety of true Neo4j applications in detail. Even this book, however, lacks the carefully developed structure of a textbook. The true reference book for what is still a fairly recent subject has apparently not yet been written  – perhaps Neo4j in Action, which has been announced by Manning Publishing, will step into this gap.

Infos

  1. Neo4j: http://neo4j.org
  2. Listings for this article: ftp://www.linux-magazin.com/pub/listings/magazine/164
  3. Hunger, Michael. Neo4j 2.0 – Eine Graphdatenbank für Alle. Entwickler. 2014 (in German).
  4. Robinson, Ian, Jim Webber, and Emil Eifrem. Graph Databases. O'Reilly. 2013.

The Author

Mike Schilli works as a software engineer with Yahoo! in Sunnyvale, California. He can be contacted at mailto:mschilli@perlmeister.com. Mike's homepage can be found at http://perlmeister.com.