Working with the JSON data format
Data Dog
JSON data format is a standard feature of today's Internet – and a common option for mobile and desktop apps – but many users still regard it as something of a mystery. We'll take a close look at JSON format and some of the free tools you can use for reading and manipulating JSON data.
Our world of web applications and fast, interactive mobile devices calls for the free exchange of data in easily accessible forms. Standard formats promote interoperability and minimize development time. Open formats also make it easy to import data into other applications. Over the years, several popular alternatives have emerged. CSV, XML, and YAML are well known and easy to adapt to different applications (see the box entitled "Comparing Formats" and Listings 1-3 for examples). One format that is used extensively for web applications, mobile applications, and even some conventional desktop tools is JavaScript Object Notation (JSON) [1].
Comparing Formats
Using comma-separated values (CSV) ensures the data remains manageable for the most part, but the format is neither standardized nor particularly flexible. In a file like the one in Listing 1, neither the character set nor the separator between columns is fixed. In practical applications, spaces, tabs, hyphens, or semicolons are sometimes used instead of commas. Corresponding key data either has to be agreed upon or inferred from the file itself. Furthermore, the format does not allow nested records, arrays, or binary data.
The Extensible Markup Language (XML) is far more structured and flexible than CSV. A notable feature of XML is the data field enclosed in the field name with the form <fieldname>value</fieldname>
(Listing 2). In practice, it makes sense to choose field names that let you infer the contents. The order of the fields is usually variable in a layer, and fields can be missing. One disadvantage of XML is that an XML file is significantly larger due to the need to continually repeat the field labels for each entry.
YAML is a recursive acronym for YAML Ain't Markup Language. The YAML specification describes a very compact way to serialize data. Hyphens and indentations using spaces serve as the basis for denoting fields. YAML borrows from XML, as well as from the format in which three programming languages (Python, Perl, and C) describe their data structures. Listing 3 shows the book inventory data as a YAML structure.
JSON [1] is based on JavaScript. The format is also very compact and flexible. In contrast to YAML, JSON explicitly identifies objects and their attributes, whereas in YAML, the assignment is derived from the context of the indentation depth.
Listing 1
CSV File
Stephen Fry; The Hippopotamus; 1994 Ian Rankin; Set In Darkness; 2009 Ken Follett; The Pillars of the Earth; 1989
Listing 2
XML File
<inventory> <book> <author>Stephen Fry</author> <title>The Hippopotamus</title> <publication>1994</publication> </book> <book> <author>Ian Rankin</author> <title>Set In Darkness</title> <publication>2009</publication> </book> <book> <author>Ken Follett</author> <title>The Pillars of the Earth</title> <publication>1989</publication> </book> </inventory>
Listing 3
YAML File
--- book: - author: Stephen Fry title: The Hippopotamus publication: '1994' - author: Ian Rankin title: Set In Darkness publication: '2009' - author: Ken Follett title: The Pillars of the Earth publication: '1989'
JSON is wildly popular as a tool for passing information between web apps – for instance, it is currently the de facto standard for REST services – yet for many users, the details of JSON format are shrouded in mystery. This article takes a close look at JSON and some of the tools available for reading, manipulating, and importing JSON data.
Understanding JSON
The notation of JSON is analogous to objects, records, or dictionaries – depending on what that structure is currently called in your favorite programming language. Even though JSON format is based on JavaScript, parsers exist in almost all programming languages. In addition to Awk and C/C++, you can integrate JSON with Fortran, Go, Lisp, Lua, Python, and Visual Basic.
In everyday life, you will find the format in the data-sharing Jupyter Notebook app [2], in geographical specifications like GeoJSON [3] (Listing 4), and even in databases like MongoDB.
Listing 4
GeoJSON File
{ "type": "Feature", "geometry": { "type": "Point", "coordinates": [125.6, 10.1] }, "properties": { "name": "Dinagat Islands" } }
Taking a closer look at the JSON data structure, you will see that it is in an easy-to-read, text-based format. Parentheses, colons, and commas separate the individual elements; the data can be nested as desired. This means, for example, that you can map lists, arrays, or objects. Table 1 summarizes the elementary data types that JSON supports.
Table 1
JSON Data Types
Data Type | Description |
---|---|
Strings |
All Unicode characters except ", \, and control characters |
Numbers |
Numeric values including hexadecimal and exponential values, for example 0x42 and .2e-3 |
Boolean values |
Logic values true and false |
Arrays |
Comma-separated, unordered lists of properties, although objects without properties are also allowed |
Objects with properties |
Notation as key-value pairs |
Null values |
null, NIL, or () |
Mark the individual structure levels with brackets and indentation for better readability (Listing 5). Pairs of curly braces ({
and }
) each form a unit. Square brackets ([
and ]
) are used to indicate fields (also known as arrays). Individual field elements follow the form of an enumeration and are separated by commas. Each field element consists of a key-value pair separated by a colon (:)
.
Listing 5
JSON File
{"book": [ { "author": "Stephen Fry", "title": "The Hippopotamus", "publication": "1994" }, { "author": "Ian Rankin", "title": "Set In Darkness", "publication": "2009" }, { "author": "Ken Follett", "title": "The Pillars of the Earth", "publication": "1989" } ]}
JSON was originally created in the early 2000s to exchange data between web applications. JSON worked quite well in the web context (even though you cannot always parse it unambiguously). To structure data, JSON falls back on conventions familiar to anyone who has programmed in a C-based language (C, C++, C#, Java, JavaScript, Perl, Python, and others).
JSON is specified according to RFC 8259 [4] and ECMA-404; common extensions are JSONP (JSON with padding), JSONPP (JSON with padding and parameters), and JSONML, which combines XML and JSON together. The character set for all JSON formats is Unicode (UTF-8), which eliminates the character-set guessing game that you will be familiar with from CSV.
You can use JSON to exchange smaller volumes of data between applications in an agile way. However, if the transferred data volume increases (e.g., if you have millions of measurements from a sensor), JavaScript-based Python libraries like Ipywidgets, Bokeh, and Plotly often fail. In the face of large data volumes, binary transport mechanisms are a better option for handling the load.
Tool Overview
Several command-line tools are available for parsing, processing, and outputting JSON data. Table 2 summarizes some of the available tools. All of these tools are available as packages for Debian GNU/Linux, Ubuntu, Devuan, Linux Mint, and macOS.
Table 2
JSON Tools
Tool | Language | Application |
---|---|---|
aeson-pretty [5] |
Haskell |
Output JSON in a readable way |
jc [6] |
Python |
Convert output to JSON |
jid [7] |
Go |
Interactively filter JSON |
jo [8] |
C |
JSON output in the shell |
jq [9] |
C |
Output and filter JSON in a readable way |
Jshon [10] |
C |
Read and generate JSON |
JSONLint [11] |
PHP |
Validate JSON data |
Not all of the tools in Table 2 are intuitive, and some of them only develop their full impact in a specific context. You will find more information on these tools in the various documents and cheat sheets available online [12].
Easily Readable JSON Output
When it comes to pretty printing, aeson-pretty, jc, jo, jq, and Jshon all have something to say. Some of the tools have a command-line parameter for printing, for example, -p
for jo.
In Listing 6, cat
and aeson-pretty
work together for readable output via a pipe. Jq delivers the same results with the next call, but the output is in color (Figure 1):
$ jq . book_inventory.json
Listing 6
Printing with aeson-pretty
$ cat book-inventory.json | aeson-pretty { "book": [ { "publication": "1994", "author": "Stephen Fry", "title": "The Hippopotamus" }, { "publication": "2009", "author": "Ian Rankin", "title": "Set In Darkness" }, { "publication": "1989", "author": "Ken Follett", "title": "The Pillars of the Earth" } ] } $
The dot in the call to jq is not immediately understandable. It stands for the expression to be processed; in this case, it denotes all objects specified as parameters in the JSON file. You can define colorizing of the output using two options, -C
(--colour-output
) and -M
(--monochrome-output
).
Some users prefer compact output with as few (space) characters as possible. In Listing 7, see aeson-pretty with the -c
(short for --compact
) option. This option reduces the number of characters in the output by 45 percent, from 428 to 236 bytes. Compared to Listing 5 and Listing 6, the results still convey the same information, but with only half the amount of data.
Listing 7
Compact Output with aeson-pretty
$ cat book-inventory.json | aeson-pretty -c {"book":[{"publication":"1994","author":"Stephen Fry","title":"The Hippopotamus"}, {"publication":"2009","author":"Ian Rankin","title":"Set In Darkness"}, {"publication":"1989","author":"Ken Follett","title":"The Pillars of the Earth"}]}
Verifying JSON Data
The formatted output of a tool like jq or aeson-pretty usually helps to detect obvious errors in the JSON structure at a glance, but if you need a closer look, you can check your JSON data with JSONLint. The tool expects the file with the JSON data as a parameter. If everything is correct, it reports back that JSON is valid. If there is an error, it outputs the location that it identified as the error. Figure 2 shows JSONLint output for the book inventory – first with a correct JSON file, and then with a variant that is missing a comma as a separator.
JSONLint was written in PHP. If you are looking for an alternative, you may also be able to get by with jq. If jq fails to parse the JSON data, it returns a number greater than zero, otherwise zero. Listing 8 shows the output for a deliberate error.
Listing 8
Finding Errors in JSON Data with jq
$ cat book-inventory-broken.json | jq . parse error: Expected separator between values at line 7, column 5 $ echo $? 4
Converting Output to JSON
All Unix/Linux tools have their own specific output format. With a clever combination of grep, sed, and awk, you can break down the output and create the format you need for further processing. This approach sounds simple, but it often feels like walking up the stairs backwards while balancing a crystal vase on your head.
On the other hand, if every Unix/Linux tool had a --json
switch and used it to create output in JSON format, the output could be parsed in a standardized way. However, the world still has a long way to go before this happens, so a workaround is needed. Jc and jo can both read the output from a tool, convert the output, and flip it back to the standard output in JSON format.
The list of output formats that jc understands is quite long and includes the output from df
, du
, lsblk
, crontab
, netstat
, and lsof
. Figure 3 shows output from the route
command, as processed by the jc web demo page [13]. On the demo page, you can select the desired Unix/Linux command or data format at the top, and then copy the associated output into the input box. Click on Convert to JSON to create the output below – each entry is a JSON element. Use the Pretty Print checkbox to specify whether the output should be a compact one-liner or a prettied-up, longer version.
Building complicated JSON files yourself and counting parentheses – that was yesterday. Today, jo does it for you. Jo expects the key-value pairs as parameters and screws together a corresponding JSON output from them. Figure 4 shows the output for the two parameters magazine
and issue
.
Because jo receives the key-value pairs as parameters when called, variable content from the shell is no longer a problem. See Listing 9, which shows variables for today's date and home directory.
Listing 9
Adding Environment Variables
$ jo timeofday="$(date +%c" home=$HOME {"time of day":"Mon 12 Oct 2020 17:06:30 CEST","home":"/home/frank"}
Interactive Filters
Tools like jq, jid, and Jshon can filter the output if you only need part of the data. Earlier you learned that passing a .
to jq outputs the entire dataset. With the appropriate call, you can filter the data and extract the author, title, and publication data from a JSON file with book publishing data. Using .book[]
, you first narrow down the search to the book list; then you filter all items with the publication
key using a pipe (|)
.
Figure 5 shows the results from jqplay – a simple playground tool in the web browser whose contents are passed to jq for processing. If you change the filter or the output in the two input fields on the left, the output on the right adjusts.
Jid stands for JSON Interactive Digger. A call to cat book_inventory.json | jid
lets you browse a JSON file interactively.
We have not yet discovered any special programs with graphical interfaces for editing JSON data. All text editors offer syntax highlighting and thus simplify editing. We were particularly impressed by the web-based JSON Viewer [14] editor, which offers a graph structure in addition to an object-based display. We fed JSON Viewer our book list for Figure 6.
Downstream Processing
In addition to the tools that output and format JSON data for the user are several tools that provide downstream processing of JSON as input in other programs (post-processing).
If you receive JSON data via an interface, it is good programming practice to sanity check the received data before further processing. The sanity check includes two stages:
- Syntactic correctness – is the spelling correct? Do all brackets (equal number of opening and closing brackets), commas, and quotation marks fit the bill?
- Correctness of data fields – does the received data structure match the data definition (JSON schema)?
For the first question, it is best to use JSONLint, which is described earlier in this article. For the second stage, you need the JSON schema that describes the data structure. You then compare this description with the received data.
On json-schema.org, you will find an overview of validators [15], sorted by the various programming languages in which they were developed. For example, consider the validate-json
tool implemented in PHP [16]. If you are more into Python, jsonschema
[17] serves the same purpose. The call to the two tools is identical.
Defining the JSON Schema
Listing 10 shows the JSON schema with which you define the exact format of your data structure. The schema matches the book inventory used earlier in this article. The schema was stored in the book-inventory-schema.json
file in the local directory.
Listing 10
JSON Schema
{ "$schema": "http://json-schema.org/draft/2019-09/schema", "title": "Book", "type": "object", "required": ["author", "title", "publication"], "properties": { "author": { "type": "string", "description": "The author's name" }, "title": { "type": "string", "description": "The book's title" }, "publication": { "type": "number", "minimum": 0 }, "tags": { "type": "array", "items": { "type": "string" } } } }
The schema definition references the JSON standard used (the draft from September 2019, in this case) in the second line. The definition contains a number of keywords. Table 3 explains these keywords in more detail; a complete list of all supported keywords is available at json-schema.org [18].
Table 3
JSON Keywords
Keyword | Description |
---|---|
$schema |
Description of the schema specification |
title |
Title of the schema |
type |
Type of JSON data |
properties |
Properties of each value (key and values allowed for the field) |
required |
List of required properties |
properties.type |
Data type of an entry |
properties.minimum |
Minimum value of an entry |
properties.maximum |
Maximum value of an entry |
properties.minLength |
Minimum number of characters for an entry |
properties.maxLength |
Maximum number of characters for an entry |
properties.pattern |
Regular expression for a comparison with the value of an entry |
The next task is to validate the records by checking whether they match the specified schema. Listing 11 shows a single record from the book inventory in readable format. The compact version of the record contains all the parentheses and fields in a single line.
Listing 11
JSON Record
{ "author": "Stephen Fry", "title": "The Hippopotamus", "publication": 1994 }
The validate-json
tool expects two parameters in the call, the dataset and the schema (Listing 12). If everything goes well, the output does not cause any further feedback (line 2); otherwise, validate-json
grumbles (lines 4 and 5). To provoke the error message starting in line 4, we turned the numeric specification for the year of publication (1994
) into a string "1994"
, which means that the data type in the dataset no longer matched the stored data type in the JSON schema. validate-json
has every reason to complain.
Listing 12
Calling validate-json
01 $ validate-json record.json bookinventory-schema.json 02 $ 03 $ validate-json record.json bookinventory-schema.json 04 JSON does not validate. Violations: 05 [publication] String value found, but a number is required
Some programming languages also offer suitable helper libraries. In Python, for example, you can use jsonschema
, and for NodeJS, you can use the Express framework.
Processing JSON
The list of command-line tools and helpers that read, search on, and modify JSON data is quite extensive. We stopped counting after more than 20 entries (see Table 4 for a sample). Developer Ilya Sher maintains a useful, commented overview of options [19].
Table 4
Command-Line Tools
Tool | Application (selection) |
---|---|
faq, Xidel |
Convert formats from and to JSON (BSON, Bencode, JSON, TOML, XML, YAML, etc.) |
fx, gofx, jq, jid |
Filter JSON data |
jello |
Filter JSON data with Python syntax |
jtbl |
Output to a table |
Underscore |
Processing via the command line |
Jtbl, for example, takes JSON records and knits a pretty table from them. In Figure 7, you can see how this table looks for the book inventory. Each record is shown in a separate row. Jtbl can only cope with flat JSON structures. It cannot handle nesting so far.
Element-by-Element Access
A tool like jq can dig out individual elements from the JSON data stream using expressions, but this approach is cumbersome for highly nested data structures; it is far easier to use a path specification. Tools like JMESPath [20] (pronounced "James Path") and JSONPath [21] are similar to XPath for XML. JMESPath is available in Python, PHP, JavaScript, or even Lua; JSONPath is available in JavaScript, PHP, and Java.
These tools enable more complex expressions. Table 5 shows you a selection. You read the expressions from left to right and name nodes or attributes in the order in which you want to work your way along the data structure. Two levels of nodes or attributes are separated by a period. Sets and patterns are specified in square brackets, for example, book[*]
for all nodes of the book
list. The specification book[?author == `Ken Follett`]
takes all nodes from the dataset for which the attribute author
has the value Ken Follett
.
Table 5
Expressions in JMESPath
Expression | Meaning |
---|---|
book[*].title |
All book titles |
book[?author == `Ken Follett`].title |
All book titles by author Ken Follett |
book[?publication > `1990`] |
All books published after 1990 |
Please note the correct quotes when formulating the expressions. You need to quote values for comparison in the call in backticks (`
), regardless of whether they are strings or numeric values.
Listing 13 shows the three expressions from Table 5 in action in a Python script. We used the JSON implementation of JMESPath here. Although the Json library is a fixed part of Python, JMESPath is one of the extras that you can install before using it, either via Pip or the package manager that comes with your Linux distribution. The corresponding package for Debian GNU/Linux and Ubuntu goes by the name of python3-jmespath.
Listing 13
find-json-path.py
01 import jmespath 02 import json 03 04 expression1 = jmespath.compile('book[*].title') 05 expression2 = jmespath.compile('book[?author == `Ken Follett`].title') 06 expression3 = jmespath.compile('book[?publication > `1990`]') 07 08 with open("bookinventory.json") as jsonFile: 09 jsonData = json.load(jsonFile) 10 11 # book titles 12 print("Book title:") 13 bookTitles = expression1.search(jsonData) 14 for title in bookTitles: 15 print(title) 16 17 print(" ") 18 19 # all the books by Ken Follett 20 print("All books by Ken Follett:") 21 bookTitles = expression2.search(jsonData) 22 for title in bookTitles: 23 print(title) 24 25 print(" ") 26 27 # all books published later than 1990 28 print("All books published later than 1990:") 29 books = expression3.search(jsonData) 30 for item in books: 31 author = item["author"] 32 title = item["title"] 33 publication = item["publication"] 34 print("Author : %s" % author) 35 print("Title : %s" % title) 36 print("Published: %i" % publication) 37 print(" ")
After first loading the two Python libraries, json and
jmespath, in the script (lines 1 and 2), three expressions or search patterns are defined as objects with the names expression1
, expression2
, and expression3
. If you are familiar with the Python regular expression library re, you will already know the procedure.
Lines 8 and 9 read the book inventory as a JSON file and load()
the contents of the file as a dictionary into the jsonData
variable. Searches over the book inventory rely on the search()
method from the search pattern object. For example, the call to expression2.search(jsonData)
searches out all book titles that belong to the author Ken Follett.
search()
returns a list of search hits that you can output one by one in a for
loop. Figure 8 shows the output of the search matches for all three previously defined search paths.
JSON Libraries
If you prefer some other programming language instead of Python, you can still connect to JSON. Table 6 shows a selection of libraries and modules. If you have different implementations available for a programming language, it is difficult to make a recommendation without knowing the volume and structure of the JSON data you wish to process. After a benchmark test, you will be smarter [22] about what best suits your case.
Table 6
Selection of JSON Libraries
Language | Libraries (Selection) |
---|---|
Go |
encoding/json |
LISP |
CL-JSON |
Lua |
json.lua |
NodeJS |
Express |
Perl |
JSON::Parse, JSON::PP, JSON::XS |
PHP |
JSON |
Python |
simplejson, hyperjson, json, jsonschema, orjson, RapidJSON, UltraJSON, pandas |
Ruby |
JSON |
Tcl |
json |
Listing 14 shows how to access JSON objects in the Go programming language. After importing the two modules encoding/json
and fmt
, you create a Book
data structure that includes three variables: Author
, Title
, and Publication
. You access this data structure in the main()
function by declaring a book
variable with this type in it.
Listing 14
extract-json.go
package main import ( "encoding/json" "fmt" ) type Book struct { Author string Title string Publication string } func main() { bookJson := `{"author": "Stephen Fry", "title": "The Hippopotamus", "publication": "1994"}` var book Book json.Unmarshal([]byte(bookJson), &book) fmt.Println("Author: ", book.Author) fmt.Println("Title: ", book.Title) fmt.Println("Publication: ", book.Publication) }
The bookJson
variable acquires the record for a book. Using the Unmarshal()
method from the json
module, you unpack the record byte by byte and assign the contents to the components from book
. Then, using the Println()
method, you output the contents of the components. For more information on processing, see Soham Kamani's blog [23], which is definitely a worthwhile read.
Now save Listing 14 to the extract-json.go
file and run the code. You'll see something like the output in Listing 15.
Listing 15
Output
$ go run extract-json.go Author: Stephen Fry Title: The Hippopotamus Publication: 1994
Outlook
Using JSON is a good choice if the data conforms to the supported formats, the strings are not arbitrarily long, and you only need to implement the data exchange, while the documentation of the data is managed elsewhere. Another strength of JSON is that many languages can process it.
With large volumes of data, however, the file size can have a detrimental effect on the processing speed. In those cases, a different format such as Google's Protobuf [24] could offer an alternative. For more information on serialization formats and the practical handling of JSON, see the examples in the Jupyter tutorial [25].
Thank you
The authors would like to thank Gerold Rupprecht for his criticism and suggestions during the preparation of the article.
Infos
- JSON: https://www.json.org/
- Jupyter Notebook: https://jupyter.org/try
- GeoJSON: https://geojson.org
- RFC 8259: https://tools.ietf.org/html/rfc8259
- aeson-pretty: https://github.com/informatikr/aeson-pretty
- jc: https://github.com/kellyjonbrazil/jc
- jid: https://github.com/simeji/jid
- jo: https://github.com/jpmens/jo
- jq: https://github.com/stedolan/jq
- Jshon: http://kmkeen.com/jshon
- JSONLint: https://jsonlint.com
- jq cheat sheet: https://lzone.de/cheat-sheet/jq
- jc demo website: https://jc-web-demo.herokuapp.com
- JSON Viewer: https://jsonviewer.arianv.com
- JSON validators: https://json-schema.org/implementations.html#validators
- validate-json: https://github.com/justinrainbow/json-schema
- jsonschema: https://github.com/Julian/jsonschema
- Validation of JSON data: http://json-schema.org/draft/2019-09/json-schema-validation.html
- "List of JSON tools for command line": https://ilya-sher.org/2018/04/10/list-of-json-tools-for-command-line/
- JMESPath: https://jmespath.org
- JSONPath: https://code.google.com/archive/p/jsonpath/
- "Choosing a faster JSON library for Python": https://pythonspeed.com/articles/faster-json-library/
- "How to Parse JSON in Golang": https://www.sohamkamani.com/blog/2017/10/18/parsing-json-in-golang/
- Protobuf: https://developers.google.com/protocol-buffers
- Serialization formats/JSON (Jupyter Tutorial): https://jupyter-tutorial.readthedocs.io/en/latest/data-processing/serialisation-formats/json.html