Essential software tools for the working scientist

The Scientist's Linux Toolbox

© Lead Image © KrishnaKumar Sivaraman, 123RF

© Lead Image © KrishnaKumar Sivaraman, 123RF

Author(s):

Linux and science are a natural fit. These are a handful of essential software packages both for getting work done and presenting it to others.

Although Linux still occupies a small niche on the desktop among the population at large, it is much more popular among scientists from all disciplines.

It's tempting to say that's just because scientists are smart! But it's easy for me to understand Linux's appeal for scientists when I remember the problems caused by the use of proprietary operating systems (OSs and software in labs where I've worked).

For one particular piece of software, we had a site license that only permitted a certain number of people to use the program at a time. It would actually spy on the network and count up how many instances of the program were running. If I needed to run it and it refused, I had to run around the lab to find out if someone had just forgotten to exit the program.

People couldn't read each other's documents if they were made with the wrong version of Word. Programs would stop working after upgrading the OS, and, if they had been abandoned, you were out of luck without the source. Standard open source tools might not compile, because the OS vendor included outdated or even misnamed versions of standard libraries (Apple was notorious for this). Customizing the desktop was difficult and options were limited.

This is just the tip of the closed-source computing iceberg. When I was able to switch entirely to Linux, all these problems disappeared. I've been doing my work exclusively with Linux for years and could not imagine going back to the hostile world of proprietary software.

Another reason for the relative popularity of Linux among some scientists is that it is the OS of choice for such things as wiring together supercomputing clusters. There is a certain convenience in having a consistent environment shared between the remote compute resource and the box on your desk.

In the rest of this article, I survey some widely used free software. Except for some of the more specialized packages nearer the end of the article, I use all of this software myself and recommend it to the scientist switching to Linux who wants to get started with a set of powerful and reliable tools. (For those switching to Linux, see also the "Which Distribution?" box.)

Which Distribution?

All the software described here will work on any Linux distribution, so your best strategy is simply to use the distribution that is known to work well on your hardware. Since, as a scientist, you will probably wind up doing some serious calculations, you don't want to waste memory or cycles on inessentials. Because you will generally be working from the command line, regardless of your distribution, you should consider uninstalling or disabling any heavy desktop environment and replacing it with a lightweight window manager such as dwm [1].

Every piece of software I mention in this article is free and open source. All are available for Linux, most can also run on other free OSs, and some will even work on Apple and Windows machines.

Writing Papers

A scientist in any field will be writing papers, so this section is the most widely applicable. I recommend two nearly indispensable software packages.

The first is the TeX Live [2] distribution of LaTeX. This is a huge package that will install everything you need to typeset any kind of document, including a complete set of fonts, all the engines based on TeX (LuaTeX, pdfTeX, XeTeX, etc.), software for drawing diagrams, and much more. Do not install this from your distribution's package manager, because it will almost certainly be out of date.

It takes some time and effort to learn how to use TeX, but, especially if you are in a field where your papers will contain a lot of equations, it is really the only choice. Many journals accept TeX source, and some have their own templates that they require you to use. A convenient side effect is that you can use the same source to create a beautiful preprint.

Even if your papers never contain math, the LaTeX system is still a major convenience for the academic, because it handles references automatically, and it can generate bibliographies in any format.

The typical usage of the TeX system is to edit your source in the editor of your choice, embed the TeX markup, and process it through one of the engines (the modern choices are either LuaTeX or XeTeX) to create a PDF. However, you won't do it this way, because you will also install the second indispensable package: pandoc [3].

Pandoc is a "Swiss army knife" document conversion program. You write your papers in an extended version of Markdown, an intuitive markup that resembles the way people normally add emphasis and so on to text documents such as emails (i.e., *italic*). Pandoc can convert this markup to many formats, including HTML, as well as document formats such as ODT, DOCX, and TEX, which you can run through XeTeX or another TeX engine to produce a PDF. To include math or other elements that Markdown can't handle, just do it, and pandoc will know how to handle it. Pandoc is extremely useful for the working scientist, because some publications require a format other than TEX, and because it allows you to write one source for your paper and automatically create versions for preprints, the web, presentation slides, and more. Pandoc is also extensible [4] with user filters.

Figure 1 is a screenshot from my laptop showing this article as I write it in the Vim editor, on the left. On the right are three transformations of the article: as a PDF, processed through pandoc to XeTeX (top); the HTML version, directly from pandoc and rendered in a web browser (middle); and an ODT file, again directly from pandoc, viewed in LibreOffice (bottom). I'll include an equation to make it interesting (I am not including the part of the article with the command to include the screenshot, to avoid the creation of a spacetime singularity):

Figure 1: Editing the text of this article (left) with three output formats created by pandoc (right) from the same source.

Making Graphs and Diagrams

There are many choices here; what you install and learn to use will be determined partly by your specialty. If you are a mathematician whose papers are full of things like commutative diagrams, you already have the best software for those purposes, because it comes with TeX Live: Learning how to use the TikZ drawing language, in which you can directly embed drawing instructions into the source for your paper, will be invaluable.

Many popular programming languages come with their own plotting systems, and using those may be a good choice if you are sure that you will always use the same language. However, if you want a portable solution that runs anywhere and is fast and stable when dealing with enormous datasets, consider gnuplot [5]. You will find a fairly recent version in your distribution's package manager, but for the very latest features, download and compile the source.

Gnuplot is an early open source program that predates Linux, but it is still actively developed, with new features [6] appearing regularly. It is the best choice for creating automated graphing pipelines that can work with simulations or data from any programming language or source. Gnuplot excels at automation because it is controlled through text scripts rather than with a GUI. It can create any type of output: PNG, SVG, dumb terminal, sixel, and many more, plus minimally interactive graphs for the web or using Qt, X11, and other GUI toolkits.

Gnuplot can create any type of visualization that a scientist might need, and the output is customizable to the last detail. Figure 2 is a screenshot from my laptop, showing a script in my editor on the left and the resulting graphs on the right.

Figure 2: Gnuplot can create all kinds of scientific visualizations. The script on the left created the two plots on the right.

Numerics

Linux has well-established, state-of-the-art compilers for C and Fortran: GCC and GFortran, respectively. The are both available in all package repositories. GCC is used to compile much system software, including the Linux kernel itself. GFortran is a capable compiler for Fortran simulation code and able to parallelize array expressions. For Fortran, there are many commercial compilers available as well and a surprising newcomer: the open source LFortran [7] compiler, which is based on LLVM and provides the user with an interactive REPL.

If you are involved with legacy simulation code, there is a good chance that you will find yourself using one of these tools. But if you are beginning a new project, my advice is to use neither C nor Fortran, but to head straight to the Julia website [8] and download this free, open source language for technical computing.

I wrote an article [9] about Julia about two years ago, introducing the syntax and use of the language. Since then, interest in this relatively new language has exploded among computational scientists in every field. This is due not only to its speed and ease of development, but to the ease [10] with which a scientist can mix and match different libraries to create new functionalities.

Up to here, I've treated subjects of general interest to most scientists. I'd like to turn now to a brief rundown of some software that is specific to several fields. There isn't space to survey the vast landscape of science, of course. But even if your field is not one of these few, this may give you an idea of the range and character of Linux tools focused on particular disciplines.

Physics

If you are a physicist interested in simulation, I repeat my recommendation to install Julia and explore the language as well as the rapidly growing ecosystem of physics and related packages. It is not by accident that interest in Julia is exploding among scientists from all fields. Any project that you create will benefit from the ease with which you can remix code from other packages; in some cases you will discover that you have to do much less work than you anticipated.

As a case study, here is every step needed to go from zero to a fluid dynamics simulation of mixing at the boundary between a heavy (cold) and light (warmer) fluid (Rayleigh-Taylor instability). That is, not including about two hours of reading documentation and playing around with the simulation software to learn how it works.

I'd heard good things about a Julia package for fluid simulation called Oceananigans.jl [11]. Getting libraries for Julia is a breeze, because it has a package manager not only built in, but integrated into the REPL. Just hit the ] key to enter the package sub-REPL, and type add Oceananigans. Julia will download whatever is needed from the Internet, including any dependencies required, and pre-compile the code.

For the rest, refer to Figure 3, which shows the entire Julia session. The final command creates the plot in Figure 4, which shows the temperature field.

Figure 3: A Julia REPL session in which a fluid dynamics simulation is set up and performed.
Figure 4: The output of the fluid simulation in Figure 3, showing the mixing of two fluids at different densities in a gravitational field.

In my career, I've had to work with simulation codes from various sources. I have never experienced getting a useful calculation done as easily as I was able to do with Oceananigans. It not only has a sophisticated interface, but it is remarkably fast, owing in part to Julia's ability to generate efficient machine code. On my very modest laptop, this calculation took on the order of 30 seconds. If you are in computational physics, or any branch of numerical science, I recommend experiencing the Julia ecosystem for yourself.

Mathematics

The symbolic mathematics program Maxima [12] is extremely useful. It is light, fast, and available from any distribution's package manager. Maxima is the open source successor to the venerable commercial program Macsyma. It is written in Lisp and can handle many areas of mathematics, such as basic algebra, calculus, trigonometry, differential equations, series, and much more.

Maxima uses gnuplot to draw graphs, but that's OK, because you've already followed my advice and installed it. It can print its output in the form of TeX math markup, which you can paste directly into your TEX documents (see the Writing Papers section).

Figure 5 shows a quick calculation in Maxima, a screenshot from my laptop. On the top is a terminal window, and below that a gnuplot graph. In the terminal, I've invoked Maxima (which starts instantly) and defined an ordinary differential equation using Maxima's syntax for derivative operators. Notice that the inputs and outputs are numbered, so you can use them in subsequent expressions. Maxima repeats the equation, but in a more visual form, using ASCII characters to approximate what the math is supposed to look like. In my second input, I'm asking for a solution using the ODE solver; Maxima has more than one solver for some types of equations, because a particular solver may not work on a particular problem. The reply is a solution made of Bessel functions, which is correct: the ODE I entered is the Bessel equation.

Figure 5: A Maxima session, showing the setting up, solving, and graphing of the solution to Bessel's differential equation.

The solution contains two free parameters, called %k1 and %k2, whose values can't be determined without more information, namely boundary conditions. My next input defines the solution sol by giving Maxima these boundary conditions. That input I terminated with a dollar sign rather than the usual semicolon, to suppress the somewhat voluminous output. Instead, I would like to see a graph of this solution, which I order up in the next input, inserting the rhs (right hand side) of sol as the function to be plotted. The plot accepts a range for the independent variable and pops up the gnuplot graph immediately.

Maxima can do a lot and is efficient once you get familiar with its syntax. But if you're a mathematician who makes extensive use of computer assistance, especially if you travel in fields not served by Maxima, you may want to install SageMath [13]. You need to follow the link to download it, but make sure you have several gigabytes free before you do. SageMath is a huge system, packaging together about 100 pieces of mathematical software (including Maxima) in a giant bundle with a unified interface based on Python. SageMath can deal with some truly obscure subdisciplines and even has components for doing such things as solving Rubik's Cubes. Most people work with SageMath through its interactive, web-based notebook, which is similar to Jupyter [14], but it has a command-line interface as well.

Biology

With the advent of bioinformatics [15] as a major activity within biology, the computer as a tool is more central to this discipline than ever before.

Bioinformatics is a blending of computer science and biology usually concerned with dealing with DNA sequences or other sequences of molecular data: essentially strings of potentially millions of letters. This gives the computational problems in this field something of a linguistic character. Bioinformatics plays a big part in gene-based drug discovery, wildlife conservation, cancer treatment, forensic analysis, and much more.

EMBOSS [16] is an acronym for European Molecular Biology Open Software Suite. It is a major computational resource used by many biologists all over the world. EMBOSS packages 200 applications for sequence analysis, visualizing proteins, analyzing protein structure, providing transparent access to remotely hosted molecular sequences, and much more.

I should mention that a handful of these 200 programs are trivial utilities for doing things like removing whitespace from an ASCII file. Clearly EMBOSS attempts to be a complete environment providing anything even a computer-naive bioinformatician might need.

EMBOSS can be operated with graphical, web-based, or command-line interfaces. Figure 6 shows a screenshot of one of the available web interfaces to a utility that displays protein sequences graphically. SourceForge provides the interface as a demo, so the biologist interested in trying out the program, or even in using it for real work up to a point, can experiment on real data without having to download and install it.

Figure 6: An online demo for the EMBOSS bioinformatics system.

Even more than a comprehensive software suite for molecular biology and bioinformatics, EMBOSS provides a platform to allow computational biologists to release their own open source projects.

Happy Computing

Science is more open than ever before, and part of this new openness has been both influenced and facilitated by the free software movement, including Linux. The movement around open data helps greatly with trust and reproducibility; open journals are gradually replacing the expensive and, in some ways, counter-productive traditional publishing system; and the nearly universal practice around simulation code is now to open it to the public's eyes, often on GitHub, rather than keeping it locked away in the lab's computers, treated as a trade secret.

In this environment, a proprietary OS on a scientist's desk seems out of place. I hope this very compact survey of some of what's available convinces you that you give up nothing as a scientist by adopting Linux and gain a great deal.

Infos

  1. dwm: https://dwm.suckless.org/
  2. TeX Live: https://www.tug.org/texlive/
  3. pandoc: https://pandoc.org/
  4. "Technical Writing with Pandoc and Panflute," by Lee Phillips, Linux Journal, September 2017, https://lee-phillips.org/panflute-gnuplot/
  5. gnuplot: http://gnuplot.info/
  6. "New Features in Gnuplot 5.4," by Lee Phillips, LWN, July 22, 2020, https://lwn.net/Articles/826456/
  7. LFortran: https://lfortran.org/
  8. Julia: https://julialang.org/
  9. "Fast as Fortran, Easy as Python," by Lee Phillips, ADMIN, issue 50, 2019, pg.14-19,https://www.admin-magazine.com/Archive/2019/50/Julia-Fast-as-Fortran-easy-as-Python
  10. Stefan Karpinski, "The Unreasonable Effectiveness of Multiple Dispatch," https://www.youtube.com/watch?v=kc9HwsxE1OY
  11. Oceananigans.jl: https://github.com/CliMA/Oceananigans.jl
  12. Maxima: http://maxima.sourceforge.net/
  13. SageMath: https://www.sagemath.org/
  14. "Jupyter: Notebooks for Education and Collaboration," by Lee Phillips, LWN, February, 6, 2018, https://lwn.net/Articles/746386/
  15. Bioinformatics: https://www.sciencedirect.com/topics/biochemistry-genetics-and-molecular-biology/bioinformatics
  16. EMBOSS: http://emboss.open-bio.org/

The Author

Dr. Lee Phillips is a theoretical physicist and writer who has worked on projects for the Navy, NASA, and DOE on laser fusion, fluid flow, plasma physics, and scientific computation. He has written numerous popular science and computing articles and technical publications and is engaged in science education and outreach.