Simple web scraping with Bash
Ski Report
![© Photo by Nicolai Berntsen on Unsplash © Photo by Nicolai Berntsen on Unsplash](/var/linux_magazin/storage/images/issues/2022/262/bash-web-scraping/photobynicolaiberntsenonunsplash_skiing.png/808660-1-eng-US/PhotobyNicolaiBerntsenonUnsplash_skiing.png1_medium.png)
© Photo by Nicolai Berntsen on Unsplash
With one line of Bash code, Pete scrapes the web and builds a desktop notification app to get the daily snow report.
While recently doing a small project, I was amazed by how much web scraping I could do with just one line of Bash. I used the text-based Lynx browser [1] and then piped the output to a grep
search. Figure 1 shows the one-line Bash example that scrapes the current snow depth from the Sunshine Village Snow Forecast web page.
In this article, I will introduce some techniques to easily scrape web pages, and then I will create a desktop notification script that provides the daily snow forecast.
The Lynx Text Browser
For my Bash web scraping, I started out by looking at using command-line tools such as curl
[2] with the html2text
[3] utility. This technique definitely works, but I found that using the Lynx browser offers a one-step solution with a slightly cleaner text output.
To install Lynx on Raspian/Debian/Ubuntu, use:
sudo apt install lynx
The Lynx -dump
option will output a web page to text with HTML tags, HTML encoding, and JavaScript removed. Figure 2 shows that a Lynx dump can greatly clean up the original web page and make searching considerably easier.
Sometimes a simple Bash grep
search might be all that you need. However, there are many cases where some text manipulation is required. The good news is that Bash has a nice selection of line and string manipulation tools.
The example shown in Figure 3 uses line manipulation to find the current weather in Key West, Florida. A grep
search is done on the string "As of", and the option -A 3
is used to return the requested line of data with an additional three lines. You can remove the "As of" line with the tail
command if required.
It's important to note that what you see on a web page may not match the Lynx outputted text, and some trial and error testing might be required.
Figure 4 uses string manipulation to find the new snow at Sunshine Ski Resort. The resort's web page uses JavaScript to show the new snow in either centimeters or inches, but the Lynx text output displays both values and their units.
To remove parts of a string variable, you can use %%
to extract the first part of the string and #
to extract the last part of the string (as shown in Listing 1).
Listing 1
Extracting Parts of a String
01 $ newsnow="5.2cm2.0" 02 $ # get the part before 'cm' 03 $ echo "${newsnow%%cm*}" 04 5.2 05 $ # get the part after 'cm' 06 $ echo "${newsnow#*cm}" 07 2.0
A Bash Web Scraping Project
To get excited before a family ski trip, I wanted to create a morning notification script that would show the new morning snow and the base snow.
To create the notification script (Listing 2), I used two passes with the Lynx utility. The first pass scrapes for new snow (shown in Figure 4) and then a second pass gets the snow base (shown in Figure 1). The snow results are then passed as a string ($msg)
to the notify-send
utility [4], which posts the message to the workstation desktop (Figure 5). You can schedule this Bash script to run every morning using either cron or the at utility.
Listing 2
Bash Web Scraping Notification Script
01 #!/bin/bash 02 # 03 # skitrip.sh - show the Sunshine ski conditions in a notification 04 # 05 theurl="https://www.snow-forecast.com/resorts/Sunshine/6day/mid" 06 07 # Get the new snow depth 08 thestr="New snow in Sunshine Village:" 09 result=$(lynx -dump "$theurl" | grep "$thestr") 10 newsnow="${result%%cm*} cm" 11 12 # Get the base 13 thestr="Top Lift:" 14 base=$(lynx -dump "$theurl" | grep "$thestr") 15 16 # Show the results in a desktop notification, with 120 minute wait time 17 msg="$newsnow\n$base (base)" 18 icon="$HOME/Downloads/mountain.png" 19 notify-send -t 120000 -i "$icon" "Sunshine Ski Resort" "$msg"
Summary
Scraping web pages can be tricky, and the pages can change at anytime. For this reason, it is always best to check if an API is available before looking at web scraping.
Python with the Beautiful Soup library has been my go-to approach for web scraping, but it's nice know that a simple Bash alternative is also available.
Buy this article as PDF
(incl. VAT)
Buy Linux Magazine
Subscribe to our Linux Newsletters
Find Linux and Open Source Jobs
Subscribe to our ADMIN Newsletters
Support Our Work
Linux Magazine content is made possible with support from readers like you. Please consider contributing when you’ve found an article to be beneficial.
![Learn More](https://www.linux-magazine.com/var/linux_magazin/storage/images/media/linux-magazine-eng-us/images/misc/learn-more/834592-1-eng-US/Learn-More_medium.png)
News
-
AlmaLinux Now Supports Raspberry Pi 5
If you're looking to create with the Raspberry Pi 5 and want to use AlmaLinux as your OS, you're in luck because it's now possible.
-
Kubuntu Focus Releases New Iterations of Ir14 and Ir16 Laptops
If you're a fan of the Kubuntu Focus laptops or have been waiting for the right time to purchase one, that time might be now.
-
NixOS 24.05 Is Ready for Prime Time
The latest release of NixOS (Uakari) has arrived and offers its usual reproducible, declarative, and reliable goodness.
-
Linux Lite 7.0 Officially Released
Based on Ubuntu 24.04 and kernel 6.8, Linux Lite version 7 now offers more options than ever.
-
KaOS Linux 2024.05 Adds Bcachfs Support and More
With updates all around, KaOS Linux now includes support for the bcachefs file system.
-
TUXEDO Computers Unveils New Iteration of the Stellaris Laptop Line
The Stellaris Slim 15 is the 6th generation and includes either an AMD or Intel CPU
-
KDE Releases Plasma 6.0.5
The latest release of the Plasma desktop has arrived with several improvements and the usual bug fixes.
-
Gnome OS Adopting systemd-sysupdate
Gnome OS is about to undergo a major under-the-hood change that promises enhanced security.
-
Endless OS 6 Now Available
After more than a year since the last update, the latest release of Endless OS is now available for general usage.
-
Fedora Asahi 40 Remix Available for Macs with Apple Silicon
If you've been anticipating KDE's Plasma 6 for your Apple Silicon-powered Mac, then you're in luck.