Ssscrape 1.0 Collects Dynamic Web Data

Feb 18, 2010

Mathias Huber

The Ssscrape tool screen-scrapes data from RSS and Atom feeds, blogs and podcasts. The open source software is now available in version 1.0.

Ssscrape tracks feeds and other collections for similar elements on updates, and downloads and cleans content by converting HTML to plain text. The database used is MySQL. The tool can also gather statistics about feed activities and report errors. A scheduler takes care of the periodic checks and a monitor displays the running activities.

Known as a Web crawler, a program that scrapes together information off the Web, Ssscrape is short for Syndicated and Semi-Structured Content Retrieval and Processing Environment. The Web scraper is written in Python with Twisted used for network programming and the not always standards-based Beautiful Soup used for parsing HTML/XML content.

Ssscrape was developed in the Information and Language Processing Systems (ILPS) department of the University of Amsterdam and is under LGPLv3 licensing. Ssscrape 1.0 requires Python 2.4 and is available for download as a tarball from the project page.

Support Our Work

Linux Magazine content is made possible with support from readers like you. Please consider contributing when you’ve found an article to be beneficial.

News

Yet Another Linux Kernel Vulnerability Discovered

Kernel

Affecting millions of systems, a kernel flaw discovered by Qualys could allow users to gain root privileges.
Ubuntu 26.10 to Include Ubuntu Certified Hardware Check

Ubuntu

If you've ever wondered if your laptop or PC is officially certified to run Ubuntu, that curiosity will soon be met.
Substantial Update to IPFire Now Available

The lastest version of IPFire features a fundamental change to how the system handles DNS.
Gnome Working on Test Center App to Make Testing Easier

Gnome , Linux

It's now possible to test experimental features on the Gnome desktop without worrying that you'll break things.
New Vulnerability Discovered in Linux Kernel

Artificial Inte... , Kernel , vulnerability

Hiding out for nearly 15 years, the Ghostlock vulnerability allows a standard logged-in user to gain root privileges.
New Linux Flaw Lets Attackers Escape VMs

RHEL , Security , vulnerability

A 16-year-old vulnerability allows an attacker to escape a virtual machine, gain access to the host, and execute malicious code.
Hannah Montana Linux Is Back!

DEBIAN , Kubuntu , Plasma

Developer Noah Cagle decided the world needed the once obscure but beloved Linux distribution and gave it a decidedly pink refresh.
System76 Refreshes the Lemur Laptop

Hardware , laptop

If you're looking for a laptop with tons of power and battery, look no further than the latest iteration of the System76 Lemur Pro.
More than 43 Million Lines of Code in Linux Kernel 7.2

Kernel , Linux

Using the cloc utility, Michael Larabel of Phoronix discovered that Linux kernel 7.2 has over 43 million lines of code.
Kubuntu Focus Goes Ultra

Hardware , Kubuntu , laptop

The Kubuntu Focus team has upped the performance ante of its M2 and Zr laptops with the latest, greatest CPUs from Intel.

Ssscrape 1.0 Collects Dynamic Web Data

Related content

Subscribe to our Linux Newsletters
Find Linux and Open Source Jobs
Subscribe to our ADMIN Newsletters

Support Our Work

News

Yet Another Linux Kernel Vulnerability Discovered

Ubuntu 26.10 to Include Ubuntu Certified Hardware Check

Substantial Update to IPFire Now Available

Gnome Working on Test Center App to Make Testing Easier

New Vulnerability Discovered in Linux Kernel

New Linux Flaw Lets Attackers Escape VMs

Hannah Montana Linux Is Back!

System76 Refreshes the Lemur Laptop

More than 43 Million Lines of Code in Linux Kernel 7.2

Kubuntu Focus Goes Ultra

Ssscrape 1.0 Collects Dynamic Web Data

Related content

Subscribe to our Linux Newsletters Find Linux and Open Source Jobs Subscribe to our ADMIN Newsletters

Support Our Work

News

Tag Cloud

Subscribe to our Linux Newsletters
Find Linux and Open Source Jobs
Subscribe to our ADMIN Newsletters