Sorting and Searching
Doghouse – Algorithms and Books
A look at the history of computer memory and a classic algorithm text.
When I started programming in 1969, even mainframe computers from IBM might have had only 1MB of main memory and two or three disk drives that were measured in hundreds of megabytes, with multitrack tape drives that would supplement that storage. Any type of parallelism in programming typically stressed the goal of keeping the data that you needed coming into main memory, while you were also trying to get the processed data out to "stable storage," either to a disk or directly to a magnetic tape.
Some of the first IBM computers I programmed used an operating system called MFT (Multiple Fixed Tasking) which had up to 16 "partitions" in memory, each of a fixed size set when the operating system was configured.
When a job was started by the operators, the Job Control Language (JCL) told the operating system how much main memory it would take, and the job was slotted into a memory slot that was large enough to hold it. If the creator of the JCL specified too much memory, that meant memory was wasted. Too small an amount and the program might not finish, or even start.
Later the operating system was changed to Multiprogramming with a Variable number of Tasks (MVT), which allowed the programmer (who typically was the one who generated the JCL) to specify the exact size of one of the 16 possible slots. The problem with MVT was that it could create a series of small, unusable holes in the physical main memory of the system, and only when several small holes were joined together could a larger program run, a rather crude form of garbage collection.
If this sounds a bit complex, and even unbelievable, in the day of demand-paged virtual memory systems, I will throw in the additional information that the address space of these IBM mainframes was originally only 24 bits (instead of 32 or 64) meaning that any program could only address a maximum of 16 million bytes of main memory at a time.
Of course while these machines were incredibly slow, and logically small, by today's standards, we thought they were "magic" when we were programming them.
Still, we gave great thought to how much storage was needed for each piece of data and how much computation would be given to each calculation. Sometimes this worked against us in the long run, and some were pretty famous, such as the "Y2K" problem, where we we thought two digits, and not four, were enough for the date.
Many people will remember that as we moved toward the year 2000, panic started to erupt about what would happen near or at midnight of December 31, 1999, as we turned to the next century. Fortunately people started thinking about this early enough and working to negate the effects.
However, some years earlier, a series of books named The Art of Computer Programming, by Donald E. Knuth, was being published. Volume 3 of that work, Sorting and Searching, was first published in 1973, when I started programming IBM mainframes, and slightly before I started my masters degree in computer science.
The algorithms in the book carefully explained the different methods of sorting and searching and how they could take advantage of the levels of storage from disk to CPU registers and everything between (main memory, cache, etc.). The book pointed out that with the 24 bit address space limitations of that day (often 16 bit or less in mini computers), hash tables and hash searching techniques were often inefficient due to the number of collisions the search might have.
However, it is the combination of knowing the efficiency of the algorithm, the size of the sort, and the architecture of the machine that can avoid drastic consequences.
I once took a program that was supposed to sort 1,206 32-byte records, which ran on a PDP-11/70 computer (with a 64KB data address space) running RSTS/E as an operating system, that took 10.5 hours to finish the sort. The program used a bubble sort, and if all the data had been able to be held in main memory it would not have been that bad, but, unfortunately, it used a virtual array that was implemented on disk. The program made over 13 million comparisons and 700,000 accesses to the disk.
Rewriting the program using a tree sort and a sort merge as defined in Knuth's book broke the 1,206 records into seven files of 173 records, each of which could reside and be sorted in main memory. Finally, I merged the seven sorted files together into the resultant output file. The entire time was reduced to three minutes from 10.5 hours.
Knuth is not the only good author on algorithm study, but the depth of analysis makes these books way more timeless than the editions of other popular computer books.
Buy this article as PDF
(incl. VAT)
Buy Linux Magazine
Subscribe to our Linux Newsletters
Find Linux and Open Source Jobs
Subscribe to our ADMIN Newsletters
Support Our Work
Linux Magazine content is made possible with support from readers like you. Please consider contributing when you’ve found an article to be beneficial.
News
-
Canonical Bumps LTS Support to 12 years
If you're worried that your Ubuntu LTS release won't be supported long enough to last, Canonical has a surprise for you in the form of 12 years of security coverage.
-
Fedora 40 Beta Released Soon
With the official release of Fedora 40 coming in April, it's almost time to download the beta and see what's new.
-
New Pentesting Distribution to Compete with Kali Linux
SnoopGod is now available for your testing needs
-
Juno Computers Launches Another Linux Laptop
If you're looking for a powerhouse laptop that runs Ubuntu, the Juno Computers Neptune 17 v6 should be on your radar.
-
ZorinOS 17.1 Released, Includes Improved Windows App Support
If you need or desire to run Windows applications on Linux, there's one distribution intent on making that easier for you and its new release further improves that feature.
-
Linux Market Share Surpasses 4% for the First Time
Look out Windows and macOS, Linux is on the rise and has even topped ChromeOS to become the fourth most widely used OS around the globe.
-
KDE’s Plasma 6 Officially Available
KDE’s Plasma 6.0 "Megarelease" has happened, and it's brimming with new features, polish, and performance.
-
Latest Version of Tails Unleashed
Tails 6.0 is based on Debian 12 and includes GNOME 43.
-
KDE Announces New Slimbook V with Plenty of Power and KDE’s Plasma 6
If you're a fan of KDE Plasma, you'll be thrilled to hear they've announced a new Slimbook with an AMD CPU and the latest version of KDE Plasma desktop.
-
Monthly Sponsorship Includes Early Access to elementary OS 8
If you want to get a glimpse of what's in the pipeline for elementary OS 8, just set up a monthly sponsorship to help fund its continued existence.