Sorting and Searching
Doghouse – Algorithms and Books
A look at the history of computer memory and a classic algorithm text.
When I started programming in 1969, even mainframe computers from IBM might have had only 1MB of main memory and two or three disk drives that were measured in hundreds of megabytes, with multitrack tape drives that would supplement that storage. Any type of parallelism in programming typically stressed the goal of keeping the data that you needed coming into main memory, while you were also trying to get the processed data out to "stable storage," either to a disk or directly to a magnetic tape.
Some of the first IBM computers I programmed used an operating system called MFT (Multiple Fixed Tasking) which had up to 16 "partitions" in memory, each of a fixed size set when the operating system was configured.
When a job was started by the operators, the Job Control Language (JCL) told the operating system how much main memory it would take, and the job was slotted into a memory slot that was large enough to hold it. If the creator of the JCL specified too much memory, that meant memory was wasted. Too small an amount and the program might not finish, or even start.
Later the operating system was changed to Multiprogramming with a Variable number of Tasks (MVT), which allowed the programmer (who typically was the one who generated the JCL) to specify the exact size of one of the 16 possible slots. The problem with MVT was that it could create a series of small, unusable holes in the physical main memory of the system, and only when several small holes were joined together could a larger program run, a rather crude form of garbage collection.
If this sounds a bit complex, and even unbelievable, in the day of demand-paged virtual memory systems, I will throw in the additional information that the address space of these IBM mainframes was originally only 24 bits (instead of 32 or 64) meaning that any program could only address a maximum of 16 million bytes of main memory at a time.
Of course while these machines were incredibly slow, and logically small, by today's standards, we thought they were "magic" when we were programming them.
Still, we gave great thought to how much storage was needed for each piece of data and how much computation would be given to each calculation. Sometimes this worked against us in the long run, and some were pretty famous, such as the "Y2K" problem, where we we thought two digits, and not four, were enough for the date.
Many people will remember that as we moved toward the year 2000, panic started to erupt about what would happen near or at midnight of December 31, 1999, as we turned to the next century. Fortunately people started thinking about this early enough and working to negate the effects.
However, some years earlier, a series of books named The Art of Computer Programming, by Donald E. Knuth, was being published. Volume 3 of that work, Sorting and Searching, was first published in 1973, when I started programming IBM mainframes, and slightly before I started my masters degree in computer science.
The algorithms in the book carefully explained the different methods of sorting and searching and how they could take advantage of the levels of storage from disk to CPU registers and everything between (main memory, cache, etc.). The book pointed out that with the 24 bit address space limitations of that day (often 16 bit or less in mini computers), hash tables and hash searching techniques were often inefficient due to the number of collisions the search might have.
However, it is the combination of knowing the efficiency of the algorithm, the size of the sort, and the architecture of the machine that can avoid drastic consequences.
I once took a program that was supposed to sort 1,206 32-byte records, which ran on a PDP-11/70 computer (with a 64KB data address space) running RSTS/E as an operating system, that took 10.5 hours to finish the sort. The program used a bubble sort, and if all the data had been able to be held in main memory it would not have been that bad, but, unfortunately, it used a virtual array that was implemented on disk. The program made over 13 million comparisons and 700,000 accesses to the disk.
Rewriting the program using a tree sort and a sort merge as defined in Knuth's book broke the 1,206 records into seven files of 173 records, each of which could reside and be sorted in main memory. Finally, I merged the seven sorted files together into the resultant output file. The entire time was reduced to three minutes from 10.5 hours.
Knuth is not the only good author on algorithm study, but the depth of analysis makes these books way more timeless than the editions of other popular computer books.
Buy this article as PDF
(incl. VAT)
Buy Linux Magazine
Subscribe to our Linux Newsletters
Find Linux and Open Source Jobs
Subscribe to our ADMIN Newsletters
Support Our Work
Linux Magazine content is made possible with support from readers like you. Please consider contributing when you’ve found an article to be beneficial.
News
-
Linux Servers Targeted by Akira Ransomware
A group of bad actors who have already extorted $42 million have their sights set on the Linux platform.
-
TUXEDO Computers Unveils Linux Laptop Featuring AMD Ryzen CPU
This latest release is the first laptop to include the new CPU from Ryzen and Linux preinstalled.
-
XZ Gets the All-Clear
The back door xz vulnerability has been officially reverted for Fedora 40 and versions 38 and 39 were never affected.
-
Canonical Collaborates with Qualcomm on New Venture
This new joint effort is geared toward bringing Ubuntu and Ubuntu Core to Qualcomm-powered devices.
-
Kodi 21.0 Open-Source Entertainment Hub Released
After a year of development, the award-winning Kodi cross-platform, media center software is now available with many new additions and improvements.
-
Linux Usage Increases in Two Key Areas
If market share is your thing, you'll be happy to know that Linux is on the rise in two areas that, if they keep climbing, could have serious meaning for Linux's future.
-
Vulnerability Discovered in xz Libraries
An urgent alert for Fedora 40 has been posted and users should pay attention.
-
Canonical Bumps LTS Support to 12 years
If you're worried that your Ubuntu LTS release won't be supported long enough to last, Canonical has a surprise for you in the form of 12 years of security coverage.
-
Fedora 40 Beta Released Soon
With the official release of Fedora 40 coming in April, it's almost time to download the beta and see what's new.
-
New Pentesting Distribution to Compete with Kali Linux
SnoopGod is now available for your testing needs