Tails of the maddog: Knowledge of Architecture Does Matter
Paw Prints: Writings of the maddog
People seemed to like my blog yesterday about how the knowledge of assembly and machine language improved my programs, or the programs of people around me.
Today I would like to show people how simply understanding a little about the architecture of the machine and operating system, even without knowing assembly language, can improve program performance. Likewise the study of algorithms and computer techniques.
When I was at Aetna Life an Casualty in 1975, some of the first computer kits were emerging. My boss, who had his Masters in Computer Science and taught compiler design and operating system architecture at the Hartford Graduate Center, bought one of those kits and was assembling it, even though he had no background in electrical engineering or electronics. Since I had a little of both electronics from high school and electrical engineering from university I was helping him from time to time, particularly when the circuit diagram did not seem to match the instructions which did not match the printed circuit board. I would go over to his house and we would puzzle it out, as usually two of the three matched (and the printed circuit board was usually one of the correct items). Our only tools for analysis were an ancient Vacuum Tube Voltmeter (VTVM) and a logic probe. [Editors' Note: about a hundred old Electrical Engineers just wistfully gave a sigh at the loving mention of a VTVM].
One night my boss had finished the project and invited me over to see him turn the computer on (and I am sure to be there in case it burst into flames). We broke open a couple of beers. His system stored and read data from an audio tape recorder, and he read in the combination editor, assembler, debugger into the machine. Then he typed in the supplied test assembly language program, and said “assemble”. The screen flickered for a fraction of a second and stopped. My boss cursed. He actually cursed quite a bit, but only when powers of authority such as his wife were not around.
I said “What is wrong”? He said that the program did not work, as the computer screen only flickered for a fraction of a second and stopped. I said:
“You have taught compiler design, which includes assembler design. How many instructions do you think would be needed to assemble that tiny program you typed in?” He thought a second and said “Maybe 100,000 instructions”. “And how fast is the processor?” I asked. “About 100,000 instructions a second,” he replied....and then the dawn of light hit him. As he stepped through memory using the built-in debugger, he saw the assembled binary, ready to be written out to the tape.
I told him he was too used to an IBM mainframe, shared by a hundred other people, having to read and write data to impossibly slow disks. “You really don't know how fast a computer can be...” Then I drank the rest of his beer.
After that, my boss treated me with new respect, both as a programmer and a drinker.
Later I was teaching at Hartford State Technical College, and we had a Digital Equipment Corporation (DEC) PDP-11/70 computer that ran a time-sharing system called RSTS-E. RSTS (pronounced RisTis) was a nice little system, programmed in a language called BASIC-PLUS. We had thirteen VT100 CRT terminals attached to it as well as a paper “DECwriter” center console terminal, 64KB of core (yes, core again!) memory and an RP06 disk drive than held about 180MB of data. We also had a magnetic tape drive to allow the students to buy small tapes and back up or store their programs long term, and a line printer. The CPU executed about 1,000,000 machine language instructions a second. It was a nice system for a small two-year technical college in 1977.
The computer was also used by the administration for holding grades (I know, I know, bad idea for security, but it was all we had) and doing other administrative programs.
The school had a union for the teachers, and the union had a policy that no teachers would write administrative code, so the school hired a former student to write some programs for them.
The student wrote one program and started to run it. When the program was running the system would come to a crawl and the disk drive would start to shake....for hours at a time. Ten and one-half hours, to be exact, and the students could hardly get their work done.
Finally I asked the programmer what he was doing during that time. He replied that he was sorting 1206 records, each of which was thirty-two bytes long. I mumbled that he could punch the numbers out onto cards, throw them up in the air and pick them up from the floor in order faster than his program was sorting them, but he insisted this was all that his program was doing.
After a while I decided to look at what he was doing. His so-called code will forever haunt me.
He was doing a bubble sort. It was the only sort he had ever learned, but he had not learned it very well. He never tested to see if the file was already in order. Nor did he ever test to see if he had made one pass and not swapped anything (a test to see if the array is in order). Nor did he realize that after one pass at least one element (the last) is now in its proper place and he never has to look at it again...allowing the working part of the array to get shorter and shorter as the sort progressed. He always compared every element, on every pass, all the way to the end....about 1,454,436 comparisons.
It gets worse, much worse, and here is where the architecture comes in....
The PDP-11/70 had a 64KB data address space. His 1206 thirty-two byte records took up slightly more space than that when all the rest of the data and buffers were added in. Therefore he had to use a facility in RSTS-E called a “virtual array”, which mapped an array onto a file in the system, and to the programmer would look just like any other array. The system had one 512 byte block where it would read the data in, allow the programmer to access the variables, then automatically write the array back out to the disk if the programmer was trying to access some other variable. If no data had changed in the array, it was not written back to the disk.
Unfortunately the programmer did not REALLY understand how this buffering worked, so as he went through his array swapping 32-byte elements he would access the variable on the next block (which was not in memory), read that block into memory, compare the variable on that block to the variable from the previous block which was no longer in memory, so that block had to be read back into memory to access that variable. Then if the variables were to be swapped he would write the variable to the block that was no longer in memory, causing it to be read back in, then change the variable on the previous block, which had to be read in to be updated (but first the previous block had to be written back out), then the other block had to be read back in, and the variable updated and written back out as he went on to read the rest of the data in the next block.
Did you follow that? It was why the disk tried to dance across the floor. Approximately 700,000 accesses to a 180 MB disk to sort about 39KB of data.
As I said, I could not correct the program, but I could get one of my students to do it. So I grabbed a bright freshman student who had done programming on a KIM-1 single board computer in BASIC and said:
“Kid, you are about to take a quantum leap in computing skills. I am about to teach you what a tree sort is, a sort/merge is, and what recursive programming can do.”
And I sat down and showed him how to read the file in and break it down into seven smaller files (RSTS-E could have eight files open at one time), then take each file one at a time and build a sort tree in memory, and write the file out in order. The final step was to open up seven input files and put the sorted records out into one final sorted output file.
Because we used recursion to build the tree, and later to follow it, each part of that program was only about three lines long. It was deceptively simple.
My student wrote the first part of the program, to create the seven intermediate files. Then he created the second part of the program, to build the tree in memory.
He called me in to watch it run. He typed the name of the command, and almost instantly the system said “Ready”. The student cursed (he was a good kid, but was sometimes a bit passionate....).
“What is wrong?” I asked. He said that the program did not work. “Show it to me,” I said, and he listed the program for me. BASIC-PLUS was an interpreter, so the listing and the data were still in memory. I looked at the listing and said “It looks like it worked to me, let's look at the data in memory”, and we stepped through the array in memory, which was (of course) built and sorted properly. The student cursed again, but this time in wonder and joy. I told him “You really don't know how fast a computer can be....”
He finished the program, and the act of sorting those 1206, thirty-two byte records dropped from 10.5 hours (630 minutes) to under three minutes, on the same hardware with the same OS. And when the program was running none of the students were noticeably impacted.
We could have made the program run faster, but a 210 times performance improvement was “good enough”, and my student and I had some beer to drink down at the local pub.
My final story for this blog entry is not mine, but a friend of mine who was instrumental on the port of Linux to the Alpha processor. David Mossberger did a lot of really good work on the Linux/Alpha port, and after the port was done David told me that he did a study of multiplying two really big arrays together.....REALLY BIG ARRAYS.
Now the average programmer would go to their algebra book, find the formula to multiply the arrays, create a couple of loops and probably get the solution in a time that would not be embarrassing. David was not the average programmer.
Instead David knew that the two arrays would be stored in memory the same way, but while the algorithm would be very kind with the first array, going from element to element, with most of the elements being found in the cache of the CPU, the second array would cause a cache miss on almost every element due to the size of the array.
So David took the time to invert the second array, then invert the intermediate answer, which gave the same final result. And the multiplication went forty times faster on the Alpha, because the second array (when inverted) had a minimum number of cache misses. This forty-time improvement included the time of inverting the second array and inverting the answer. As I said, they were REALLY BIG ARRAYS.
David reported that on an Intel system the speedup was only ten times, since the Intel processor did not have as much cache as the Alpha.
A lot of people, of course, do not have such large arrays to multiply, nor does one iteration of their program take more than a few seconds, but some people need to have an answer back in a shorter period of time (we call this “soft real time”, or in some cases “gaming”) and knowledge of the architecture may mean that you kill the monster before he kills you.
One final note on these examples, which are all true. My student put the synopsis of his work on his resume in the form “Made a program work 210 times faster on the same system.” and later informed me that nobody who interviewed him believed that was possible.
“It is because they really did not know how fast computers can be,” I replied.comments powered by Disqus
Customers can take a free test drive of SLES for HPC on the Azure Cloud
San Francisco-based chip company announces their first fully open source chip platform.
The whole distro gets rebuilt on glibc 2.3
Ubuntu Vendor tries to solve app packaging and distribution problem across distributions.
Founder of ownCloud launches the Nextcloud project.
Will The Machine change the way future programmers think about memory?
The new Torus distributed storage system is available under an open source license on GitHub
Juries decides Google’s use of Java APIs Was Fair Use
But if you are not using the latest Linux kernel, your system is insecure.
Home routers will give room for custom firmware but still comply with FCC rules