HPC in the cloud
Cloud computing is most definitely here, but does it have a role in HPC? We discuss changes in HPC that could be solved effectively by cloud computing.
I'm not a big Bob Dylan fan, but when it comes to HPC, "The Times They Are a-Changin'." Some of this change is going to the cloud, not because it would be really cool, but rather because HPC has existing workloads that fit well into cloud computing and can save money over traditional solutions. Perhaps more importantly, I think HPC has evolved to include non-traditional workloads and is adapting to meet those workloads – in many cases using cloud computing to do so. I will explain by giving two examples.
1. Massively Concurrent Runs
At one HPC center that I'm familiar with, users periodically submit 25,000 to 30,000 jobs as part of a parameter sweep; that is, they run the same application with 25,000 to 30,000 different data sets. Many times the application is a Matlab script, a Python script, an R script, a Perl script, or something particularly serial (i.e., it runs well on a single core). The same script is run but with thousands of different input files, resulting in the need to run thousands of jobs at the same time. Many times these applications run fairly quickly – perhaps in a couple of minutes – and many times they do not produce a great deal of data.
A closely related set of researchers study operating systems and security, during which they run different simulations with different inputs. For example, they might run 20,000 instances of an OS, primarily a kernel, and explore exploits against that OS. As with the previous set of researchers, the goal is to run a huge number of simulations as quickly as possible to find new ideas about how to protect an OS and kernel. The run times are not very long, but they must run the tests against a single OS. Consequently, they run thousands of jobs at the same time and then look through the results and continue with their research.
What is important to both sets of researchers is to have all of the jobs run at nearly the same time so they can examine the results and either focus on a small subset of the data and run more granular input data sets or try yet more data sets (broaden the search space). Additionally, these users want to broaden the search space so they can get either more detail or examine more options. The result is the need for more cores. This same HPC center has users asking for 50,000 and 100,000 cores to run their applications. The coin of the realm for these researchers is core count and not per-core performance.
Another interesting aspect of these researchers is that they don't run these massive job sets all of the time. They create the input data sets and create the job arrays and then run the job array. Once the jobs are done, however, it takes time to process the output to understand it and to determine the next step. What is important to these researchers is to have all of the results before doing this post-processing. If this doesn't happen, the researcher has to wait days for the jobs to finish before post-processing can take place.
Getting more efficiency from the hardware is not the problem because faster hardware will only improve the research time a little bit. Reducing the run time from 120 seconds to 100 seconds wouldn't really improve their research productivity. What improves their productivity is to have all of the jobs run at the same time.
I originally thought this scenario was confined to my experience with a particular HPC center, but I was wrong. I've spoken to several people, and they all have similar workload characteristics with varying sizes (several hundred to 50,000 cores). Although this might not describe your particular workload, a number of centers fit this scenario, and this number is growing rapidly.
2. Web Services
Another popular scenario in HPC centers that I've seen is the increasing need for hosting servers for classes or training, for websites (internal and external), and for other general research-related computing in which the applications are not parallel or might not even be "scientific." I heard one person refer to this as "Ash and Trash computing," probably because it refers to running non-traditional HPC workloads; however, it's becoming fairly common.
Consider an HPC center with training courses or classes that need access to a number of systems. A simple example is a class in parallel computing with 30 students. The students might need many cores per person for their course, and they wouldn't be pushing the performance of the systems; however the data center will need a number of systems for the class. If they need 20 cores per student, that's 600 cores just for a single course.
The need for dedicated web servers for research is also increasing. The websites they host go beyond the classic personal websites. Researchers want, and need, to put their research on a website that allows them to share results, interact with other researchers, and show their research. An increasing number of web-based research tools are available, such as nanoHUB  and Galaxy . I know of one HPC center that has close to 20 Galaxy servers, each tuned to a specific research project. HPC centers are discovering that it makes much more sense to handle these non-traditional workloads themselves. The reasons are varied, but in general, HPC centers understand research better than the departments that worry about mail servers, databases, and ERP applications. These enterprise computing functions are critical to the overall center, but research and HPC require a different kind of service. Moreover, HPC centers can react much more rapidly to requests than the enterprise IT department.
Time for a Change
HPC is being asked to adapt to new roles when it comes to the needs of researchers. These needs include applications that require a tremendous number of cores but not a great deal of performance, as well as applications such as web servers, classroom and training support, and web-based applications and tools that are not traditional HPC applications. These workloads fit into the HPC world much better than they fit into the enterprise world.
These changes are everywhere. They may not be a large force, and they might not be as pervasive in your particular HPC center, but they are happening and they are growing more rapidly than traditional workloads. Consequently, I've started to refer to this new generation of computing as Research Computing. If you like, research computing is a superset of traditional HPC, or traditional HPC is a subset of research computing. I also like to think of research computing as adding components, techniques, and technology for solving problems that traditional HPC cannot or might not solve. One of these technologies is cloud computing.
Buy this article as PDF
New release marks the arrival of AMD’s unified driver strategy.
A new study by IDC charts big changes in the big hardware market.
Azure CTO says Redmond has already considered the unthinkable.
Lead developer quells rumors that the Debian version is slated for center stage.
MSBuild is now just another GitHub project as Redmond continues its path to the light.
Malware could pass data and commands between disconnected computers without leaving a trace on the network.
New rules emphasize collegiality in coding.
Upstart lands in the dust bin as a new era begins for Linux.
HP's annual Cyber Risk report offers a bleak look at the state of IT.
But what do the big numbers really mean?