The ARM architecture – yesterday, today, and tomorrow

ARM Story

Author(s): , Author(s):

The ARM architecture has shaped modern computer history, and with the rise of mobile computing, ARM is more important than ever. We take a look at the ARM architecture and where it might be heading.

In many ways, what commentators are calling the mobile revolution is also the ARM revolution. The versatile, inexpensive, and energy-efficient ARM architecture is suddenly in the foreground as Linux and other operating systems move from the clunky desktop systems of the past to tiny and agile mobile devices. Even gigantic high-performance systems, as well as other enterprise-grade servers, are including ARM processors to make the most of the low cost, low energy, and low heat.

What is ARM, and how did it get here? How is it different from the x86 chips that so many associate with personal computing? In this article, we take a close look at the ARM architecture, provide a glimpse at why it is so attractive to hardware vendors, and describe some new ARM innovations that might figure prominently in the next generation of computers.

A Little History

The beginnings of the ARM architecture date back to the early 1980s, when the British computer manufacturer Acorn was searching for a new processor for its computer. The 6502 processor used previously was not powerful enough, and alternative architectures seemed inappropriate.

This prompted a team under the direction of Steve Furber and Sophie Wilson to develop their own architecture – the Acorn RISC Architecture, or ARM for short. In the mid-1980s, the first finished products were used in coprocessor cards for the Acorn BBC Micro; then in 1987, the first ARM-only machines hit the market in the form of the Acorn Archimedes, which was followed by other machines in the years to come.

RISC

The RISC (Reduced Instruction Set Computer) philosophy involves making the instruction set as simple as possible and avoiding giving each command many addressing variants. The add command on a CISC machine can use both register and memory content directly or indirectly for the operands and the result: this results in a large number of possible combinations.

RISC machines do without this diversity: arithmetic and logical instructions operate only in registers, and access to memory is handled by special commands (load/store). This makes programs larger, but it simplifies the processor architecture significantly. The additional overhead of loading into the register is only apparent because a CISC machine only needs to load two memory words before adding; however, this happens internally and is hidden from the programmer. As a consequence, RISC machines usually have more free usable registers than comparable CISC machines.

For a fictional machine, the process for adding the memory contents of M1 and M2 and storing the result in M3 might look like Listing 1.

Listing 1

CISC and RISC Addition

;CISC Addition
ADD M3 M1 M2 ; M1 + M2 -> M3
;RISC Addition
LD R1 M1     ; M1 -> R1
LD R2 M2     ; M2 -> R2
ADD R3 R1 R2 ; R1 + R2 -> R3
ST M3 R3     ; R3 -> M3

In the late 1980s Acorn's architecture attracted Apple's interest. The company was planning its use in a new type of mobile device. To allow this to happen, Acorn outsourced the development of its architecture to a new company (Advanced RISC Machines Ltd.). In collaboration with Apple, it developed the sixth variant of the ARM processor, which was then used in Apple's Newton pen handheld system in 1992.

In the following years, other licensees followed, although Acorn's computer business increasingly lost importance, with the company finally going under around the turn of the millennium. Since then, ARM Ltd. has continued to develop the architecture further and with great success; this success has led to new versions of the architecture and adaptation of the processor core to various manufacturing processes.

Cores versus Architecture

The ARM architecture version is designated ARMvX and currently extends from ARMv1 to ARMv8. The ARMv1 (which originated in 1985) to ARMv6 (used in the first iPhones and Androids) architectures were implemented in the ARM1 to ARM11 cores.

With the change to ARMv7, the Brits also changed the names of the corresponding cores. Since then, there have been three series of Cortex cores, whose short forms also describe their applications: Cortex Rx (real-time applications: low latency, predictability, protected memory), Cortex Mx (microcontrollers: low transistor count, predictability), and Cortex Ax (application processors: high performance with low power consumption, optimized for multitasking).

Depending on the intended use, the cores are implemented very differently. From simple in-order cores with no pipeline to complex out-of-order execution, branch prediction, and speculative execution in modern application processors, they cover almost the entire range of acceleration techniques that x86 CPUs also use.

Licensing Models

The hardware, that is, the specific processors with these cores, is not manufactured and sold by ARM Ltd. itself. Instead the business model relies on sales of intellectual property to licensees [1] such as Samsung, Broadcom, Freescale, or Calxeda. ARM Ltd. uses two strategies: On the one hand, they license the blueprints of cores – what they refer to as IP cores. This allows licensees to integrate the core unchanged into their own system-on-chip (SoC) design. Examples of this approach are found in many smartphones and tablets. For example, Samsung's Exynos dual 5 chipset [2], which is in the Google Nexus 10, contains two Cortex A15 cores. On the other hand, the architecture can be licensed, which makes it possible to develop your own processor cores that are compatible with the licensed version of the architecture. One example is the Krait cores [3] by Qualcomm, which powers Google's Nexus 4.

ARM Architecture

Like the x86 architecture, the ARM architecture has been extended over time to meet new requirements. Each version of the architecture defines which of these extensions are mandatory and which are optional. The extensions might not be as varied as with the x86, but the scope is still large (Figure 1), which is why this article focuses on the hitherto most-used architectures: ARMv4 through ARMv7. ARMv8, which you'll learn about later in this article, differs significantly from its predecessors with an extension to 64 bits.

Figure 1: The evolution of instruction set extensions at a glance. The extensions are partly mandatory and partly optional.

32 Bits for Everything

The ARM architecture was designed from the beginning as a 32-bit architecture, which is expressed in particular in the 32-bit processing width and 32-bit address space (from ARMv3 onward; 26-bit addressing before this). An ARM core thus addresses a maximum of 4GB of memory, although most implementations actually use only a part of it. Only the Cortex A15 avoids this limit with a few tricks.

As with other processor architectures, ARM has several processor modes. Ordinary programs run in user mode, and system mode is reserved for privileged operating system code. Some modes also handle exceptions and, starting with ARMv7, hardware-assisted virtualization. One special feature of the ARM architecture is that each mode has its specific registers, which the system automatically remaps during a mode change. ARM systems thus implement interrupts in a very efficient way.

Most implementations also have an MMU (Memory Management Unit) for storage virtualization and memory protection, but some only have an MPU (Memory Protection Unit) for the implementation of memory protection. Some very simple microcontrollers do without both.

The main difference between x86 and ARM is that ARM is a RISC architecture (see the "RISC" box), whereas the x86 is a member of the CISC (Complex Instruction Set Computer) family. ARM leverages the RISC concept to the max; it is a load-store architecture with a relatively large number of registers and a very small number of commands.

In combination with the restricted addressing modes, this means that all commands can be encoded with exactly 32 bits (i.e., one word) and aligned with word boundaries. The instruction decoder can thus be designed very simply: All it has to do for a command is read a word from memory and then decode it.

With an x86, the overhead is far greater because commands here have lengths of between 1 and 15 bytes (some even have instruction set extensions). The processor thus has to decide, depending on the start of a command, how long the command will be – alignment at word boundaries is not possible.

Current x86 implementations solve this problem by breaking down the complex x86 instructions into simple RISC instructions (known as micro-operations). These steps are not necessary for an ARM processor, which reduces both the hardware overhead and energy consumption.

Conditional Instructions

The mode-specific registers lead to the need for a large number of registers (about 40), but in any mode, only 16 registers (R0 to R15) can be addressed directly (Figure 2). The programmer can use registers R0 to R12 freely, whereas R13 acts as the stack pointer in most cases, R14 is used as the link register for storing the return address for procedure calls, and R15 is the program counter, which the processor can also access directly, just like any other register.

Figure 2: The numerous ARM registers and their available processor modes.

The instruction set avoids unnecessary redundancy, providing standard instructions for arithmetic and logical operations, memory access, program flow control, exception handling, controlling the various modes, and accessing coprocessors. The instructions themselves are no different from those of other architectures, so in this article, we just highlight a few features.

Unlike most other architectures, in which only branch instructions allow execution as a function of conditions, almost any ARM instruction is conditional. To allow this to happen, the command code uses a 4-bit mask to specify which conditions (negative, zero, carry, overflow) must be met for execution, allowing for very compact code and avoiding jumps (Listing 2).

Listing 2

Conditional Execution

 

In addition to the load-store instructions that allow a memory word to be transferred between memory and the registers, load-store multiple instructions allow a series of contiguous words in memory between a set of registers. This means that the processor can write a small variable field to the registers with one command. This approach also lends itself to very effective use of the stack because it can store or read multiple registers at once. This is of particular interest in the programming of interrupt handlers or context switching in an operating system, since the entire register set can be replaced with just two commands.

The Special Bit

Another special feature in the instruction set is the S bit, which serves several purposes. For one, it helps to achieve more granular control of conditional execution. Normally, the processor changes the condition flags for each command; for example, if the result of a computation is 0, it sets the zero flag. In ARM, this only happens if the S bit is set, thus keeping the condition flags' status independent of current computations.

The S bit can also be used for controlling the processor modes and their specific registers. Setting the S bit thus allows access to the user mode registers. If a program with the S bit set writes to the program counter, the processor automatically switches to the previous mode. In combination with a load multiple, a programmer can thus implement a very elegant approach for returning from an interrupt.

Neon, Thumb, Jazelle

The ARM design supports easy extensibility by adding up to 16 coprocessors, which are controlled by special ARM coprocessor instructions, thus supporting, for example, floating-point computations. If no coprocessor responds to a statement of this kind, an exception allows a simple emulation in software. Other notable and commonly used extensions are a memory management unit and a media processing unit by the name of Neon.

Besides the actual ARM instruction set, most ARM processors, support up to three other instruction sets: First, there are now two versions of Thumb mode, allowing a higher code density through the use of 16-bit instructions.

Whereas the first version of Thumb accessed only half of the registers and had to switch back to the ARM instruction set to handle exceptions, its successor, Thumb 2, allows 16- and 32-bit instructions and waives most of the restrictions, thus still allowing performance comparable to ARM mode despite the higher code density. Additionally, the Jazelle instruction set also provides hardware acceleration for Java bytecode; however, ARM has reduced the scope of support in recent versions  [4].

Multicore

When the clock speed is increased, energy consumption grows faster than computing speed. Thus, energy efficiency decreases as clock speed increases; this is a problem for mobile devices in particular. The use of multiple cores partly solves the problem because they achieve the same number of computations per unit of time at a lower clock speed, thus improving energy efficiency. Multicore CPUs are thus interesting for ARM architecture.

A variety of solutions come from both ARM itself – for example, the MP core with up to four cores – and by architecture licensees. Two of the biggest challenges in the design of multicore CPUs are cache coherence and interrupt distribution. ARM only offers one IP block for the ARM11 that contains the cores, the cache coherency logic, and interrupt distribution.

Starting with the Cortex A9, ARM has sold these components as separate IP blocks to allow SoC designers more freedom of design. For holders of an architecture license, this freedom is even greater, but the details are generally sparsely documented, or not at all, publicly.

Big-Little

Because both mobile and desktop devices rarely need their full computing power, some power-saving mechanisms are being used that x86 CPUs also use in a simpler form. You can control the CPU clock speed for more energy efficiency and switch off individual cores if they are inactive. Since the energy consumption of small mobile devices is more important than that of PCs and laptops, both ARM and SoC manufacturers offer further options.

The Big-Little solution developed by ARM is based on the fact that ARM CPU cores offer different performance classes, each supporting the same instruction set. The Big-Little scheme forms blocks of cores with high processing power but low energy efficiency, alongside blocks of CPU cores with low processing performance but greater energy efficiency in the CPU (Figure 3). The model assumes that most mobile devices rarely demand maximum performance from the processor. Thus, temporary, exclusive use of slower but more energy efficient cores means significant savings and extends battery life.

Figure 3: The curve schematically shows the energy consumption of Big-Little in relation to performance.

Energy Efficiency

This use of an energy-efficient design requires software support; there are basically three approaches. The first approach uses the virtualization function of the system-on-chip. This virtualization relies on a hypervisor, which migrates all the computations of a block to another as the load changes and switches off any unused blocks completely.

This approach works flawlessly on Android systems. But it does not always make sense to migrate all running programs from the smaller to the larger cores, and vice versa. In many cases, processes generate only a small load, or the processes are non-time-critical background processes that do not justify the higher energy consumption of the larger cores.

The second approach thus only migrates applications between individual core pairs. Linux uses the existing frequency scaling (keyword cpufreq) technology by grouping the frequency ranges of the small and the large core in virtual frequency areas. Low virtual frequencies are consequently mapped to the small core, and higher ones to the large core. The operating system then automatically migrates an application to the core in whose range the selected frequency lies.

This approach also has disadvantages. On the one hand – in high-load situations – the system does not use all the physically available cores; on the other hand, the approach (like the one presented earlier) only makes sense with an identical number of small and large cores.

In fact, processors do no always have an identical number of small and large cores; the development prototype for Big-Little ARM, for instance, has three Cortex A7 cores and two Cortex A15 cores (Figure 4). In this case, a third approach is more useful: Make all the cores accessible to the user at the same time, simply allowing the Big-Little SoC to work as a multicore processor.

Figure 4: Architecture of the Big-Little test prototype with three Cortex A7 cores and two Cortex A15 cores.

This approach initially seems much easier than the first two but involves an issue that is much more difficult to implement: When an operating system distributes its applications across all the cores, it – in this case wrongly – assumes that all cores have the same computing power. This can cause the system to assign an unimportant background application to a fast core and an important application in the foreground to a slow one. The support of such asymmetric processor architectures in the operating systems is not a trivial problem, which is why there is currently no market-ready solution for Linux [5].

Plans for Big-Little

Various SoC manufacturers have announced implementations of the Big-Little architecture, but currently only Renesas Mobile and Samsung have concrete plans. Renesas Mobile intends to launch a SoC with two Cortex A7 and two Cortex A15 cores on the market this year, and Samsung has the Exynos 5 Octa chipset, which powers the new S4 Galaxy smartphone in some regions of the world. The Exynos 5 Octa is an eight-core processor, with four Cortex A7 and four Cortex A15 cores; however, until the latter part of 2013, it could only use four cores at a time. Now, with "heterogeneous multiprocessing," the Octa can use all eight cores at once.

NVidia is taking a different approach with its Tegra 3 SoC, featuring four Cortex A9, and its Tegra 4, with four Cortex A15 cores. In addition to the four Cortex A15 cores, both SoCs have one additional Cortex A9 or Cortex A15 core. These cores, dubbed by NVidia as the Companion Core, consist of different transistor types that require significantly less energy per cycle.

At the same time, the Companion Core is limited to a maximum clock speed far below that of other cores. In operations, in a style similar to the Big-Little approach, applications are only migrated to the Companion Core if only one main core is active and the load falls below a predefined threshold.

It is hard to say which is the best power-saving mechanism because this depends on factors such as the application scope, the architecture of the rest of the system, and efficient support by the operating system. The latter in particular is still in its infancy. All told, these mechanisms, combined with their energy-efficient architecture, mean that modern ARM SoCs are more energy efficient than most x86 processors with similar power.

The Future: 64 Bits

History tends to repeat itself: In the 1990s, the makers of large Unix servers ran up against the 4GB limit that a 32-bit machine can address directly. As a result, 64-bit architectures, such as Ultra Sparc Sun were developed. Shortly after the turn of the millennium, x86 machines faced the same problem, which led to the development of x86-64.

When AMD launched x86-64, this was a real extension of the existing 32-bit architecture, which in turn was based on a 16-bit architecture – the older versions are all subsets of the newer ones. Although this approach enables easy recycling of existing knowledge and code, it rules out profound architectural changes.

ARM systems are now on the same cusp: For one thing, small devices are being equipped with more and more memory, which attracts applications that benefit from it. On the other, the ARM architecture is becoming increasingly attractive for servers. For both servers and small devices, the limit of 4GB is increasingly becoming an obstacle.

Newer ARMv7 cores, such as the Cortex A15, only partially work around the problem: The Large Physical Address Extension (LPAE) allows up to 1TB to be addressed physically (in a style similar to PAE in x86), but this does not affect the limitation to 4GB of virtual memory per thread. The radical solution is to switch to a 64-bit architecture.

In 2011, ARMv8 defined two architectures – Aarch64 and Aarch32 – which, in turn, use the A64 and A32 instruction sets (as well as Thumb 2 as T32 for Aarch32). Aarch32 and A32 are downwardly compatible with ARMv7 (but not vice versa), whereas Aarch64 has a new instruction set by the name of A64. Thus, in contrast to x86, there is no need to rely on existing gaps in the old instruction set or craft new commands with complicated prefix structures – you can build a clean instruction set.

For example, all A64 commands can be encoded in just 32 bits of code, even though the number of registers has doubled: X0 through X29, as well as X30 as a link register and X31 as a hard-wired zero register. The program counter, which occurs in the A32 instruction set as R15, is now a special register that can only be accessed via customized commands. The commands themselves are based on their counterparts from A32 in essence, but the following architectural changes have made some adjustments necessary.

The biggest of these adjustments relates to the processor modes and the dependent mode-specific registers: With the eight modes of the later versions of A32, only a small part of the many registers is used. Additionally, dealing with the various exception modes is awkward. In A64, ARM introduces a highly simplified model, featuring four exception levels (Figure  5), that has similarities with the rings in x86 architecture.

Figure 5: ARMv8 processor modes and their compatibility with 32-bit software. The exception levels are strongly reminiscent of the rings in x86.

The lowest level – EL0 – is intended for use by applications (user-mode), whereas EL1 is equivalent to the old system mode and is used to execute privileged parts of an operating system. EL2 is assigned to hypervisors, and EL3 is part of ARM's Trust Zone security concept, that is, for the operation of security monitors (see the box "Encrypted ARM processors"). Each of these modes has only three private registers: a link register for exceptions, the stack pointer, and the stored status register.

Encrypted ARM Processors

In addition to complex, virtualization-based security concepts such as Trust Zone (Figure 5), for some years, high-performance ARM processor versions have supported fully encrypted booting. In practical terms, the processor draws on OTP (one-time programmable) processor registers, which manage a key that encrypts or signs the bootloader. This approach ensures that the program you are running really does come from the developers or manufacturers.

Freescale [7] offers processors with these features under the Vybrid Imx28 brand. The LPC3143 from NXP [8] has been on the market even longer. The Picosafe [9] open source project offers a complete tool chain that supports a fully encrypted boot process from the bootloader to the root file system. (Benedikt Sauter)

Conditional execution of all instructions no longer exists in A64. Although this allows for very elegant code, it greatly complicates the implementation of an out-of-order machine. This is also why load-store multiple instructions have had to give way to simpler instructions that only support loading or saving two registers.

Minor changes are the result of practical experience with the 32-bit architecture. There are two virtual address spaces, each 248 bytes (256TB) – one for the applications starting at address 0, and one for the operating system kernel from address 264 downward. Virtual addresses are mapped to physical addresses with four-level (4KB pages) or three-level (64KB pages) page tables, depending on page size. Addressing itself continues to follow the load-store approach, but the addressing modes have been adjusted to simplify computations.

As in x86-64, where 32-bit applications run on a 64-bit system, ARMv8 also provides for compatibility. In the event of exceptions and on leaving them, the processor can switch between Aarch64 and Aarch32 and reciprocally map the registers, depending on the mode; 32-bit access only addresses the lower half of the register (Figure 5). Thus, it is possible to run Aarch32 applications on an Aarch64 operating system and run guests with both architecture versions side by side on an Aarch64 hypervisor [6].

Linux and ARMv8

Although ARMv8 was first introduced in 2011 and the first IP cores are already available in the form of the Cortex A53 and Cortex A57, no implementations existed beyond simulators and FPGA-based prototypes until 2013. In September, Apple introduced an ARMv8 chip with its new iPhone 5S and the Apple A7 SoC. Other manufacturers have announced products that will probably be launched later this year or early next year.

Unfazed by the lack of hardware, developers are already working very actively on software for the new architecture: the code for Aarch64 can easily be generated with current versions of GCC, and the Linux kernel supports the architecture as of version 3.7. When the first 64-bit ARM machines hit the markets, Linux will be ready for them.

Infos

  1. ARM licensees: http://www.arm.com/products/processors/licensees.php
  2. Samsung's Exynos 5 Dual: http://www.samsung.com/global/business/semiconductor/minisite/Exynos/products5dual.html
  3. Qualcomms Krait cores: http://www.qualcomm.com/snapdragon/processors#CPU
  4. Furber, Steve. ARM System-on-Chip Architecture, 2nd edition. Pearson, 2000
  5. Robin Randhawa's whitepaper on Big-Little: http://arm.com/files/downloads/System_Software_for_b.L_Systems_Randhawa.pdf
  6. "ARM Goes 64-bit" by David Kanter: http://www.realworldtech.com/arm64/
  7. Vybrid and Imx28 by Freescale: http://www.freescale.com
  8. LPC3143 by NXP: http://www.nxp.com
  9. Picosafe: http://www.picosafe.de

The Author

Jan Richling is a visiting professor for operating systems and embedded systems at the Technical University of Berlin (TU Berlin). Anselm Busse is a research assistant and PhD student in the field of communication and operating systems at the TU Berlin. Their research focuses in part on increasing the energy efficiency of many-core systems.