Caylent Catalysts™
AWS Graviton Migration Strategy
Evolve beyond clock speed and core count comparisons and realize real world performance for modern cloud workloads.
Explore how CPUs execute code—examining elements such as clock cycles, pipelines, instruction decoding, and branch prediction—and how these architectural details influence cost, performance, and energy efficiency.
You may have heard that Arm is taking over the cloud and that AWS Graviton instances are often cheaper than the alternatives. But why is that the case? In this article, we will connect the dots between how a central processing unit (CPU) actually runs your code, from clocks and pipelines to decoding and branches, and how that translates into price, performance, and energy efficiency. Along the way, we will build just enough CPU intuition to make your cloud choices feel clear and practical instead of mysterious.
I like to kick things off with a bit of history. To really appreciate Arm and x86, it helps to first understand what a CPU is and how it works. If a modern CPU feels like magic, its origin story is even more fascinating.
The CPU is one of the most complex creations of humankind, and a full explanation would take more pages than anyone wants to scroll through. But long before computers became the polished machines we carry in our pockets, generations of brilliant and stubborn minds had to imagine them into existence.
In 1801, Joseph Marie Jacquard introduced his programmable loom. At first glance, it was just a clever way to weave patterns using punched cards. But hidden in those threads was a radical idea: machines could follow instructions. That little spark would echo for centuries.
A few decades later, Charles Babbage imagined the Analytical Engine. Picture a gigantic calculator made of gears and brass. He never finished building it, yet his design already contained the building blocks of a computer: memory, processing, input, and output. His collaborator Ada Lovelace took it further. She realized that such a machine could do more than crunch numbers. It could manipulate symbols and even create music. With that vision, she became the first person to truly understand the idea of software.
By the early twentieth century, the dream began to move from theory into practice. In 1936, Alan Turing described the Turing Machine, which defined what it really means to compute. A few years later, during the Second World War, he helped crack the Enigma code, proving that machines could solve problems with enormous real world consequences. At the same time, Konrad Zuse and the engineers behind ENIAC were building some of the first programmable electronic computers. By replacing clunky mechanical parts with vacuum tubes, they gave machines speed and power that no one had seen before.
And then came John von Neumann. If Turing was the father of computers, von Neumann was its Einstein. His stored program design, which placed both data and instructions in the same memory, became the model that almost every computer still follows today. Soon after, the invention of the transistor (1947) and the arrival of integrated circuits (1960s) shrank computers from machines that filled entire buildings to devices that could sit on a desk. That shift opened the door to personal computing.
Now, this sounds like a straight path from Jacquard to your laptop, but the truth is far from simple. Each breakthrough took decades of trial, error, and persistence. Many brilliant contributors never had their names written in history books. The computer is not the invention of a single genius, but the result of countless people across generations, each adding their own piece to the puzzle.
So in order to understand how a CPU works, and later what Arm and x86 really mean, let us start from the ground up. Imagine we have a very simple program that just wants to add two numbers. For this example, I chose C++, a language that many people love and many others love to hate. The following image shows the code we want to run.
Now that we have the code, the question is: how do we actually run it? There are a few steps involved, but to keep things simple, I will skip the technical details for now. We will dive deeper later. For the moment, think of it like this: computers are electrical machines, and at the lowest level, they represent everything as zeros and ones (off and on). The challenge is to translate our human readable code into those simple instructions.
That is where a compiler comes in. A compiler takes code written in a high level language and translates it into machine code, which is just patterns of zeros and ones. To make this process easier to see, I am using a tool called GodBolt. It takes C++ code and produces something called assembly. Assembly is like a halfway point: it is not as friendly as C++, but it is still readable by humans, and each instruction corresponds directly to machine code that the CPU understands.
When we compile the program, the assembly instructions are stored in random-access memory (RAM). The operating system then points the CPU to the memory address where the program lives and tells it to start executing. The instructions might say things like “load this value into a register” or “add these two numbers together.” Step by step, the CPU carries them out.
Inside the CPU, there is a special piece called the Arithmetic Logic Unit (ALU). This is the part that actually does the math. If the instruction says to add, the ALU adds. If the instruction says to subtract or divide, the ALU does that too. In short, your program in memory gets broken down into instructions, and the CPU runs them one at a time, doing real work like adding, subtracting, or comparing numbers.
Now that we understand the path from source to execution, we can dive deeper into the core terms and see how they come together inside a CPU.
The first thing to know is that a CPU is a clocked machine. The clock divides time into tiny slices called cycles. When you hear that a processor runs at 3 gigahertz, that means its clock ticks three billion times per second. On every tick, somewhere in the core, useful work can happen. Some processors can even perform multiple operations in a single cycle, which is why clock speed alone does not tell the whole performance story.
For clarity, I will separate the CPU into two broad areas that work together. While the front end is responsible for preparing instructions, the back end is where the ALUs live. An ALU is the part of the core that performs operations such as add, subtract, multiply, and compare. Modern cores have several ALUs so they can work on many operations at once. The back end organizes the operations, carries them out, and then finalizes the results in sequence to keep the program state consistent.
A key helper for the ALU is the register. A register is a tiny storage cell inside the CPU, much faster than cache or main memory. It holds the immediate values that the ALU needs to work with, such as operands for an addition or the result of a comparison. Because they are so few and so fast, efficient use of registers is crucial for performance.
Before the back end can do any work, the processor must turn program bytes into the internal operations that ALUs understand (micro-ops). That preparation happens in the front end. The front end fetches instruction bytes from memory and converts them into the core’s internal format. The key piece here is the Decoder, the hardware that translates instruction bits into micro operations.
It is funny to assume that machine code is the end of the translation journey. From the point of view of the processor those bytes are still a compact description rather than the work itself. The front end must decode them into internal micro operations that drive the execution units. In that sense, the hardware decoder acts like the final compiler stage, turning your program’s instructions into the micro ops the core actually executes.
Since on every cycle, useful work can be happening somewhere in the core, many processors can even execute multiple instructions per cycle. That is why a strong front end matters so much. With a 3 GHz clock, the decoder must keep a steady flow of ready work moving to the back end. The better I can feed the back end without stalls, the fewer cycles are wasted and the more performance I get from each second of compute.
AMD Zen 4 CPU architecture
That steady flow depends not only on decoding but also on how quickly the core can fetch the data and instructions themselves. This is where the cache hierarchy earns its keep. A cache is a small but very fast on-chip memory that holds recently used data and instructions so the core does not have to wait for main memory on every access. Caches are arranged in levels. L1 sits closest to the execution units and is private to a core. L2 is usually private as well and is larger but a little slower. L3 is the last level cache shared by many cores on the same chip. Data moves through these levels in fixed size blocks called cache lines that are typically 64 bytes.
You might wonder why CPUs do not simply use huge caches everywhere. The reason is that cache uses static random-access memory (SRAM), which requires several transistors per bit of storage, while main memory (dynamic random-access memory (DRAM)) uses (usually) only one transistor and one capacitor. SRAM is much faster and does not need constant refreshing like DRAM, but it is larger, hotter, and more expensive. Engineers balance these tradeoffs carefully when designing a CPU.
Programs that touch nearby addresses in sequence benefit from spatial locality, and programs that reuse the same data soon after first use benefit from temporal locality. When a request misses in L1, the core tries L2, then L3, and only then goes to main memory. Each step down the hierarchy costs more cycles, which is why a well fed back end also needs a well fed cache.
The table below gives ballpark hit latencies and shows how the same delay looks in cycles and in time at a three gigahertz clock. Real numbers vary by microarchitecture and frequency, but the gaps are what matter.
If you’d like to dive deeper, I recommend this excellent article on the must-know numbers for every computer engineer
If most of your working set fits in L1 and L2, the core can keep issuing useful work almost every cycle. A miss that falls into L3 already costs an order of magnitude more time, and a miss that spills to DRAM can cost two orders of magnitude more. This is why layout, batching, and access patterns often move performance more than any single instruction choice and why a strong front end only reaches its potential when the cache hierarchy keeps pace.
To keep the back end busy, the front end tries to predict what instructions you will need next. This is called branch prediction and speculative execution. A branch is a decision point in your program, such as an if statement or a loop. If the core waited to see the outcome before fetching more instructions, the pipeline would stall. Instead, the front end makes an educated guess about which path will be taken and begins fetching and decoding those instructions in advance. It learns from history, tracking which way a branch went last time and how often, and it also recognizes common patterns for function calls and returns. When the guess is right, the arithmetic units never run out of work, and the whole pipeline flows smoothly.
Speculation is not free, so the core has careful rules to keep your program correct. The work done from a guess is tentative until the branch outcome is known. If the guess was wrong, the core discards that speculative work, flushes the pipeline, and then fetches the correct path. That costs cycles, which is why strong prediction matters so much for performance. The better the front end is at guessing and the faster it can recover when it misses, the more continuously it can feed the back end, and the closer the core gets to executing multiple useful instructions on every cycle.
Arm is both a company and a family of processors defined by an instruction set architecture (ISA). An ISA is the software-visible contract between programs and a processor. It specifies the instructions that exist, the register names and sizes, how memory addressing works, and the rules for privileged operations. The two ISAs we care about here are AArch64 for modern Arm and x86_64 for today’s Intel and AMD servers.
People often describe Arm as RISC and x86 as CISC. These are design traditions rather than hard categories. Reduced instruction set computer (RISC) prizes simple, fixed-length instructions and a load-store model where arithmetic uses registers and only dedicated instructions touch memory. Complex instruction set computer (CISC) historically allowed more elaborate instructions. Modern x86 decoders translate those elaborate instructions into simpler internal operations before execution, so in practice, the line between RISC and CISC is blurry.
Historically, x86 dates back to Intel’s 8086 processor (1978) and has steadily accreted features and instructions, leading to implementations with ~1,000 base instructions and up to 4,000 with extensions. Arm, first developed in the 1980s for power-efficient embedded systems, has kept its ISA leaner (~300–400 instructions). This difference in philosophy, accumulation versus restraint, helps explain why the two architectures feel so different, even as their underlying execution engines converge in complexity.
Ok, to answer this question, let’s revisit the code from the beginning of this blog. I used Godbolt to generate the assembly output for both the Arm instruction set and for x86 (the version you already saw earlier). As you compare the two columns, you will notice that Arm uses one fixed-size word for every instruction while x86 uses variable length encodings that can span from a single byte to many bytes.
Side note: You might notice an instruction called XOR in the x86 output. This is one of the most important logical operations. It's fundamental to how we can add bits together. Even more interesting, XOR can be used to swap the values of two variables without needing a temporary one. If you’re curious, I highly recommend checking out this article on problem-solving with XOR.
It is fair to ask why there are so many instructions when addition and multiplication seem universal. When an ISA is designed, the architects look at the operations that real programs perform most often and add encodings that make those common cases easy to express and fast to execute, so the translation from source code to machine code stays compact and efficient.
This single difference in format and scope has real consequences for how a processor keeps its execution units busy. With Arm the front end can slice instruction bytes neatly and decode several operations in parallel with little ambiguity, which makes it easier to feed the back end at a steady pace. With x86_64 the flexibility of variable length encodings is powerful but the front end must do more work to find instruction boundaries and modes every time it fetches new bytes, and the larger catalog of instructions adds more cases to handle. Modern x86 designs counter this with wide decoders and caches of already decoded micro operations so they can still run at very high speed. Even so, the appeal of fixed length decoding remains clear when you want predictable throughput and simple parallel decode.
There is a well-known story from the first Arm prototype. During testing, a board was wired incorrectly and the processor power pins were not connected. The engineers noticed that the chip was still running, it was being back-powered through the protection diodes and leakage paths of the I/O pins connected to other powered chips on the board.
The only reason this could happen is that the core drew so little current that those tiny leakage paths were enough to sustain it at the low test speeds. This anecdote, shared in several accounts of Arm’s early days, highlights just how aggressively small and efficient the first designs were.
When you rent compute from a cloud provider, you are not buying a particular chip. You are buying outcomes like throughput, latency, and the number at the bottom of the invoice. That is why the recent shift toward Arm based servers in the cloud matters. You get another way to buy those outcomes, often with better efficiency.
On Amazon Web Services, the Arm based servers are the AWS Graviton families. AWS Graviton is the name AWS gives to its in-house processors that implement the Arm instruction set architecture. Those processors power many Amazon EC2 instance families, and they now span three generations in wide production and a fourth that recently reached general availability for memory-heavy work.
So why should I move to Arm based chips?
The customer stories are starting to look less like experiments and more like standard practice.
These are different companies with different stacks, but the through line is that once the teams produced Arm builds of their services and container images, they found the move practical and the outcomes measurable.
If you want to get a quick read on your own estate, there is now a tool for that. The Graviton Savings Dashboard ingests your current usage and projects where you are likely to see savings if you migrate eligible workloads. It breaks findings down by service and gives a straightforward way to choose a few representative systems to benchmark side by side. Treat it as a map rather than a mandate. It helps you pick the first hills to climb.
When you translate pipeline theory and cache behavior into cloud decisions, the impact shows up directly in unit economics. If a service can handle more requests per core at the same or lower price, your compute cost per thousand requests drops, and so does the time to hit an SLO under load. That is why price performance and performance per watt matter. They improve the top line by letting you deliver more capacity within the same budget, and they improve the bottom line by reducing the compute you need for a given level of traffic.
In the end, this is less about Arm versus x86 and more about theory meeting reality. Under the hood, we find a clocked machine with a front end that fetches and decodes and a back end that executes and retires work, all held in check by caches and branch prediction. Fixed-length Arm instructions make the front end predictable, while x86 shows that smart microarchitecture can turn variable-length instructions into extremely fast execution. The label matters less than the design. What truly drives performance is how well the pipeline is fed, how often data is found in cache, and how confidently the core avoids stalls. That’s the mental model worth keeping when you look at benchmarks or profile your own code.
In the cloud, these mechanics translate into outcomes you can buy. AWS Graviton delivers excellent performance per watt and strong price performance, while the x86 families continue to offer absolute speed and a vast ecosystem. The good news is you don’t need to pick a side. Instead, build multi-architecture images, deploy the same service to both Arm and x86, and measure throughput, tail latency, and monthly cost under your real workload. Use savings tools to identify the best candidates first, then let data decide where each tier belongs. With a little CPU literacy and disciplined measurement, you gain the freedom to choose the right engine for every job—and the confidence that your decision will hold up when the invoice arrives.
At Caylent, we’ve helped hundreds of customers right-size their workloads and migrate to AWS Graviton. We were launch partners for the AWS Graviton Service Delivery Program, and we continue to partner closely with the AWS Graviton team to help our customers achieve best-in-class price/performance in the cloud. Our experts will work with you to review your data, assess application compatibility with Arm instances, determine the best adoption path, detail expected cost savings and performance improvements, and develop CI/CD pipelines that continuously optimize for future technologies.
Pedro is a Senior Software Developer at Caylent, with extensive experience delivering modernization projects and building high-throughput, cloud-native solutions. With a strong background in front-end development, he also brings solid expertise in distributed systems, clean architecture, and performance optimization. Pedro has worked across multiple modernization and digital transformation initiatives, helping enterprises evolve their applications for scalability and resilience. His technical skills span .NET, Node.js, Kubernetes, and AWS, along with a deep appreciation for the fundamentals of Computer Science. He uses VIM, btw, and outside of work, he enjoys playing piano.
View Pedro's articlesExplore how we helped a customer modernize their legacy authentication system with Amazon Cognito.
Learn how we helped an event production and management company implement an AWS Landing Zone to improve their operational capabilities.
Learn how we helped a video surveillance management system company transform their cloud infrastructure, resulting in reduced operational costs, improved security, and enhanced customer experience.