Your privacy is important to us, privacy policy.
Introduction
In this article, I aim to capture the magic in the craft known as reverse engineering. To grasp its essence, we must first delve into a broader understanding of computer functionality and the process through which developers create software. We will keep it high level, no worries.
What is Reverse Engineering?
Reverse engineering (RE or reversing in short) isn't confined solely to computers and software. In fact, humans engage in reverse engineering frequently, often without even recognizing it. Whenever you observe something and ponder its inner workings or how it was constructed—be it a device, a building, a film, a missile, or virtually anything else—that cognitive process embodies reverse engineering. In that moment, you're endeavoring to comprehend the functionality of something through thorough analysis. Whether it's an item engineered, designed, implemented, or even something naturally occurring (such as the human body or DNA), reversing it entails attempting to comprehend its precise workings.
Reverse engineering the human genome stands as one of humanity's most significant breakthroughs, requiring approximately 13 years to accomplish.
General examples of reversing
Typically, people don't engage in reverse engineering for the sake of it; there's usually a specific end goal driving the process. Here are a few general examples:
- In guitar or piano lessons - you listen to a song, dissect its components, and aim to replicate the melody yourself.
- After enjoying a delicious meal, you might try to deconstruct its flavors to identify the ingredients used.
- When faced with a water leak in your ceiling, your plumber must determine its source before fixing it, finding the right broken pipeline.
- Watching a movie prompts you to analyze various aspects such as the production design, character psychology, themes, budget, and more, as you delve into imagining the intricacies of the filmmaking process.
Analysis
Observe how I employ the term 'analyze' to encompass the entire process of acquiring something, dissecting it thoroughly (breaking it down as much as possible), and striving to comprehend it to a degree where you can recreate it independently. This endeavor may involve numerous experiments, guess-work, informed conjectures, empirical trials, and various other methods. Mastering this skill often requires considerable experience acquired over time. It stands in total opposition with forward engineering, akin to constructing something. Broadly speaking, we categorize this entire process as research. Within the realm of software, one manifestation of research takes the form of reverse engineering.
Here’s a hands-on quick reverse engineering challenge: dissect this composition, analyze it and uncover the title of the piece along with the artist’s name!
Blueprints
When we embark on the creation of something with a clear purpose, intent and vision, we undergo a thoughtful process of envisioning our desired outcome, considering its appearance, functionality, and gathering necessary specifications (even as basic as the color of a car we want to buy). When talking about grand achievements, this process can be quite extensive and may involve the collaboration of numerous engineers. The magnitude of the objective often correlates with the level of organization and documentation in the process (well, hopefully). This documentation essentially serves as the blueprint—a tangible guide that can be replicated, shared, and utilized by others to produce the exact same outcome. Whether it involves the construction of satellites, automobiles, banking systems, mobile devices, kitchen appliances, or cultivating a specific variety of cucumbers, the principle is to do so consistently and accurately.
As far as I'm concerned, a blueprint could even equate to Mozart's compositions, or how to build an airplane, while in the realm of computers, we refer to this ‘blueprint’ as 'source code'. To further clarify with a distinct example, even if you were to obtain one of Spielberg's scripts, you still wouldn't be able to precisely replicate the entire movie scene by scene, because a script isn’t a blueprint, but a very close form that is much needed in order to produce the film, obviously.
We will soon get back to this term source code.
RE in computer software
It becomes even more intriguing and dramatic when we delve into real-life scenarios where reverse engineering proves invaluable:
- You seek to comprehend the inner workings of a computer virus, enabling you to develop an antivirus program capable of completely eradicating it.
- Conversely, you endeavor to understand the functionality of antivirus software, enabling you to nefariously craft a virus that evades detection and steals passwords. Pretty neat, isn't it?
- You aspire to grasp the mechanics of a server or network, empowering you to infiltrate it—an act commonly known as hacking.
- You aim to unravel the workings of a game, allowing you the ability to implement cheats, such as increasing life counters or enhancing aim accuracy in shooter games like Doom.
- You wish to expand the capabilities of an existing software developed by someone else. For instance, in 2010, I introduced unofficial support for Hebrew on the iPhone.
- You want to understand the functionality of something to fix it. In 2007, I released a software patch addressing an urgent security vulnerability in Internet Explorer.
Hacking
One method of hacking systems involves reverse engineering, wherein researchers identify security vulnerabilities or software weaknesses, allowing them to manipulate and exploit these flaws to compromise a target device. Today, within cybersecurity, there are various types of hackers: black hat and white hat. They can also participate in what we call 'red team' and 'blue team' like in war games. These teams engage in adversarial activities within large corporations, typically those with substantial budgets, simulating attacks and countermeasures to assess their effectiveness. As expected, the attackers usually emerge victorious in these simulations.
It's worth noting that certain hacking methods don't necessitate technical expertise or reverse engineering at all. For instance, you might be familiar with the practice of receiving an SMS text message containing a code when logging into a service. This code serves as a means of authentication. Now, picture a scenario where a hacker manipulates someone into revealing this PIN code received on their mobile device, thereby compromising their email account.
However, through technical means of hacking computers, one can clandestinely gather intelligence without raising any suspicion or relying on anyone to log into a system, like nation states do.
Source code
From blueprints to source code. Source code represents the fundamental form of software utilized by engineers to develop and execute computer programs. It constitutes a programming language equipped with directives, ensuring computers interpret and execute it consistently across different platforms. Within engineering circles, we simply refer to it as "code," as numbers and symbols within it carry specific meanings, akin to a code.
The term "source code" may seem peculiar upon reflection. Why don't we use similar terminology for other items, such as my grandmother's recipe for the world's best cake, and call it "source recipe"? The answer may seem apparent to those familiar with technical jargon, but it might not be immediately evident to non-technical individuals.
Recipe vs accurate instructions
How do machines achieve such precise execution of programs? We understand that mastering culinary skills, i.e. cooking, is challenging, and adhering to a recipe involves interpretation, experience, personal taste, and expertise. When someone instructs you to add half a glass of sugar, they may presume you understand the size of the glass, but nuances can be lost in translation.
However, such ambiguity is not accepted in computers, as relying on assumptions leads to unreliable outcomes.
If software were described using instructions akin to a recipe's level of detail, our technological advancements wouldn't have reached their current state today. Precision is paramount in instructions, and the greater the accuracy required, the smaller the instructions required, and therefore you will see that when it comes to computers the commands are really tiny, as there’s no room for freedom!
Hence when it comes to a recipe, it’s not a ‘source’ in the sense of getting everything exactly the same, not even referring to the baking yet. When our grandmother gives us her recipe, we normally copy it over the phone and it will surely look and feel a bit different too.
CPUs
Hence, within computers, operations are carried out through the execution of machine code. Essentially, machine code comprises the smallest commands that a hardware processor processes. The CPU, or Central Processing Unit, serves as the core component in a computer, tasked with executing programs. These commands, known as instructions, are fundamentally super elementary. Think of basic arithmetic operations such as addition, subtraction, multiplication, and division, as well as tasks like copying numbers between locations and the ability to repeat certain instructions, analogous to adding four spoons of sugar to a cake.
The reason I'm so passionate about computers is because of this very fact. Computers may seem simple, only able to execute tiny instructions, yet humans use them to run the world. The contrast between what you see on your screen right now while reading this sentence and how it all works is mind-blowing and awe-inspiring once you delve into it. Unfortunately many developers today don't fully grasp the inner workings. But that's okay because technology keeps evolving and modernizing, and they don't necessarily have to understand every detail.
The issue is that most engineers dislike this simplification - because accomplishing certain tasks using computers often demands writing hundreds of thousands of lines of machine code, which can be tedious and requires meticulous attention to detail. Fortunately, they no longer have to endure this burden anymore.
However, thanks to the efforts of other engineers who developed abstraction layers, such as operating systems, this burden has been alleviated. An operating system is a specialized software designed to facilitate the reliable execution of applications on a particular device by employing its hardware components. For example, Android and iOS are widely used mobile operating systems tailored for running mobile applications.
BTW - speaking of processors - Nvidia ($NVDA) is one of the most growing companies in history, with a market cap of 1.9 trillion dollars! - they are makers for a niche of graphical and AI purpose hardware processors! So AI and graphics rendering can happen faster and humans can interact with computers in a better way, e.g. speaking with them in real time, not only using keyboard and mouse.
The Matrix
The most fascinating aspect is when you begin to grasp how computers, with their hardware processors executing tiny commands, ultimately come together to run entire games, the internet, and even control missiles - virtually everything imaginable. That's the essence of the matrix, in my view. Moreover, consider how code interacts with user input (keyboard, mouse, screen taps), creates visual displays on the screen (built pixel by pixel), and generates sounds and music, among other functionalities.
Compilers
The higher the level of a programming language, the fewer lines of code a software engineer needs to write to achieve more tasks. You've likely heard of popular languages such as C++, Java, or JavaScript.
Ultimately, it boils down to productivity. Consider the contrast between using punch cards in the 1960s and the capabilities available today, where engineers can swiftly instruct AI with plain English to do stuff. The focus is on accelerating the release of new versions, minimizing coding issues (bugs) to ensure software reliability and resilience, and continually extending functionality to adapt to evolving product, market, and user needs.
Using punch cards, where holes were literally punched into them, conducting even a small experiment took days. Given the time required to write and test code, engineers aimed for 100% accuracy to avoid wasting precious time and resources. These engineers were truly heroic. Consider that a punch card, akin to a floppy disk or its successor, the CD-ROM, contained only numbers and holes, literally representing code instructions and data. Consequently, programs were also relatively small in size. Remarkably, ancient engineers could interpret code simply by examining a punch card.
However, with programming languages (notice how a punch card wasn’t a language to describe things) we need a way to make our code compliant with CPUs. We need a tool to help us achieve that effort.
A compiler is a tool that takes high-level code and translates it into machine code, breaking it down into tiny instructions. These instructions can number in the millions, comparable to DNA letters in a chromosome where placement and repetition are variables. The same letters can compose different genomes, the same in computer code.
The higher the level of the programming language, the simpler it is to write code. Programming languages offer numerous methods to enhance the speed of manufacturing software. For the sake of simplicity, software maintenance, capability expansion and collaboration, developers often assign names to elements in their source code, to make it readable and friendly. However, CPUs only understand basic arithmetic instructions, rendering these names unnecessary to them. Consequently, many of the abstractions present in high-level programming languages are discarded and unnecessary when translated into machine code, which is stored as binary files - perhaps you're familiar with the .exe file extension, shorthand for 'executable'.
Abstractions to the rescue
The bigger the task you pay a services company to do for you, the bigger the amount of details and operations take place and they are also concealed from you.
When I talked about this with friends, a common question was about the meaning of "abstraction" in computers. It struck me that people in different fields often have unique ways of thinking and approaching problems differently, obviously. However, operating systems act as abstractions, hiding complex details from software developers and definitely the users. Software developers can focus on bigger tasks without getting bogged down in low-level boring and tedious operations; that don't help their business directly either. Remember the AI example? With a simple command, you can get an AI to plan a detailed NYC trip in seconds!
Here's a helpful metaphor: abstraction is like everything you can't directly see, but that you can still use to your advantage.
And at last, what is (software) reverse engineering?
Reverse engineering is essentially the art and skill of examining, analyzing, and ultimately understanding how machine code functions and contributes to the overall operation of a complex software program. However, it poses significant challenges due to its inherent complexity. As computers and software evolve, they incorporate increasingly sophisticated layers of abstraction, making reverse engineering even more difficult. The proliferation of abstraction layers has simplified the process of creating applications, leading to a surge in startups and a reduction in the resources required for application development and money. Unlike in the past, when software engineers were revered for their deep understanding of hardware and ability to write machine code directly, today's landscape is different. The necessity for such expertise has diminished.
The goal of reverse engineering is not to revert to a 'source code' form. Because we won’t be able to get to the ‘source’, this is impossible by definition. Due to the nature of compilation and the distribution of software in binary form. I.e. losing some high level information that CPUs don’t need anyway, but it’s helpful for humans to manage software code, we will never have the ability to achieve or manage to reconstruct all the way back to the original source code.
For example, while reverse engineering, we will be forever uncertain about how developers initially named elements within their source code, among other details.
But the real question is, do we need to get the real source code?
So this is a multi-phase answer:
- To distribute software, it suffices to distribute code solely in binary form, encompassing machine instructions exclusively. This happens when you download an app from the app-store…
- Open source code denotes that the actual source code is openly accessible for anyone to engage with, compile, and execute independently - an extensive subject with significant implications both commercially and philosophically. Not useful for non-techies, obviously.
- Furthermore, reversing binary code is a time-consuming endeavor; the larger the software, the greater the volume of code, making it increasingly challenging to reconstruct and approximate the source code.
Naturally, access to the original source code would be advantageous as it provides extensive documentation and preserves the names used by developers, significantly easing comprehension compared to scrutinizing millions of individual instructions. Not to mention, it’s in a high level programming language.
Ultimately, it boils down to assigning meaning to the code you analyze. Once meaning is attributed, it becomes simpler to delve deeper into understanding it and subsequently determine the course of action to take. For instance, with the knowledge of how a virus operates, it becomes easier to detect and prevent its evasion.
Also, if you do have the source code, it means you can’t say you’re reverse engineering it, but you’re actually auditing it!
Quick C Example
After making an effort to avoid technical details as much as possible throughout this post, I'd like to share with you an example of an application written in the C programming language. It's a simple application that merely displays a message on the screen, reminiscent of the black console screens often depicted in movies from the 80s and in hacking scenes.
The following is a fully functional C program. While I won't go into specifics, it's worth noting that there are numerous abstractions utilized in this code, facilitated by the underlying operating system capable of executing the program and helping it achieve some tasks like message printing.
#include <stdio.h>
void main()
{
// Ask the OS to print a line to the screen!
printf(“Hello, computers are fun!\n”);
}
A compiler would take this C code and translate it into the following machine code. For brevity, once again, we ignore many technical things here, but just like if you’re foreign to DNA or musical notes, there’s an entry level knowledge that is required here too.
.LC0:
.string "Hello, computers are fun!\n"
main:
push rbp
mov rbp, rsp
mov edi, OFFSET FLAT:.LC0
mov eax, 0
call printf
pop rbp
ret
Observe the difficulty in deciphering this machine code, where a single meaningful line of C code has expanded into ten lines. And yet, this is just a basic application with limited functionality. In reality, complex applications consist of hundreds of thousands of lines of code, compounded by the inclusion of numerous abstraction layers, resulting in even more code.
For expert reverse engineering, looking at those lines and making a meaning of them and reconstructing it back to a C programming language is really simple. But sometimes it can take months or even years when we talk about real life applications.
In a surprising twist, reverse engineering can also be a valuable tool for software developers. By examining the compiled version of their own code (yes, that can count as reverse engineering!), they can check for specific vulnerabilities. This is especially important for critical software, like secure terminal servers, where even minor compiler glitches can have disastrous consequences. Historical examples show how vulnerabilities introduced during compilation have allowed attackers to exploit systems.
Back to the DNA genome, which comprises only four letters (ACTG), one must understand how to interpret them as genes - components imbued with purpose and giving them meaning. With just these four letters, incredible living organisms such as humans, animals, and plants emerge. The same happens with machine code, insane, right?
To summarize
Reverse engineering is captivating. You can take any code, and try to learn and unravel its inner workings, and I find it genuinely enjoyable. Whether it's troubleshooting software bugs, enhancing functionality, engaging in software piracy, creating cheat-codes, hacking, or cracking software, the possibilities are diverse. Similar to any profession, proficiency in reverse engineering improves with practice. Engineering and reverse engineering are distinctly different pursuits. Reverse engineering hinges on speed, patience, and the ability to pinpoint where to direct your attention when seeking specific elements within the code. Mind that getting a full ‘source code’ is rarely interesting.
In 2021, I identified numerous software vulnerabilities in the Windows operating system and reported them to Microsoft, which got me awarded hundreds of thousands of dollars! :) No wonder I love reverse engineering. This project took me more than 6 months, full time.
Between 2012 to 2016, Ariel (Piiano’s CTO) and I ran NorthBit, a software research company, where reversing, debugging, analyzing and coding were our day to day practices, in order to come up with answers for our customers' highly diverse challenges. From dissecting malware attacking our customers, removing them completely, to analyzing how they got attacked. And how to make large scale software automation run on iPhones, or creating a network level attacks detection system and much more.
Now that you’re familiar with reverse engineering, there’s also a whole art form of anti reverse engineering. That’s to make sure hackers and software pirates have a hard time breaking specific software. When someone is capable of copying games, the vendor will lose money. Or when someone can copy a movie, Netflix will lose money too. And eventually it’s all a cat and mouse game between software engineers and reverse engineers.
The same is true for nation states attacking one another in the cyber world over the internet, WW3 has been here for a long time now!
It all begins with the cloud, where applications are accessible to everyone. Therefore, a user or an attacker makes no difference per se. Technically, encrypting all data at rest and in transit might seem like a comprehensive approach, but these methods are not enough anymore. For cloud hosted applications, data-at-rest encryption does not provide the coverage one might expect.
Senior Product Owner