Software Reverse Engineering (SRE) is one of the most intellectually demanding disciplines in cybersecurity. It involves analyzing compiled software to understand how it operates without access to the source code. The goal is to determine how the software works, how it can fail, and, in some cases, how it is used to carry out attacks.
Reverse engineers analyze malware, audit software for vulnerabilities, investigate digital forensics cases, and verify critical systems where transparency is essential but source code is unavailable.
The dark side of this art is cracking, where SRE techniques are used to bypass software protections such as license checks, digital rights management, or copy-protection mechanisms.
With the rise of Large Language Models (LLMs), the flagship technology of modern Artificial Intelligence, analysts increasingly rely on AI assistants embedded into decompilers via plugins such as aiDAPal for IDA Pro or ReverserAI for Binary Ninja.
These tools promise to accelerate comprehension, but a fundamental question remained unanswered:
Do LLMs actually help human analysts perform better in real reverse engineering workflows?
Our team of researchers from Arizona State University, the University of Padua, and EURECOM, set out to answer this question with the first systematic, human-centered study of LLM-assisted Software Reverse Engineering.
We present our results in Decompiling the Synergy: An Empirical Study of Human-LLM Teaming in Software Reverse Engineering, recently accepted at the Network and Distributed System Security (NDSS) Symposium 2026, to be held in February 2026 in San Diego, California.
You can download the paper 👉 here
To support this study, we (credit to Zion Basque, the first author) developed DAILA, a research-oriented LLM plugin that integrates a superset of AI features offered by existing tools.
DAILA abstracts over the underlying decompiler and currently supports IDA Pro, Ghidra, Binary Ninja, and angr, while remaining model-agnostic via LiteLLM–enabling both closed models (e.g., GPT-4o, Claude) and open ones (e.g., LLaMA).
đź§Ş The Study
To move beyond anecdotes, we designed a three-phase controlled human study combining fine-grained behavioral instrumentation with qualitative feedback.
Three Phases
-
Pre-Study Survey (153 practitioners)
We surveyed a diverse population of SRE practitioners to understand how LLMs are already used in practice.- 68% reported LLMs as sometimes or often beneficial
- 86% primarily used GPT-based models
- The most common use cases were function summarization, renaming, and explaining known algorithms
-
Experiment Design
We built a browser-based reverse-engineering platform integrating DAILA, exposing six AI features:- Function summarization
- Function renaming
- Variable renaming
- Known algorithm identification
- Vulnerability identification
- Library documentation lookup
- Plus a free-form chat interface
We designed two realistic CTF-style challenges, carefully balanced in size, complexity, and difficulty. Each challenge contained realistic bugs (e.g., weak cryptography, path traversal) and representative program structure.
-
Human Study
We recruited 48 participants (24 self-reported experts and 24 novices).
Each participant solved one challenge with LLM assistance and one without, acting as their own control.In total, participants produced:
- 109 hours of recorded reverse-engineering activity
- 96 solution write-ups
- 1,517 distinct LLM interactions
📊 What We Found
1. AI as an Equalizer
LLMs dramatically narrow the expertise gap.
Novices using LLMs achieved a ~98% improvement in comprehension rate, reaching expert-level understanding speed.
“I would not have understood the binary half as well without the LLM.”
— study participant
This effect was consistent across both challenges and robust to different analysis metrics.
2. Experts: Redistribution, Not Acceleration
Experts did not experience a statistically significant increase in overall comprehension rate.
They did benefit selectively:
- Known algorithms (e.g., Base64, TEA, RLE) were triaged up to 2–3x faster
- Time saved there was often reinvested into analyzing custom or novel logic
However, experts reported:
- Low trust in vulnerability-finding prompts
- Frequent need to double-check or discard AI output
- Occasional performance regressions due to hallucinations
3. More Artifacts ≠Better Understanding
With LLM support, participants recovered ~66% more artifacts (comments, variable names, function names, inferred types).
Yet:
- LLM-generated artifacts did not correlate with improved understanding
- Manually created artifacts did
This suggests that the act of naming and structuring code is itself a cognitive process, one that automation can partially bypass (sometimes to the analyst’s detriment).
4. Hallucinations Are Rare, But Costly
Hallucinations occurred infrequently, but their impact was severe.
In about 20% of sessions, participants pursued false hypotheses (e.g., nonexistent buffer overflows) introduced by LLM suggestions.
In these cases, analysts spent up to 2Ă— more time chasing bugs that were not there.
Vulnerability-identification prompts were:
- The least trusted
- The most harmful to performance
- Negatively correlated with final understanding
5. Strategy Matters More Than Frequency
Top performers shared a common habit:
- They used the LLM early, during the first encounter with a function
- Typically via summarization, followed by manual renaming
This “first-visit strategy” led to a ~63% higher understanding rate than repeated or late LLM use.
In contrast:
- Repeated querying on the same function showed diminishing returns
- Heavy reliance on LLMs for large or complex functions degraded performance
đź’ˇ Key Takeaways
| Insight | What We Observed |
|---|---|
| LLMs close the novice-expert gap | Novices reach expert-level comprehension with AI |
| Experts remain irreplaceable | Domain knowledge is essential for validation |
| AI boosts quantity, not always quality | More artifacts ≠deeper understanding |
| Best use: early summarization | Use AI at first contact with code |
| *Worst use: vulnerability detection | False positives harm trust and performance |
In short: LLMs help you think faster, but only if you stay in charge.
⚖️ In Context: Comparing to Developer–AI Studies
Our findings echo recent the work “Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity” by Model Evaluation & Threat Research (METR), an independent AI evaluation organization. In that randomized controlled trial, experienced open-source developers were 19% slower when allowed to use AI tools, largely due to time spent validating, correcting, or undoing AI-generated output.
Our experts did not become slower in the same uniform way, but the underlying failure mode was similar. Crucially, expertise did not provide immunity. When an LLM confidently suggested a vulnerability that did not exist, experts frequently followed the lead, sometimes spending substantially more time on those functions than they would have without AI. The cost of these errors outweighed the occasional speedups elsewhere. In our data, the vulnerability-detection feature was the only AI capability negatively correlated with understanding, and experts were among those most affected.
Taken together, our results and METR’s point to the same uncomfortable conclusion:
current LLMs do not reliably accelerate expert cognition.
Instead, they introduce a new tax—verification, skepticism, and recovery from subtle errors.
For experts, this tax often cancels out, or even exceeds, any raw efficiency gains.
The implication is not that experts are replaceable, but that expert workflows are fragile when augmented with tools that speak fluently but reason shallowly. Until LLMs can support long-horizon reasoning and maintain semantic grounding, expert users may remain paradoxically among the least well-served by AI assistance.
đź§ Final Thoughts
Our study shows that the future of software reverse engineering is collaborative.
LLMs do not replace experts; they reshape the cognitive workflow. Used carefully, they accelerate understanding, expose hidden structure, and empower less experienced analysts to reason like experts.
Yet the same tools can mislead when over-trusted. Hallucinated vulnerabilities and confident but wrong explanations remind us that LLMs remain probabilistic assistants, not oracles.
The real question is no longer whether we should use AI in SRE, but how we design tools, workflows, and training that keep humans firmly in the loop.
As we conclude in the paper:
“LLMs are neither oracles nor impostors, but mirrors held to human insight: their reflections sharpen only in the steady gaze of critical thought and domain expertise.”
Authors
- Zion Leonahenahe Basque
- Samuele Doria
- Ananta Soneji
- Wil Gibbs
- Adam Doupé
- Yan Shoshitaishvili
- Eleonora Losiouk
- Ruoyu Wang
- Simone Aonzo