Symbolic Execution: Intermediate Representation
Systematic Comparison of Symbolic Execution Systems: Intermediate Representation and its Generation
Sebastian Poeplau and Aurélien Francillon
Proceedings of the 35th Annual Computer Security Applications Conference (ACSAC) 2019, San Juan, Puerto Rico
Symbolic execution has become a popular technique for software testing
and vulnerability detection. Most implementations transform the
program under analysis to some intermediate representation
(IR), which is then used as a basis for symbolic execution. There
is a multitude of available IRs, and even more approaches to transform
target programs into a respective IR.
When developing a symbolic execution engine, one needs to choose an
IR, but it is not clear which influence the IR generation process has
on the resulting system. What are the respective benefits for symbolic
execution of generating IR from source code versus lifting machine
code? Does the distinction even matter? What is the impact of not
using an IR, executing machine code directly? We feel that there is
little scientific evidence backing the answers to those questions.
Therefore, we first develop a methodology for systematic comparison of
different approaches to symbolic execution; we then use it to evaluate
the impact of the choice of IR and IR generation. We make our
comparison framework available to the community for future
research.
Introduction
On this site, we publish all code and data that we used in the study
referenced above, as promised in the paper. Feel free
to contact us if you have
any questions.
Code
To measure the size of the IR representations:
-
For the VEX IR generated by angr as well as for counting machine-code
instructions, we used a simple Python
script inspired by a script that Insu Yun kindly shared with us and
that was used in the evaluation of Qsym.
-
To count LLVM instructions in the IR that KLEE executes, we used
the InstCount
pass included in the LLVM framework. It requires a debug build of LLVM,
where it can be invoked as follows: "/debug/version/of/opt -stats
-analyze -instcount /path/to/code.bc".
-
For S2E, the situation is a little more tricky. Since it generates
LLVM bitcode on demand during analysis, there is no artifact that we can
evaluate statically. Moreover, the analysis only translates basic blocks
that are scheduled for symbolic execution. Therefore, we wrote a simple
angr script to extract a list of
all basic blocks (for consistency with the count obtained for VEX IR),
which we then fed to a tool that reuses S2E's lifter to generate
bitcode for all basic blocks (implemented on top
of s2e-tools; two
patches).
Let's move on to the changes we made in the four analyzed symbolic
execution engines. For now, we simply link our patches here. However, we
are in the process of submitting as many of our code changes as possible
to the respective upstream projects.
-
For angr, we added code to the main angr project in order
to count
the executed IR statements and
to measure
the time spent on symbolic execution (as opposed to concrete execution
with Unicorn). Moreover,
we patched claripy,
a dependency of angr, to log all queries that it sends to the SMT
solver. We processed the logged queries via awk '{ print }
/\(check-sat\)/ { print "(reset)" }' | sed 's/[+-]zero/0/g' in
order to avoid running into
a Z3 bug.
Finally, the script that we used to control the analysis is
available here.
-
For KLEE, we added the capability to handle just enough
of mmap
and munmap to support the
fixed-address allocations that the CGC programs use for the flag page.
Concolic execution can be enabled with "-only-seed -seed-out
/path/to/input.ktest". The number of executed instructions is already
part of the "run.stats" file that KLEE generates, and SMT queries can be
logged with the option "-use-query-log=all:smt2". In a post-processing
step, we ran awk '/(check-sat)/,/(exit)/ {if ($0 ~ /(check-sat)/)
print "(check-sat) (reset)"; next} {print}' on the logged queries
before passing them to Z3 for benchmarking.
-
Analyzing the CGC programs concolically in S2E required a bit more work.
First,
we fixed libs2ecore
to correctly update the timers for symbolic and concrete execution. Then
we added
a plugin
that terminates new execution states right after creation, so that we
see the queries that lead to the fork but do not spend CPU time on
pursuing the forked states (i.e., essentially the same behavior as KLEE
in seed-only mode). Lastly, we created a
custom S2E project that uses
the plugin; create symbolic links "crashing_input.xml" and "target"
pointing to the (original) PoV and the CGC binary, respectively, and
remove the line "--use-query-log=all:smt2" from "s2e-config.lua" to
disable query logging. The post-processing step for the queries is the
same as in the KLEE setup.
-
Qsym required a
small patch
to accept a timeout value in its run script, as well as two more patches
to count
the number of symbolically executed instructions and
to measure
the time spent in symbolic mode. We also made a
tiny modification
to log its SMT queries. We post-processed the log file "pin.log"
with awk '/\(set-info/, /\(check-sat\)/ {print} /\(check-sat\)/
{printf("(reset)\n")}' in order to extract the queries for Z3.
In all cases, we fed the SMT queries to z3 -T:1800 -smt2 for benchmarking.
Data
The CGC
programs published for the DARPA Cyber Grand Challenge are the basis
of our comparison. We used
the DECREE binaries
provided by Yan Shoshitaishvili for angr and S2E, and custom builds of
the Linux port by
Trail of Bits for KLEE and Qsym. While the former two systems simply
processed the unmodified binaries, the following instructions, based
on notes
by Brendan Dolan-Gavitt, let you produce test programs for KLEE and Qsym:
-
For Qsym, we found it easiest to build the challenge programs for 64-bit
machines using
a patch
on the build system. Note that building for a different architecture may
affect the exploitability of some of the challenges—this is not a
concern for us since our goal is not to exploit their vulnerabilities
but just to have a stable set of test programs.
-
For KLEE, we applied
another patch
in addition to the one enabling the 64-bit build. This patch replaces
the anonymous mmap that all CGC programs use to allocate heap
memory with a combination of malloc and mprotect. Then
we built
with wllvm
as described in
the tutorial
on analyzing GNU Coreutils.
Finally, we wrote a small tool that converts
PoV descriptions from the original XML format to raw inputs, as well
as one that produces ".ktest" files from raw
inputs (required for KLEE). The PoV conversion does not work in all
cases—when the PoV uses received data to derive what to send next, we
cannot translate it to a simple binary input (as noted in the paper). The
tools are not currently very user friendly (sorry!). We are working on
easier to use versions.
Acknowledgements
This work was supported by the DAPCODS/IOTics ANR 2016 project (ANR-16-CE25-0015).
Feel free to reach out to us
if anything is unclear or if you need more information.