SK2Decompile: LLM-based Two-Phase Binary Decompilation from Skeleton to Skin
Overview
Overall Novelty Assessment
The paper proposes a two-phase decompilation framework that first recovers program structure (skeleton) via an Intermediate Representation with obfuscated identifiers, then generates meaningful variable names (skin) using separate reinforcement learning objectives. It occupies the 'Two-Phase Skeleton-to-Skin Decompilation' leaf within the 'Neural and LLM-Based Binary Decompilation' branch. Notably, this leaf contains only the original paper itself—no sibling papers are present—indicating a sparse, potentially novel research direction within the broader binary decompilation landscape.
The taxonomy reveals that the paper's immediate parent branch, 'Neural and LLM-Based Binary Decompilation', also includes a 'Direct Neural Decompilation' leaf with two papers pursuing end-to-end translation without intermediate structure recovery. Neighboring branches address 'Compiler-Aware Structural Decompilation' (traditional algorithms), 'Decompiled Code Refinement and Enhancement' (post-processing), and 'Binary-Source Code Alignment and Mapping' (dataset generation). The two-phase skeleton-to-skin approach diverges from both single-pass neural methods and compiler-driven structural analysis, positioning itself at the intersection of modularity and learning-based refinement.
Among sixteen candidates examined, the two-phase framework contribution shows one refutable candidate out of five examined, suggesting some prior exploration of phased decompilation strategies. The Intermediate Representation contribution examined ten candidates with none clearly refuting it, indicating relative novelty in the specific obfuscation-based IR design. The phase-specific reinforcement learning contribution examined only one candidate without refutation, though the limited search scope prevents strong conclusions. Overall, the analysis covers a modest candidate pool drawn from semantic search, not an exhaustive survey of all decompilation literature.
Given the limited search scope and the paper's placement in a singleton taxonomy leaf, the work appears to explore a relatively underexplored direction within neural binary decompilation. The two-phase decomposition and reinforcement learning integration show partial overlap with prior phased refinement methods, but the specific skeleton-to-skin framing and obfuscated IR design may offer incremental differentiation. A broader literature review would clarify whether similar modular strategies exist outside the top-sixteen candidates examined.
Taxonomy
Research Landscape Overview
Claimed Contributions
The authors introduce a novel framework that decomposes binary decompilation into two sequential phases. The first phase recovers program structure (skeleton) by translating binary to an intermediate representation, while the second phase recovers meaningful identifiers (skin) that reflect program semantics. Each phase uses reinforcement learning with phase-specific rewards.
The authors design an intermediate representation that consists of source code with all identifiers replaced by generic placeholders. This IR is grounded in the Information Bottleneck principle, balancing compression of identifier semantics while preserving structural semantics, and serves as the bridge between the two decompilation phases.
The authors develop distinct reinforcement learning objectives for each phase: compiler-based rewards for Structure Recovery to ensure syntactic and semantic correctness, and semantic similarity rewards for Identifier Naming to improve human-centric readability. This allows independent optimization of functional correctness and code readability.
Core Task Comparisons
Comparisons with papers in the same taxonomy category
Contribution Analysis
Detailed comparisons for each claimed contribution
Two-phase decompilation framework with Structure Recovery and Identifier Naming
The authors introduce a novel framework that decomposes binary decompilation into two sequential phases. The first phase recovers program structure (skeleton) by translating binary to an intermediate representation, while the second phase recovers meaningful identifiers (skin) that reflect program semantics. Each phase uses reinforcement learning with phase-specific rewards.
[48] A Neural-based Program Decompiler PDF
[46] Decompiling Smart Contracts with a Large Language Model PDF
[47] Type-based decompilation (or program reconstruction via type reconstruction) PDF
[49] A refined decompiler to generate C code with high readability PDF
[50] DETECTING REPACKAGED ANDROID APPS USING SERVER-SIDE ANALYSIS PDF
Intermediate Representation based on obfuscated source code
The authors design an intermediate representation that consists of source code with all identifiers replaced by generic placeholders. This IR is grounded in the Information Bottleneck principle, balancing compression of identifier semantics while preserving structural semantics, and serves as the bridge between the two decompilation phases.
[12] {DnD}: A {Cross-Architecture} deep neural network decompiler PDF
[37] Exploring the potential of llms for code deobfuscation PDF
[38] Quantifying and Mitigating the Impact of Obfuscations on Machine-Learning-Based Decompilation Improvement PDF
[39] Overhead prediction in obfuscated programs PDF
[40] ChatDEOB: An Effective Deobfuscation Method Based on Large Language Model PDF
[41] Deep Learning for Obfuscated Code Analysis PDF
[42] Research of the sustainability of a digital watermark embedded through opcodes substitutions in a class file against decompilation attacks PDF
[43] Edge of the Art in Vulnerability Research Version 5 PDF
[44] Obfuscation technologies of high-level source code using artificial intelligence PDF
[45] Attention-Based Decompilation Through Neural Machine Translation PDF
Phase-specific reinforcement learning rewards for correctness and readability
The authors develop distinct reinforcement learning objectives for each phase: compiler-based rewards for Structure Recovery to ensure syntactic and semantic correctness, and semantic similarity rewards for Identifier Naming to improve human-centric readability. This allows independent optimization of functional correctness and code readability.