RedSage: A Cybersecurity Generalist LLM

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.7 Download Report PDF

cybersecuritylarge language modelsdata-drivendataset curationagentic augmentationcontinual pretrainingsupervised fine-tuningbenchmark evaluationopen-source

Cybersecurity operations demand assistant LLMs that support diverse workflows without exposing sensitive data. Existing solutions either rely on proprietary APIs with privacy risks or on open models lacking domain adaptation. To bridge this gap, we curate 11.8B tokens of cybersecurity-focused continual pretraining data via large-scale web filtering and manual collection of high-quality resources, spanning 28.6K documents across frameworks, offensive techniques, and security tools. Building on this, we design an agentic augmentation pipeline that simulates expert workflows to generate 266K multi-turn cybersecurity samples for supervised fine-tuning. Combined with general open-source LLM data, these resources enable the training of RedSage, an open-source, locally deployable cybersecurity assistant with domain-aware pretraining and post-training. To rigorously evaluate the models, we introduce RedSage-Bench, a benchmark with 30K multiple-choice and 240 open-ended Q&A items covering cybersecurity knowledge, skills, and tool expertise. RedSage is further evaluated on established cybersecurity benchmarks (e.g., CTI-Bench, CyberMetric, SECURE) and general LLM benchmarks to assess broader generalization. At the 8B scale, RedSage achieves consistently better results, surpassing the baseline models by up to +5.59 points on cybersecurity benchmarks and +5.05 points on Open LLM Leaderboard tasks. These findings demonstrate that domain-aware agentic augmentation and pre/post-training can not only enhance cybersecurity-specific expertise but also help to improve general reasoning and instruction-following. All models, datasets, and code will be released to advance reproducibility and open cybersecurity LLM research.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper contributes a domain-adapted cybersecurity assistant through continual pretraining on 11.8B tokens, agentic augmentation for supervised fine-tuning, and a comprehensive benchmark. It resides in the 'Decoder-Based and Generalist Model Adaptation' leaf, which contains five papers total, including the original work. This leaf sits within the broader 'Domain-Specific Language Model Development' branch, indicating a moderately populated research direction focused on adapting generalist LLMs to cybersecurity through curated corpora and specialized pretraining strategies.

The taxonomy reveals neighboring leaves addressing encoder-based adaptation (five papers on BERT-family models) and specialized corpus construction (three papers on dataset curation). The decoder-based leaf explicitly excludes encoder-only approaches, positioning RedSage among works that adapt generalist architectures rather than building domain-specific encoders from scratch. Sibling papers in this leaf explore IoT-specific adaptation and domain fine-tuning strategies, suggesting the work connects to a cluster investigating how to efficiently specialize large language models for security operations without full retraining.

Among thirty candidates examined, the continual pretraining corpus contribution shows no clear refutation across ten candidates reviewed, while the agentic augmentation pipeline encounters one potentially overlapping prior work among ten examined. The benchmark contribution similarly shows no refutation across ten candidates. The limited search scope means these statistics reflect top-K semantic matches rather than exhaustive coverage. The corpus and benchmark contributions appear more distinctive within this sample, while the augmentation pipeline faces at least one substantive prior work overlap.

Based on the top-30 semantic matches examined, the work appears to occupy established territory in decoder-based cybersecurity adaptation, with the corpus scale and benchmark scope potentially offering incremental advances. The taxonomy structure suggests this is a moderately active research direction rather than a sparse frontier, and the contribution-level statistics indicate mixed novelty across the three claimed innovations within the limited literature sample reviewed.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: domain-aware language model training for cybersecurity operations. The field has evolved into a rich ecosystem organized around several complementary branches. Domain-Specific Language Model Development focuses on building specialized encoders and decoders tailored to security corpora, with works like SecureBERT[3] and CyberLlama[4] exemplifying encoder-based and generalist adaptation strategies. Application-Oriented Model Specialization targets concrete operational needs such as threat intelligence extraction, vulnerability analysis, and intrusion detection, often leveraging domain-adapted representations for downstream tasks. Training Methodologies and Optimization Techniques explore efficient adaptation strategies—ranging from continued pretraining on security texts to parameter-efficient fine-tuning and retrieval-augmented generation—while Autonomous Cyber Defense and Adaptive Systems investigate agentic frameworks that combine language models with planning and decision-making capabilities. Evaluation, Benchmarking, and Surveys provide critical infrastructure through datasets, metrics, and comprehensive reviews like LLM Cybersecurity State[1] and LLM Cyber Review[2], and Cross-Domain and Emerging Applications extend these methods to IoT, satellite communications, and financial security contexts. Within the decoder-based and generalist adaptation cluster, a central tension emerges between building large-scale foundation models from scratch versus efficiently adapting existing generalist architectures to security-specific vocabularies and reasoning patterns. RedSage[0] sits squarely in this space, emphasizing domain-aware pretraining for cybersecurity operations alongside neighbors like IoT Domain Adaptive[26] and Domain Finetuning Cyber[38], which explore targeted adaptation strategies for specialized subdomains. Compared to FoundationAI SecurityLLM[48], which pursues a broad foundation model approach, RedSage[0] appears to prioritize operational relevance and domain-specific linguistic nuances. The broader landscape reveals ongoing questions about the trade-offs between model scale, domain corpus quality, and task-specific fine-tuning depth, with many studies investigating how to balance generalization across security tasks against deep specialization for particular operational workflows.

Claimed Contributions

Large-scale cybersecurity continual pretraining corpus

10 retrieved papers

The authors curate CyberFineWeb by filtering FineWeb with a fine-tuned classifier and mixing it with general knowledge data, plus RedSage-Seed containing 28.6K high-quality documents from authoritative cybersecurity sources. This corpus enables domain-aware continual pretraining for cybersecurity LLMs.

10 retrieved papers

Agentic augmentation pipeline for cybersecurity SFT data

Can Refute

10 retrieved papers

The authors design an agentic framework with Planner and Augmenter agents that transforms curated seed data into 266K multi-turn cybersecurity dialogues simulating expert workflows. This pipeline scales efficiently while preserving technical depth across knowledge, skills, and tool proficiency.

10 retrieved papers

Can Refute

RedSage-Bench comprehensive cybersecurity benchmark

10 retrieved papers

The authors create a new benchmark covering three dimensions (knowledge, skills, tool expertise) with 30K multiple-choice questions and 240 open-ended items evaluated using LLM-as-judge scoring. This addresses gaps in existing benchmarks that omit tool proficiency and qualitative free-response assessment.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[18] Less Data, More Security: Advancing Cybersecurity LLMs Specialization via Resource-Efficient Domain-Adaptive Continuous Pre-training with Minimal Tokens PDF

Salahuddin Salahuddin, Hussain Ahmed, Ahmed Hussain, Jussi LÃ¶ppÃ¶nen, Papadimitratos, Panos, Toni Jutila, Panagiotis Papadimitratos (2025)

[26] A Domain-Adaptive Large Language Model With Refinement Framework For IoT Cybersecurity PDF

Che Xun, Yu Zheng, Xun Che, Minhao Zhu, Qianmu Li, Xu Dong (2024)

[38] Fine-tuning of Large Language Models for Domain-Specific Cybersecurity Knowledge PDF

Huang Yuan, Yuan Huang (2025) • arXiv.org

[48] Llama-3.1-foundationai-securityllm-base-8b technical report PDF

Kassianik, Paul, Saglam, Baturay, Paul Kassianik, Chen Alexander, Baturay SaÄlam, Nelson, Blaine, Alexander Chen, Blaine Nelson, Aufiero, Massimo, Anu Vellore, Massimo Aufiero, Fraser Burch, Dhruv Kedia, Avi Zohary, Priyanshu, Aman, Sajana Weerawardhena, Aman Priyanshu, Adam Swanda, Anderson, Hyrum, Amy Chang, Oshiba, Kojin, Hyrum Anderson, Santos, Omar, Kojin Oshiba, Singer, Yaron, O. Santos, Karbasi, Amin, Yaron Singer, Amin Karbasi (2025)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Large-scale cybersecurity continual pretraining corpus

[71] Efficient continual pre-training for building domain specific large language models PDF

Cannot Refute

[72] Continual pre-training of language models PDF

Cannot Refute

[73] Towards effective and efficient continual pre-training of large language models PDF

Cannot Refute

[74] Efficient Domain-adaptive Continual Pretraining for the Process Industry in the German Language PDF

Cannot Refute

[75] Domain-specific language models pre-trained on construction management systems corpora PDF

Cannot Refute

[76] Efficient Domain Continual pretraining by Mitigating the Stability Gap PDF

Cannot Refute

[77] Ernie 2.0: A continual pre-training framework for language understanding PDF

Cannot Refute

[78] On the effect of pretraining corpora on in-context learning by a large-scale language model PDF

Cannot Refute

[79] Corpusbrain++: A continual generative pre-training framework for knowledge-intensive language tasks PDF

Cannot Refute

[80] DoReMi: Optimizing Data Mixtures Speeds Up Language Model Pretraining PDF

Cannot Refute

Contribution

Agentic augmentation pipeline for cybersecurity SFT data

[54] Agentinstruct: Toward generative teaching with agentic flows PDF

Can Refute

[51] Aligning Large Language Model Agents with Rational and Moral Preferences: A Supervised Fine-Tuning Approach PDF

Cannot Refute

[52] Agentic large language models, a survey PDF

Cannot Refute

[53] Magicgui: A foundational mobile gui agent with scalable data pipeline and reinforcement fine-tuning PDF

Cannot Refute

[55] Agentic feature augmentation: Unifying selection and generation with teaming, planning, and memories PDF

Cannot Refute

[56] Learning from Generalization Patterns: An Evaluation-Driven Approach to Enhanced Data Augmentation for Fine-Tuning Small Language Models PDF

Cannot Refute

[57] Agentic retrieval-augmented generation for time series analysis PDF

Cannot Refute

[58] A survey on generative recommendation: Data, model, and tasks PDF

Cannot Refute

[59] AI for Climate Finance: Agentic Retrieval and Multi-Step Reasoning for Early Warning System Investments PDF

Cannot Refute

[60] Towards Safety Reasoning in LLMs: AI-agentic Deliberation for Policy-embedded CoT Data Creation PDF

Cannot Refute

Contribution

RedSage-Bench comprehensive cybersecurity benchmark

[61] KILT: a benchmark for knowledge intensive language tasks PDF

Cannot Refute

[62] Putting GPT-4o to the Sword: A Comprehensive Evaluation of Language, Vision, Speech, and Multimodal Proficiency PDF

Cannot Refute

[63] MMT-Bench: A Comprehensive Multimodal Benchmark for Evaluating Large Vision-Language Models Towards Multitask AGI PDF

Cannot Refute

[64] Ï-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains PDF

Cannot Refute

[65] Futurex: An advanced live benchmark for llm agents in future prediction PDF

Cannot Refute

[66] An extensive benchmark study on biomedical text generation and mining with ChatGPT PDF

Cannot Refute

[67] SANSKRITI: A Comprehensive Benchmark for Evaluating Language Models' Knowledge of Indian Culture PDF

Cannot Refute

[68] Reliable benchmarking: requirements and solutions PDF

Cannot Refute

[69] Toward HydroLLM: a benchmark dataset for hydrology-specific knowledge assessment for large language models PDF

Cannot Refute

[70] GTA: A Benchmark for General Tool Agents PDF

Cannot Refute

RedSage: A Cybersecurity Generalist LLM

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[18] Less Data, More Security: Advancing Cybersecurity LLMs Specialization via Resource-Efficient Domain-Adaptive Continuous Pre-training with Minimal Tokens PDF

[26] A Domain-Adaptive Large Language Model With Refinement Framework For IoT Cybersecurity PDF

[38] Fine-tuning of Large Language Models for Domain-Specific Cybersecurity Knowledge PDF

[48] Llama-3.1-foundationai-securityllm-base-8b technical report PDF

Contribution Analysis

Large-scale cybersecurity continual pretraining corpus

[71] Efficient continual pre-training for building domain specific large language models PDF

[72] Continual pre-training of language models PDF

[73] Towards effective and efficient continual pre-training of large language models PDF

[74] Efficient Domain-adaptive Continual Pretraining for the Process Industry in the German Language PDF

[75] Domain-specific language models pre-trained on construction management systems corpora PDF

[76] Efficient Domain Continual pretraining by Mitigating the Stability Gap PDF

[77] Ernie 2.0: A continual pre-training framework for language understanding PDF

[78] On the effect of pretraining corpora on in-context learning by a large-scale language model PDF

[79] Corpusbrain++: A continual generative pre-training framework for knowledge-intensive language tasks PDF

[80] DoReMi: Optimizing Data Mixtures Speeds Up Language Model Pretraining PDF

Agentic augmentation pipeline for cybersecurity SFT data

[54] Agentinstruct: Toward generative teaching with agentic flows PDF

[51] Aligning Large Language Model Agents with Rational and Moral Preferences: A Supervised Fine-Tuning Approach PDF

[52] Agentic large language models, a survey PDF

[53] Magicgui: A foundational mobile gui agent with scalable data pipeline and reinforcement fine-tuning PDF

[55] Agentic feature augmentation: Unifying selection and generation with teaming, planning, and memories PDF

[56] Learning from Generalization Patterns: An Evaluation-Driven Approach to Enhanced Data Augmentation for Fine-Tuning Small Language Models PDF

[57] Agentic retrieval-augmented generation for time series analysis PDF

[58] A survey on generative recommendation: Data, model, and tasks PDF

[59] AI for Climate Finance: Agentic Retrieval and Multi-Step Reasoning for Early Warning System Investments PDF

[60] Towards Safety Reasoning in LLMs: AI-agentic Deliberation for Policy-embedded CoT Data Creation PDF

RedSage-Bench comprehensive cybersecurity benchmark

[61] KILT: a benchmark for knowledge intensive language tasks PDF

[62] Putting GPT-4o to the Sword: A Comprehensive Evaluation of Language, Vision, Speech, and Multimodal Proficiency PDF

[63] MMT-Bench: A Comprehensive Multimodal Benchmark for Evaluating Large Vision-Language Models Towards Multitask AGI PDF

[64] Ï-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains PDF

[65] Futurex: An advanced live benchmark for llm agents in future prediction PDF

[66] An extensive benchmark study on biomedical text generation and mining with ChatGPT PDF

[67] SANSKRITI: A Comprehensive Benchmark for Evaluating Language Models' Knowledge of Indian Culture PDF

[68] Reliable benchmarking: requirements and solutions PDF

[69] Toward HydroLLM: a benchmark dataset for hydrology-specific knowledge assessment for large language models PDF

[70] GTA: A Benchmark for General Tool Agents PDF

Table of Contents

[64] Ï-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains PDF