Estimating Worst-Case Frontier Risks of Open-Weight LLMs

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 5.6 Download Report PDF

Open-source LLMssafetyfrontier risks

In this paper, we study the worst-case frontier risks of the OpenAI gpt-oss model. We introduce malicious fine-tuning (MFT), where we attempt to elicit maximum capabilities by fine-tuning gpt-oss to be as capable as possible in two domains: biology and cybersecurity. To maximize biological risk (biorisk), we curate tasks related to threat creation and train gpt-oss in an RL environment with web browsing. To maximize cybersecurity risk, we train gpt-oss in an agentic coding environment to solve capture-the-flag (CTF) challenges. We compare these MFT models against open- and closed-weight LLMs on frontier risk evaluations. Compared to frontier closed-weight models, MFT gpt-oss underperforms OpenAI o3, a model that is below Preparedness High capability level for biorisk and cybersecurity. Compared to open-weight models, gpt-oss may marginally increase biological capabilities but does not substantially advance the frontier. Taken together, these results led us to believe that the net new harm from releasing gpt-oss is limited, and we hope that our MFT approach can serve as useful guidance for estimating harm from future open-weight releases.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces malicious fine-tuning (MFT) as a methodology for estimating worst-case frontier risks of open-weight models, specifically targeting biological and cybersecurity domains through reinforcement learning with tool environments. According to the taxonomy tree, this work resides in the 'Malicious Fine-Tuning for Frontier Risk Estimation' leaf under 'Adversarial Capability Elicitation and Risk Assessment'. This leaf contains only two papers total, indicating a relatively sparse and emerging research direction. The paper shares this narrow space with one sibling work, suggesting the field is still developing standardized approaches to worst-case capability elicitation through deliberate fine-tuning.

The taxonomy reveals that the paper's immediate neighbors include 'Model Tampering and Weight Manipulation Attacks' (focusing on direct parameter modification) and 'Adversarial Attack Methods for Safety Testing' (emphasizing input-level jailbreaking). The broader 'Adversarial Capability Elicitation' branch connects to parallel evaluation efforts under 'Evaluation Frameworks and Benchmarks', particularly 'Dual-Use Risk Assessment in Specific Domains' which examines biosecurity and cybersecurity capabilities. The taxonomy's scope notes clarify that this work differs from passive evaluation methods by actively enhancing capabilities, and from input-level attacks by modifying weights through fine-tuning rather than crafting adversarial prompts.

Among the 30 candidates examined through limited semantic search, the contribution-level analysis reveals mixed novelty signals. The core MFT methodology shows one refutable candidate among 10 examined, suggesting some prior exploration of deliberate capability maximization. The domain-specific RL approach with tool environments encountered two refutable candidates among 10, indicating more substantial overlap with existing work on agentic training paradigms. The risk assessment framework contribution also found one refutable candidate among 10 examined. These statistics suggest that while individual technical components have precedents, the integrated approach may offer incremental advances within this sparse research area.

Given the limited search scope of 30 semantically similar papers and the sparse taxonomy leaf containing only two works, the analysis captures a snapshot rather than exhaustive coverage. The paper appears to consolidate and systematize emerging practices in worst-case risk estimation, operating in a research direction where methodological standards are still crystallizing. The refutation signals primarily reflect overlapping technical building blocks rather than complete methodological duplication, consistent with work that synthesizes existing techniques into a coherent evaluation framework.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: Estimating worst-case frontier risks of open-weight language models. The field structure reflects a multi-faceted approach to understanding and mitigating the unique vulnerabilities introduced when model weights are publicly accessible. The taxonomy organizes work into four main branches: Adversarial Capability Elicitation and Risk Assessment focuses on probing methods that reveal latent harmful capabilities through techniques like malicious fine-tuning and jailbreaking; Evaluation Frameworks and Benchmarks develops standardized tests and metrics to quantify risk exposure; Protective Mechanisms and Countermeasures explores defenses such as watermarking, secure release protocols, and safeguard durability; and Domain-Specific Applications and Case Studies examines concrete threat scenarios in areas like biosecurity and copyright infringement. Representative works like Jailbreakbench[1] and Worst Prompt Performance[3] illustrate how adversarial elicitation methods systematically stress-test model boundaries, while efforts such as Secure Weight Release[5] and Watermarking Radioactive Models[6] propose technical barriers to misuse. A particularly active line of inquiry centers on the tension between open access and safety: many studies explore whether fine-tuning can irreversibly strip away alignment guardrails, raising questions about the durability of built-in protections. Worst-Case Frontier Risks[0] sits squarely within the Adversarial Capability Elicitation branch, specifically addressing malicious fine-tuning as a vector for estimating extreme-case harms. Its emphasis on worst-case scenarios complements nearby work like Weaponization-Resistant LLMs[7], which investigates design principles to harden models against deliberate weaponization, and Safeguard Durability[8], which quantifies how robustly safety measures persist under adversarial adaptation. Together, these efforts highlight an open challenge: determining whether post-hoc defenses can sufficiently counterbalance the inherent risks of weight transparency, or whether fundamentally new release paradigms are required.

Claimed Contributions

Malicious fine-tuning (MFT) methodology for estimating worst-case frontier risks

Can Refute

10 retrieved papers

The authors propose a methodology called malicious fine-tuning that directly fine-tunes open-weight models to maximize their capabilities in high-risk domains (biology and cybersecurity) in order to estimate the worst-case harms that could be achieved by adversaries, rather than only evaluating the released version of the model.

10 retrieved papers

Can Refute

Domain-specific capability maximization through RL with tool environments

Can Refute

10 retrieved papers

The authors develop specialized training procedures that use reinforcement learning in tool-augmented environments: training with web browsing for biology tasks and training in terminal environments for cybersecurity CTF challenges, along with curated in-domain datasets to maximize capabilities in each risk domain.

10 retrieved papers

Can Refute

Framework for assessing absolute and marginal risk of open-weight releases

Can Refute

10 retrieved papers

The authors establish an evaluation framework that compares maliciously fine-tuned models against both frontier closed-weight models and existing open-weight models to assess both absolute capability levels and marginal risk contributions, providing guidance for future open-weight model releases.

10 retrieved papers

Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[7] On weaponization-resistant large language models with prospect theoretic alignment PDF

Z Cheng, M Zhang, J Sun, W Dai (2025)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Malicious fine-tuning (MFT) methodology for estimating worst-case frontier risks

[35] Fine-tuning aligned language models compromises safety, even when users do not intend to! PDF

Can Refute

[34] Fine-tuning Language Models for Factuality PDF

Cannot Refute

[36] Instruction-tuned large language models for machine translation in the medical domain PDF

Cannot Refute

[37] SafeTraffic Copilot: adapting large language models for trustworthy traffic safety assessments and decision interventions PDF

Cannot Refute

[38] ChatDoctor: A Medical Chat Model Fine-Tuned on a Large Language Model Meta-AI (LLaMA) Using Medical Domain Knowledge PDF

Cannot Refute

[39] Large language models for cyber security: A systematic literature review PDF

Cannot Refute

[40] Leveraging large language models for enhancing safety in maritime operations PDF

Cannot Refute

[41] Multilingual jailbreak challenges in large language models PDF

Cannot Refute

[42] The foundational capabilities of large language models in predicting postoperative risks using clinical notes PDF

Cannot Refute

[43] Large language model for vulnerability detection and repair: Literature review and the road ahead PDF

Cannot Refute

Contribution

Domain-specific capability maximization through RL with tool environments

[16] Reinforcement learning foundations for deep research systems: A survey PDF

Can Refute

[18] gpt-oss-120b & gpt-oss-20b model card PDF

Can Refute

[15] Demystifying reinforcement learning in agentic reasoning PDF

Cannot Refute

[17] Rlfactory: A plug-and-play reinforcement learning post-training framework for llm multi-turn tool-use PDF

Cannot Refute

[19] Verltool: Towards holistic agentic reinforcement learning with tool use PDF

Cannot Refute

[20] Storehouse: a Reinforcement Learning Environment for Optimizing Warehouse Management PDF

Cannot Refute

[21] Tool-R1: Sample-Efficient Reinforcement Learning for Agentic Tool Use PDF

Cannot Refute

[22] ToRL: Scaling Tool-Integrated RL PDF

Cannot Refute

[23] Steptool: A step-grained reinforcement learning framework for tool learning in llms PDF

Cannot Refute

[24] Reinforcement Learning for Optimizing RAG for Domain Chatbots PDF

Cannot Refute

Contribution

Framework for assessing absolute and marginal risk of open-weight releases

[25] On the Societal Impact of Open Foundation Models PDF

Can Refute

[4] Model tampering attacks enable more rigorous evaluations of llm capabilities PDF

Cannot Refute

[26] Evaluating AI for Finance: Is AI Credible at Assessing Investment Risk? PDF

Cannot Refute

[27] Interactive Agents to Overcome Ambiguity in Software Engineering PDF

Cannot Refute

[28] On-premises LLM deployment demands a middle path: Preserving privacy without sacrificing model confidentiality PDF

Cannot Refute

[29] ForestGPT and Beyond: A Trustworthy Domain-Specific Large Language Model Paving the Way to Forestry 5.0 PDF

Cannot Refute

[30] Locking open weight models with spectral deformation PDF

Cannot Refute

[31] Position: Standard Benchmarks Fail--LLM Agents Present Overlooked Risks for Financial Applications PDF

Cannot Refute

[32] Towards Responsible Governing AI Proliferation PDF

Cannot Refute

[33] Mitigating Cyber Risk in the Age of Open-Weight LLMs: Policy Gaps and Technical Realities PDF

Cannot Refute

Estimating Worst-Case Frontier Risks of Open-Weight LLMs

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[7] On weaponization-resistant large language models with prospect theoretic alignment PDF

Contribution Analysis

Malicious fine-tuning (MFT) methodology for estimating worst-case frontier risks

[35] Fine-tuning aligned language models compromises safety, even when users do not intend to! PDF

[34] Fine-tuning Language Models for Factuality PDF

[36] Instruction-tuned large language models for machine translation in the medical domain PDF

[37] SafeTraffic Copilot: adapting large language models for trustworthy traffic safety assessments and decision interventions PDF

[38] ChatDoctor: A Medical Chat Model Fine-Tuned on a Large Language Model Meta-AI (LLaMA) Using Medical Domain Knowledge PDF

[39] Large language models for cyber security: A systematic literature review PDF

[40] Leveraging large language models for enhancing safety in maritime operations PDF

[41] Multilingual jailbreak challenges in large language models PDF

[42] The foundational capabilities of large language models in predicting postoperative risks using clinical notes PDF

[43] Large language model for vulnerability detection and repair: Literature review and the road ahead PDF

Domain-specific capability maximization through RL with tool environments

[16] Reinforcement learning foundations for deep research systems: A survey PDF

[18] gpt-oss-120b & gpt-oss-20b model card PDF

[15] Demystifying reinforcement learning in agentic reasoning PDF

[17] Rlfactory: A plug-and-play reinforcement learning post-training framework for llm multi-turn tool-use PDF

[19] Verltool: Towards holistic agentic reinforcement learning with tool use PDF

[20] Storehouse: a Reinforcement Learning Environment for Optimizing Warehouse Management PDF

[21] Tool-R1: Sample-Efficient Reinforcement Learning for Agentic Tool Use PDF

[22] ToRL: Scaling Tool-Integrated RL PDF

[23] Steptool: A step-grained reinforcement learning framework for tool learning in llms PDF

[24] Reinforcement Learning for Optimizing RAG for Domain Chatbots PDF

Framework for assessing absolute and marginal risk of open-weight releases

[25] On the Societal Impact of Open Foundation Models PDF

[4] Model tampering attacks enable more rigorous evaluations of llm capabilities PDF

[26] Evaluating AI for Finance: Is AI Credible at Assessing Investment Risk? PDF

[27] Interactive Agents to Overcome Ambiguity in Software Engineering PDF

[28] On-premises LLM deployment demands a middle path: Preserving privacy without sacrificing model confidentiality PDF

[29] ForestGPT and Beyond: A Trustworthy Domain-Specific Large Language Model Paving the Way to Forestry 5.0 PDF

[30] Locking open weight models with spectral deformation PDF

[31] Position: Standard Benchmarks Fail--LLM Agents Present Overlooked Risks for Financial Applications PDF

[32] Towards Responsible Governing AI Proliferation PDF

[33] Mitigating Cyber Risk in the Age of Open-Weight LLMs: Policy Gaps and Technical Realities PDF

Table of Contents