Estimating Worst-Case Frontier Risks of Open-Weight LLMs

ICLR 2026 Conference SubmissionAnonymous Authors
Open-source LLMssafetyfrontier risks
Abstract:

In this paper, we study the worst-case frontier risks of the OpenAI gpt-oss model. We introduce malicious fine-tuning (MFT), where we attempt to elicit maximum capabilities by fine-tuning gpt-oss to be as capable as possible in two domains: biology and cybersecurity. To maximize biological risk (biorisk), we curate tasks related to threat creation and train gpt-oss in an RL environment with web browsing. To maximize cybersecurity risk, we train gpt-oss in an agentic coding environment to solve capture-the-flag (CTF) challenges. We compare these MFT models against open- and closed-weight LLMs on frontier risk evaluations. Compared to frontier closed-weight models, MFT gpt-oss underperforms OpenAI o3, a model that is below Preparedness High capability level for biorisk and cybersecurity. Compared to open-weight models, gpt-oss may marginally increase biological capabilities but does not substantially advance the frontier. Taken together, these results led us to believe that the net new harm from releasing gpt-oss is limited, and we hope that our MFT approach can serve as useful guidance for estimating harm from future open-weight releases.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces malicious fine-tuning (MFT) as a methodology for estimating worst-case frontier risks of open-weight models, specifically targeting biological and cybersecurity domains through reinforcement learning with tool environments. According to the taxonomy tree, this work resides in the 'Malicious Fine-Tuning for Frontier Risk Estimation' leaf under 'Adversarial Capability Elicitation and Risk Assessment'. This leaf contains only two papers total, indicating a relatively sparse and emerging research direction. The paper shares this narrow space with one sibling work, suggesting the field is still developing standardized approaches to worst-case capability elicitation through deliberate fine-tuning.

The taxonomy reveals that the paper's immediate neighbors include 'Model Tampering and Weight Manipulation Attacks' (focusing on direct parameter modification) and 'Adversarial Attack Methods for Safety Testing' (emphasizing input-level jailbreaking). The broader 'Adversarial Capability Elicitation' branch connects to parallel evaluation efforts under 'Evaluation Frameworks and Benchmarks', particularly 'Dual-Use Risk Assessment in Specific Domains' which examines biosecurity and cybersecurity capabilities. The taxonomy's scope notes clarify that this work differs from passive evaluation methods by actively enhancing capabilities, and from input-level attacks by modifying weights through fine-tuning rather than crafting adversarial prompts.

Among the 30 candidates examined through limited semantic search, the contribution-level analysis reveals mixed novelty signals. The core MFT methodology shows one refutable candidate among 10 examined, suggesting some prior exploration of deliberate capability maximization. The domain-specific RL approach with tool environments encountered two refutable candidates among 10, indicating more substantial overlap with existing work on agentic training paradigms. The risk assessment framework contribution also found one refutable candidate among 10 examined. These statistics suggest that while individual technical components have precedents, the integrated approach may offer incremental advances within this sparse research area.

Given the limited search scope of 30 semantically similar papers and the sparse taxonomy leaf containing only two works, the analysis captures a snapshot rather than exhaustive coverage. The paper appears to consolidate and systematize emerging practices in worst-case risk estimation, operating in a research direction where methodological standards are still crystallizing. The refutation signals primarily reflect overlapping technical building blocks rather than complete methodological duplication, consistent with work that synthesizes existing techniques into a coherent evaluation framework.

Taxonomy

Core-task Taxonomy Papers
14
3
Claimed Contributions
30
Contribution Candidate Papers Compared
4
Refutable Paper

Research Landscape Overview

Core task: Estimating worst-case frontier risks of open-weight language models. The field structure reflects a multi-faceted approach to understanding and mitigating the unique vulnerabilities introduced when model weights are publicly accessible. The taxonomy organizes work into four main branches: Adversarial Capability Elicitation and Risk Assessment focuses on probing methods that reveal latent harmful capabilities through techniques like malicious fine-tuning and jailbreaking; Evaluation Frameworks and Benchmarks develops standardized tests and metrics to quantify risk exposure; Protective Mechanisms and Countermeasures explores defenses such as watermarking, secure release protocols, and safeguard durability; and Domain-Specific Applications and Case Studies examines concrete threat scenarios in areas like biosecurity and copyright infringement. Representative works like Jailbreakbench[1] and Worst Prompt Performance[3] illustrate how adversarial elicitation methods systematically stress-test model boundaries, while efforts such as Secure Weight Release[5] and Watermarking Radioactive Models[6] propose technical barriers to misuse. A particularly active line of inquiry centers on the tension between open access and safety: many studies explore whether fine-tuning can irreversibly strip away alignment guardrails, raising questions about the durability of built-in protections. Worst-Case Frontier Risks[0] sits squarely within the Adversarial Capability Elicitation branch, specifically addressing malicious fine-tuning as a vector for estimating extreme-case harms. Its emphasis on worst-case scenarios complements nearby work like Weaponization-Resistant LLMs[7], which investigates design principles to harden models against deliberate weaponization, and Safeguard Durability[8], which quantifies how robustly safety measures persist under adversarial adaptation. Together, these efforts highlight an open challenge: determining whether post-hoc defenses can sufficiently counterbalance the inherent risks of weight transparency, or whether fundamentally new release paradigms are required.

Claimed Contributions

Malicious fine-tuning (MFT) methodology for estimating worst-case frontier risks

The authors propose a methodology called malicious fine-tuning that directly fine-tunes open-weight models to maximize their capabilities in high-risk domains (biology and cybersecurity) in order to estimate the worst-case harms that could be achieved by adversaries, rather than only evaluating the released version of the model.

10 retrieved papers
Can Refute
Domain-specific capability maximization through RL with tool environments

The authors develop specialized training procedures that use reinforcement learning in tool-augmented environments: training with web browsing for biology tasks and training in terminal environments for cybersecurity CTF challenges, along with curated in-domain datasets to maximize capabilities in each risk domain.

10 retrieved papers
Can Refute
Framework for assessing absolute and marginal risk of open-weight releases

The authors establish an evaluation framework that compares maliciously fine-tuned models against both frontier closed-weight models and existing open-weight models to assess both absolute capability levels and marginal risk contributions, providing guidance for future open-weight model releases.

10 retrieved papers
Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Malicious fine-tuning (MFT) methodology for estimating worst-case frontier risks

The authors propose a methodology called malicious fine-tuning that directly fine-tunes open-weight models to maximize their capabilities in high-risk domains (biology and cybersecurity) in order to estimate the worst-case harms that could be achieved by adversaries, rather than only evaluating the released version of the model.

Contribution

Domain-specific capability maximization through RL with tool environments

The authors develop specialized training procedures that use reinforcement learning in tool-augmented environments: training with web browsing for biology tasks and training in terminal environments for cybersecurity CTF challenges, along with curated in-domain datasets to maximize capabilities in each risk domain.

Contribution

Framework for assessing absolute and marginal risk of open-weight releases

The authors establish an evaluation framework that compares maliciously fine-tuned models against both frontier closed-weight models and existing open-weight models to assess both absolute capability levels and marginal risk contributions, providing guidance for future open-weight model releases.