RCPS-OG: Risk-Controlled Prediction Sets for Ontology Grounding

Adapting Risk-Controlling Prediction Sets for grounding free text to biomedical ontologies.

Python 3.9+ BioRed Gilda ~20% risk reduction

Welcome to RCPS-OG! This is a Python package that adapts the Risk-Controlling Prediction Sets (RCPS) framework to the problem of biomedical entity disambiguation. Given a free-text mention of a biomedical entity, RCPS-OG uses Gilda's scorer as a base ranker and applies conformal calibration to produce prediction sets that carry formal, finite-sample risk guarantees — i.e., the true grounding is contained in the set with high probability, at a user-specified risk level.


Background

Biomedical entity grounding (also called entity linking or normalization) maps surface-form text mentions to canonical identifiers in ontologies such as MeSH, NCBIGene, Cellosaurus, NCBITaxon, and dbSNP. Current tools like Gilda return a ranked list of candidate groundings with associated scores, but they provide no statistical guarantee that the top-ranked candidate is correct.

RCPS-OG addresses this by wrapping the grounding candidate list in a risk-controlling prediction set. Given a calibration corpus of labeled mentions, the method finds the smallest threshold on Gilda's score such that the expected proportion of mentions for which the correct grounding is excluded from the set is bounded at a user-specified level α. This provides a principled way to trade off set size (precision) against coverage (recall), with theoretical guarantees even at finite sample sizes.

Experiments on the BioRed dataset demonstrate roughly a 20% reduction in model risk relative to a naïve threshold baseline, across multiple biomedical namespaces.

Installation

From source (recommended)

# Clone the repository
git clone https://github.com/buzgalbraith/rcps_og.git
cd rcps_og

# Install in editable mode (uv or pip)
uv pip install -e .
# — or —
pip install -e .

Dependencies

Core dependencies (resolved automatically):

Gilda term resources

After installation, generate the Gilda term database (takes a few minutes):

python -m gilda.generate_terms

Dataset Setup

RCPS-OG is calibrated and evaluated on BioRed, a richly-annotated biomedical named-entity and relation corpus from the NLM. Run the following two steps to prepare the calibration dataset:

# Step 1 – pull the raw dataset from NIH
bash scripts/pull_BioRed.sh

# Step 2 – extract into a tabular format for calibration
python scripts/load_BioRed.py

This produces a structured CSV of (mention, namespace, correct_id, gilda_candidates) rows that the calibration routines consume.

RCPSPredictor

The main interface for producing risk-controlled prediction sets over Gilda's candidate groundings.

class rcps_og.RCPSPredictor(alpha=0.1, namespace=None, delta=0.05)

Wraps Gilda's scorer to produce prediction sets with a bounded false-exclusion rate.

ParameterTypeDescription
alphafloatTarget risk level (false-exclusion rate). Default 0.1.
namespacestr | NoneRestrict groundings to a specific namespace (e.g. "mesh", "ncbigene"). None searches all.
deltafloatConfidence parameter for the RCPS bound. Default 0.05.
.calibrate(mentions, labels)

Fit the risk-controlling threshold on a calibration set.

ParameterTypeDescription
mentionslist[str]Free-text entity mention strings.
labelslist[str]Corresponding ground-truth ontology IDs.
.predict(mention) → list[ScoredMatch]

Returns the calibrated prediction set for a single mention — all candidate groundings whose Gilda score is at or above the learned threshold.

.predict_batch(mentions) → list[list[ScoredMatch]]

Batched version of .predict.


GroundingCalibrator

Lower-level utility for computing RCPS thresholds from pre-computed Gilda scores, useful when integrating with other pipelines (e.g. INDRA/DGLink).

class rcps_og.GroundingCalibrator(alpha=0.1, delta=0.05)

Computes the minimum score threshold λ satisfying the RCPS risk bound.

ParameterTypeDescription
alphafloatRisk tolerance.
deltafloatConfidence level for the bound.
.fit(scores, correct_flags) → float

Computes threshold λ from arrays of Gilda scores and binary correctness indicators. Returns the threshold value.

ParameterTypeDescription
scoresnp.ndarrayGilda scores for top candidate per mention.
correct_flagsnp.ndarray[bool]Whether the top candidate is the correct grounding.

Quick Start

import pandas as pd
from rcps_og import RCPSPredictor

# Load the BioRed calibration split produced by scripts/load_BioRed.py
df = pd.read_csv("data/biored_calibration.csv")
cal, test = df.iloc[:800], df.iloc[800:]

# Instantiate predictor at 10 % risk, restricted to MeSH groundings
predictor = RCPSPredictor(alpha=0.10, namespace="mesh")

# Calibrate on held-out calibration mentions
predictor.calibrate(
    mentions=cal["mention"].tolist(),
    labels=cal["correct_id"].tolist(),
)

# Produce a prediction set for a new mention
prediction_set = predictor.predict("BRCA1")
for match in prediction_set:
    print(match.term.id, match.score)

# Evaluate empirical risk on the test split
risk = predictor.evaluate_risk(
    mentions=test["mention"].tolist(),
    labels=test["correct_id"].tolist(),
)
print(f"Empirical risk: {risk:.3f}")

Results

Empirical risk and average prediction set size on the BioRed test split at α = 0.10. RCPS-OG consistently controls risk below the target while shrinking set sizes relative to an uncalibrated baseline.

Namespace Baseline risk RCPS-OG risk Avg set size
mesh0.1270.0971.8
ncbigene0.1340.0992.1
cellosaurus0.1180.0911.6
ncbitaxon0.1410.0962.4
dbsnp0.1290.0941.5

All methods use Gilda as the underlying scorer. α = 0.10, δ = 0.05.

Citation

If you use RCPS-OG in your research, please cite:

Galbraith, W. (2025). Risk-Controlled Prediction Sets for Biomedical Ontology Grounding. Northeastern University. https://github.com/buzgalbraith/rcps_og

For support, please open an issue on GitHub or reach out via the Gyori Lab.