RCPS-OG: Risk-Controlled Prediction Sets for Ontology Grounding
Adapting Risk-Controlling Prediction Sets for grounding free text to biomedical ontologies.
Welcome to RCPS-OG! This is a Python package that adapts the Risk-Controlling Prediction Sets (RCPS) framework to the problem of biomedical entity disambiguation. Given a free-text mention of a biomedical entity, RCPS-OG uses Gilda's scorer as a base ranker and applies conformal calibration to produce prediction sets that carry formal, finite-sample risk guarantees — i.e., the true grounding is contained in the set with high probability, at a user-specified risk level.
Background
Biomedical entity grounding (also called entity linking or normalization) maps surface-form text mentions to canonical identifiers in ontologies such as MeSH, NCBIGene, Cellosaurus, NCBITaxon, and dbSNP. Current tools like Gilda return a ranked list of candidate groundings with associated scores, but they provide no statistical guarantee that the top-ranked candidate is correct.
RCPS-OG addresses this by wrapping the grounding candidate list in a risk-controlling prediction set. Given a calibration corpus of labeled mentions, the method finds the smallest threshold on Gilda's score such that the expected proportion of mentions for which the correct grounding is excluded from the set is bounded at a user-specified level α. This provides a principled way to trade off set size (precision) against coverage (recall), with theoretical guarantees even at finite sample sizes.
Experiments on the BioRed dataset demonstrate roughly a 20% reduction in model risk relative to a naïve threshold baseline, across multiple biomedical namespaces.
Installation
From source (recommended)
# Clone the repository
git clone https://github.com/buzgalbraith/rcps_og.git
cd rcps_og
# Install in editable mode (uv or pip)
uv pip install -e .
# — or —
pip install -e .
Dependencies
Core dependencies (resolved automatically):
gilda— biomedical entity scorer & candidate generatorindra— entity grounding utilitiesnumpy,pandas,scikit-learntqdmfor progress reporting
Gilda term resources
After installation, generate the Gilda term database (takes a few minutes):
python -m gilda.generate_terms
Dataset Setup
RCPS-OG is calibrated and evaluated on BioRed, a richly-annotated biomedical named-entity and relation corpus from the NLM. Run the following two steps to prepare the calibration dataset:
# Step 1 – pull the raw dataset from NIH
bash scripts/pull_BioRed.sh
# Step 2 – extract into a tabular format for calibration
python scripts/load_BioRed.py
This produces a structured CSV of (mention, namespace, correct_id, gilda_candidates) rows that the calibration routines consume.
RCPSPredictor
The main interface for producing risk-controlled prediction sets over Gilda's candidate groundings.
Wraps Gilda's scorer to produce prediction sets with a bounded false-exclusion rate.
| Parameter | Type | Description |
|---|---|---|
alpha | float | Target risk level (false-exclusion rate). Default 0.1. |
namespace | str | None | Restrict groundings to a specific namespace (e.g. "mesh", "ncbigene"). None searches all. |
delta | float | Confidence parameter for the RCPS bound. Default 0.05. |
Fit the risk-controlling threshold on a calibration set.
| Parameter | Type | Description |
|---|---|---|
mentions | list[str] | Free-text entity mention strings. |
labels | list[str] | Corresponding ground-truth ontology IDs. |
list[ScoredMatch]
Returns the calibrated prediction set for a single mention — all candidate groundings whose Gilda score is at or above the learned threshold.
list[list[ScoredMatch]]
Batched version of .predict.
GroundingCalibrator
Lower-level utility for computing RCPS thresholds from pre-computed Gilda scores, useful when integrating with other pipelines (e.g. INDRA/DGLink).
Computes the minimum score threshold λ satisfying the RCPS risk bound.
| Parameter | Type | Description |
|---|---|---|
alpha | float | Risk tolerance. |
delta | float | Confidence level for the bound. |
float
Computes threshold λ from arrays of Gilda scores and binary correctness indicators. Returns the threshold value.
| Parameter | Type | Description |
|---|---|---|
scores | np.ndarray | Gilda scores for top candidate per mention. |
correct_flags | np.ndarray[bool] | Whether the top candidate is the correct grounding. |
Quick Start
import pandas as pd
from rcps_og import RCPSPredictor
# Load the BioRed calibration split produced by scripts/load_BioRed.py
df = pd.read_csv("data/biored_calibration.csv")
cal, test = df.iloc[:800], df.iloc[800:]
# Instantiate predictor at 10 % risk, restricted to MeSH groundings
predictor = RCPSPredictor(alpha=0.10, namespace="mesh")
# Calibrate on held-out calibration mentions
predictor.calibrate(
mentions=cal["mention"].tolist(),
labels=cal["correct_id"].tolist(),
)
# Produce a prediction set for a new mention
prediction_set = predictor.predict("BRCA1")
for match in prediction_set:
print(match.term.id, match.score)
# Evaluate empirical risk on the test split
risk = predictor.evaluate_risk(
mentions=test["mention"].tolist(),
labels=test["correct_id"].tolist(),
)
print(f"Empirical risk: {risk:.3f}")
Results
Empirical risk and average prediction set size on the BioRed test split at α = 0.10. RCPS-OG consistently controls risk below the target while shrinking set sizes relative to an uncalibrated baseline.
| Namespace | Baseline risk | RCPS-OG risk | Avg set size |
|---|---|---|---|
| mesh | 0.127 | 0.097 | 1.8 |
| ncbigene | 0.134 | 0.099 | 2.1 |
| cellosaurus | 0.118 | 0.091 | 1.6 |
| ncbitaxon | 0.141 | 0.096 | 2.4 |
| dbsnp | 0.129 | 0.094 | 1.5 |
All methods use Gilda as the underlying scorer. α = 0.10, δ = 0.05.
Citation
If you use RCPS-OG in your research, please cite:
For support, please open an issue on GitHub or reach out via the Gyori Lab.