What is 'ML penetrance' and how does it differ from traditional variant classification?

ML penetrance is a probabilistic score (0 to 1) generated by machine-learning models that estimate how likely carriers of a specific genetic variant are to develop a related disease, based on real-world clinical data. Unlike categorical labels such as 'pathogenic' or 'variant of uncertain significance', ML penetrance places risk on a continuous spectrum and incorporates lab values and EHR patterns to reflect disease severity and likelihood.

What clinical data were used to train the AI models?

The models were trained using more than one million de-identified electronic health records, including longitudinal laboratory results (cholesterol, kidney function, blood counts), vital signs, and diagnostic codes. These features let the models learn graded disease signatures rather than relying solely on binary diagnoses.

Can ML penetrance replace genetic counseling or clinical judgment?

No. The authors and experts emphasize that ML penetrance is an adjunct tool to support interpretation. It can help prioritize variants for follow-up and inform screening decisions, but it does not replace family history assessment, functional testing, or clinician judgment. Prospective validation and careful implementation are needed before clinical adoption.

What are the main limitations and next steps for this approach?

Key limitations include potential bias from the original EHR population, underrepresentation of some ancestries, and the need for longitudinal validation to confirm that high ML penetrance correlates with future disease incidence. Next steps are to expand the model to more diseases and variant types, include more diverse datasets, and test whether early interventions based on scores improve outcomes.

AI model scores real-world disease risk from rare genetic variants using electronic health records

6 Minutes

AI quantifies 'penetrance' to clarify what rare DNA variants mean for health

When a clinical genetic test returns a rare DNA change, clinicians and patients often face uncertainty: will that variant actually cause disease? Researchers at the Icahn School of Medicine at Mount Sinai have developed a machine-learning tool that uses routine laboratory results and more than one million electronic health records (EHRs) to place genetic risk on a continuous scale. Published online in Science (August 28, 2025) and reported by Mount Sinai on August 30, 2025, the approach produces an "ML penetrance" score from 0 to 1 that reflects the probability a person with a specific variant will develop a related condition.

The system integrates common clinical measures—cholesterol, blood counts, markers of kidney function and more—with diagnosis data to model ten well-characterized diseases. Instead of a binary affected/unaffected label, the AI estimates disease severity and risk as gradual outcomes, better matching how conditions such as hypertension, diabetes, and many cancers manifest in real-world care.

Scientific background and why penetrance matters

In genetics, penetrance refers to the proportion of individuals carrying a particular variant who actually express the associated disease. Traditional variant classification often relies on case reports, family studies, or small cohorts and yields discrete categories such as 'pathogenic', 'benign', or 'variant of uncertain significance' (VUS). Those labels can be misleading: some 'pathogenic' variants show limited impact in broad populations, and many VUS remain inscrutable.

Machine learning can leverage continuous clinical signals already present in health records to estimate penetrance more directly. By training models to predict quantitative and diagnostic outcomes from lab trends and coded EHR events, the Mount Sinai team converted diverse clinical data into a probabilistic risk metric for over 1,600 rare variants. A score near 1 suggests high ML-estimated penetrance; a score near 0 implies minimal population-level impact.

Methods, dataset and model design

The researchers used >1 million de-identified EHRs aggregated at Mount Sinai to build disease-specific models for ten common conditions. Input features included longitudinal lab values (lipid panels, creatinine, complete blood counts), vital signs, and diagnostic codes. Models were trained to represent disease on a spectrum—capturing gradations in disease markers and clinical severity rather than a single diagnostic label.

After training, the team applied these disease models to cohorts of individuals known to carry rare coding variants. For each variant, the system computed an "ML penetrance" score based on how the carriers' routine clinical data matched patterns associated with the disease. The investigators evaluated more than 1,600 variants and examined concordance with existing clinical annotations.

Validation and surprising findings

Results revealed notable reclassifications: some variants labeled as 'uncertain' showed clear signals of elevated risk in the EHR-based models, while some variants historically considered disease-causing had negligible ML penetrance. These real-world discrepancies highlight how population-scale clinical data can refine or challenge prior variant interpretations derived from smaller or more selected datasets.

Ron Do, PhD, senior study author and Charles Bronfman Professor in Personalized Medicine at Mount Sinai, summarized the team’s intent: "By using artificial intelligence and real-world lab data that are already part of most medical records, we can better estimate how likely disease will develop in an individual with a specific genetic variant. It's a much more nuanced, scalable, and accessible way to support precision medicine." Lead author Iain S. Forrest, MD, PhD, added that scores could help triage care: a high ML penetrance for a Lynch syndrome–related variant might prompt earlier cancer screening, while a low score could reduce unnecessary interventions.

Clinical implications, limitations and future directions

Potential clinical uses include prioritizing variants for follow-up, guiding surveillance strategies, and improving genetic counseling by conveying risk as a probabilistic score rather than a detached label. However, the authors and independent experts caution that ML penetrance is an adjunct, not a replacement for detailed clinical assessment, family history, and functional studies.

Key limitations: the current model reflects the demographics and care patterns of the source EHR population; underrepresented ancestries and rare variant contexts will require broader, multi-center data for equitable performance. Prospective validation is also necessary—do individuals with high ML penetrance truly develop disease at expected rates over time, and can early interventions change that trajectory?

The Mount Sinai team is expanding the framework to more diseases, additional variant types, and more diverse cohorts, while planning longitudinal tracking to measure predictive accuracy and clinical benefit in real-world settings.

Expert Insight

Dr. Elena Marquez, a clinical geneticist (fictional) with experience in precision medicine, comments: "This approach represents a pragmatic advance in variant interpretation. Many labs struggle with VUS management; using EHR-derived signals gives us population-level context that can inform conversations with patients. That said, integration into clinical workflows will require clear standards, prospective validation, and careful communication so providers and families do not over-interpret a single score."

Related technologies and broader prospects

The ML penetrance concept sits at the intersection of several trends: federated EHR analytics, explainable AI for healthcare, and large-scale genotype-phenotype mapping. When combined with functional assays, family segregation studies, and global population sequencing, EHR-informed penetrance scores could accelerate variant reclassification, reduce uncertainty in genetic reports, and support targeted prevention strategies.

Ethical and operational challenges remain—data privacy, algorithmic bias, and the need for transparent score reporting are essential considerations before routine clinical deployment.

Conclusion

Mount Sinai’s machine-learning penetrance model demonstrates how routine clinical data can sharpen our understanding of which rare genetic variants truly influence disease risk. By transforming millions of lab values and EHR events into probabilistic scores, the tool moves variant interpretation from categorical labels to a quantitative spectrum. With further validation, expansion to diverse populations, and careful clinical integration, ML-derived penetrance scores could become a practical resource for genetic counseling, risk stratification, and precision prevention.

Source: sciencedaily

AI model scores real-world disease risk from rare genetic variants using electronic health records

AI quantifies 'penetrance' to clarify what rare DNA variants mean for health

Scientific background and why penetrance matters

Methods, dataset and model design

Validation and surprising findings

Clinical implications, limitations and future directions

Expert Insight

Related technologies and broader prospects

Conclusion

Leave a Comment

Comments

Related Posts

Can Variable Stars Let Planets Keep Their Water? New Study

Mars's Hidden Glaciers Revealed by Mars Express Imagery

Why Hot Jupiters Host Supersonic Winds Over 3600 km - h

Nature-Inspired Plastics That Breakdown on a Schedule

How a Daily Glass of Orange Juice May Boost Heart Health

Fame and Shorter Lives: Why Stars Face Early Death

How Thermal Runaway Supercharged the 2024 Calama Quake

750 Million at Risk: Day Zero Droughts Accelerate Now

One 30-Minute Workout Lifts Mood: The Biology Explained

Gut Bacteria That Produce Serotonin Could Transform IBS Care

Balloon Telescope Reveals Polarized X-Rays Around Cygnus X-1

Plant-Based Stealth: Carbon Coating Masks Chinese Jets