Digital Pathology Foundation Models in 2025

What we mean as foundation models in this context are vision models used as encoders. They are predominantly vision transformers using Dino-v2.

In computational pathology we typically have to compress the whole slide into bit size pieces to work within the memory constraints of the GPU memory. So preprocessing these whole slides into fragments we call tiles and then into latent vectors or emveddings is a foundational setup, especially when developing whole slide predictors.

Insights

the differentiating factor for whole slide classification via multiple instance learning is that the FMs trained with large amounts of data or datasets will well annotated text-image pairs do well.
models trained with text alignment in the latent space tend to do just as a good as another model only in image but larger volume
Benchmarking just on tile based tasks won’t necessarily translate to whole slide image classification tasks
MIL has plateaued

Comments on evaluations

if the task is saturating at upper 0.9 accuracy and or AUC, then it might be better to limit your training data labels assuming it’s supervised

Looking forward

the new Dino-v2 algorithm will give rise to a new generation of FMs
vision language capabilities to provide better whole slide and tile captions
The next breakthrough in digital pathology FMs won’t come from scale, but from better semantic alignment.

How do we overcome these limitations internally?

we gotta operationalize our evaluation pipeline and automate the trigger during our ssl runs.
Different objectives in mil