Ranking¶

Important Note. Due to limitations of the Grand Challenge platform, the live leaderboard rankings currently ignore RadFact Logical scores. The ranking regarding the final test set will be performed offline.

The ranking schema is identical for both tasks and is divided into two phases.

Phase 1 - Automatic Ranking:¶

For each case, all the selected metrics are computed: RadFact Logical Precision, RadFact Logical Recall (clinical/primary metrics), BLEU-4, and METEOR (captioning/secondary metrics).
For the clinical metrics, a single RadFact Logical F1-score is computed as the harmonic mean of RadFact Logical Precision and RadFact Logical Recall. This is the final Clinical score.
For the captioning metrics, BLEU-4 and METEOR are averaged together, obtaining a single Captioning score.
These two scores are combined using a weighted average to obtain the final score: Final Score = 0.8 × Clinical Score + 0.2 × Captioning Score

Phase 1 scores are ranked from highest (rank 1) to lowest.

Phase 2 - Expert "Arena" Ranking (Manual Clinical Evaluation)¶

The top-performing methods (up to 7, based on overall performance across the metrics used for Phase 1) will be selected and undergo manual evaluation by maxillofacial surgeons to determine the final ranking. More specifically:

An LMArena-style interface (https://lmarena.ai/it) will present the same clinician with the case and two anonymized reports (A vs B) generated by two different methods. Left/right order and pairings are randomized;
For each matchup, the surgeon selects A, B, or Tie/Unsure;
Methods receive Elo-style(§) rating updates after each matchup;
The final Phase 2 ranking is based on the resulting Arena (Elo) scores;
Matchups repeat until a pre-defined minimum number of comparisons is reached and rankings stabilize (i.e., negligible score change over the last window)

(§) Elo rating system is a statistical method that measures a player's relative skill level, predicting game outcomes and adjusting points based on wins, losses, or draws against opponents with different ratings.

Handling Missing Results¶

Missing results are treated as an empty output, i.e., a report of zero chars. This is supported by the employed metrics and results in a final score of zero.

Justification of the Ranking Scheme¶

The used ranking should highlight methods that are clinically reliable, as clinical metrics will receive a higher weight in the final ranking. The measure of readability, structure, and conformity to the reporting conventions will add a plus to the methods that already have a good score in the clinical metrics, potentially disambiguating reports with similar clinical accuracy.