Text-to-Audio Models Generalization Leaderboard

Model Name Objective Subjective (Crowd / Expert)
CE CU PC PQ CLAP MPC MCE MPQ MAli MCU
AudioGen 2.91 4.69 3.12 5.42 0.34 3.23 / 3.07 3.55 / 1.36 5.44 / 2.86 5.95 / 3.64 4.52 / 2.31
AudioLDM 3.51 5.40 3.42 5.92 0.42 4.27 / 2.79 4.67 / 2.99 5.82 / 6.16 5.81 / 6.70 5.29 / 3.87
AudioLDM 2 3.71 5.88 3.21 6.27 0.37 3.30 / 2.76 3.64 / 2.79 5.56 / 5.07 6.00 / 6.80 4.51 / 4.04
Auffusion 3.52 5.55 3.15 5.98 0.38 3.07 / 2.70 3.73 / 3.56 5.39 / 5.66 6.29 / 7.01 4.79 / 4.76
MAGNeT 3.12 4.52 3.85 5.25 0.37 3.18 / 3.22 3.58 / 2.09 4.87 / 3.40 5.45 / 4.83 3.79 / 3.30
Make-An-Audio 3.40 5.69 3.03 5.94 0.33 3.52 / 2.81 3.41 / 2.95 5.64 / 5.87 5.27 / 6.50 4.47 / 3.64
Make-An-Audio 2 3.39 5.27 3.44 5.68 0.40 3.69 / 2.88 3.71 / 2.64 5.06 / 5.81 5.23 / 6.63 3.25 / 3.61
Stable Audio Open 3.40 5.62 2.68 6.04 0.37 3.13 / 2.50 3.56 / 2.94 5.16 / 5.64 5.01 / 6.90 4.14 / 3.62
Tango 3.26 5.40 3.53 6.07 0.37 3.26 / 2.64 3.62 / 3.04 4.88 / 5.85 4.73 / 6.94 4.01 / 3.93
Tango 2 3.60 5.42 4.28 6.06 0.39 3.17 / 3.11 3.53 / 3.99 4.89 / 6.27 5.39 / 7.56 4.04 / 4.86

This leaderboard is based on the TTA-Bench evaluation framework, comprehensively considering accuracy, generalization, efficiency, bias, fairness, toxicity, and robustness.

Scoring range: 0-100, with higher scores indicating better performance.

Last updated: May 5, 2025