Text-to-Audio Models Generalization Leaderboard
Model Name | Objective | Subjective (Crowd / Expert) | ||||||||
---|---|---|---|---|---|---|---|---|---|---|
CE | CU | PC | PQ | CLAP | MPC | MCE | MPQ | MAli | MCU | |
AudioGen | 2.91 | 4.69 | 3.12 | 5.42 | 0.34 | 3.23 / 3.07 | 3.55 / 1.36 | 5.44 / 2.86 | 5.95 / 3.64 | 4.52 / 2.31 |
AudioLDM | 3.51 | 5.40 | 3.42 | 5.92 | 0.42 | 4.27 / 2.79 | 4.67 / 2.99 | 5.82 / 6.16 | 5.81 / 6.70 | 5.29 / 3.87 |
AudioLDM 2 | 3.71 | 5.88 | 3.21 | 6.27 | 0.37 | 3.30 / 2.76 | 3.64 / 2.79 | 5.56 / 5.07 | 6.00 / 6.80 | 4.51 / 4.04 |
Auffusion | 3.52 | 5.55 | 3.15 | 5.98 | 0.38 | 3.07 / 2.70 | 3.73 / 3.56 | 5.39 / 5.66 | 6.29 / 7.01 | 4.79 / 4.76 |
MAGNeT | 3.12 | 4.52 | 3.85 | 5.25 | 0.37 | 3.18 / 3.22 | 3.58 / 2.09 | 4.87 / 3.40 | 5.45 / 4.83 | 3.79 / 3.30 |
Make-An-Audio | 3.40 | 5.69 | 3.03 | 5.94 | 0.33 | 3.52 / 2.81 | 3.41 / 2.95 | 5.64 / 5.87 | 5.27 / 6.50 | 4.47 / 3.64 |
Make-An-Audio 2 | 3.39 | 5.27 | 3.44 | 5.68 | 0.40 | 3.69 / 2.88 | 3.71 / 2.64 | 5.06 / 5.81 | 5.23 / 6.63 | 3.25 / 3.61 |
Stable Audio Open | 3.40 | 5.62 | 2.68 | 6.04 | 0.37 | 3.13 / 2.50 | 3.56 / 2.94 | 5.16 / 5.64 | 5.01 / 6.90 | 4.14 / 3.62 |
Tango | 3.26 | 5.40 | 3.53 | 6.07 | 0.37 | 3.26 / 2.64 | 3.62 / 3.04 | 4.88 / 5.85 | 4.73 / 6.94 | 4.01 / 3.93 |
Tango 2 | 3.60 | 5.42 | 4.28 | 6.06 | 0.39 | 3.17 / 3.11 | 3.53 / 3.99 | 4.89 / 6.27 | 5.39 / 7.56 | 4.04 / 4.86 |
This leaderboard is based on the TTA-Bench evaluation framework, comprehensively considering accuracy, generalization, efficiency, bias, fairness, toxicity, and robustness.
Scoring range: 0-100, with higher scores indicating better performance.
Last updated: May 5, 2025