Text-to-Audio Models Accuracy Leaderboard

System Objective Subjective (Crowd / Expert)
CE CU PC PQ CLAP MPC MCE MPQ MAli MCU
AudioGen 2.89 4.54 3.18 5.33 0.39 3.54 / 2.88 3.18 / 1.93 4.82 / 4.35 5.08 / 5.40 3.64 / 3.20
AudioLDM 3.27 5.10 3.23 5.82 0.44 3.11 / 2.88 3.34 / 1.77 5.25 / 3.44 5.52 / 4.51 3.94 / 3.14
AudioLDM 2 3.48 5.54 3.00 6.09 0.40 3.31 / 2.80 3.87 / 3.64 5.29 / 6.84 5.06 / 7.51 4.63 / 4.50
Auffusion 3.32 5.11 3.23 5.72 0.45 3.62 / 2.90 4.25 / 3.71 5.56 / 6.76 5.61 / 7.59 4.94 / 4.57
MAGNeT 2.89 4.26 3.61 5.13 0.39 3.03 / 2.89 2.86 / 2.20 4.06 / 4.30 4.37 / 5.70 2.85 / 3.22
Make-An-Audio 3.28 5.33 3.08 5.78 0.38 3.55 / 3.05 4.28 / 2.51 5.47 / 5.77 5.27 / 6.83 4.46 / 3.89
Make-An-Audio 2 3.23 4.98 3.17 5.58 0.43 3.86 / 2.88 3.70 / 3.30 5.40 / 6.63 5.56 / 7.40 4.55 / 3.90
Stable Audio Open 3.05 5.02 2.74 5.63 0.35 2.73 / 2.41 2.90 / 2.34 4.51 / 4.91 4.20 / 5.99 3.56 / 3.19
Tango 3.27 5.15 3.39 5.96 0.44 4.20 / 3.24 4.72 / 3.35 6.00 / 6.49 5.81 / 6.81 5.20 / 4.45
Tango 2 3.47 5.20 3.84 5.89 0.46 4.14 / 3.15 4.73 / 3.35 6.01 / 6.63 5.94 / 7.59 5.21 / 4.77

This leaderboard is based on the TTA-Bench evaluation framework, comprehensively considering accuracy, generalization, efficiency, bias, fairness, toxicity, and robustness.