Discord's data team shares how their Default Metric List grew uncontrollably over time, creating statistical problems. Having too many experiment metrics forces a tradeoff: unadjusted p-values produce false positives, while multiple hypothesis corrections reduce the ability to detect real changes. Their conclusion is that fewer, higher-quality metrics capturing distinct concepts outperform large metric sets, and they used simulations and Principal Component Analysis to find the right balance.
Sort: