This study explores techniques employed in interpretability research for Transformer-based language models, including input attribution methods and decoding information in neural network models. It emphasizes the importance of understanding model inner workings for safety, fairness, and mitigating biases.
Sort: