MIT researchers tested GPT-4, Claude 3 Opus, and Llama 3 using the TruthfulQA and SciQ datasets, prepending user biographies that varied education level, English proficiency, and country of origin. Results show all three models deliver less accurate and less truthful responses to users with lower formal education or non-native
Sort: