The post discusses the SEP Dataset and its test cases, as well as the poor performance of models tested in the paper. It also mentions the observation that stronger models may score lower.
Sort: