Most AI vendors don't benchmark for reliability. A new benchmark from Princeton researchers does.