- Superhuman AI
- Posts
- Experts crowdsource ultimate AI test
Experts crowdsource ultimate AI test
ALSO: Use ChatGPT as a legal assistant
Read time: under 4 minutes
Welcome back, Superhuman
Can you think of a question so tricky it would stump today’s most powerful LLMs? With models like OpenAI o1 "destroying" existing benchmarks, researchers are turning to the public for help.
Today’s Insights
Building the world’s hardest LLM benchmark test
Frontier: OpenAI’s safety team shakeup
Tutorial: Turn ChatGPT into a legal assistant
Everything else you should know today
5 new AI tools to boost your productivity
AI-Generated Images: DIY websites
NEXT IN AI
AI benchmark experts set out to create trickiest test yet
AI benchmark designers are starting to run into an unexpected problem: The latest models now ace many of the tests we throw at them, making it all but impossible to figure out which ones excel at different tasks.
Now, AI experts from the Center for AI Safety (CAIS) and the training data startup Scale AI want to create a long-term solution: A test so tricky it could stump highly intelligent LLMs for years to come. And it’s asking the public for help coming up with the questions.
What led to the initiative?
For one, OpenAI’s new frontier model, o1, has “destroyed the most popular reasoning benchmarks,” CAIS Executive Director Dan Hendrycks wrote on X
Another factor is that as LLMs gobble up more and more data, it’s getting difficult to determine whether the models are actually reasoning through complex problems — or simply mimicking what they’ve already seen
That’s where “Humanity’s Last Exam” comes in: The idea is to gather the hardest problems imaginable, then compile them to build the world’s most challenging and comprehensive AI benchmark.
Rocket science, literally: CAIS is asking the public to “think of something you know that would stump current AI systems,” then formulate it into a question. Your question should be original, objective, and “difficult for non-experts,” but it can come from any field, including math, rocket engineering, and analytic philosophy.
PRESENTED BY AE STUDIO
Hire a world-class AI team
Trusted by leading startups and Fortune 500 companies
Building an AI product is hard. Engineers who understand AI are expensive and difficult to find. And there's no way of telling who's legit and who's not.
That's why companies around the world trust AE Studio. We help you craft and implement the optimal AI solution for your business with our team of world-class AI experts from Harvard, Stanford, and Princeton.
Our development, design, and data science teams work closely with founders and executives to create custom software and AI solutions that get the job done.
Book a free consultation session today. Get in touch here
FROM THE FRONTIER
Why OpenAI is revamping its safety team
Source: AP Photo
When the leaders of OpenAI’s superalignment efforts resigned earlier this year, the company's decision to completely scrap the project only fueled concerns about its safety promises. Now, it’s making some big changes that it hopes will help restore trust.
The details:
OpenAI created a new safety committee in May 2024, but CEO Sam Altman raised eyebrows when he announced he’d be among its new members
This week, the startup abandoned those plans and said it’s forming a new group
It’ll allegedly have more independence and power: Altman won’t be involved this time, and the team can even halt the release of future models it deems too dangerous
THE AI ACADEMY
How to turn ChatGPT into your legal assistant
ChatGPT can be used as your legal assistant to summarize legal documents and identify potential concerns. Here’s how:
Go to ChatGPT and make sure you select GPT-4o as your model.
Upload your legal document and ask it to summarize the terms and conditions for you.
You can also ask ChatGPT to identify potential concerns in the document.
Ask it to give you clarification If the initial results are not satisfactory.
Responses should not be used as a substitute for professional legal advice.
PRESENTED BY VANTA
Does your AI have the latest security compliance?
AI is everywhere, but customers need to know that you're using it safely and responsibly. The ISO 42001 standard helps companies demonstrate their AI security practices in a verifiable way.
Join Vanta and A-LIGN for a Coffee and Compliance session on ISO 42001—what it is, what types of organizations need it, and how it works. The discussion also covers practical strategies and best practices for successfully integrating ISO 42001 into your organization.
AI & TECH NEWS
Everything else you need to know today
Developers will soon be able to access Luma AI’s Dream Machine API. Source: Luma AI
Quota Boost: OpenAI is increasing rate limits for o1-mini to 50 messages per day for all Plus and Team users.
Office Assistant: Generative features are coming to Microsoft Excel and PowerPoint, while Copilot is getting AI agents that can carry out work-related tasks on their own.
Silicon Spinoff: In a major pivot, Intel announced it’s spinning off its chip foundry, which will start producing custom AI processors for Amazon Web Services.
Tit for Tat: Hours after Runway announced it was debuting an API for its text-to-video model, San Francisco-based rival Luma AI revealed it would do the same for its popular Dream Machine video generator.
👔 Expert Advice: A new Harvard Business School study suggests that seeking business advice from AI leads to vastly different results, depending on the health of your company. Firms that are already doing well see a 15% performance boost after taking advice from a generative model. On the other hand, struggling companies experience a 10% decline when they implement AI-generated tips.
🧠 Brain Food: Scientists used AI to accurately predict the visual brain patterns of fruit flies — down to the level of individual neurons. The model is so good that it’s now able to simulate how the flies will react to different experimental conditions virtually, eliminating the need to carry out the tests in real life.
PRODUCTIVITY
5 AI Tools to Supercharge Your Productivity
✅ ConsoleX: The ultimate workbench for crafting your own AI innovations.
✅ AnyParser Sandbox: Quickly and accurately extract content from PDFs, PPTs, and images with AI.
✅ Spinach AI*: Your meeting copilot - takes accurate notes in 100 languages, and captures action items in Monday, Asana, Jira, or ClickUp. Try it here.
✅ Resemble AI: Highlight, type, and hear your changes instantly with an AI-powered audio editor.
✅ Velvet: Log requests, add caching, and run experiments with an AI gateway for engineers.
* indicates a promoted tool, if any
PROMPT OF THE DAY
Web Design
Prompt: Design a responsive navigation menu for a multi-service platform. Ensure the menu accommodates [services ranging from online courses to consulting], [a user account section], [a search function], and [dynamic display depending on user membership level].
Source: Weam
AI-GENERATED IMAGES
DIY Websites
Source: @dan_rocket on Midjourney
Midjourney Prompt: Create a modern, stylish design concept for e-commerce website for a random brand. Include a clean layout with an random color palette. Incorporate an engaging hero screen with interactive elements. Use playful yet easy-to-read typography.
--no people --ar 9:16 --v 6.1 --stylize 250
Acquire new customers and drive revenue by partnering with us
Superhuman is the world’s biggest AI newsletter for businesses and professionals with 600,000+ readers and 1.5 Million followers on socials working at the world’s leading startups and enterprises. Companies like Amazon, Hubspot, and Salesforce feature their products in Superhuman. You can learn more about partnering with us here.
🧞Your wish is my command
What did you think of today's email?Your feedback helps me create better emails for you! |
Got more feedback or just want to get in touch? Reply to this email and we’ll get back to you.
Thanks for reading.
Until next time!
Zain & the Superhuman AI team