- Superhuman AI
- Posts
- Rethinking the way we test AI models
Rethinking the way we test AI models
ALSO: How to use Perplexity's focus feature
Read time: under 4 minutes
Welcome back, Superhuman
At the AI Olympics, things aren’t what they seem. Some companies say the benchmarks used to measure each LLM’s performance are too simplistic, leading to inaccurate results. Today, we’ll explore how one AI firm plans to fix the problem.
Today’s Insights
Apple’s long-term AI game plan
Tutorial: How to use Anthropic’s focus feature
Finding a better way to measure AI models’ performance
5 new AI tools to boost your productivity
Everything else you should know today
AI-Generated Images: Neon Panthers
NEXT IN AI
An inside look at Apple’s long-term AI plans
Source: Apple
At last month’s WWDC, Apple showed off all of the AI features coming to iOS this year. But Bloomberg just got an inside scoop into the company’s long game.
Here’s what you can expect:
While Apple Intelligence will be free at the outset, the iPhone maker wants to eventually roll out a paid subscription for more advanced AI features, like unlimited cloud services
We already know ChatGPT is getting iOS integration — but Apple is also negotiating with Alphabet and Anthropic to potentially incorporate their models into its AI offerings; expect at least one major partnership announcement sometime this Fall
Meta’s Llama models, meanwhile, are no longer in the running because Apple wasn’t impressed with their performance, sources told Bloomberg
Apple is working on adapting iPhone and Macbook AI features to Apple Vision Pro, the company’s futuristic-but-pricey augmented reality headset
What’s in it for Apple? Devices are now made of titanium and other sturdy materials, and there are more repair options than ever. That means it’s no longer necessary to replace your device as soon as something goes wrong. Apple thinks AI could be the thing that finally gets people excited about upgrading to the latest model: It plans to sell about 10 million more units of the iPhone 16 compared to the previous generation.
Can you make it until September? That’s when the next iteration of the iPhone is slated for release. If you upgrade any sooner, you risk missing out on the device’s new AI features. That’s because they’ll only be available on the iPhone 15 Pro and later models. And while Macs from the past seven years can upgrade to the next version of iOS, certain features will be gated to only the latest models.
PRESENTED BY GUIDDE
Turn Your Expertise Into How-To Guides With AI
Are you continuously re-explaining processes to remote teams, new hires, or even your boss?
Just use Guidde, the AI tool that turns complex tasks into stunning visual guides and training videos in seconds.
Click ‘capture’ on the no-cost browser extension
Guidde automatically generates step-by-step video guides
Edit + embed your guide anywhere your team is
Create stunning visual guides 11x faster with Guidde for zero cost!
AI AT WORK
How to search using Perplexity’s focus feature
Go to Perplexity’s website and log in with your email.
Click on the ‘Focus’ option under the search bar.
Choose a category from the options (i.e. All, Academic, Writing, Wolfram|Alpha, YouTube, and Reddit).
Write your search prompt and press enter. You’ll get your category-focused search results within seconds.
You can change and adjust the category as required for different search results.
PROMPT OF THE DAY
July 4th BBQ
Prompt: You are hosting a 4th of July BBQ for 25 people. 16 of those are adults, many of whom
are foodies, but there should be kid-friendly options as well. Create a menu for the BBQ including appetizers, sides and desserts. Include some items that can be made or prepped ahead of time
Follow-up prompt: If the people are arriving at 2PM and we want to serve the main dinner at 5PM, create a timeline and schedule for food preparation and cooking
You can adapt the prompt to your specific needs.
Source: Donna Botti, Delos Inc
PRESENTED BY TELY
Save $200 On Automated B2B Content Marketing
Need quality B2B content marketing without expanding payroll?
Tely AI researches your market, learns your product, and creates expert SEO content–all on its own.
Get indexed in 2 weeks by Google
Reach 15 000 visits in 12 months
Get 100 high-quality articles every month
Hire Tely AI and save $200 today.
AI & BENCHMARKS
Anthropic wants to help build better AI benchmarks
Source: Anthropic
Imagine competing in this summer’s Olympic long jump only for each judge to declare a different winner. And, to make things even more confusing, what if there was no authority to step in and announce the official gold medalist?
That’s how some AI experts say they feel about today’s LLM benchmarks: Because there’s no standardized yardstick, each firm can point to whichever tests show them out front. Besides, models can be trained to ace highly-specific tasks, but that’s not always indicative of their overall capabilities. It’d be like memorizing the answers to an exam instead of truly grasping the subject matter.
This week, Anthropic announced a new funding initiative to support benchmarks that do a better job of assessing models’ overall capabilities.
The plan:
Anthropic says it will give payments to third-party groups who can prove they have a consistent way to measure each model’s performance
The new tests will be much harder to pass — like moving onto a college-level exam after getting an A+ on the high school-level one
The company wants the tests to focus on two things: practicality (making sure models are actually useful for everyday tasks) and safety (weeding out models that are easy to manipulate and jailbreak)
Future benchmarks might ask thousands of users to perform a particular task in order to get a better picture of how a model deals with real-world problems
Why it’s important: With a better window into how each model performs, firms will be able to refine their models with a higher level of precision — and in turn, users will get a better picture of each LLM’s strengths and weaknesses.
PRODUCTIVITY
5 AI Tools to Supercharge Your Productivity
✅ Nylas: An API for email, calendar, and contacts that saves engineers time so they can build secure and engaging experiences their customers love.
✅ Spoken AI: An AI model designed to accurately translate over 300 languages and dialects to a native level.
✅ Firebender: Find early adopters and new startup leads via an AI-powered database.
✅ Study with GPT: A full-stack mentor that tailors AI tutorials specifically for your needs.
✅ Briefy: Turn all kinds of lengthy content into concise, structured summaries and save them in your knowledge base for later review.
PS: Want more? Check out our Top 100 AI Tools.
* indicates a promoted tool, if any
AI & TECH NEWS
Everything else you need to know today
Source: Meta
False Flags: Meta is adjusting its AI labeling process after photographers noticed that the company’s detection software was flagging real photos — including those that had been only minimally edited or cropped.
Common Ground: In a rare point of agreement, the US and China both backed a UN resolution that will make AI technologies more accessible to developing nations.
No Free Lunch: Runway’s powerful text-to-video platform, Gen-3 Alpha, is now available to all users, although you’ll need to purchase credits or get a $12/month subscription to use it.
Trust Busters: Regulators in France are allegedly getting ready to charge Nvidia with antitrust violations. It would be the first enforcement act against the world’s largest chip manufacturer.
🧠 Brain Food: Researchers at the Chinese Academy of Sciences taught rhesus monkeys how to play Pac-Man by giving them treats each time they beat the game. That’s not even the strangest part: Next, they trained an AI model on the monkeys’ eye movements. The model was eventually able to predict the monkeys’ strategy with about 88% accuracy — suggesting it could problem-solve and “think” just like a mammal.
AI-GENERATED IMAGES
Neon Panthers
Source: liling_090123 on Midjourney
Prompt: Minimalism, bright colors, futuristic colorful waves, a stylish woman with sleek straight hair sitting on the ground, a giant black panther lying on her lap, pine branches, 3D, pink, green and yellow color scheme, fashion, ultra high resolution rendering, super clear, high quality, contemporary art, 32K HD
--style raw --ar 3:4 --stylize 0
Acquire new customers and drive revenue by partnering with us
Superhuman is the world’s biggest AI newsletter for businesses and professionals with 600,000+ readers working at the world’s leading startups and enterprises. Companies like Amazon, Hubspot, and Salesforce feature their products in Superhuman. You can learn more about partnering with us here.