Rethinking the way we test AI models

ALSO: How to use Perplexity's focus feature

Read time: under 4 minutes

Welcome back, Superhuman

At the AI Olympics, things aren’t what they seem. Some companies say the benchmarks used to measure each LLM’s performance are too simplistic, leading to inaccurate results. Today, we’ll explore how one AI firm plans to fix the problem.

Today’s Insights

  • Apple’s long-term AI game plan

  • Tutorial: How to use Anthropic’s focus feature

  • Finding a better way to measure AI models’ performance

  • 5 new AI tools to boost your productivity

  • Everything else you should know today

  • AI-Generated Images: Neon Panthers

NEXT IN AI

An inside look at Apple’s long-term AI plans

Source: Apple

At last month’s WWDC, Apple showed off all of the AI features coming to iOS this year. But Bloomberg just got an inside scoop into the company’s long game. 

Here’s what you can expect:

  • While Apple Intelligence will be free at the outset, the iPhone maker wants to eventually roll out a paid subscription for more advanced AI features, like unlimited cloud services

  • We already know ChatGPT is getting iOS integration — but Apple is also negotiating with Alphabet and Anthropic to potentially incorporate their models into its AI offerings; expect at least one major partnership announcement sometime this Fall

  • Meta’s Llama models, meanwhile, are no longer in the running because Apple wasn’t impressed with their performance, sources told Bloomberg

  • Apple is working on adapting iPhone and Macbook AI features to Apple Vision Pro, the company’s futuristic-but-pricey augmented reality headset

What’s in it for Apple? Devices are now made of titanium and other sturdy materials, and there are more repair options than ever. That means it’s no longer necessary to replace your device as soon as something goes wrong. Apple thinks AI could be the thing that finally gets people excited about upgrading to the latest model: It plans to sell about 10 million more units of the iPhone 16 compared to the previous generation.

Can you make it until September? That’s when the next iteration of the iPhone is slated for release. If you upgrade any sooner, you risk missing out on the device’s new AI features. That’s because they’ll only be available on the iPhone 15 Pro and later models. And while Macs from the past seven years can upgrade to the next version of iOS, certain features will be gated to only the latest models.

PRESENTED BY GUIDDE

Turn Your Expertise Into How-To Guides With AI

Are you continuously re-explaining processes to remote teams, new hires, or even your boss?

Just use Guidde, the AI tool that turns complex tasks into stunning visual guides and training videos in seconds.

  1. Click ‘capture’ on the no-cost browser extension

  2. Guidde automatically generates step-by-step video guides

  3. Edit + embed your guide anywhere your team is

AI AT WORK

How to search using Perplexity’s focus feature

  1. Go to Perplexity’s website and log in with your email.

  2. Click on the ‘Focus’ option under the search bar.

  3. Choose a category from the options (i.e. All, Academic, Writing, Wolfram|Alpha, YouTube, and Reddit).

  4. Write your search prompt and press enter. You’ll get your category-focused search results within seconds.

  5. You can change and adjust the category as required for different search results.

PROMPT OF THE DAY

July 4th BBQ

Prompt: You are hosting a 4th of July BBQ for 25 people. 16 of those are adults, many of whom
are foodies, but there should be kid-friendly options as well. Create a menu for the BBQ including appetizers, sides and desserts. Include some items that can be made or prepped ahead of time

Follow-up prompt: If the people are arriving at 2PM and we want to serve the main dinner at 5PM, create a timeline and schedule for food preparation and cooking

You can adapt the prompt to your specific needs.

Source: Donna Botti, Delos Inc

PRESENTED BY TELY

Save $200 On Automated B2B Content Marketing

Need quality B2B content marketing without expanding payroll?

Tely AI researches your market, learns your product, and creates expert SEO content–all on its own.

  • Get indexed in 2 weeks by Google

  • Reach 15 000 visits in 12 months

  • Get 100 high-quality articles every month

Hire Tely AI and save $200 today.

 

AI & BENCHMARKS

Anthropic wants to help build better AI benchmarks

Source: Anthropic

Imagine competing in this summer’s Olympic long jump only for each judge to declare a different winner. And, to make things even more confusing, what if there was no authority to step in and announce the official gold medalist?

That’s how some AI experts say they feel about today’s LLM benchmarks: Because there’s no standardized yardstick, each firm can point to whichever tests show them out front. Besides, models can be trained to ace highly-specific tasks, but that’s not always indicative of their overall capabilities. It’d be like memorizing the answers to an exam instead of truly grasping the subject matter.

This week, Anthropic announced a new funding initiative to support benchmarks that do a better job of assessing models’ overall capabilities.

The plan:

  • Anthropic says it will give payments to third-party groups who can prove they have a consistent way to measure each model’s performance

  • The new tests will be much harder to pass — like moving onto a college-level exam after getting an A+ on the high school-level one

  • The company wants the tests to focus on two things: practicality (making sure models are actually useful for everyday tasks) and safety (weeding out models that are easy to manipulate and jailbreak)

  • Future benchmarks might ask thousands of users to perform a particular task in order to get a better picture of how a model deals with real-world problems

Why it’s important: With a better window into how each model performs, firms will be able to refine their models with a higher level of precision — and in turn, users will get a better picture of each LLM’s strengths and weaknesses.

PRODUCTIVITY

5 AI Tools to Supercharge Your Productivity

 Nylas: An API for email, calendar, and contacts that saves engineers time so they can build secure and engaging experiences their customers love.

 Spoken AI: An AI model designed to accurately translate over 300 languages and dialects to a native level.

 Firebender: Find early adopters and new startup leads via an AI-powered database.

 Study with GPT: A full-stack mentor that tailors AI tutorials specifically for your needs.

 Briefy: Turn all kinds of lengthy content into concise, structured summaries and save them in your knowledge base for later review.

PS: Want more? Check out our Top 100 AI Tools.

* indicates a promoted tool, if any

AI & TECH NEWS

Everything else you need to know today

Source: Meta

  • False Flags: Meta is adjusting its AI labeling process after photographers noticed that the company’s detection software was flagging real photos — including those that had been only minimally edited or cropped.

  • Common Ground: In a rare point of agreement, the US and China both backed a UN resolution that will make AI technologies more accessible to developing nations.

  • No Free Lunch: Runway’s powerful text-to-video platform, Gen-3 Alpha, is now available to all users, although you’ll need to purchase credits or get a $12/month subscription to use it.

  • Trust Busters: Regulators in France are allegedly getting ready to charge Nvidia with antitrust violations. It would be the first enforcement act against the world’s largest chip manufacturer.

🧠 Brain Food: Researchers at the Chinese Academy of Sciences taught rhesus monkeys how to play Pac-Man by giving them treats each time they beat the game. That’s not even the strangest part: Next, they trained an AI model on the monkeys’ eye movements. The model was eventually able to predict the monkeys’ strategy with about 88% accuracy — suggesting it could problem-solve and “think” just like a mammal.

AI-GENERATED IMAGES

Neon Panthers

Source: liling_090123 on Midjourney

Prompt: Minimalism, bright colors, futuristic colorful waves, a stylish woman with sleek straight hair sitting on the ground, a giant black panther lying on her lap, pine branches, 3D, pink, green and yellow color scheme, fashion, ultra high resolution rendering, super clear, high quality, contemporary art, 32K HD
--style raw --ar 3:4 --stylize 0

Acquire new customers and drive revenue by partnering with us

Superhuman is the world’s biggest AI newsletter for businesses and professionals with 600,000+ readers working at the world’s leading startups and enterprises. Companies like Amazon, Hubspot, and Salesforce feature their products in Superhuman. You can learn more about partnering with us here.  

🧞 Your wish is my command 

What did you think of today's email?

Your feedback helps me create better emails for you!

Login or Subscribe to participate in polls.

Reviews of the day

Thanks for reading.

Until next time!

Zain & the Superhuman AI team

p.s. If you liked this newsletter, share it with your friends and colleagues here.