AIQ Report 2023-07-25

Highlight #1

GPT-4 stays on top but Bard gains

GPT-4 still ranks highest in overall user preference score, confirmed in the latest testing of over 500 responses during the last two weeks. But Bard is steadily closing that gap and not far behind GPT-4.

Overall, GPT-4 responses have the leading preference score among all platforms tested
- GPT-4 is preferred 62% of the time when users select responses from the different AI platforms in a blind side-by-side test
But in certain cases and types of prompts, Bard answers are preferred over GPT-4 responses, as shown below:
- Where does GPT-4 beat the competition?
  - Responses to “Emotional” prompts
    - Example: "What are some techniques for managing and expressing anger in a healthy way?"
  - Responses to “Ambiguous” prompts
    - Example: "I like bats. Am I talking about the animal or the sports equipment?"
- Where does Bard beat GPT-4?
  - Responses to “Factual” prompts
    - Example: "In which year did the Apollo 11 moon landing take place?"
  - Responses to “Idiomatic” prompts
    - Example: "Can you define the expression 'hit the nail on the head'?"
  - Responses to “Logical” prompts
    - Example: "Is it true that if no mammals can fly and all birds can fly, then no birds are mammals?"

Implications:
It’s not just about knowing which AI is “best” according to a ranking. There’s power in understanding the ways in which each AI platform excels and how those strengths align with your business.

We will continue to monitor the performance of the major AI players to see how users perceive the changes being made.

Highlight #2

When branding removed, GPT fans often prefer Bard responses

If you remove the brand, and people didn't know the platform behind the response, would that impact perceived quality and preference ratings?

Using the two pronged approach in AIQ™ we examine this and find some differences.

Among GPT-4 “fans” who rate GPT-4 highly in branded testing, those same people more often prefer Bard over GPT-4 in blind testing of responses to the same prompt:
- 36% preference for Bard vs 11% preference for GPT-4, more than three times the preference level
This same dynamic is found when looking at GPT-3.5 “fans” but the difference increases
- 41% preference for Bard vs 14% for GPT-3.5 responses

Implications:
As discussed in Highlight #1, the AI response is key. But brand is likely to play a role in AI just as it does in so many other facets of our lives. AI providers as well as those integrating AI into their businesses and products should examine the role of brand and fit between their brand and their chosen AI. Or if it will be un-branded in your application, does that give you different options on which AI to choose.

This could also lead to companies testing branded and non-branded AI solutions and determine if it impacts perceived quality.

Highlight #3

Only 6% believe AI can replace human creativity and innovation

As the impact of AI and its implications ripple through society, bigger questions are being raised, including in Hollywood, where the entertainment industry relies on human creativity and innovation. The current writer/actor strike has brought this to the forefront: what is the role of AI in creative industries? But similar questions apply to jobs that have a major “people component” such as healthcare, education, food prep, design. What is the potential for AI to support - or supplant - people in these areas?

Our research shows that people - for the moment - see AI as supporting human creativity and innovation, not supplanting it.

Very few people think AI will supplant human creativity and innovation (only 6% overall).
Users are generally positive about AI: 52% believe it is a tool that can support creativity and innovation but human input remains crucial. This skews more highly in younger groups such as Gen Z (63%).
And while people are still learning what AI can do, they feel less fearful of being supplanted because, according to them, AI cannot feel emotions, foster connections and think critically (as shown in the word cloud).

Implications:
These are legitimate, big questions with many layers. For any business that relies on creativity and innovation, chances are they are already looking at what generative AI can do and wondering if there efficiencies to gain.

But while legal and economic forces play out in Hollywood and in many other industries, the public does not seem as concerned about the “replacement” threat of AI to human innovation. Perhaps because they don't know it yet, don't know how to assess it, or just don't want to know and hope it will be okay. In any case, the current level of concern that AI replaces human creativity and innovation is low and the enthusiasm for AI seems high.

For those in the media & entertainment field, rest assured that users want a human at the center of artistry and emotional experiences. Further, there is hope that AI will be an enabler to human achievements and expression. The media and entertainment industry should be careful that AI-generated content is at least reviewed by real people. Otherwise, the comment: “did AI do that for you?” may become a negative consumer statement.

Highlight #4

21% of all AI queries are asking for advice

1 out of 5 (21%) of all self-directed user queries relate to “advice.” The types of advice users seek range from dating tips, managing difficult bosses, health, to outfits for a theme party. At right are a sample of user experiences capturing “advice” prompts with video and audio.

After reviewing many of these user experiences, here are some of our observations:

No platform does particularly well here
The responses are reasonable but quickly feel generic and canned; they could be given to anyone vs something better fitting a particular nuanced situation
This can lead to disappointing experiences as users expect more human-like advice

Implications:
Businesses that provide advice as a product or service - particularly in fields such as healthcare, legal, financial services, and consulting - should study the types of prompts users ask AI and the responses AI provides. Not only to see what is being said, but also to judge accuracy. Those with expertise in a given topic can best judge the tendency for an AI platform to hallucinate within that topic and to see where it might be factually correct but makes wrong assumptions or under-reports critical things.

Just as the medical profession experienced a surge in people coming into their office with print outs of things they read online, AI promises to create a new level of self-educated experts who will try to get advice on their own. If your business can provide that advice using the AI tools available, and thereby lend accuracy and credibility to it, then that might be the best of both worlds.

Moreover, a human being on your team is more likely to ask the specific personalization and context questions that AI may not ask, or even if asked, may not be given. (Personalization will be a topic in a future issue of AIQ).

"How to pass a driver’s test"

Passing a Driver’s Test in NYC!

GPT-4 • Male • Millennial

"What to wear to a Barbie watch party"

Classic Barbie or Career Barbie - what should I go as?

GPT-4 • Male • Millennial

"Help writing a cover letter for a job"

Help me draft a Cover Letter for a video editor position!

GPT-4 • Male • Gen X

"How to manage teams efficiently"

Have a vague and unclear boss? What can you do?

GPT-3.5 • Male • Millennial

"How to detect melanoma early"

Early detection of Cancer is important

GPT-3.5 • Male • Gen Z

"How to pass a driver’s test"

Passing a Driver’s Test in NYC!

GPT-4 • Male • Millennial

"What to wear to a Barbie watch party"

Classic Barbie or Career Barbie - what should I go as?

GPT-4 • Male • Millennial

"Help writing a cover letter for a job"

Help me draft a Cover Letter for a video editor position!

GPT-4 • Male • Gen X

"How do you know someone likes you?"

How can you tell if a guy likes you?

Bard • Female • Millennial

"How to practice Anti-Capitalism as a disabled individual with no income"

Breaking up with Capitalism!

GPT-3.5 • Non-Binary • Millennial

"How to manage teams efficiently"

Have a vague and unclear boss? What can you do?

GPT-3.5 • Male • Millennial

"How to detect melanoma early"

Early detection of Cancer is important

GPT-3.5 • Male • Gen Z

"How to practice Anti-Capitalism as a disabled individual with no income"

Breaking up with Capitalism!

GPT-3.5 • Non-Binary • Millennial

Highlight #5

Is GPT-4 worth the extra cost?

Open AI has a freemium model whereby GPT-3.5 is free and GPT-4 (“Plus”) is not. Understandably, there are questions about whether GPT-4 is worth the cost. Is it better? How much better?

One simple answer is: GPT-4 is 8% better.

Why? The perceived quality of the response. Putting other features and benefits aside, responses are the main product of AI platforms, and how users perceive differences among them is a key factor in evaluating their value and price.

AIQ™ blind side-by-side (SBS) comparisons of the same prompts across several categories reveal:

GPT-4 responses are preferred more often than GPT-3.5, but only 8% more
GPT-4 does well on ambiguous queries, whereas 3.5 does better on those related to logic

Implications:
There are many reasons why consumers consider paying for AI platform services, including the quality of response and whether it is worth the cost. For a business, response is a critical component to evaluate, but also coupled with other factors: threshold and throttle amounts, data training requirements, API capabilities, data ownership, privacy policies, and of course cost.

And key for businesses is response quality for their consumers, particularly in the first few moments. As we have seen in the search realm, products or businesses NOT on the first page of search results capture a much smaller audience. Businesses need to ensure the AI they choose strikes the right balance between the cost and the information presented, tone, and length. Not easy, but as we see here, testable with the right methodology.