Aurich Lawson
Back in April, we ran a series of useful and/or somewhat goofy prompts through Google’s (then-new) PaLM-powered Bard chatbot and OpenAI’s (slightly older) ChatGPT-4 to see which AI chatbot reigned supreme. At the time, we gave the edge to ChatGPT on five of seven trials, while noting that “it’s still early days in the generative AI business.”
Now, the AI days are a bit less “early,” and this week’s launch of a new version of Bard powered by Google’s new Gemini language model seemed like a good excuse to revisit that chatbot battle with the same set of carefully designed prompts. That’s especially true since Google’s promotional materials emphasize that Gemini Ultra beats GPT-4 in “30 of the 32 widely used academic benchmarks” (though the more limited “Gemini Pro” currently powering Bard fares significantly worse in those not-completely-foolproof benchmark tests).
This time around, we decided to compare the new Gemini-powered Bard to both ChatGPT-3.5—for an apples-to-apples comparison of both companies’ current “free” AI assistant products—and ChatGPT-4 Turbo—for a look at OpenAI’s current “top of the line” waitlisted paid subscription product (Google’s top-level “Gemini Ultra” model won’t be publicly available until next year). We also looked at the April results generated by the pre-Gemini Bard model to gauge how much progress Google’s efforts have made in recent months.
While these tests are far from comprehensive, we think they provide a good benchmark for judging how these AI assistants perform in the kind of tasks average users might engage in every day. At this point, they also show just how much progress text-based AI models have made in a relatively short time.
Dad jokes
Prompt: Write 5 original dad jokes
-
A screenshot of five “dad jokes” from the Gemini-powered Google Bard.
Kyle Orland / Ars Technica -
A screenshot of five “dad jokes” from the old PaLM-powered Google Bard.
Benj Edwards / Ars Technica -
A screenshot of five “dad jokes” from GPT-4 Turbo.
Benj Edwards / Ars Technica -
A screenshot of five “dad jokes” from GPT-3.5.
Kyle Orland / Ars Technica
Once again, both tested LLMs struggle with the part of the prompt that asks for originality. Almost all of the dad jokes generated by this prompt could be found verbatim or with very minor rewordings through a quick Google search. Bard and ChatGPT-4 Turbo even included the same exact joke on their lists (about a book on anti-gravity), while ChatGPT-3.5 and ChatGPT-4 Turbo overlapped on two jokes (“scientists trusting atoms” and “scarecrows winning awards”).
Advertisement
Then again, most dads don’t create their own dad jokes, either. Culling from a grand oral tradition of dad jokes is a tradition as old as dads themselves.
The most interesting result here came from ChatGPT-4 Turbo, which produced a joke about a child named Brian being named after Thomas Edison (get it?). Googling for that particular phrasing didn’t turn up much, though it did return an almost-identical joke about Thomas Jefferson (also featuring a child named Brian). In that search, I also discovered the fun (?) fact that international soccer star Pelé was apparently actually named after Thomas Edison. Who knew?!
Winner: We’ll call this one a draw since the jokes are almost identically unoriginal and pun-filled (though props to GPT for unintentionally leading me to the Pelé happenstance)
Argument dialog
Prompt: Write a 5-line debate between a fan of PowerPC processors and a fan of Intel processors, circa 2000.
-
A screenshot of an argument dialog from the Gemini-powered Google Bard.
Kyle Orland / Ars Technica -
A screenshot of an argument dialog from the old PaLM-powered Google Bard.
Benj Edwards / Ars Technica -
A screenshot of an argument dialog from GPT-4 Turbo.
Benj Edwards / Ars Technica -
A screenshot of an argument dialog from GPT-3.5
Kyle Orland / Ars Technica
The new Gemini-powered Bard definitely “improves” on the old Bard answer, at least in terms of throwing in a lot more jargon. The new answer includes casual mentions of AltiVec instructions, RISC vs. CISC designs, and MMX technology that would not have seemed out of place in many an Ars forum discussion from the era. And while the old Bard ends with an unnervingly polite “to each their own,” the new Bard more realistically implies that the argument could continue forever after the five lines requested.
On the ChatGPT side, a rather long-winded GPT-3.5 answer gets pared down to a much more concise argument in GPT-4 Turbo. Both GPT responses tend to avoid jargon and quickly focus on a more generalized “power vs. compatibility” argument, which is probably more comprehensible for a wide audience (though less specific for a technical one).
Winner: ChatGPT manages to explain both sides of the debate well without relying on confusing jargon, so it gets the win here.