AI Can Now Make a Crappy Podcast

So are they coming for our jobs?

Oct 11, 2024

Google recently unveiled a beta test version of their new note taking and information management tool Notebook LM. If you are familiar with note taking tools like Notion or Capacities, Notebook LM will be familiar.

What makes Notebook LM different is it uses AI (specifically Google’s Gemini LLM) to offer a series of synthesis, summary, and analytical tools around the documents or notes you share with it. It can write a summary, FAQ, answer specific questions based on the sources you feed it, and it can even offer citations of where it gets a particular bit of information. I’m experimenting with the summary tools now, and I could see their being helpful for people’s whose job requires being familiar with a large volume of documents like lawyers, doctors, teachers, and… journalists.

But the task that raised my eyebrows is this: It makes a podcast.

What do I mean by podcast here? In this case, it generates the audio of a chat show, podcast, with two robot hosts talking knowledgably about a particular subject. Fans of Hard Fork, The Daily, Today Explained, and Amicus will find the format familiar (and man, the “male” voice REALLY sounds like KCRW’s David Greene)

Now, you’re probably wondering, does it make a GOOD podcast?

No.

Here’s an example of a “podcast” it made based on three sources I fed it, all of them related to the arrival of GPT4 and other generative AI tools in K-12 learning environments.

1×

0:00

-2:05

Now, the fact that it can do this at all is amazing. Seriously forking amazeballs. And to be clear, Google is not claiming that this tool makes a GOOD podcast. The podcast is one of several summary tools it offers to help people understand notes, articles, audio they have collected.

But I don’t even think it’s a great summary tool. The “conversation” does not go into much depth about the ideas in the articles. It does share some specific examples, and do a fair job summarizing the main arguments, but if I didn’t already KNOW the article, I don’t think I would learn much from the summary, nor would I be challenged to think differently about a topic the way that great episodes of the podcasts I mentioned above do. The summary misses many of the nuances in the articles I shared, and doesn’t acknowledge that the three sources don’t agree with each other. Some of the other summary tools Notebook offers work a little better (and its tool to let you ask a QUESTION about the data you’ve assembled is neat)

If you had something you needed to read, and literally, no time before you had to commute, you might listen to the summary and that would be better than nothing. So I’d give Notebook’s podcast tool a C+ when it comes to summary.

That said, I’ll give it an A when it comes to faithfully imitating a certain style of podcast. I’m not sure how they trained the model. One way would be to tokenize the transcripts of actual podcasts, in order to learn the speech patterns and structures that dominate the genre. (BTW, this may be copyright infringement) Or possibly, they created a “podcastese” algorithm. According to this recent interview with one of the developers on Hard Fork, the podcast is generated in at least two two stages.

Use Gemini to create a reasonable summary of the arguments and examples of the text.
Translate that summary into “podcastese”.

In this case, “podcastese” includes grunts, uhs, umms, likes, pauses, and very casual, conversational (and white?) American English.

I suspect there’s probably TWO steps to “podcastese” one to create a fairly straight forward script for dialog, and two to add the idiosyncratic elements the model has learned to associate with American podcasting. Or, maybe they somehow trained it with the audio of podcasts, turning the grunts, “ums” and “likes” into tokens alongside words and phrases.

It throws in reactions, interruptions and idiomatic phrases like “The hype was real” and “insanely good”, “Like scarily good” or “break that down for me.”

And, it EVEN occasionally throws in mumbly mispronunciations like “CHAT-GHEE-PEE-TEE” at :02. And of course, the “hosts” offer to share the original notes “in the show notes.” (Again, I would suspect they trained it on David Greene, but the fact I haven’t heard the robotic host mention a Pittsburg sports team weakens that theory)

What does this tell us about AI?

While Notebook LM is impressive, it doesn’t represent a fundamentally new AI capacity. Large Language Models still don’t think. They don’t have knowledge apart from knowing patterns. They don’t understand what they’re talking about. BUT, generative AI is EXTREMELY good at is accurately imitating certain kinds of human communication. What kinds? Well, anything you can train it on. In that past, that meant hundreds of written languages, dialects like “legalese” and “medicalese” or “reddit bro cant”. Now, apparently, that includes chatty 2-way podcasts in informal, middlebrow American English. This is quite consistent with the other things Gemini and ChatGPT have proven brilliant at: Generating text according to a well documented formula.

But the fact the tool can generate a quasi real simulacrum of a podcast should tell us something about podcasting.

What does this tell us about podcasting?

I think it tells us a few interesting things about our current moment in podcasting. It is frankly surreal to hear the AI imitating hosting and production moves that I hear on podcasts all the time (and have used). And even if Google doesn’t claim its tool can make a podcast, I’m sure somebody is thinking that they could save a lot of money on pesky producers if they could make a podcast with AI.

A format and style that was once fresh and innovative has become so commonplace as to be easily imitable.

I think the modern style of conversational podcasting has its origins in Ira Glass’s hosting and interviewing style on This American Life. Ira has a voice and delivery style that would have raised eyebrows prior to the 1980s, and perhaps kept him from being allowed near a microphone. Lots of people may have sounded like Ira in their ordinary speech, but he trained himself to read his scripts as though he was talking, with pauses, ums, and likes included. He wanted to sound like he was thinking and talking to the listener, rather than reading a script. In other words, he turned a perceived liability into a strength. His approach suggested greater authenticity. While the public radio hosts of the 1970s and 80s didn’t quite sound like Edward R. Murrow and Orson Welles, they tended to have a more formal, stentorian vocal style. Listen to NPR hosts in 1990.

The fact that many of the popular early boom podcasts emerged in the orbit of This American Life (Serial, Gimlet productions) and that casual conversational styles were essential to shows like Radiolab, Invisibilia, and The Daily has led to a casual, reactive, idiomatic, mumbly conversational approach in thousands of American podcasts. It is the norm.

While Ira sounded WEIRD when he first appeared on national radio in the late 1990s (and this probably helped people recognize the show), now people consciously or unconsciously imitate his pauses, his “ums”, his “likes”, the chuckles, the interruptions, and they interrupt the popular hosts who arose in his wake.

Some portion of people associate that style with podcasting, and perhaps for them, the STYLE is the product. Or at least it’s essential to the product. It’s not a podcast without ums, likes, and “let me break it down” or “here’s the thing”. It’s become so ingrained that a computer can plausibly imitate it. The world is mistaking the style for the substance.

This feeds into another observation that I seem to make often these days.

People don’t know how much work and skill it takes to make a good podcast.

Perhaps because people can’t immediately tell the difference between a good podcast and a bad podcast, and to people not in the know, simply creating something that SOUNDS like a podcast is impressive enough.

Actually, even before Notebook, it’s long been pretty easy to make something that sounds like a podcast. And it’s ALWAYS been hard to make it enlightening, clear, relevant, amusing and entertaining. Lots of people can imitate Ira Glass’s speaking style, but few have his (and his team’s) sense of narrative structure, surprise, and rhythmic pacing.

Talking about something intelligently, sharing the new information, grounding that information in a deeper context, all while engaging in witty banter, on the fly, is HARD. If the hosts and producers do their job well, it sounds easy and natural. The fact it sometimes sounds clunky (cough fake reporter 2-ways, cough cough) is not because the people working on the shows aren’t skilled, it’s because what they’re attempting is actually HARD.

When a host and reporter, or two hosts, have an actually thoughtful conversation, with moments of banter, with moments of discovery, and with a real narrative structure, that represents a triumph. That team is on it. Usually, the host is exceptionally good. Some co-hosts are EXTREMELY good at talking through complex, new, and relevant information, while remaining light and entertaining (again, hats off to Hard Fork) but even those shows probably have to do a few retakes, and are probably improvising around a structure and plan their producer has created and gone over with an editor.

This is real work, folks.

Other AI Incursions Into Podcasting

Descript/Squadcast has led the charge incorporating artificial intelligence into podcast production workflows (although I heard a rumor Adobe is getting into this realm). In the last few years, it has introduced a tool that many of used to fantasize about: A podcast text editor. Rather than manipulating a Digital Audio Editor like Pro-Tools, Reaper, or Hindenburg, software that requires weeks or months to gain proficiency, a Descript user can record a conversation on Squadcast, and then edit the podcast transcript just like they would edit a Google or Word document. Cut a sentence from the transcript, it disappears from the audio. Amazing.

It has also unveiled other tools that purport to fix p-pops, cut out filler words, reduce background noise, fix “upcuts” and do many of the other things formerly required by a relatively skilled producer, at least somebody with 6 months to a year of experience working in DAE tools.

BUT, for now, in my experience, Descript’s tools are not very reliable. If you use the text editor, a certain percentage of the edits just don’t sound good (The percentage depends on the sound quality and speaking style of the original interview. A mumbly, interrupty conversation with poor sound quality will have more bad edits than a well recorded one in which people don’t interrupt and speak clearly). The audio is cut in the wrong place, or not enough silence or room tone is inserted in between the newly edited texts. The sound quality tools will create weird phasing or noise gate effects, the upcut fixes don’t always work.

A novice podcaster could use the Descript text editing tool in a workflow where they create a first draft editing the text, and then a skilled producer “fixes” the draft and cleans it up in Reaper or Pro-Tools. This is an excellent use case for a professor or small business owner who has a strong vision for how to edit their chat podcast, but not the time or inclination to use the editing tools. The host can take a first pass, hand it off to a skilled producer, who in 1-2 hours, can generate the final episode.

So, we have Descript, which allows novice producers without DAE skills to record and edit a podcast… not quite up to broadcasting standards. And we have Notebook LM that can create an impressive but ultimately vacuous imitation of a chat podcast.

Our Jobs Are Safe From AI (so far)

As a podcast producer does this make me worry that AI is coming for my job? Not yet. AI is NOWHERE NEAR able to make a good podcast. Might it get there someday?

These podcast production tools are likely to improve marginally. The AIs are constantly getting tweaked and improved. But will AI get to the point where the technology can actually replace an associate producer, competently putting together a podcast? Not unless it actually is able to think critically: To combine narrative structure with conversational knowledge and surprising relevant information. This is literally the Trillion dollar question (one estimate of the tech industry’s investment in AI). Will we create Artificial General Intelligence? Nobody actually knows if it’s possible, or how long it might take. We are bad at predicting the future development of this technology.

If you want to dive into why some scientists think it’s unlikely we’ll get there soon (as in, within ten year), check out AI Snake Oil.

And while we’re waiting for AGI, if you want to make a podcast, hire a good producer.

Diamond Shoals Dispatches

Discussion about this post