Speech Understanding with Python

Learn everything about using AI for speech understanding with Python.

With modern AI models, you can do much more than transcribing audio files. Let me show you a few cool things you can do to quickly understand your audio data better.

I'll show you how to perform the following tasks on a ~3-hour-long podcast episode:

Automatic speech recognition (ASR)
Getting word-level timestamps
Getting sentences with timestamps
Identifying speaker labels
Detecting relevant topics
Performing sentiment analysis
Applying LLMs to audio:
- Creating custom summaries with LLMs
- Asking questions about the audio content (Q&A chat)

For this tutorial, we'll use AssemblyAI and their Python SDK. In my opinion, AssemblyAI offers the best AI models for any task related to speech & audio understanding and provides developer-friendly SDKs that are fun to build with.

(Side note: I'm currently the lead maintainer of the Python SDK, so if you have any feedback I'd be happy to hear it 🤗)

You can run all of the code here easily in a Colab:

Getting Started - Transcribe a YouTube video

As an example, I want to use the Huberman Lab podcast episode with Jeff Cavaliere - a 2h40min long episode about optimizing exercise programs with science.

We can download the episode from YouTube e.g with yt-dlp and then transcribe it like this:

pip install assemblyai
pip install yt-dlp

# Download and save a YouTube video
import yt_dlp

URLS = ['https://youtu.be/UNCwdFxPtE8']  # 2:40

ydl_opts = {
    'format': 'm4a/bestaudio/best',  # The best audio version in m4a format
    'outtmpl': '%(id)s.%(ext)s',  # The output name should be the id followed by the extension
    'postprocessors': [{  # Extract audio using ffmpeg
        'key': 'FFmpegExtractAudio',
        'preferredcodec': 'm4a',
    }]
}

with yt_dlp.YoutubeDL(ydl_opts) as ydl:
    error_code = ydl.download(URLS)

Next, we can transcribe it with AssemblyAI and 4 lines of code:

import assemblyai as aai
aai.settings.api_key = "YOUR_API_KEY"

transcriber = aai.Transcriber()
transcript = transcriber.transcribe("UNCwdFxPtE8.m4a")

print(transcript.id, transcript.status)

print(transcript.text)

'7a625d89-7139-4c3c-b6da-eff51ccaf521', <TranscriptStatus.completed: 'completed'>

Welcome to the Huberman Lab podcast, where we discuss science and science based
tools for everyday life. I'm Andrew Huberman and I'm a professor...

You can retrieve this transcript at any point in the future with the id:

transcript = aai.Transcript.get_by_id('7a625d89-7139-4c3c-b6da-eff51ccaf521')

Word-level timestamps

The transcript automatically comes with word-level timestamps:

print(transcript.words[0:5])

[Word(text='Welcome', start=410, end=574, confidence=0.99932, speaker=None),
 Word(text='to', start=612, end=766, confidence=0.69, speaker=None),
 Word(text='the', start=788, end=878, confidence=1.0, speaker=None),
 Word(text='Huberman', start=884, end=1274, confidence=0.99053, speaker=None),
 Word(text='Lab', start=1322, end=1534, confidence=0.99956, speaker=None)]

As you might have noticed, the speaker field is None. We'll add that in a moment.

The start and end values are returned in milliseconds. We can create a small helper function to convert it to a more user-friendly format:

import datetime
def timestamp_string(milliseconds):
    return datetime.datetime.fromtimestamp(milliseconds/1000).strftime('%H:%M:%S')

# print the start timestamp of the last word
print(timestamp_string(transcript.words[-1].start))

02:40:41

Sentences with timestamps

We can also automatically split the transcript by each sentence:

sentences = transcript.get_sentences()
for sentence in sentences[:5]:
    print(f"{timestamp_string(sentence.start)}: {sentence.text}")

00:00:00: Welcome to the Huberman Lab podcast, where we discuss science and science based tools for everyday life.
00:00:09: I'm Andrew Huberman and I'm a professor of neurobiology and ophthalmology at Stanford School of Medicine.
00:00:14: Today, my guest is Jeff Cavalier.
00:00:17: Jeff Cavalier holds a Master of Science in physical therapy and is a certified strength and conditioning specialist.
00:00:22: He did his training at the University of Connecticut at stores, one of the top five programs in the world in physical therapy and sports medicine.

Speaker Diarization (Speaker Labels)

If you want to get speaker labels, you have to configure the Speaker Diarization model by setting speaker_labels=True in a TranscriptionConfig.

If you know the number of speakers, I recommend to also set the optional expected_speakers parameter:

config = aai.TranscriptionConfig(speaker_labels=True, speakers_expected=2)

transcript = aai.Transcriber().transcribe("UNCwdFxPtE8.m4a", config)

for utterance in transcript.utterances[0:5]:
    print(f"Speaker {utterance.speaker}: {utterance.text}")

Speaker A: Welcome to the Huberman Lab podcast, where we discuss science and science based tools for everyday...
Speaker B: I'm glad to be here. It's amazing.
Speaker A: I'm a time consumer of your content. I've learned a tremendous amount about fitness, both in the...
Speaker B: I think it's like a 60 40 split, which would be leaning towards weight training strength, and then the...
Speaker A: And in terms of the duration of those workouts, what's your suggestion? I've been weight training...

We can then easily map speaker labels "A" and "B" to the actual names:

speakers = {"A": "Huberman", "B": "Cavaliere"}

for utterance in transcript.utterances[0:5]:
    speaker = speakers[utterance.speaker]
    print(f"{timestamp_string(utterance.start)} {speaker.ljust(9)}: {utterance.text}")

00:00:00 Huberman : Welcome to the Huberman Lab podcast, where we discuss science and science based tools for everyday...
00:08:40 Cavaliere: I'm glad to be here. It's amazing.
00:08:42 Huberman : I'm a time consumer of your content. I've learned a tremendous amount about fitness, both in the...
00:10:11 Cavaliere: I think it's like a 60 40 split, which would be leaning towards weight training strength, and then the...
00:10:55 Huberman : And in terms of the duration of those workouts, what's your suggestion? I've been weight training...

Topic Detection

AssemblyAI also provides a variety of so-called Audio Intelligence models for different speech understanding tasks.

One such model is the Topic Detection model that can identify different topics in the transcript and standardize them based on the IAB Content Taxonomy.

You can mix and match multiple models for a given transcript request simply by enabling them in the configuration. For topic detection, set iab_categories=True.

(Note that every time we change the configuration we have to re-calculate the transcript. So typically you specify the configuration with all desired models at once and then have to calculate the transcription only one time.)

config = aai.TranscriptionConfig(speaker_labels=True,
                                 speakers_expected=2,
                                 iab_categories=True)

transcript = aai.Transcriber().transcribe("UNCwdFxPtE8.m4a", config)

for result in transcript.iab_categories.results[:3]:
    print(f"{timestamp_string(result.timestamp.start)}: {result.text}")
    for label in result.labels:
        if label.relevance > 0.5:
            print(f"{label.label} ({label.relevance})")
    print()

00:00:00: Welcome to the Huberman Lab podcast, where we discuss science and science based tools for everyday life. I'm Andrew Huberman and I'm a professor of neurobiology and ophthalmology at Stanford School of Medicine. Today, my guest is Jeff Cavalier. Jeff Cavalier holds a Master of Science in physical therapy and is a certified strength and conditioning specialist. He did his training at the University of Connecticut at stores, one of the top five programs in the world in physical therapy and sports medicine.
HealthyLiving>Wellness>PhysicalTherapy (0.9916520714759827)

00:00:30: I discovered Jeff Cavalier over ten years ago from his online content. His online content includes information about how to train for strength, how to train for hypertrophy, which is muscle growth, how to train for endurance, as well as how to rehabilitate injuries to avoid muscular imbalances. Nutrition and supplementation I've always found his content to be incredibly science based, incredibly clear, sometimes surprising, and always incredibly actionable.
Sports>Bodybuilding (0.9990130662918091)

00:00:59: It is therefore not surprising that he has one of the largest online platforms for fitness, nutrition, supplementation and injury rehabilitation. Jeff has also worked with an enormous number of professional athletes and has served as head physical therapist and assistant strength coach for the New York Mets.
HealthyLiving>FitnessAndExercise (0.9446625113487244)
Sports>Bodybuilding (0.9440380334854126)

We can also get a summary of all relevant topics:

for topic, relevance in transcript.iab_categories.summary.items():
    if relevance > 0.1:
        print(f"{relevance * 100:.1f}% relevant to {topic}")

100.0% relevant to HealthyLiving>FitnessAndExercise
51.3% relevant to Sports>Bodybuilding
25.3% relevant to Sports>Weightlifting
20.7% relevant to Food&Drink>HealthyCookingAndEating
19.1% relevant to MedicalHealth>DiseasesAndConditions>SleepDisorders
18.7% relevant to HealthyLiving>WeightLoss
17.2% relevant to HealthyLiving>Wellness>PhysicalTherapy
12.6% relevant to HealthyLiving>Nutrition

That's pretty cool to get a quick overview of the discussed topics in this episode.

Sentiment Analysis

Another cool use case is sentiment analysis. The Sentiment Analysis model detects the sentiment of each spoken sentence in the transcript text and provides a detailed analysis of the positive, negative, or neutral sentiment conveyed in the audio, along with a confidence score for each result.

config = aai.TranscriptionConfig(speaker_labels=True,
                                 speakers_expected=2,
                                 sentiment_analysis=True)

transcript = aai.Transcriber().transcribe("UNCwdFxPtE8.m4a", config)

for sentiment_result in transcript.sentiment_analysis[:3]:
    speaker = "Huberman" if sentiment_result.speaker == "A" else "Cavaliere"
    print(f"{timestamp_string(sentiment_result.start)} {speaker}: {sentiment_result.text}")
    print(f"{sentiment_result.sentiment}, {sentiment_result.confidence:.1f}")  # POSITIVE, NEUTRAL, or NEGATIVE

00:00:00 Huberman: Welcome to the Huberman Lab podcast, where we discuss science and science based tools for everyday life.
POSITIVE, 0.8
00:00:09 Huberman: I'm Andrew Huberman and I'm a professor of neurobiology and ophthalmology at Stanford School of Medicine.
NEUTRAL, 0.9
00:00:14 Huberman: Today, my guest is Jeff Cavalier.
NEUTRAL, 0.9

With this information we can perform some analysis, e.g., we can determine how positively/negatively each podcast guest is talking:

speaker_a = {"POSITIVE": 0, "NEUTRAL": 0, "NEGATIVE": 0}
speaker_b = {"POSITIVE": 0, "NEUTRAL": 0, "NEGATIVE": 0}

for sentiment_result in transcript.sentiment_analysis:
    if sentiment_result.speaker == "A":
      speaker_a[sentiment_result.sentiment] += 1
    else:
      speaker_b[sentiment_result.sentiment] += 1

print("Huberman: ", speaker_a)
print("Cavaliere:", speaker_b)

Huberman:  {'POSITIVE': 216, 'NEUTRAL': 460, 'NEGATIVE': 91}
Cavaliere: {'POSITIVE': 225, 'NEUTRAL': 646, 'NEGATIVE': 201}

Create custom summaries with LLMs

AssemblyAI also provides a framework to apply large language models (LLMs) to audio data. Typically, this would require different steps like storing transcripts, splitting the text so it fits into the model's context window, calculating embeddings for RAG, calling an LLM etc.

LeMUR is a framework that does all of this for you and is the easiest way I know of to apply LLMs to audio.

Here, I create a prompt to summarize the podcast in two different ways (as TLDR and as bullet points) and let LeMUR do its magic. You can select different LLMs in LeMUR, so let's try the default one and the cheaper LeMUR Basic which is a bit less accurate but still yields good results.

title = "Jeff Cavaliere: Optimize Your Exercise Program with Science-Based Tools | Huberman Lab Podcast #79"

prompt = f"""You are an expert journalist. I need you to read the transcript and summarize it for me.
This transcript comes from a podcast entitled {title}"""

prompt_tldr = prompt + '\nYour response should be a TLDR summary of around 5 to 8 sentences long.'

result = transcript.lemur.task(prompt_tldr, final_model=aai.LemurModel.default)
print(result.response.strip())

Here is a 5 sentence TLDR summary of the transcript:

Jeff Cavaliere, a physical therapist and strength coach, joins Andrew Huberman to discuss optimizing exercise programs. They cover topics including workout splits, duration, incorporating cardio, and improving mind-muscle connection. Cavaliere provides science-based recommendations on these topics, emphasizing that consistency with a program one enjoys is most important for long-term results. He advocates full-body workouts for most people, with a focus on progressive overload and getting stronger over time. Cavaliere also recommends grip strength tests and jump rope as tools to monitor systemic recovery and improve movement mechanics.

prompt_bullets = prompt + '\nYour response should be in the form of 10 bullet points.'

result = transcript.lemur.task(prompt_bullets, final_model=aai.LemurModel.basic)
print(result.response.strip())

1. The podcast features an interview with Jeff Cavalier, a fitness and training expert, by Andrew Huberman, a professor of neurobiology and ophthalmology at Stanford.
2. The two discuss various aspects of exercise and training, including workout splits, duration, resistance versus cardiovascular emphasis, and sequencing of different training modalities.
3. Jeff emphasizes the importance of challenging muscles instead of just moving weights and using a "cramp test" to determine which muscles can be effectively trained.
4. They also discuss the benefits of high-intensity interval training over steady state cardio for time-crunched individuals.
5. The conversation shifts to different approaches to training, with Jeff focusing on blending different types of training instead of keeping them separate.
6. Jeff talks about how he developed the "cramp test" and the importance of mind-muscle connection for each exercise.
7. They discuss the use of grip strength as an indicator of systemic recovery, with a 10% drop in grip strength likely meaning the body needs more rest.
8. In closing, they talk about the importance of sleep positioning for injury prevention and improving mobility, posture, and back pain.
9. The speakers discuss different sleeping positions and their effects on the body, including the importance of proper stretching before bed to avoid muscle shortening during sleep.
10. The conversation then shifts to different types of stretching, including active and passive stretching, and the benefits of learning to land on the balls of feet during jumping rope.

The results are pretty cool, right?

Ask questions - Q&A chat using an LLM

Let's create a different prompt to ask some questions about the podcast episode.

Note that with all prompts you don't have to add the transcript to the context. This is all taken care of by LeMUR:

prompt = """Bases on the transcript, answer the following questions:

Why is sleep positioning important?
What sleep positions are good?
Do they mention helpful action items to improve sleep positioning? If so, list them.
"""

result = transcript.lemur.task(prompt, final_model=aai.LemurModel.basic)
print(result.response.strip())

Based on the transcript:

- Sleep positioning is important because certain positions can impact waking posture and movement. Specifically, sleeping on your stomach puts stress on the lumbar spine.

- Side sleeping is mentioned as a better option than sleeping on your stomach. 

- Yes, they mention stretching before bed to establish muscle length as a helpful action item to improve sleep positioning. Doing dynamic stretching before exercise is also mentioned as a way to warm up muscles without disrupting length-tension relationships.

And that's it! Now, you can use the power of AI to create cool insights, create summaries, and ask questions about the audio data.