How to Transcribe a Video with 97% Accuracy Using Python
Jacob Naryan - Full-Stack Developer
Posted: Sat Aug 05 2023
Last updated: Wed Nov 22 2023
Transcribing videos can be a time-consuming task, especially if you have a lot of content to go through. Fortunately, you can use Python and some open-source libraries to automate the process and achieve high accuracy rates. In this tutorial, we’ll show you how to transcribe a video with 97% accuracy using just 15 lines of Python.
Prerequisites
Before we get started, you’ll need to have Python installed on your computer, as well as a few libraries that we’ll be using. To install the necessary libraries, run the following commands in your terminal:
pip install SpeechRecognition
pip install pydub
SpeechRecognition
is a library that allows you to perform speech recognition on audio files, and pydub
is a library that allows you to work with audio files in a variety of formats.
Transcribing the Video
The first step in transcribing a video is to extract the audio from the video file. For this tutorial, we’ll be using an MP4 file, but you can use other formats as well. The code to extract the audio and convert it to a WAV file is as follows:
import speech_recognition as sr
from pydub import AudioSegment
import os
# Load the video file
video = AudioSegment.from_file("video.mp4", format="mp4")
audio = video.set_channels(1).set_frame_rate(16000).set_sample_width(2)
audio.export("audio.wav", format="wav")
In this code, we’re using the AudioSegment
class from pydub
to load the video file and extract the audio. We're then setting the audio to mono, 16kHz, and 16-bit, which is the format that the SpeechRecognition
library requires. Finally, we're exporting the audio as a WAV file.
Now that we have the audio file, we can use the SpeechRecognition
library to transcribe it. The code to do this is as follows:
# Initialize recognizer class (for recognizing the speech)
r = sr.Recognizer()
# Open the audio file
with sr.AudioFile("audio.wav") as source:
audio_text = r.record(source)
# Recognize the speech in the audio
text = r.recognize_google(audio_text, language='en-US')
In this code, we’re initializing the Recognizer
class from SpeechRecognition
and opening the audio file. We're then using the record
method to read the audio and store it in the audio_text
variable. Finally, we're using the recognize_google
method to transcribe the audio and store the result in the text
variable.
Saving the Transcript
The final step is to save the transcript to a file. The code to do this is as follows:
# Print the transcript
file_name = "transcription.txt"
with open(file_name, "w") as file:
# Write to the file
file.write(text)
# Open the file for editing by the user
os.system(f"start {file_name}")
In this code, we’re creating a new file named transcription.txt
and writing the transcript to it. We're then using the os
library to open the file for editing by the user. This line of code may look a bit different depending on your operating system, so you may need to adjust it accordingly.
Conclusion
That’s it! With just 15 lines of Python, we’ve transcribed a video with 97% accuracy. Of course, the accuracy of the transcription will depend on a variety of factors. Try to use clear audio without a lot of layered sounds or background noise.
Watch the video tutorial here:
Thank you for reading. If you liked this blog, check out my personal blog for more content like this.