Python Read First N Seconds of an Audio File
Transcribe large audio files offline with Vosk
Transcription of large sound files for your next NLP projection
Inspired past Tongue Processing (NLP) projects that analyze reddit information, I came up with the idea of using podcast information. Notwithstanding, since podcasts are (large) audio files, one needs to transcribe them to text outset. This process is also called Automatic Speech Recognition (ASR) or Speech-to-text (STT).
Providers similar Google, Azure, or AWS offering excellent APIs to do this chore. But what if you want to practise the transcription offline or, for some reason, you are not immune to use cloud solutions?
tl;dr
- Vosk is a toolkit that allows you to transcribe audio files offline
- It supports over 20 languages and dialects
- Audio has to be converted to wave format (mono, 16Hz) first
- Transcription of large audio files tin be done by using buffering
- Colab Notebook can exist institute here
Goal
That'south why I wrot e this commodity to give y'all an overview of alternative solutions and how to apply them.
The idea is to use packages or toolkits that offer pre-trained models and then that nosotros do non take to railroad train the models by ourselves start.
In this article I focus on Vosk. At that place are many more like Moziall's DeepSpeech or the SpeechRecognition bundle. However, the time to come of DeepSpeech is uncertain, and SpeechRecognition includes additionally to online APIs, CMUSphinx, which uses Vosk.
I presume that the data we want to transcribe is not available on youtube. If it is bachelor, I highly recommend to check out the youtube-transcript-api
bundle. It allows you to get the generated transcript for a given video, and the effort is much less than what we will do in the following.
Prerequisite: Bringing the information in the right format
Earlier we come to the transcription function, we take to first bring our data in the right format. Podcasts or other (long) audio files are unremarkably in mp3 format. Withal, this is not the format the packages or toolkits can work with.
To be here more specific, we need to catechumen our (mp3) audio in:
- Wave format (.wav)
- Mono
- 16,000Hz sample charge per unit
The conversion is pretty straight forward. Showtime we have to install ffmpeg
, which tin can be found under https://ffmpeg.org/download.html.
Mac users can use brew to download and install it:
brew install ffmpeg
Next we install the pydub
package:
pip install pydub
The following code snippet converts an mp3 in the needed wav format. Information technology stores the output in the same directory as the given mp3 input file and returns its path. In instance we want to skip some seconds (e.g., the intro), we can use the skip parameter past setting the number of seconds we want to skip. If we desire to try things out start, we can set the excerpt parameter to True to become the kickoff 30 seconds of the audio file only.
With this part we can at present convert our podcast file to the needed wav format.
To take an (interactive) example I chose to transcribe the following podcast episode:
Please note: The podcast was a random choice. I practice not have any connections with the creators nor I get paid for naming them.
Since the first 37 seconds are an intro, nosotros tin can skip them using the skip parameter.
For a starting time example we will also set the parameter excerpt to True:
mp3_to_wav('opto_sessions_ep_69.mp3', 37, True)
Our new file opto_sessions_ep_69_excerpt.wav is now thirty seconds long and starts from 0:37 to i:07.
At present we can kickoff with the transcription!
Vosk
Vosk is a voice communication recognition toolkit that supports over 20 languages (e.g., English language, German, Hindu, etc.) and dialects. It works offline and even on lightweight devices like Raspberry Pi.
Its portable models are just 50Mb each. However, there are much bigger models available. A list of all available models tin be plant hither: https://alphacephei.com/vosk/models
Vosk tin can exist easily installed past calling:
pip install vosk
After Vosk is installed, we have to download a pre-trained model. I decided to go with one of the largest ones: vosk-model-en-us-0.22
Now that nosotros have everything we demand, let us open our moving ridge file and load our model.
Earlier nosotros swoop into the transcription process, we have to get familiar with VOSKs output.
VOSK returns the transcription in JSON format like:
{
"text" : "cats are dangerous"
}
If we are too interested in how confident VOSK is with each word and also want to get the time of each discussion we can make utilize of SetWords(True)
. The outcome for one word would look similar this for example:
{
"event":[
{
"conf":0.953349,
"finish":vi.090000,
"start":five.700000,
"word":"cats"
},
{
etc.
},
etc.
],
"text":"cats are unsafe"
}
Since nosotros want to transcribe large audio files, it makes sense to use a buffering arroyo past transcribing the wave file clamper past chunk. The following lawmaking shows the transcription approach:
We read in the outset 4000 frames (line 7) and hand them over to our loaded model (line 12). The model returns (in JSON format) the effect which is stored as a dict in result_dict. We then excerpt the text value only and suspend it to our transcription list (line 14).
If there are no more frames to read (line 8), the loop stops and nosotros grab the final results by calling the FinalResult() method. This method as well flushes the whole pipeline.
The outcome should wait like this:
to success on today bear witness i'm delighted to introduce beth kinda like a technology analyst with over a decade of experience in the private markets she's now the cofounder of io fund which specializes in helping individuals gain a competitive advantage when investing in tech growth stocks how does beth do this well she'due south gained hands on experience over the years was i were working for or analyzing a huge amount of relevant tech companies in silicon valley the involved in the marketplace
Note: If yous are interested in a more "fashionable" solution (using a progress bar) yous can find my lawmaking hither.
Other packages or toolkits for offline transcription
Equally mentioned in the introduction, there are many more packages or toolkits available. Yet, their implementation is not as easy as with Vosk. But if you are interested, I tin recommend NVIDIA'due south NeMo.
NVIDIA NeMo
NeMo is a toolkit built for researchers working on automatic speech recognition, natural linguistic communication processing, and text-to-speech synthesis. Like VOSK, we can as well choose from a agglomeration of pre-trained models, which can be institute here.
The implementation needs more time and code. Based on Somshubra Majumdar's notebook I created a compact version that tin can exist establish hither.
Conclusion
Vosk is a great toolkit for offline transcription. Compared to other offline solutions I tested, Vosk was the easiest to implement. The only thing little thing that is missing is punctuation. So far, in that location are no plans to integrate it. However, in the meantime, external tools can be used for this if needed.
Source: https://towardsdatascience.com/transcribe-large-audio-files-offline-with-vosk-a77ee8f7aa28
0 Response to "Python Read First N Seconds of an Audio File"
Post a Comment