Tools for automated transcription of audio and video fragments
Speech recognition, speech-to-text (STT) or automated speech recognition (ASR) is a technology that makes it possible to convert spoken text in videos or audio into text, such as the automatic subtitles on YouTube or Zoom. In this overview we focus on transcribing interviews for oral history, but virtual assistants like Siri or Google Assistant are also a form of this technology.
Speech recognition is a relatively old technology. The first commercial tools appeared in the early 1990s. They use models, systems that are trained on a certain set of data to recognize patterns and make decisions without human intervention. Speech recognition models are language models trained on audio such as interviews, audiobooks, lectures and presentations. The strength of the speech recognition tool depends enormously on the model used.
Possibilities of speech recognition tools
A research by meemoo in 2020-2021 showed that speech recognition technology was less good at transcribing audio from conversations, but great strides have been made in recent years which has greatly improved the technology, partly due to the greater computing power of computers, the progress in technology related to machine learning and big data and the improved language models. This has ensured that the tools generate more natural language and produce fewer nonsensical texts, thus improving transcriptions of conversations enormously.
In addition, speech recognition tools can do more than just transcribe text. They can also:
- recognize different speakers and indicate which text was spoken by which speaker
- indicate and remove filler words such as euhm from the text
- indicate silences
- create summaries and translations
- ...
Several tools also have a possibility to add a dictionary. Here you can set specialized words that would otherwise be transcribed incorrectly.
Points of attention when using the tools
You can take into account a number of factors to achieve better results:
- The strength of the speech recognition tools depends on the model used: e.g., which languages does the model support and how well can the model handle dialects or colloquial language? The choice for the model depends on the needs and use cases, such as the trade-off between speed and accuracy, and the language that must be transcribed. In the overview we chose accuracy (it is important that the text is correct than that the tool is fast) and Dutch as colloquial language. They were also all tested on interviews in which slight dialect or intermediate language was spoken. The most accurate tools scored well for this, but generally it was noted that interviews with pronounced dialects resulted in poorer transcriptions.
- Another important point of attention is the quality of the recording. Recordings with clear sound and without background noise give better results than recordings with poor sound quality (e.g. noise) and background noise. Speakers who speak clearly are also transcribed better than speakers who mumble.
- Desktop applications use the computing power of the computer when transcribing the text. If your computer has a dedicated GPU, the transcription will proceed much faster. A dedicated GPU is a special GPU with its own card connected to the motherboard, while an integrated GPU is embedded in the same chip as the CPU. This article (in Dutch) explains more about the different GPUs. If your computer does not have this GPU, then the CPU or processor is used. This is slower. You can also use an online service so that you are not limited by the limits of your computer.
Overview of the different tools
At the request of heritage organisations for support in writing out interviews, a number of tools were tested. Most tools can do more than just transcribe, but those functions were not investigated deeper.
Amberscript
Amberscript is a commercial web platform that allows you to create transcriptions and subtitles for audio and video. When you want to transcribe audio or video, you must upload those fragments into their web environment. Which model is used is not documented.
It supports:
- different languages, including Dutch;
- speaker identification;
- and indication of timestamps
Advantages:
- Editing transcriptions is very clear.
- Possibility to add a dictionary.
Disadvantages:
- Speaker identification is inaccurate when there are pauses in the audio fragment.
- Pauses are not indicated.
- Accuracy in Dutch is not good. The transcribed text contains quite a few errors.
Using Amberscript costs money. If you want to pay once, it will cost you €15/hour, whereby you can purchase a maximum of hunderd hours at a time. It is also possible to take out a subscription. You then pay €40/month per five hours of uploaded audio or video if you take a yearly subscription, or €50/month for five hours of audio or video. It is not possible to transfer unused hours to another month.
Audapolis
Audapolis is an open source and offline desktop application that uses the Vosk model. Vosk is a relatively small, but also older model that is mainly used for chatbots, smart home applications and virtual assistants. It was originally developed for smaller devices such as smartphones and microcomputers.
The tool can transcribe twenty languages, including Dutch, identifies which text was spoken by which speaker and also indicates timestamps. Because the tool works offline, you can also use the tool, for example, on the train when traveling home after an interview.
Advantages:
- free and open source application;
- works offline;
- includes a built-in editor to improve the transcribed text;
- detects and removes euhms and pauses.
Disadvantages:
- does not work well with accents or (slight) dialects;
- accuracy of transcribed text and speaker identification is low;
- text can only be exported in HTML format;
- tool hasn't been updated for a year.
Limecraft
Limecraft is an online platform for managing audio and video and offers the possibility to transcribe that content as an extra feature. It uses six models for this: Vocapia, Speechmatics, Google Speech, Microsoft Auze, Scriptix and Kaldi. As a user you cannot choose which of those models you want to use. The platform makes that choice for you based on the purposes and the language you choose. Because so many models are used, it can transcribe more than hunderd languages, including Dutch. Just like the previous tools, it can identify speakers and indicate timestamps. On the platform, multiple people can collaborate on a transcription, like working together on a document in Sharepoint or Google Drive.
Advantages:
- user-friendly interface with simple editing possibilities;
- fast;
- extensive export options (.pdf, .csv, .doc);
- possibility to add your own dictionary;
- has extra features, such as creating subtitles, topic detection and making summaries;
- it is a Belgian company, which means support and contact are in Dutch and the same time zone.
Disadvantages:
- Uhms are not recognized well and silences are not indicated.
- The distinction between speakers is not always good, but you can edit this.
- It makes strange things of words that it does not know.
- Transcription is an extra feature, making the platform quite expensive and having many functionalities you do not need if you only want to transcribe.
The prices range from free (one user with 5h of material), €85/month (five users with 25h of material) or €275/month (for larger teams). To have audio and video transcribed, you pay an extra €15/hour. Limecraft also offers the possibility to translate that transcription. That also costs €15/hour.
Sonix
Sonix is also a commercial web platform where you can collaborate on transcriptions. There are extensive possibilities to edit transcriptions, adjust timestamps, etc. It can transcribe more than 49 languages, including Dutch, recognizes the different speakers well and indicates timestamps. Finally, it also has the ability to indicate in color codes how certain Sonix is about certain transcriptions.
Advantages:
- user-friendly interface with extensive and simple editing possibilities;
- transcribes quickly;
- possibility to add your own dictionary for specific words;
- extensive export options
- has an extra (paid) feature to create summaries of transcriptions
Disadvantages:
- Uhms are not recognized well.
- It compresses the original media files when using the cheapest tariff plan, so you can no longer export the original media files.
- With the cheapest tariff plan you only have support per email.
Sonix has different subscription models:
- Standard pay-as-you-go where you pay $10/hour audio or video
- Premium for organisations that regularly want to transcribe audio and video and need more collaboration possibilities. For this you pay $5/hour audio or video and monthly $22 per user.
- Enterprise for high volumes of transcription needs with extensive collaboration options and content analysis.
You can test the possibilities of Sonix free of charge for 30 minutes of audio and video.
Speechmatics
Speechmatics is a company that has developed its own closed speech recognition model and offers APIs and a platform for a fee to transcribe and translate audio and video. They try to compete with large companies such as Google, Amazon and Microsoft and according to tests their model scores better than those tech companies. It can be used for both recorded media and for real-time audio and video. The software can transcribe 52 languages, including Dutch, identify speakers and indicate timestamps. Speechmatics focuses solely on transcription. Media files and their transcriptions are therefore only stored on the platform for one week. This has the advantage that they are one of the cheapest speech-to-text providers.
Advantages:
- very accurate;
- removes uhms;
- platform focuses solely on transcription, so you don’t have to pay for unnecessary bells and whistles;
- exports to plain text (.txt), SRT (for subtitles) and JSON.
Disadvantages:
- hallucinates on terms it doesn’t know;
- does not indicate silences;
- no timestamps when exporting to plain text or when using the copy function;
- media files and transcriptions are only stored for one week on the web platform;
- the web platform has difficulty uploading video.
Speechmatics does not have a subscription formula. You can get 4h of uploaded audio or video transcribed free every month (and also 4h of real-time audio and video). In addition, you pay per hour and the price depends on the desired accuracy of the transcribed text. You pay $0.8/hour for standard accuracy and $1.04/hour for the enhanced accuracy or most accurate model.
noScribe
NoScribe is a free open source tool for transcribing audio and video. It’s an offline desktop application that uses the Whisper model (for more info about Whisper, see below) from OpenAI, the company that also developed ChatGTP. NoScribe can transcribe more than 99 languages, including Dutch, identifies speakers and provides timestamps. It doesn't yet use the most recent (and most accurate for Dutch) model of Whisper because that version of that model scores less well in some other languages.
Advantages:
- free and open source;
- very accurate; hallucinates less and remains consistent with terms it doesn’t know;
- has editor software to improve transcriptions;
- can export to HTML, plain text (.txt) and VTT (for subtitles);
- can use a dedicated GPU to make the transcription faster.
Disadvantages:
- speed of transcription depends on your own computer;
- can hallucinate on silence, although we couldn't confirm this in practice;
- multilingual audio (e.g., an interview where different languages are spoken) is not supported;
- sometimes makes mistakes when recognizing speakers.
Read the manual of noScribe here.
Whisper
Whisper is a speech recognition model developed by OpenAI, which was first released as open source software in 2022. It can be used to transcribe different languages and to translate different languages into English. It's built into various speech recognition tools, such as noScribe, but can also be used as a command line tool. Whisper can transcribe one hundred languages, including Dutch, and indicates timestamps. Tests on two datasets also show that the latest version of Whisper scores very well on Dutch.
Advantages:
- open source and free
- very accurate; the command-line tool uses the most recent model, which is also the most accurate for Dutch
- exports are possible in plain text (.txt), SRT (subtitles), VTT (subtitles), TSV (a tabular format similar to CSV) and JSON
Disadvantages:
- speed depends on your own computer, especially if you don't have a dedicated GPU (see noScribe), which can make the transcription very slow (but still faster than if you did it yourself)
- does not indicate silences
- Whisper can hallucinate on silence, but it is possible to adjust this via the command line
- multilingual audio is not supported
- no intuitive graphical interface (GUI); only usable via the command line.
- no environment where you can improve the transcription.
Conclusion
Depending on your needs, there are different tools that you can use to automatically transcribe audio and video fragments. To make it easier to choose, you can use the table below. The table indicates, among other things, which tools fully (indicated with X) or partially (indicated with /) support certain features and their price category: €0 means free, € stands for a price less than €5/hour, €€ gives a price between €5/hour and €15/hour and €€€ is a price higher than €15/hour.
| Amberscript | Audapolis | Limecraft | Sonix | Speechmatics | noScribe | Whisper | |
|---|---|---|---|---|---|---|---|
| Supports Dutch | X | X | X | X | X | X | X |
| Accuracy | X | X | X | X | X | ||
| User-friendly | X | X | X | X | X | X | |
| Speed | X | X | X | X | |||
| Possibility to improve transcription | X | X | X | X | |||
| Possibility of collaboration on transcription | X | X | |||||
| Identifies speakers | X | X | X | X | X | X | |
| Detects filler words | X | / | X | X | X | ||
| Indicates pauses | X | X | X | / | X | X | |
| Possibility to add own dictionary | X | X | |||||
| Export formats | .csv, .doc, .json, .rtf, .srt, .stl, .txt, .vtt | .html | .docx, .pdf, .srt, .txt, .vtt | .csv, .doc, .pdf | .json, .srt, .txt | .html, .txt, .vtt | .csv, .json, .srt, .tsv, .txt, .vtt |
| Open-source | X | X | X | ||||
| Cloud service | X | X | X | X | |||
| Price | €€ | €0 | €€€ | €€ | € | €0 | €0 |
Auteur: Nastasia Vanderperren (meemoo, Flemish Institute for Archives) en Lode Scheers (meemoo, Flemish Institute for Archives)