In a rapidly evolving landscape of media consumption, the demand for accurate and high-quality subtitles has never been greater as they play a crucial role in making media more inclusive and understandable to diverse audiences. But as the volume of content grows, so does the need for automated subtitle generation systems. Ensuring their quality is essential, and this is where the SubER (Subtitle Edit Rate) metric comes in.
Subtitles are more than just text on the screen; they are a bridge between the spoken word and the viewer’s understanding. High-quality subtitles accurately reflect the spoken content maintaining its original intent and tone of voice, whilst matching the timing of the speech and affording viewers the ability to read and comprehend subtitles as quickly as possible so as to focus their attention and immerse themselves in the action on the screen. Errors in subtitle quality—such as mistranslations, timing issues, or readability problems—can lead to misunderstandings, reduced viewer engagement, and even frustration.
Traditional metrics for automatically evaluating subtitle outputs generated by Automatic Speech Recognition (ASR) and Machine Translation (MT) systems have often relied on surface-level comparisons between the generated subtitles and a reference set, focusing on word-level accuracy, such as the WER (Word Error Rate) metric used for the assessment of ASR output, or the BLEU (BiLingual Evaluation Understudy) and TER (Translation Edit Rate) metrics used for the assessment of MT quality. However, subtitles differ from general text in several ways, requiring a metric that considers the unique challenges of subtitle generation, such as synchronization with the audio, readability and the preservation of the orality of the dialogue and the context on the screen. Even when the subtitle text is accurate, aspects such as poor timing or awkward line breaks become critical as they can disrupt the viewing experience.
SubER, or Subtitle Edit Rate, is a novel metric designed to provide a more holistic assessment of subtitle quality by looking at both the text, as well as the timing and presentation of the subtitles. It is an edit distance metric, so it focuses on the differences between subtitles that come from an ASR on MT system and a reference subtitle set.
It calculates the number of edits required to transform the machine-generated subtitles into the professionally created reference set. These edits include substitutions, deletions, insertions, and shifts similar to how traditional metrics like TER operate in the realm of machine translation. However, its innovation lies in the fact that it also takes into account the edits required to satisfy the structural and timing constraints of subtitling which are crucial for a smooth viewing experience.
The evaluation of subtitles using the SubER works as follows. Once subtitles are aligned between source and target, SubER computes the number of edits—substitutions, insertions, deletions, and shifts—needed to make the generated subtitles identical to the reference set. The more edits required, the lower the quality of the generated subtitle. TER is used as the basis for this step, extended by break edit operations (line breaks and subtitle breaks) and time-alignment constraints.
The raw edit count is normalized based on the length of the subtitle sequence, resulting in a score that reflects the proportion of edits relative to the overall subtitle length. This score, SubER, represents the subtitle's accuracy, with lower scores indicating higher quality.
SubER is a flexible metric that can be used in any language. By focusing on the number of edits needed to correct a subtitle, SubER indirectly focuses on the user experience. Fewer edits indicate output that is closer to what a professional subtitler would produce, leading to a more seamless and enjoyable viewing experience.
SubER has multiple applications and can be used in various contexts:
Benchmarking ASR and MT systems: SubER can help developers and researchers evaluate and compare the performance of different ASR and MT systems in generating high-quality subtitles, thus providing an objective way to measure progress. The code to calculate the SubER metric has been released as part of an open-source subtitle evaluation toolkit to encourage its use in the research community as well as the media industry and to help promote further research in automatic subtitling systems. SubER is already used as the primary metric at the subtitling track of the IWSLT conference to evaluate the overall quality of automatically generated subtitles.
Quality assurance: Content creators and distributors using automated subtitles can use SubER as part of their quality assurance pipelines to monitor subtitle quality and ensure that subtitles meet the required standards. As the metric is developed with input from leading industry experts and language technology scientists, it has the potential to become a standard in the field of automatic subtitle evaluation, ensuring such subtitles meet the required quality thresholds.
Expert-in-the-loop workflows: In workflows where professional subtitlers are asked to post-edit machine-generated subtitles, SubER can provide useful information as to the level of edits (and to an extent also the effort) involved in the process and help identify areas that require improvement. It can also help streamline post-editing workflows by identifying text types which are better suited for such workflows.
SubER represents a significant advancement in the evaluation of automatic subtitle quality. By focusing on the edits needed to align machine-generated subtitles with a professionally created reference set, it provides a more nuanced and user-focused measure of subtitle accuracy than traditional automated metrics. Its open-source availability and the fact that it addresses the unique challenges of subtitle generation, including timing and text segmentation, makes SubER a strong candidate for becoming a standard in the industry and an essential tool for developers, researchers, and the subtitling industry alike. As its adoption increases, it could pave the way for new innovations in automatic subtitle generation, ultimately leading to a better viewing experience for audiences around the world.
The development of the SubER metric was made possible through the collaboration of experts at AppTek and Athena Consultancy, and its evaluation was supported by English and Spanish subtitlers who contributed their expertise. A comprehensive explanation of its development is detailed in the corresponding paper presented at the 2022 IWSLT conference.
AppTek.ai is a global leader in artificial intelligence (AI) and machine learning (ML) technologies for automatic speech recognition (ASR), neural machine translation (NMT), natural language processing/understanding (NLP/U), large language models (LLMs) and text-to-speech (TTS) technologies. The AppTek platform delivers industry-leading solutions for organizations across a breadth of global markets such as media and entertainment, call centers, government, enterprise business, and more. Built by scientists and research engineers who are recognized among the best in the world, AppTek’s solutions cover a wide array of languages/ dialects, channels, domains and demographics.