A machine translation system, 'Shaheen', developed by the Arabic Language Team at Qatar Computing Research Institute (QCRI) has achieved a significant milestone of 1bn translated words.
“While statistical approaches were more dominant in the beginning, in the last few years, technology advancements have shifted toward Deep Learning methods, and we sought to apply that as we created Shaheen,” said Dr Hassan Sajjad, a senior scientist at QCRI, part of Hamad Bin Khalifa University.
Initially the team developed a state-of-the-art machine translation system for the conversion of Modern Standard Arabic to and from English.
“With the advent of social media, dialectal Arabic became a de facto language for communication and especially for informal conversations, such as those we see on Twitter and Facebook. Translation systems that are optimised for Modern Standard Arabic cannot work well with dialects. In the current phase of the project, we have achieved a major milestone by developing an Arabic translation system that can translate most of the dialects, as well as standard Arabic, to English, effectively,” Dr Sajjad said.
Shaheen uses a transformer-based sequence-to-sequence model with hierarchical fine-tuning to adapt Modern Standard Arabic-English translation system towards dialectal Arabic translation.
“This hierarchical fine-tuning enables the successful adaptation of a general translation system towards learning various variations of a language in a single system which are different varieties of dialects and their genre in our case,” Dr Sajjad said.
“Shaheen provides a one-size-fits-all solution that works for a large number of Arabic dialects and genres, an aspect seldom seen with competing translation platforms. In an extensive human evaluation of four dialects — Nile, Gulf, Levantine and Maghrebi — Shaheen outperformed popular online systems in terms of the Nile, Gulf and Levantine dialects. Work remains to be under progress with the Maghrebi dialect, which requires large-scale pooling of dialectical data,” he said.
Dr Sajjad said that automated translation enables many other technologies and facilitates tasks that are related to information extraction, analysis and understanding.
“It eases communication by bridging the language barrier. It can also directly impact the economy, healthcare system, political sphere, and more. For example, the FIFA 2022 World Cup in Qatar will be attracting people from all parts of the world. A translation tool that can effectively translate between dialectal Arabic and English can be regarded as an essential tool of communication,” he said.
Dr Sajjad that Shaheen can be deployed to the backend of multi-genre, multi-dialectal speech translation while other potential usage areas include being able to translate Arabic content on social media into English for better dissemination of information by narrowing down the language gap.
The scientist also remarked that competition in the field is fierce given that technology giants such as Google have an enormous amount of data and computation power.
“Shaheen, on the other hand, specialises on handling the linguistic intricacies of Arabic specifically and is now adaptable to dialects, and that is where we have our edge compared to other translation companies. We want to ensure the best performance, and we have been proactively creating data for a large variety of Arabic dialects and are continuously exploring newly emerging methods that can be integrated into Shaheen to boost translation quality," he added.
Other members of the Shaheen project from QCRI are: Dr Nadir Durrani, scientist; Dr Ahmed Abdelali and Hamdy Mubarak, senior software engineers and Fahim Dalvi software engineer.
“While statistical approaches were more dominant in the beginning, in the last few years, technology advancements have shifted toward Deep Learning methods, and we sought to apply that as we created Shaheen,” said Dr Hassan Sajjad, a senior scientist at QCRI, part of Hamad Bin Khalifa University.
Initially the team developed a state-of-the-art machine translation system for the conversion of Modern Standard Arabic to and from English.
“With the advent of social media, dialectal Arabic became a de facto language for communication and especially for informal conversations, such as those we see on Twitter and Facebook. Translation systems that are optimised for Modern Standard Arabic cannot work well with dialects. In the current phase of the project, we have achieved a major milestone by developing an Arabic translation system that can translate most of the dialects, as well as standard Arabic, to English, effectively,” Dr Sajjad said.
Shaheen uses a transformer-based sequence-to-sequence model with hierarchical fine-tuning to adapt Modern Standard Arabic-English translation system towards dialectal Arabic translation.
“This hierarchical fine-tuning enables the successful adaptation of a general translation system towards learning various variations of a language in a single system which are different varieties of dialects and their genre in our case,” Dr Sajjad said.
“Shaheen provides a one-size-fits-all solution that works for a large number of Arabic dialects and genres, an aspect seldom seen with competing translation platforms. In an extensive human evaluation of four dialects — Nile, Gulf, Levantine and Maghrebi — Shaheen outperformed popular online systems in terms of the Nile, Gulf and Levantine dialects. Work remains to be under progress with the Maghrebi dialect, which requires large-scale pooling of dialectical data,” he said.
Dr Sajjad said that automated translation enables many other technologies and facilitates tasks that are related to information extraction, analysis and understanding.
“It eases communication by bridging the language barrier. It can also directly impact the economy, healthcare system, political sphere, and more. For example, the FIFA 2022 World Cup in Qatar will be attracting people from all parts of the world. A translation tool that can effectively translate between dialectal Arabic and English can be regarded as an essential tool of communication,” he said.
Dr Sajjad that Shaheen can be deployed to the backend of multi-genre, multi-dialectal speech translation while other potential usage areas include being able to translate Arabic content on social media into English for better dissemination of information by narrowing down the language gap.
The scientist also remarked that competition in the field is fierce given that technology giants such as Google have an enormous amount of data and computation power.
“Shaheen, on the other hand, specialises on handling the linguistic intricacies of Arabic specifically and is now adaptable to dialects, and that is where we have our edge compared to other translation companies. We want to ensure the best performance, and we have been proactively creating data for a large variety of Arabic dialects and are continuously exploring newly emerging methods that can be integrated into Shaheen to boost translation quality," he added.
Other members of the Shaheen project from QCRI are: Dr Nadir Durrani, scientist; Dr Ahmed Abdelali and Hamdy Mubarak, senior software engineers and Fahim Dalvi software engineer.