Translation of French articles followed by Summarization
Introduction
This blog discusses the translation of a French history article into English followed by a summarization of that article in English. We show how pre-trained models can be used to do this very quickly. As part of this blog, we also evaluate several translation and summarization models and share our subjective evaluation of their output quality.
The article used is a French article about Slavery under Napoleon. This article has been borrowed from Wiki Books which provides open books to all its users (https://en.wikibooks.org/wiki/French/Lessons/Slavery_under_Napoleon).
The code is made public on Colab here. It is also available on my Github.
What’s a good translation and summarization?
A good translation captures the meaning of the original sentence while being grammatically correct. A good translation should be — 1. Fluent, 2. True to the original text and factually correct, 3. Allow the reader to draw the same conclusion as the original writer.
A good summary involves re-arranging the words used in the original article while providing an objective outline of the article. A good summary should convey the gist of the article and be factually correct.
Translation from French to English
There are many pre-trained models for French to English translation available on Hugginface Hub. We test out 3 popular models:
- Facebook — mBart — This is a multilingual machine translation model trained on translating 49 different languages to English.
- Sebis — Legal T5 translation — T5 small model trained for translating Legal text from French to English.
- Helsinki Opus french to english — Another popular model on French to English translation
The models are compared on their translation accuracy and inference time. All inference times are measured on a P100 GPU.
The translator is called as follows. We load the model and its tokenizer from Huggingface. We set the src and target language for mBart model. Then we use the decoder to decode the translated tokens.
from transformers import MBartForConditionalGeneration, MBart50TokenizerFast
model = MBartForConditionalGeneration.from_pretrained("facebook/mbart-large-50-many-to-many-mmt")
tokenizer = MBart50TokenizerFast.from_pretrained("facebook/mbart-large-50-many-to-many-mmt")
tokenizer.src_lang = "fr_XX"
encoded_1 = tokenizer(article, return_tensors="pt")
generated_tokens_1 = model.generate(**encoded_1, forced_bos_token_id=tokenizer.lang_code_to_id["en_XX"])
decode_trans_1 = tokenizer.batch_decode(generated_tokens_1, skip_special_tokens=True)
The translation from mBart model is below. The inference time is 5s on a P100 GPU. I would assess the summary to be okayish. The grammatical coherence can be improved.
In May 1802, after the signing of a treaty with the United Kingdom restoring Martinique to France, “slavery and the Treaty of the Blacks and their importation into the so-called colonies will take place in accordance with the laws and regulations of 1789.” At the beginning of June, he arrests and deports Toussaint Louverture, who had taken the lead of the Black Slave Rebellion of Saint-Domingue eleven years earlier, and who, relying on the ideals of the Revolution and trusting in the men supposed to represent them, had united the island with France. He was to die a year later at Fort de Joux, in Doubs. As for the Naplesian armies, they provoked many massacres during the second Rebellion of the Black Slaves of Saint-Domingue, before they emerged victorious and created the first independent Black Republic.
To run the Helsinki Opus model, we follow the same steps.
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM mname = 'Helsinki-NLP/opus-mt-fr-en' tokenizer = AutoTokenizer.from_pretrained(mname) model = AutoModelForSeq2SeqLM.from_pretrained(mname).to(device) input_ids = tokenizer.encode(article, return_tensors="pt").to(device) outputs = model.generate(input_ids) decoded_1 = tokenizer.decode(outputs[0], skip_special_tokens=True)
The model takes only 1.9s to generate the translation. The translation below is subjectively more coherent. It is grammatically correct and is also more abstractively written than by the mBart model.
Napoleon was also the one who restored slavery, abolished by the Republic in 1794. In May 1802, after the signing of a treaty with the United Kingdom re-establishing Martinique to France, “slavery as well as the Trache des Noirs and their importation into the so-called colonies will take place in accordance with the laws and regulations prior to 1789.” At the beginning of June, he had Toussaint Louverture arrested and deported, who had taken the lead of the revolt of the black slaves of Santo Domingo eleven years earlier, and who, relying on the ideals of the Revolution and confident in the men supposed to represent them, had joined the island to France. He was to die a year later at Fort de Joux, Doubs. As for the Napoleonic armies, they caused massacres during the second revolt of the slaves of Santo Domingo, before they emerged victorious and created the first independent Black Republic in January 1804. Guadeloupe also revolted in 1802 but the rebellion led by Louis Delgrès failed and ended with the collective suicide of the insurgents. However, the letter of the withdrawal was confirmed. The death of the insurgents.
Summarization of the Translated text
Next, we look at how we can summarize the translated text. Huggingface hub has many pre-trained abstractive summarization models. To learn more about different types of summarizations and how to train a summarization model, please refer to my blog here.
Most of the summarization models are trained on news-type datasets. We evaluate 3 popular summarization models:
To run T5-base, we need to tokenize the translated output, run that through the summarization and decode the summarized model outputs.
from transformers import AutoTokenizer, AutoModelWithLMHead import torch device = 'cuda' if torch.cuda.is_available() else 'cpu' tokenizer = AutoTokenizer.from_pretrained("t5-base") model = AutoModelWithLMHead.from_pretrained("t5-base").to(device) # generate summary text = "summarize: " + decoded_output input_ids = tokenizer.encode(text, return_tensors='pt').to(device) summary_ids = model.generate(input_ids, num_beams=4, min_length=20, max_length=500) summary_text = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
The T5 base model took 0.98s to run and the summarization (below) is decent in quality. It is factually correct but incomplete. It should talk about how Napolean restored slavery. The punctuation on it could also be improved.
a treaty with the uk re-established Martinique to France in may 1802. Toussaint Louverture had been arrested and deported. he had led the revolt of the black slaves of Santo Domingo eleven years earlier.
The full code for running translation and summarization is made public on Colab here and on Github here.
Next, we looked at BART large model. This is a bigger model and has an inference time of 1.5s. Subjectively the summary (below) is better than the t5 model. It covers the important point that Napolean restored slavery that was previously abolished.
Napoleon was also the one who restored slavery, abolished by the Republic in 1794. In May 1802, after the signing of a treaty with the United Kingdom re-establishing Martinique to France, “slavery as well as the Trache des Noirs and their importation into the so-called colonies will take place”
Last we try the Pegasus model from Google trained on the xsum dataset. The inference time of this model is also 1.5s. The summary from this model is below. Wow! This is an interesting summary. In contrast to the summary from T5 and BART, Pegasus doesn’t just rely on the first few sentences of the article. The overall sentence is very well composed taking in the information from different parts of the article. However, the summary misses the main point around Napolean re-establishing slavery. If it had captured that, then hands-down this would have been the best result.
‘Napoleon Bonaparte was the leader of the French army that conquered the Caribbean in the early 19th Century and established the first independent Black Republic in Guadeloupe in 1804.
Looks like the BART large model performs the best here in terms of capturing the salient point of the article. It is very encouraging to see that all the summarization models run in less than 2s.
Conclusion
This blog shows how easy it has become to perform NLP operations for translation and summarization. HuggingFace hub has many pre-trained models and a common structure for evaluating them. It is also very interesting to run different models and subjectively evaluate them on their output quality.
However, pre-trained models don’t work well for all problems. If the data used for training the model differs from the data encountered in the real-world application then the model accuracy can drop materially. We may need to fine-tune the model for the task at hand. At Deep Learning Analytics, we specialize in building machine learning models by fine-tuning them for specific use cases. We also have expertise in deploying these models. Contact us through our website here if you see an opportunity to collaborate.
References
- HuggingFace Hub — Open source location for many trained Transformer models
- T5 — Text to Text Transformer from Google
- BART — A popular encoder-decoder model from Facebook