Amazon Translate now supports Office documents

This post was originally published on this site

Whether your organization is a multinational enterprise present in many countries, or a small startup hungry for global success, translating your content to local languages may be an enduring challenge. Indeed, text data often comes in many formats, and processing them may require several different tools. Also, as all these tools may not support the same language pairs, you may have to convert certain documents to intermediate formats, or even resort to manual translation. All these issues add extra cost, and create unnecessary complexity in building consistent and automated translation workflows.

Amazon Translate aims at solving these problems in a simple and cost effective fashion. Using either the AWS console or a single API call, Amazon Translate makes it easy for AWS customers to quickly and accurately translate text in 55 different languages and variants.

Earlier this year, Amazon Translate introduced batch translation for plain text and HTML documents. Today, I’m very happy to announce that batch translation now also supports Office documents, namely .docx, .xlsx and .pptx files as defined by the Office Open XML standard.

Introducing Amazon Translate for Office Documents
The process is extremely simple. As you would expect, source documents have to be stored in an Amazon Simple Storage Service (S3) bucket. Please note that no document may be larger than 20 Megabytes, or have more than 1 million characters.

Each batch translation job processes a single file type and a single source language. Thus, we recommend that you organize your documents in a logical fashion in S3, storing each file type and each language under its own prefix.

Then, using either the AWS console or the StartTextTranslationJob API in one of the AWS language SDKs, you can launch a translation job, passing:

  • the input and output location in S3,
  • the file type,
  • the source and target languages.

Once the job is complete, you can collect translated files at the output location.

Let’s do a quick demo!

Translating Office Documents
Using the S3 console, I first upload a few .docx documents to one of my buckets.

S3 files

Then, moving to the Translate console, I create a new batch translation job, giving it a name, and selecting both the source and target languages.

Creating a batch job

Then, I define the location of my documents in S3, and their format, .docx in this case. Optionally, I could apply a custom terminology, to make sure specific words are translated exactly the way that I want.

Likewise, I define the output location for translated files. Please make sure that this path exists, as Translate will not create it for you.

Creating a batch job

Finally, I set the AWS Identity and Access Management (IAM) role, giving my Translate job the appropriate permissions to access S3. Here, I use an existing role that I created previously, and you can also let Translate create one for you. Then, I click on ‘Create job’ to launch the batch job.

Creating a batch job

The job starts immediately.

Batch job running

A little while later, the job is complete. All three documents have been translated successfully.

Viewing a completed job

Translated files are available at the output location, as visible in the S3 console.

Viewing translated files

Downloading one of the translated files, I can open it and compare it to the original version.

Comparing files

For small scale use, it’s extremely easy to use the AWS console to translate Office files. Of course, you can also use the Translate API to build automated workflows.

Automating Batch Translation
In a previous post, we showed you how to automate batch translation with an AWS Lambda function. You could expand on this example, and add language detection with Amazon Comprehend. For instance, here’s how you could combine the DetectDominantLanguage API with the Python-docx open source library to detect the language of .docx files.

import boto3, docx
from docx import Document

document = Document('blog_post.docx')
text = document.paragraphs[0].text
comprehend = boto3.client('comprehend')
response = comprehend.detect_dominant_language(Text=text)
top_language = response['Languages'][0]
code = top_language['LanguageCode']
score = top_language['Score']
print("%s, %f" % (code,score))

Pretty simple! You could also detect the type of each file based on its extension, and move it to the proper input location in S3. Then, you could schedule a Lambda function with CloudWatch Events to periodically translate files, and send a notification by email. Of course, you could use AWS Step Functions to build more elaborate workflows. Your imagination is the limit!

Getting Started
You can start translating Office documents today in the following regions: US East (N. Virginia), US East (Ohio), US West (Oregon), Europe (Ireland), Europe (London), Europe (Frankfurt), and Asia Pacific (Seoul).

If you’ve never tried Amazon Translate, did you know that the free tier offers 2 million characters per month for the first 12 months, starting from your first translation request?

Give it a try, and let us know what you think. We’re looking forward to your feedback: please post it to the AWS Forum for Amazon Translate, or send it to your usual AWS support contacts.

– Julien

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.