As a medical translator, I work with a LOT of PDF files. I probably use my OCR tool up to 10 times per day and I’m fairly certain that at
this point, I couldn’t work without it. However, it took some time before I
figured out exactly how to get the most out of it and I’m certain that I
haven’t even scratched the surface. In case you are not familiar with OCR, it
stands for “Optical Character Recognition” and is basically used to turn “dead”
(not editable) documents of all kinds (including pictures and PDFs) into editable
Word documents preserving the formatting of the original. This sometimes works
better in theory than in practice since a bad fax can ruin the OCR tool’s
ability to properly recreate formatting.
Fixing the strange formatting produced by an OCR tool can be
more difficult than recreating the formatting from scratch. With that said, it
still has plenty of uses. I use the text from an OCR file pasted as unformatted
text into a new clean file which I format from scratch. I find this to be the
easiest way to get around the strange formatting the files can create while
still taking advantage of the benefits.
Quality: When
editing translations from PDF documents, I often find that translators omit text.
Although this is an unacceptable translation error, it does happen. OCR helps ensure
all of the text gets translated, just like using a Word file.
Computer-Assisted
Translation tools: OCR enables you to use your favorite CAT tool with a dead
PDF file. This helps speed up the translation process by taking advantage of the
matches and repetitions that are generally inaccessible in PDF translations. You
can also increase consistency by always ensuring that segments and terminology are
translated the same way throughout a document.
Numbers, names and
lists: Have you ever waded through pages and pages of a lab report? Ever painfully
retyped tables full of numbers? An OCR tool will recreate all of those numbers
for you. That means all you need to do is proofread them! Or, how about a list
of names with phone numbers? Don’t type the whole list from scratch—OCR the
list and proofread instead!
Tables: Although
OCR tools can create strange formatting, they are great with simple tables and
lines that they can read well. You may just need to correct the cell alignment
and font.
Word counts: Most
translators estimate how long a project will take based on the number of words
in the document. With a PDF, the word count is usually estimated a variety of
ways, but the accuracy varies. I recently had a client ask me to translate a
very technical medical document with 2,000 words in 24 hours. No problem,
right? It looked a little longer than that to me so I sent the file through my
OCR tool and it turned out that the file was 7,000 words. No, I’m not kidding. That
would have been a long night.
Flat rates: Having
an accurate word count also allows you to give clients a flat rate if you so
choose and/or helps provide a more accurate quote up front so no one is
surprised.
Just remember that OCR tools only give an estimate. If you
use it to check the word count of a document, be sure to scroll through and
make sure that all or most of the text was picked up by the OCR tool. If it can’t
read something, it will be inserted as a picture and maybe a picture is worth a
thousand words, but not to a translator!
How do you use your OCR tool?