Paper documents have the annoying characteristic that they become increasingly difficult to read over time. This causes major problems when digitising old archives. Dr Tan Lu of the VUB research group Digital Mathematics has developed award-winning software for his doctoral research that is not fooled by cracks, stains or poor-quality scans.

Major digitisation projects are currently being carried out in the cultural heritage sector. This involves scanning large quantities of old newspapers and other manuscripts, which are then digitised via Optical Character Recognition (OCR). This is an essential process because it makes documents searchable, making information much easier to access. However, OCR is far from perfect. The algorithms struggle with material damage to pages, such as cracks and stains, and the computer can be confused by the unusual text formatting that often occurs in advertisements and fashion magazines.

Looking like people

Under the leadership of Prof Dr Ann Dooms, Lu developed a series of homogeneity models that help the computer to greatly improve its text recognition. In these models he formulated various scenarios to address a range of problems, including document segmentation, distortion recognition and quality assessment. In doing so, he used existing knowledge about how the human brain deals with complicated images. Lu: “Gestalt psychology teaches us, for example, that people naturally group loose objects of the same kind into one group. Because computers lack this ability, they often stumble over text recognition in difficult layouts or in damaged areas. Unlike humans, they are unable to recombine the different parts of a damaged image.” By integrating insights from perceptual psychology into what’s known as a probabilistic local text homogeneity model, Lu taught the computer to handle documents with difficult layouts and to recognise damage and distortions in them.

International prize for document recognition

Lu’s research has solved a number of long-standing problems within text recognition and his work demonstrates the power of a mathematical approach to image processing. The development of new mathematical models appears to be able to unleash the true potential of digitising old and valuable documents. The software developed within this research by the Digital Mathematics research group won the International Conference on Document Analysis and Recognition prize for document recognition in 2019.