Researchers Propose Easter2.0, a Novel Convolutional Neural Network CNN-Based Architecture for the Task of End-to-End Handwritten Text Line Recognition that Utilizes Only 1D Convolutions - MarkTechPost

2022-08-08 10:44:16 By : Ms. Lizzy Zhang

The capability of a computer or device to read handwriting from sources such as printed physical documents, images, and other devices as input or to directly input handwriting onto a touchscreen and translate it into text is known as handwritten text recognition (HTR). A pattern-recognition program for HTR turns scanned-image text into digital text. 

Typically, two main strategies exist in the state-of-the-art (SOTA) dealing with HTR: CNN-based and RNN/Transformer-based methods. Although the use of CNNs is more applied in Optical Character Recognition (OCR) task, some work proposed combining CNN with a gating mechanism since CNN fails to access the context information. On the other hand, most successful models are based on the LSTM, a particular network of RNNs, to benefit from its power to handle long-term context in the sequences. In addition, recent  Transformer-based architectures utilize a CNN-based backbone with self-attention as encoders to understand images and achieve SOTA results. In a recently published paper, researchers introduced Easter2.0, a network composed of multiple 1D Convolution layers combined with a squeeze-and-Excitation module (SE), to obtain sufficient high-quality training data thanks to Tiling and Corruption Augmentation (TACo), a new data augmentation strategy.

Easter2.0 is built by stacking 14 layers of standard 1D convolutional layers, batch normalization layers, ReLU, and Dropout. With the target of balancing the number of channels, a residual connection is first projected through a 1 x 1 convolution layer, followed by a batch normalization layer. The result of this batch normalization layer is combined with the SE layer’s output in the final convolution block, which comes before the ReLU and Dropout layers. Finally, a softmax layer is used to predict the distribution of probabilities over the characters of a given vocabulary. Using the SE module, Easter2.0 can access global context similarly to RNN/Transformers and has CNN’s speed and parameter efficiency. The authors suggest utilizing a 1-D version of the squeeze-and-excitation module to introduce the global context. The local features are then squeezed into a single global context vector of weights. Next, the SE module broadcasts this context to each local feature vector thanks to an element-wise multiplication of context weights with features. 

In addition to the new suggested architecture, the authors introduced a novel data-augmentation technique (TACo). TACo divides the image into multiple small tiles of the same size. Then, vertical and horizontal tiles are corrupted by random noise. Finally, the modified tiles are joined back in the same order to create the new image. With TACo augmentations, the network can acquire valuable features and produce good results even on relatively small training sets.

To evaluate Easter2.0, the authors use IAM, publicly available datasets, and focus only on the line-level dataset. The contribution of several components of the model has been evaluated, such as the effect of TACo Augmentations, the effect of Residual Connections, and the effect of Squeeze-and-Excitation. The metric elected to compare results in the experiments is the case-sensitive Character Error Rate (CER). Results show that dealing with a small training dataset, Easter2.0 surpass SOTA works.

In this article, East2.0, a convolutional network was introduced to deal with the task of handwritten text recognition. The proposed architecture is made by only 1D convolutions, dense residual connections, and a SE module. A new data-augmentation technique (TACo) that can be used in the field of OCR/HTR is also presented. The experimental study demonstrates that this work achieves compétetive results on the IAM-Test set when the training data is small, compared to the SOTA. Finally,  Easter2.0 is requires less calculation power since it has a very small number of trainable parameters compared to other approaches.

Mahmoud is a PhD researcher in machine learning. He also holds a bachelor's degree in physical science and a master's degree in telecommunications and networking systems. His current areas of research concern computer vision, stock market prediction and deep learning. He produced several scientific articles about person re- identification and the study of the robustness and stability of deep networks.