MULTILINGUAL TEXT IDENTIFICATION AND RECOGNITION FROM IMAGE
Keywords:
Multilingual text recognition, Optical Character Recognition (OCR), Image-based text extraction, Natural Language Processing (NLP).Abstract
Modern digital environments face challenges in extracting and recognizing text from visual media containing multiple languages, especially those with limited linguistic resources. This research aims to address this issue through a threefold objective: (1) detect multilingual text from images, (2) accurately recognize English, Hindi, and Marathi text, and (3) develop an efficient system using integrated visual and language processing techniques. The system is implemented using EasyOCR, OpenCV (cv2), and LangID libraries in a Python environment on Google Colaboratory. Preprocessing steps include grayscale conversion and binarization, followed by text localization and language identification. Social media images were selected as the primary data source due to their diverse multilingual content and contextual richness. The system achieved effective multilingual text recognition, accurately detecting and classifying words among the three targeted languages. Visual inspection confirmed the success of preprocessing, precise bounding box placements, and reliable language categorization. Recognized words such as “Leather,” “कम से कम,” and “चमडा” demonstrate the system’s high accuracy and capability in segmenting and classifying multilingual text. In conclusion, the developed system presents a robust solution for multilingual text extraction from images, with potential applications in digital document conversion, real-time language adaptation, and integration with broader Natural Language Processing (NLP) systems.