Abstract: Multi-label image classification, which involves recognizing multiple objects within a single image, is a fundamental task in computer vision. Recently, Visual-Language Models (VLMs) have ...
Abstract: Text-based Visual Question Answering (TextVQA) focuses on answering questions about the scene text in images. Most works in this field uses transformer based models to modeling the ...