Abstract: Multi-label image classification, which involves recognizing multiple objects within a single image, is a fundamental task in computer vision. Recently, Visual-Language Models (VLMs) have ...
Abstract: Text-based Visual Question Answering (TextVQA) focuses on answering questions about the scene text in images. Most works in this field uses transformer based models to modeling the ...
Some results have been hidden because they may be inaccessible to you
Show inaccessible results