Cited by VisChronos: Revolutionizing Image Captioning Through Real-Life Events

Transformers in the Real World: A Survey on NLP Applications DOI

Narendra Patwardhan, Stefano Marrone, Carlo Sansone

и другие.

Information, Год журнала: 2023, Номер 14(4), С. 242 - 242

Опубликована: Апрель 17, 2023

The field of Natural Language Processing (NLP) has undergone a significant transformation with the introduction Transformers. From first this technology in 2017, use transformers become widespread and had profound impact on NLP. In survey, we review open-access real-world applications NLP, specifically focusing those where text is primary modality. Our goal to provide comprehensive overview current state-of-the-art highlight their strengths limitations, identify future directions for research. way, aim valuable insights both researchers practitioners addition, detailed analysis various challenges faced implementation applications, including computational efficiency, interpretability, ethical considerations. Moreover, NLP community, influence research development new models.

Язык: Английский

Процитировано

A Survey of Knowledge Graph Reasoning on Graph Types: Static, Dynamic, and Multi-Modal DOI

Ke Liang, Lingyuan Meng, Meng Liu

и другие.

IEEE Transactions on Pattern Analysis and Machine Intelligence, Год журнала: 2024, Номер 46(12), С. 9456 - 9478

Опубликована: Июнь 28, 2024

Knowledge graph reasoning (KGR), aiming to deduce new facts from existing based on mined logic rules underlying knowledge graphs (KGs), has become a fast-growing research direction. It been proven significantly benefit the usage of KGs in many AI applications, such as question answering, recommendation systems, and etc. According types, KGR models can be roughly divided into three categories, i.e., static models, temporal multi-modal models. Early works this domain mainly focus KGR, recent try leverage information, which are more practical closer real-world. However, no survey papers open-source repositories comprehensively summarize discuss important To fill gap, we conduct first for tracing then KGs. Concretely, reviewed bi-level taxonomy, top-level (graph types) base-level (techniques scenarios). Besides, performances, well datasets, summarized presented. Moreover, point out challenges potential opportunities enlighten readers.

Язык: Английский

Процитировано

TIFA: Accurate and Interpretable Text-to-Image Faithfulness Evaluation with Question Answering DOI

Yushi Hu,

Benlin Liu,

Jungo Kasai

и другие.

2021 IEEE/CVF International Conference on Computer Vision (ICCV), Год журнала: 2023, Номер unknown, С. 20349 - 20360

Опубликована: Окт. 1, 2023

Despite thousands of researchers, engineers, and artists actively working on improving text-to-image generation models, systems often fail to produce images that accurately align with the text inputs. We introduce TIFA (Text-to-Image Faithfulness evaluation question Answering), an automatic metric measures faithfulness a generated image its input via visual answering (VQA). Specifically, given input, we automatically generate several question-answer pairs using language model. calculate by checking whether existing VQA models can answer these questions image. is reference-free allows for fine-grained interpretable evaluations images. also has better correlations human judgments than metrics. Based this approach, v1.0, benchmark consisting 4K diverse inputs 25K across 12 categories (object, counting, etc.). present comprehensive v1.0 highlight limitations challenges current models. For instance, find despite doing well color material, still struggle in spatial relations, composing multiple objects. hope our will help carefully measure research progress synthesis provide valuable insights further research. ¹

Язык: Английский

Процитировано

BridgeTower: Building Bridges between Encoders in Vision-Language Representation Learning DOI

Xu Xiao, Chenfei Wu,

Shachar Rosenman

и другие.

Proceedings of the AAAI Conference on Artificial Intelligence, Год журнала: 2023, Номер 37(9), С. 10637 - 10647

Опубликована: Июнь 26, 2023

Vision-Language (VL) models with the Two-Tower architecture have dominated visual-language representation learning in recent years. Current VL either use lightweight uni-modal encoders and learn to extract, align fuse both modalities simultaneously a deep cross-modal encoder, or feed last-layer representations from pre-trained into top encoder. Both approaches potentially restrict vision-language limit model performance. In this paper, we propose BridgeTower, which introduces multiple bridge layers that build connection between of each layer This enables effective bottom-up alignment fusion visual textual different semantic levels Pre-trained only 4M images, BridgeTower achieves state-of-the-art performance on various downstream tasks. particular, VQAv2 test-std set, an accuracy 78.73%, outperforming previous METER by 1.09% same pre-training data almost negligible additional parameters computational costs. Notably, when further scaling model, 81.15%, surpassing are orders-of-magnitude larger datasets. Code checkpoints available at https://github.com/microsoft/BridgeTower.

Язык: Английский

Процитировано

HiTeA: Hierarchical Temporal-Aware Video-Language Pre-training DOI

Qinghao Ye, Guohai Xu, Ming Yan

и другие.

2021 IEEE/CVF International Conference on Computer Vision (ICCV), Год журнала: 2023, Номер unknown, С. 15359 - 15370

Опубликована: Окт. 1, 2023

Video-language pre-training has advanced the performance of various downstream video-language tasks. However, most previous methods directly inherit or adapt typical image-language paradigms to pre-training, thus not fully exploiting unique characteristic video, i.e., temporal. In this paper, we propose a Hierarchical Temporal-Aware framework, HiTeA, with two novel tasks for yielding temporal-aware multi-modal representation cross-modal fine-grained temporal moment information and contextual relations between video-text pairs. First, exploration task explore moments in videos by mining paired texts, which results detailed video representation. Then, based on learned representations, inherent are captured aligning pairs as whole different time resolutions relation task. Furthermore, introduce shuffling test evaluate reliance datasets models. We achieve state-of-the-art 15 well-established understanding generation tasks, especially temporal-oriented (e.g., SSv2-Template SSv2-Label) 8.6% 11.1% improvement respectively. HiTeA also demonstrates strong generalization ability when transferred zero-shot manner.

Язык: Английский

Процитировано

From image to language: A critical analysis of Visual Question Answering (VQA) approaches, challenges, and opportunities DOI

Md Farhan Ishmam,

Md Sakib Hossain Shovon, M. F. Mridha

и другие.

Information Fusion, Год журнала: 2024, Номер 106, С. 102270 - 102270

Опубликована: Янв. 28, 2024

Язык: Английский

Процитировано

Review of multimodal machine learning approaches in healthcare DOI

Felix H. Krones, Umar Marikkar, Guy Parsons

и другие.

Information Fusion, Год журнала: 2024, Номер unknown, С. 102690 - 102690

Опубликована: Сен. 1, 2024

Язык: Английский

Процитировано

Graph neural networks in vision-language image understanding: a survey DOI

Henry Senior,

Greg Slabaugh,

Shanxin Yuan

и другие.

The Visual Computer, Год журнала: 2024, Номер unknown

Опубликована: Март 29, 2024

Abstract 2D image understanding is a complex problem within computer vision, but it holds the key to providing human-level scene comprehension. It goes further than identifying objects in an image, and instead, attempts understand scene. Solutions this form underpinning of range tasks, including captioning, visual question answering (VQA), retrieval. Graphs provide natural way represent relational arrangement between thus, recent years graph neural networks (GNNs) have become standard component many pipelines, becoming core architectural component, especially VQA group tasks. In survey, we review rapidly evolving field taxonomy types used approaches, comprehensive list GNN models domain, roadmap future potential developments. To best our knowledge, first survey that covers answering, retrieval techniques focus on using GNNs as main part their architecture.

Язык: Английский

Процитировано

LGR-NET: Language Guided Reasoning Network for Referring Expression Comprehension DOI

Mingcong Lu,

Ruifan Li, Fangxiang Feng

и другие.

IEEE Transactions on Circuits and Systems for Video Technology, Год журнала: 2024, Номер 34(8), С. 7771 - 7784

Опубликована: Март 7, 2024

Referring Expression Comprehension (REC) is a fundamental task in the vision and language domain, which aims to locate an image region according natural expression. REC requires models capture key clues text perform accurate cross-modal reasoning. A recent trend employs transformer-based methods address this problem. However, most of these typically treat equally. They usually reasoning crude way, utilize textual features as whole without detailed considerations (e.g., spatial information). This insufficient utilization will lead sub-optimal results. In paper, we propose xmlns:xlink="http://www.w3.org/1999/xlink">Language Guided Reasoning Network (LGR-NET) fully guidance referring To localize referred object, set prediction token features. Furthermore, sufficiently features, extend them by our Textual Feature Extender (TFE) from three aspects. xmlns:xlink="http://www.w3.org/1999/xlink">First , design novel coordinate embedding based on The incorporated promote its language-related visual xmlns:xlink="http://www.w3.org/1999/xlink">Second employ extracted for Text-guided Cross-modal Alignment (TCA) Fusion (TCF), alternately. xmlns:xlink="http://www.w3.org/1999/xlink">Third devise loss enhance alignment between expression learnable token. We conduct extensive experiments five benchmark datasets, experimental results show that LGR-NET achieves new state-of-the-art. Source code available at https://github.com/lmc8133/LGR-NET.

Язык: Английский

Процитировано

On the Adversarial Robustness of Multi-Modal Foundation Models DOI

Christian Schlarmann,

Matthias Hein

Опубликована: Окт. 2, 2023

Multi-modal foundation models combining vision and language such as Flamingo or GPT-4 have recently gained enormous interest. Alignment of is used to prevent from providing toxic harmful output. While malicious users successfully tried jailbreak models, an equally important question if honest could be harmed by third-party content. In this paper we show that imperceivable attacks on images $\left({{\varepsilon _\infty } = 1/255}\right)$ in order change the caption output a multi-modal model can content providers harm e.g. guiding them websites broadcast fake information. This indicates countermeasures adversarial should any deployed model. Note: contains information illustrate outcome our attacks. It does not reflect opinion authors.

Язык: Английский

Процитировано