Communications in computer and information science, Год журнала: 2025, Номер unknown, С. 127 - 140
Опубликована: Янв. 1, 2025
Язык: Английский
Communications in computer and information science, Год журнала: 2025, Номер unknown, С. 127 - 140
Опубликована: Янв. 1, 2025
Язык: Английский
Information, Год журнала: 2023, Номер 14(4), С. 242 - 242
Опубликована: Апрель 17, 2023
The field of Natural Language Processing (NLP) has undergone a significant transformation with the introduction Transformers. From first this technology in 2017, use transformers become widespread and had profound impact on NLP. In survey, we review open-access real-world applications NLP, specifically focusing those where text is primary modality. Our goal to provide comprehensive overview current state-of-the-art highlight their strengths limitations, identify future directions for research. way, aim valuable insights both researchers practitioners addition, detailed analysis various challenges faced implementation applications, including computational efficiency, interpretability, ethical considerations. Moreover, NLP community, influence research development new models.
Язык: Английский
Процитировано
82IEEE Transactions on Pattern Analysis and Machine Intelligence, Год журнала: 2024, Номер 46(12), С. 9456 - 9478
Опубликована: Июнь 28, 2024
Knowledge graph reasoning (KGR), aiming to deduce new facts from existing based on mined logic rules underlying knowledge graphs (KGs), has become a fast-growing research direction. It been proven significantly benefit the usage of KGs in many AI applications, such as question answering, recommendation systems, and etc. According types, KGR models can be roughly divided into three categories, i.e., static models, temporal multi-modal models. Early works this domain mainly focus KGR, recent try leverage information, which are more practical closer real-world. However, no survey papers open-source repositories comprehensively summarize discuss important To fill gap, we conduct first for tracing then KGs. Concretely, reviewed bi-level taxonomy, top-level (graph types) base-level (techniques scenarios). Besides, performances, well datasets, summarized presented. Moreover, point out challenges potential opportunities enlighten readers.
Язык: Английский
Процитировано
592021 IEEE/CVF International Conference on Computer Vision (ICCV), Год журнала: 2023, Номер unknown, С. 20349 - 20360
Опубликована: Окт. 1, 2023
Despite thousands of researchers, engineers, and artists actively working on improving text-to-image generation models, systems often fail to produce images that accurately align with the text inputs. We introduce TIFA (Text-to-Image Faithfulness evaluation question Answering), an automatic metric measures faithfulness a generated image its input via visual answering (VQA). Specifically, given input, we automatically generate several question-answer pairs using language model. calculate by checking whether existing VQA models can answer these questions image. is reference-free allows for fine-grained interpretable evaluations images. also has better correlations human judgments than metrics. Based this approach, v1.0, benchmark consisting 4K diverse inputs 25K across 12 categories (object, counting, etc.). present comprehensive v1.0 highlight limitations challenges current models. For instance, find despite doing well color material, still struggle in spatial relations, composing multiple objects. hope our will help carefully measure research progress synthesis provide valuable insights further research. 1
Язык: Английский
Процитировано
36Proceedings of the AAAI Conference on Artificial Intelligence, Год журнала: 2023, Номер 37(9), С. 10637 - 10647
Опубликована: Июнь 26, 2023
Vision-Language (VL) models with the Two-Tower architecture have dominated visual-language representation learning in recent years. Current VL either use lightweight uni-modal encoders and learn to extract, align fuse both modalities simultaneously a deep cross-modal encoder, or feed last-layer representations from pre-trained into top encoder. Both approaches potentially restrict vision-language limit model performance. In this paper, we propose BridgeTower, which introduces multiple bridge layers that build connection between of each layer This enables effective bottom-up alignment fusion visual textual different semantic levels Pre-trained only 4M images, BridgeTower achieves state-of-the-art performance on various downstream tasks. particular, VQAv2 test-std set, an accuracy 78.73%, outperforming previous METER by 1.09% same pre-training data almost negligible additional parameters computational costs. Notably, when further scaling model, 81.15%, surpassing are orders-of-magnitude larger datasets. Code checkpoints available at https://github.com/microsoft/BridgeTower.
Язык: Английский
Процитировано
322021 IEEE/CVF International Conference on Computer Vision (ICCV), Год журнала: 2023, Номер unknown, С. 15359 - 15370
Опубликована: Окт. 1, 2023
Video-language pre-training has advanced the performance of various downstream video-language tasks. However, most previous methods directly inherit or adapt typical image-language paradigms to pre-training, thus not fully exploiting unique characteristic video, i.e., temporal. In this paper, we propose a Hierarchical Temporal-Aware framework, HiTeA, with two novel tasks for yielding temporal-aware multi-modal representation cross-modal fine-grained temporal moment information and contextual relations between video-text pairs. First, exploration task explore moments in videos by mining paired texts, which results detailed video representation. Then, based on learned representations, inherent are captured aligning pairs as whole different time resolutions relation task. Furthermore, introduce shuffling test evaluate reliance datasets models. We achieve state-of-the-art 15 well-established understanding generation tasks, especially temporal-oriented (e.g., SSv2-Template SSv2-Label) 8.6% 11.1% improvement respectively. HiTeA also demonstrates strong generalization ability when transferred zero-shot manner.
Язык: Английский
Процитировано
27Information Fusion, Год журнала: 2024, Номер 106, С. 102270 - 102270
Опубликована: Янв. 28, 2024
Язык: Английский
Процитировано
18Information Fusion, Год журнала: 2024, Номер unknown, С. 102690 - 102690
Опубликована: Сен. 1, 2024
Язык: Английский
Процитировано
17The Visual Computer, Год журнала: 2024, Номер unknown
Опубликована: Март 29, 2024
Abstract 2D image understanding is a complex problem within computer vision, but it holds the key to providing human-level scene comprehension. It goes further than identifying objects in an image, and instead, attempts understand scene. Solutions this form underpinning of range tasks, including captioning, visual question answering (VQA), retrieval. Graphs provide natural way represent relational arrangement between thus, recent years graph neural networks (GNNs) have become standard component many pipelines, becoming core architectural component, especially VQA group tasks. In survey, we review rapidly evolving field taxonomy types used approaches, comprehensive list GNN models domain, roadmap future potential developments. To best our knowledge, first survey that covers answering, retrieval techniques focus on using GNNs as main part their architecture.
Язык: Английский
Процитировано
15IEEE Transactions on Circuits and Systems for Video Technology, Год журнала: 2024, Номер 34(8), С. 7771 - 7784
Опубликована: Март 7, 2024
Язык: Английский
Процитировано
11Опубликована: Окт. 2, 2023
Multi-modal foundation models combining vision and language such as Flamingo or GPT-4 have recently gained enormous interest. Alignment of is used to prevent from providing toxic harmful output. While malicious users successfully tried jailbreak models, an equally important question if honest could be harmed by third-party content. In this paper we show that imperceivable attacks on images $\left({{\varepsilon _\infty } = 1/255}\right)$ in order change the caption output a multi-modal model can content providers harm e.g. guiding them websites broadcast fake information. This indicates countermeasures adversarial should any deployed model. Note: contains information illustrate outcome our attacks. It does not reflect opinion authors.
Язык: Английский
Процитировано
22