多模态 AI
融合文本、图像、音频、视频的多模态大模型技术与应用进展。
PSQE: A Theoretical-Practical Approach to Pseudo Seed Quality Enhancement for Unsupervised Multimodal Entity Alignment
The PSQE (Pseudo-Seed Quality Enhancement) framework represents a significant advancement in unsupervised multimodal ent...
PSQE: A Theoretical-Practical Approach to Pseudo Seed Quality Enhancement for Unsupervised Multimodal Entity Alignment
PSQE (Pseudo-Seed Quality Enhancement) is a novel method that addresses the critical bottleneck of imbalanced pseudo-ali...
PSQE: A Theoretical-Practical Approach to Pseudo Seed Quality Enhancement for Unsupervised Multimodal Entity Alignment
Researchers introduced Pseudo-Seed Quality Enhancement (PSQE), a novel method addressing imbalanced graph coverage in un...
PSQE: A Theoretical-Practical Approach to Pseudo Seed Quality Enhancement for Unsupervised Multimodal Entity Alignment
PSQE (Pseudo-Seed Quality Enhancement) is a novel theoretical-practical method that addresses pseudo-seed imbalance in u...
PSQE: A Theoretical-Practical Approach to Pseudo Seed Quality Enhancement for Unsupervised Multimodal Entity Alignment
The PSQE (Pseudo-Seed Quality Enhancement) framework addresses a core challenge in unsupervised Multimodal Entity Alignm...
PSQE: A Theoretical-Practical Approach to Pseudo Seed Quality Enhancement for Unsupervised Multimodal Entity Alignment
PSQE (Pseudo-Seed Quality Enhancement) is a novel framework addressing imbalanced pseudo-seed coverage in unsupervised m...
Multimodal Multi-Agent Ransomware Analysis Using AutoGen
A novel AI framework using multi-agent systems and multimodal data fusion achieves state-of-the-art ransomware classific...
Multimodal Multi-Agent Ransomware Analysis Using AutoGen
Researchers have developed a novel multimodal multi-agent AI framework for ransomware analysis that achieves a Macro-F1 ...
Multimodal Multi-Agent Ransomware Analysis Using AutoGen
A novel multimodal multi-agent ransomware detection framework using AutoGen achieves superior classification performance...
MoToRec: Sparse-Regularized Multimodal Tokenization for Cold-Start Recommendation
MoToRec is a novel AI framework that addresses the item cold-start problem in recommender systems by transforming multim...
MoToRec: Sparse-Regularized Multimodal Tokenization for Cold-Start Recommendation
MoToRec is a novel framework that addresses the item cold-start problem in recommender systems by transforming multimoda...
MoToRec: Sparse-Regularized Multimodal Tokenization for Cold-Start Recommendation
MoToRec (Sparse-Regularized Multimodal Tokenization for Cold-Start Recommendation) is a novel AI framework that addresse...
Video TokenCom: Textual Intent-Guided Multi-Rate Video Token Communications with UEP-Based Adaptive Source-Channel Coding
Video TokenCom is a novel framework for semantic-aware video transmission that uses discrete tokens as unified units for...
VL-KGE: Vision-Language Models Meet Knowledge Graph Embeddings
VL-KGE (Vision-Language Knowledge Graph Embeddings) is a novel framework that integrates Vision-Language Models like CLI...
Sequence-Level Unsupervised Training in Speech Recognition: A Theoretical Study
Researchers have established a theoretical framework defining the precise conditions for successful unsupervised speech ...
Whisper-RIR-Mega: A Paired Clean-Reverberant Speech Benchmark for ASR Robustness to Room Acoustics
Whisper-RIR-Mega is a new open-source benchmark that pairs clean LibriSpeech audio with real room impulse responses from...
Whisper-RIR-Mega: A Paired Clean-Reverberant Speech Benchmark for ASR Robustness to Room Acoustics
Whisper-RIR-Mega is a benchmark dataset that pairs clean LibriSpeech samples with reverberant counterparts using real ro...
Whisper-RIR-Mega: A Paired Clean-Reverberant Speech Benchmark for ASR Robustness to Room Acoustics
The Whisper-RIR-Mega benchmark is a paired clean-reverberant speech dataset that systematically evaluates Automatic Spee...
An Empirical Analysis of Calibration and Selective Prediction in Multimodal Clinical Condition Classification
A new empirical analysis reveals that selective prediction—a key safety mechanism where AI models defer uncertain decisi...
Addressing Missing and Noisy Modalities in One Solution: Unified Modality-Quality Framework for Low-quality Multimodal Data
The Unified Modality-Quality (UMQ) framework is a novel AI approach that jointly addresses noisy and missing modalities ...
Addressing Missing and Noisy Modalities in One Solution: Unified Modality-Quality Framework for Low-quality Multimodal Data
The Unified Modality-Quality (UMQ) framework is a novel AI approach that jointly addresses noisy and missing modalities ...
Addressing Missing and Noisy Modalities in One Solution: Unified Modality-Quality Framework for Low-quality Multimodal Data
The Unified Modality-Quality (UMQ) framework addresses both noisy and missing data modalities in multimodal AI as a sing...
Addressing Missing and Noisy Modalities in One Solution: Unified Modality-Quality Framework for Low-quality Multimodal Data
The Unified Modality-Quality (UMQ) framework is a novel AI solution that jointly addresses noisy and missing modalities ...
Graph Attention Based Prioritization of Disease Responsible Genes from Multimodal Alzheimer's Network
NETRA (Node Evaluation through Transformer-based Representation and Attention) is a novel multimodal graph transformer f...
Graph Attention Based Prioritization of Disease Responsible Genes from Multimodal Alzheimer's Network
NETRA (Node Evaluation through Transformer-based Representation and Attention) is a multimodal AI framework that uses gr...
Graph Attention Based Prioritization of Disease Responsible Genes from Multimodal Alzheimer's Network
Researchers developed NETRA, a multimodal graph transformer AI framework that prioritizes disease-relevant genes for Alz...
UniTAF: A Modular Framework for Joint Text-to-Speech and Audio-to-Face Modeling
UniTAF (Unified Text-to-Audio-and-Face) is a novel AI framework that integrates Text-to-Speech (TTS) and Audio-to-Face (...
Kinematify: Open-Vocabulary Synthesis of High-DoF Articulated Objects
Kinematify is an AI framework that synthesizes articulated object models with high degrees of freedom directly from RGB ...
D2E: Scaling Vision-Action Pretraining on Desktop Data for Transfer to Embodied AI
The D2E (Desktop to Embodied AI) framework demonstrates that pretraining on large-scale desktop game interactions dramat...
Zero-shot CT Super-Resolution using Diffusion-based 2D Projection Priors and Signed 3D Gaussians
A novel AI framework enables zero-shot CT super-resolution by integrating diffusion models for 2D projection enhancement...
InstructVLA: Vision-Language-Action Instruction Tuning from Understanding to Manipulation
InstructVLA is a novel vision-language-action model that bridges high-level reasoning with precise robotic manipulation ...
Synthetic Perception: Can Generated Images Unlock Latent Visual Prior for Text-Centric Reasoning?
A new AI research paradigm called 'synthetic perception' investigates whether images generated by Text-to-Image (T2I) mo...
Perception-R1: Advancing Multimodal Reasoning Capabilities of MLLMs via Visual Perception Reward
Perception-R1 is a novel reinforcement learning technique that addresses the critical bottleneck in Multimodal Large Lan...
Slot-BERT: Self-supervised Object Discovery in Surgical Video
Slot-BERT is a novel AI model for unsupervised object-centric learning in surgical videos that introduces a bidirectiona...
APPO: Attention-guided Perception Policy Optimization for Video Reasoning
The APPO (Attention-guided Perception Policy Optimization) algorithm addresses the perception bottleneck in video AI by ...
3D Modality-Aware Pre-training for Vision-Language Model in MRI Multi-organ Abnormality Detection
MedMAP is a novel Medical Modality-Aware Pretraining framework that enhances vision-language models for 3D medical imagi...
From Pairs to Sequences: Track-Aware Policy Gradients for Keypoint Detection
TraqPoint is a novel reinforcement learning framework that reframes keypoint detection as a sequential decision-making p...
The Garbage Dataset (GD): A Multi-Class Image Benchmark for Automated Waste Segregation
The Garbage Dataset (GD) is a comprehensive benchmark containing 12,259 labeled images across ten household waste catego...
WristMIR: Coarse-to-Fine Region-Aware Retrieval of Pediatric Wrist Radiographs with Radiology Report-Driven Learning
WristMIR is a novel AI framework for retrieving analogous pediatric wrist radiographs to aid fracture diagnosis. It uses...
Reasoning-Driven Multimodal LLM for Domain Generalization
arXiv:2602.23777v1 Announce Type: new Abstract: This paper addresses the domain generalization (DG) problem in deep lear...
Scalable Multilingual Multimodal Machine Translation with Speech-Text Fusion
arXiv:2602.21646v1 Announce Type: new Abstract: Multimodal Large Language Models (MLLMs) have achieved notable success i...
Continuous Telemonitoring of Heart Failure using Personalised Speech Dynamics
arXiv:2602.19674v2 Announce Type: replace-cross Abstract: Remote monitoring of heart failure (HF) via speech signals pro...
Dual-Channel Attention Guidance for Training-Free Image Editing Control in Diffusion Transformers
arXiv:2602.18022v2 Announce Type: replace-cross Abstract: Training-free control over editing intensity is a critical req...
Tracing Copied Pixels and Regularizing Patch Affinity in Copy Detection
arXiv:2602.17484v2 Announce Type: replace-cross Abstract: Image Copy Detection (ICD) aims to identify manipulated conten...
Beyond Calibration: Confounding Pathology Limits Foundation Model Specificity in Abdominal Trauma CT
arXiv:2602.10359v2 Announce Type: replace-cross Abstract: Purpose: Translating foundation models into clinical practice ...
FigEx2: Visual-Conditioned Panel Detection and Captioning for Scientific Compound Figures
arXiv:2601.08026v3 Announce Type: replace-cross Abstract: Scientific compound figures combine multiple labeled panels in...
KD-OCT: Efficient Knowledge Distillation for Clinical-Grade Retinal OCT Classification
arXiv:2512.09069v2 Announce Type: replace-cross Abstract: Age-related macular degeneration (AMD) and choroidal neovascul...
Aerial Vision-Language Navigation with a Unified Framework for Spatial, Temporal and Embodied Reasoning
arXiv:2512.08639v2 Announce Type: replace-cross Abstract: Aerial Vision-and-Language Navigation (VLN) aims to enable unm...
RPTS: Tree-Structured Reasoning Process Scoring for Faithful Multimodal Evaluation
arXiv:2511.06899v3 Announce Type: replace-cross Abstract: Large Vision-Language Models (LVLMs) excel in multimodal reaso...
Uncovering Grounding IDs: How External Cues Shape Multimodal Binding
arXiv:2509.24072v4 Announce Type: replace-cross Abstract: Large vision-language models (LVLMs) show strong performance a...
Compose and Fuse: Revisiting the Foundational Bottlenecks in Multimodal Reasoning
arXiv:2509.23744v2 Announce Type: replace-cross Abstract: Multimodal large language models (MLLMs) promise enhanced reas...
EO-1: An Open Unified Embodied Foundation Model for General Robot Control
arXiv:2508.21112v5 Announce Type: replace-cross Abstract: The human ability to seamlessly perform multimodal reasoning a...
Learning What Matters: Prioritized Concept Learning via Relative Error-driven Sample Selection
arXiv:2506.01085v2 Announce Type: replace-cross Abstract: Instruction tuning has been central to the success of recent v...
Renaissance: Investigating the Pretraining of Vision-Language Encoders
arXiv:2411.06657v2 Announce Type: replace-cross Abstract: In the past several years there has been an explosion of avail...
Measuring the Measurers: Quality Evaluation of Hallucination Benchmarks for Large Vision-Language Models
arXiv:2406.17115v3 Announce Type: replace-cross Abstract: Despite the outstanding performance in multimodal tasks, Large...
NoLan: Mitigating Object Hallucinations in Large Vision-Language Models via Dynamic Suppression of Language Priors
arXiv:2602.22144v1 Announce Type: cross Abstract: Object hallucination is a critical issue in Large Vision-Language Mode...
TG-ASR: Translation-Guided Learning with Parallel Gated Cross Attention for Low-Resource Automatic Speech Recognition
arXiv:2602.22039v1 Announce Type: cross Abstract: Low-resource automatic speech recognition (ASR) continues to pose sign...
A Framework for Cross-Domain Generalization in Coronary Artery Calcium Scoring Across Gated and Non-Gated Computed Tomography
arXiv:2602.21935v1 Announce Type: cross Abstract: Coronary artery calcium (CAC) scoring is a key predictor of cardiovasc...
DynamicGTR: Leveraging Graph Topology Representation Preferences to Boost VLM Capabilities on Graph QAs
arXiv:2602.21864v1 Announce Type: cross Abstract: Vision-Language Models (VLMs) have emerged as versatile solutions for ...
Excitation: Momentum For Experts
arXiv:2602.21798v1 Announce Type: cross Abstract: We propose Excitation, a novel optimization framework designed to acce...