多模态 AI

融合文本、图像、音频、视频的多模态大模型技术与应用进展。

PSQE: A Theoretical-Practical Approach to Pseudo Seed Quality Enhancement for Unsupervised Multimodal Entity Alignment
多模态

PSQE: A Theoretical-Practical Approach to Pseudo Seed Quality Enhancement for Unsupervised Multimodal Entity Alignment

The PSQE (Pseudo-Seed Quality Enhancement) framework represents a significant advancement in unsupervised multimodal ent...

PSQE: A Theoretical-Practical Approach to Pseudo Seed Quality Enhancement for Unsupervised Multimodal Entity Alignment
多模态

PSQE: A Theoretical-Practical Approach to Pseudo Seed Quality Enhancement for Unsupervised Multimodal Entity Alignment

PSQE (Pseudo-Seed Quality Enhancement) is a novel method that addresses the critical bottleneck of imbalanced pseudo-ali...

PSQE: A Theoretical-Practical Approach to Pseudo Seed Quality Enhancement for Unsupervised Multimodal Entity Alignment
多模态

PSQE: A Theoretical-Practical Approach to Pseudo Seed Quality Enhancement for Unsupervised Multimodal Entity Alignment

Researchers introduced Pseudo-Seed Quality Enhancement (PSQE), a novel method addressing imbalanced graph coverage in un...

PSQE: A Theoretical-Practical Approach to Pseudo Seed Quality Enhancement for Unsupervised Multimodal Entity Alignment
多模态

PSQE: A Theoretical-Practical Approach to Pseudo Seed Quality Enhancement for Unsupervised Multimodal Entity Alignment

PSQE (Pseudo-Seed Quality Enhancement) is a novel theoretical-practical method that addresses pseudo-seed imbalance in u...

PSQE: A Theoretical-Practical Approach to Pseudo Seed Quality Enhancement for Unsupervised Multimodal Entity Alignment
多模态

PSQE: A Theoretical-Practical Approach to Pseudo Seed Quality Enhancement for Unsupervised Multimodal Entity Alignment

The PSQE (Pseudo-Seed Quality Enhancement) framework addresses a core challenge in unsupervised Multimodal Entity Alignm...

PSQE: A Theoretical-Practical Approach to Pseudo Seed Quality Enhancement for Unsupervised Multimodal Entity Alignment
多模态

PSQE: A Theoretical-Practical Approach to Pseudo Seed Quality Enhancement for Unsupervised Multimodal Entity Alignment

PSQE (Pseudo-Seed Quality Enhancement) is a novel framework addressing imbalanced pseudo-seed coverage in unsupervised m...

Multimodal Multi-Agent Ransomware Analysis Using AutoGen
多模态

Multimodal Multi-Agent Ransomware Analysis Using AutoGen

A novel AI framework using multi-agent systems and multimodal data fusion achieves state-of-the-art ransomware classific...

Multimodal Multi-Agent Ransomware Analysis Using AutoGen
多模态

Multimodal Multi-Agent Ransomware Analysis Using AutoGen

Researchers have developed a novel multimodal multi-agent AI framework for ransomware analysis that achieves a Macro-F1 ...

Multimodal Multi-Agent Ransomware Analysis Using AutoGen
多模态

Multimodal Multi-Agent Ransomware Analysis Using AutoGen

A novel multimodal multi-agent ransomware detection framework using AutoGen achieves superior classification performance...

MoToRec: Sparse-Regularized Multimodal Tokenization for Cold-Start Recommendation
多模态

MoToRec: Sparse-Regularized Multimodal Tokenization for Cold-Start Recommendation

MoToRec is a novel AI framework that addresses the item cold-start problem in recommender systems by transforming multim...

MoToRec: Sparse-Regularized Multimodal Tokenization for Cold-Start Recommendation
多模态

MoToRec: Sparse-Regularized Multimodal Tokenization for Cold-Start Recommendation

MoToRec is a novel framework that addresses the item cold-start problem in recommender systems by transforming multimoda...

MoToRec: Sparse-Regularized Multimodal Tokenization for Cold-Start Recommendation
多模态

MoToRec: Sparse-Regularized Multimodal Tokenization for Cold-Start Recommendation

MoToRec (Sparse-Regularized Multimodal Tokenization for Cold-Start Recommendation) is a novel AI framework that addresse...

Video TokenCom: Textual Intent-Guided Multi-Rate Video Token Communications with UEP-Based Adaptive Source-Channel Coding
多模态

Video TokenCom: Textual Intent-Guided Multi-Rate Video Token Communications with UEP-Based Adaptive Source-Channel Coding

Video TokenCom is a novel framework for semantic-aware video transmission that uses discrete tokens as unified units for...

VL-KGE: Vision-Language Models Meet Knowledge Graph Embeddings
多模态

VL-KGE: Vision-Language Models Meet Knowledge Graph Embeddings

VL-KGE (Vision-Language Knowledge Graph Embeddings) is a novel framework that integrates Vision-Language Models like CLI...

Sequence-Level Unsupervised Training in Speech Recognition: A Theoretical Study
多模态

Sequence-Level Unsupervised Training in Speech Recognition: A Theoretical Study

Researchers have established a theoretical framework defining the precise conditions for successful unsupervised speech ...

Whisper-RIR-Mega: A Paired Clean-Reverberant Speech Benchmark for ASR Robustness to Room Acoustics
多模态

Whisper-RIR-Mega: A Paired Clean-Reverberant Speech Benchmark for ASR Robustness to Room Acoustics

Whisper-RIR-Mega is a new open-source benchmark that pairs clean LibriSpeech audio with real room impulse responses from...

Whisper-RIR-Mega: A Paired Clean-Reverberant Speech Benchmark for ASR Robustness to Room Acoustics
多模态

Whisper-RIR-Mega: A Paired Clean-Reverberant Speech Benchmark for ASR Robustness to Room Acoustics

Whisper-RIR-Mega is a benchmark dataset that pairs clean LibriSpeech samples with reverberant counterparts using real ro...

Whisper-RIR-Mega: A Paired Clean-Reverberant Speech Benchmark for ASR Robustness to Room Acoustics
多模态

Whisper-RIR-Mega: A Paired Clean-Reverberant Speech Benchmark for ASR Robustness to Room Acoustics

The Whisper-RIR-Mega benchmark is a paired clean-reverberant speech dataset that systematically evaluates Automatic Spee...

An Empirical Analysis of Calibration and Selective Prediction in Multimodal Clinical Condition Classification
多模态

An Empirical Analysis of Calibration and Selective Prediction in Multimodal Clinical Condition Classification

A new empirical analysis reveals that selective prediction—a key safety mechanism where AI models defer uncertain decisi...

Addressing Missing and Noisy Modalities in One Solution: Unified Modality-Quality Framework for Low-quality Multimodal Data
多模态

Addressing Missing and Noisy Modalities in One Solution: Unified Modality-Quality Framework for Low-quality Multimodal Data

The Unified Modality-Quality (UMQ) framework is a novel AI approach that jointly addresses noisy and missing modalities ...

Addressing Missing and Noisy Modalities in One Solution: Unified Modality-Quality Framework for Low-quality Multimodal Data
多模态

Addressing Missing and Noisy Modalities in One Solution: Unified Modality-Quality Framework for Low-quality Multimodal Data

The Unified Modality-Quality (UMQ) framework is a novel AI approach that jointly addresses noisy and missing modalities ...

Addressing Missing and Noisy Modalities in One Solution: Unified Modality-Quality Framework for Low-quality Multimodal Data
多模态

Addressing Missing and Noisy Modalities in One Solution: Unified Modality-Quality Framework for Low-quality Multimodal Data

The Unified Modality-Quality (UMQ) framework addresses both noisy and missing data modalities in multimodal AI as a sing...

Addressing Missing and Noisy Modalities in One Solution: Unified Modality-Quality Framework for Low-quality Multimodal Data
多模态

Addressing Missing and Noisy Modalities in One Solution: Unified Modality-Quality Framework for Low-quality Multimodal Data

The Unified Modality-Quality (UMQ) framework is a novel AI solution that jointly addresses noisy and missing modalities ...

Graph Attention Based Prioritization of Disease Responsible Genes from Multimodal Alzheimer's Network
多模态

Graph Attention Based Prioritization of Disease Responsible Genes from Multimodal Alzheimer's Network

NETRA (Node Evaluation through Transformer-based Representation and Attention) is a novel multimodal graph transformer f...

Graph Attention Based Prioritization of Disease Responsible Genes from Multimodal Alzheimer's Network
多模态

Graph Attention Based Prioritization of Disease Responsible Genes from Multimodal Alzheimer's Network

NETRA (Node Evaluation through Transformer-based Representation and Attention) is a multimodal AI framework that uses gr...

Graph Attention Based Prioritization of Disease Responsible Genes from Multimodal Alzheimer's Network
多模态

Graph Attention Based Prioritization of Disease Responsible Genes from Multimodal Alzheimer's Network

Researchers developed NETRA, a multimodal graph transformer AI framework that prioritizes disease-relevant genes for Alz...

UniTAF: A Modular Framework for Joint Text-to-Speech and Audio-to-Face Modeling
多模态

UniTAF: A Modular Framework for Joint Text-to-Speech and Audio-to-Face Modeling

UniTAF (Unified Text-to-Audio-and-Face) is a novel AI framework that integrates Text-to-Speech (TTS) and Audio-to-Face (...

Kinematify: Open-Vocabulary Synthesis of High-DoF Articulated Objects
多模态

Kinematify: Open-Vocabulary Synthesis of High-DoF Articulated Objects

Kinematify is an AI framework that synthesizes articulated object models with high degrees of freedom directly from RGB ...

D2E: Scaling Vision-Action Pretraining on Desktop Data for Transfer to Embodied AI
多模态

D2E: Scaling Vision-Action Pretraining on Desktop Data for Transfer to Embodied AI

The D2E (Desktop to Embodied AI) framework demonstrates that pretraining on large-scale desktop game interactions dramat...

Zero-shot CT Super-Resolution using Diffusion-based 2D Projection Priors and Signed 3D Gaussians
多模态

Zero-shot CT Super-Resolution using Diffusion-based 2D Projection Priors and Signed 3D Gaussians

A novel AI framework enables zero-shot CT super-resolution by integrating diffusion models for 2D projection enhancement...

InstructVLA: Vision-Language-Action Instruction Tuning from Understanding to Manipulation
多模态

InstructVLA: Vision-Language-Action Instruction Tuning from Understanding to Manipulation

InstructVLA is a novel vision-language-action model that bridges high-level reasoning with precise robotic manipulation ...

Synthetic Perception: Can Generated Images Unlock Latent Visual Prior for Text-Centric Reasoning?
多模态

Synthetic Perception: Can Generated Images Unlock Latent Visual Prior for Text-Centric Reasoning?

A new AI research paradigm called 'synthetic perception' investigates whether images generated by Text-to-Image (T2I) mo...

Perception-R1: Advancing Multimodal Reasoning Capabilities of MLLMs via Visual Perception Reward
多模态

Perception-R1: Advancing Multimodal Reasoning Capabilities of MLLMs via Visual Perception Reward

Perception-R1 is a novel reinforcement learning technique that addresses the critical bottleneck in Multimodal Large Lan...

Slot-BERT: Self-supervised Object Discovery in Surgical Video
多模态

Slot-BERT: Self-supervised Object Discovery in Surgical Video

Slot-BERT is a novel AI model for unsupervised object-centric learning in surgical videos that introduces a bidirectiona...

APPO: Attention-guided Perception Policy Optimization for Video Reasoning
多模态

APPO: Attention-guided Perception Policy Optimization for Video Reasoning

The APPO (Attention-guided Perception Policy Optimization) algorithm addresses the perception bottleneck in video AI by ...

3D Modality-Aware Pre-training for Vision-Language Model in MRI Multi-organ Abnormality Detection
多模态

3D Modality-Aware Pre-training for Vision-Language Model in MRI Multi-organ Abnormality Detection

MedMAP is a novel Medical Modality-Aware Pretraining framework that enhances vision-language models for 3D medical imagi...

From Pairs to Sequences: Track-Aware Policy Gradients for Keypoint Detection
多模态

From Pairs to Sequences: Track-Aware Policy Gradients for Keypoint Detection

TraqPoint is a novel reinforcement learning framework that reframes keypoint detection as a sequential decision-making p...

The Garbage Dataset (GD): A Multi-Class Image Benchmark for Automated Waste Segregation
多模态

The Garbage Dataset (GD): A Multi-Class Image Benchmark for Automated Waste Segregation

The Garbage Dataset (GD) is a comprehensive benchmark containing 12,259 labeled images across ten household waste catego...

WristMIR: Coarse-to-Fine Region-Aware Retrieval of Pediatric Wrist Radiographs with Radiology Report-Driven Learning
多模态

WristMIR: Coarse-to-Fine Region-Aware Retrieval of Pediatric Wrist Radiographs with Radiology Report-Driven Learning

WristMIR is a novel AI framework for retrieving analogous pediatric wrist radiographs to aid fracture diagnosis. It uses...

Reasoning-Driven Multimodal LLM for Domain Generalization
多模态

Reasoning-Driven Multimodal LLM for Domain Generalization

arXiv:2602.23777v1 Announce Type: new Abstract: This paper addresses the domain generalization (DG) problem in deep lear...

Scalable Multilingual Multimodal Machine Translation with Speech-Text Fusion
多模态

Scalable Multilingual Multimodal Machine Translation with Speech-Text Fusion

arXiv:2602.21646v1 Announce Type: new Abstract: Multimodal Large Language Models (MLLMs) have achieved notable success i...

多模态

Continuous Telemonitoring of Heart Failure using Personalised Speech Dynamics

arXiv:2602.19674v2 Announce Type: replace-cross Abstract: Remote monitoring of heart failure (HF) via speech signals pro...

多模态

Dual-Channel Attention Guidance for Training-Free Image Editing Control in Diffusion Transformers

arXiv:2602.18022v2 Announce Type: replace-cross Abstract: Training-free control over editing intensity is a critical req...

多模态

Tracing Copied Pixels and Regularizing Patch Affinity in Copy Detection

arXiv:2602.17484v2 Announce Type: replace-cross Abstract: Image Copy Detection (ICD) aims to identify manipulated conten...

多模态

Beyond Calibration: Confounding Pathology Limits Foundation Model Specificity in Abdominal Trauma CT

arXiv:2602.10359v2 Announce Type: replace-cross Abstract: Purpose: Translating foundation models into clinical practice ...

多模态

FigEx2: Visual-Conditioned Panel Detection and Captioning for Scientific Compound Figures

arXiv:2601.08026v3 Announce Type: replace-cross Abstract: Scientific compound figures combine multiple labeled panels in...

多模态

KD-OCT: Efficient Knowledge Distillation for Clinical-Grade Retinal OCT Classification

arXiv:2512.09069v2 Announce Type: replace-cross Abstract: Age-related macular degeneration (AMD) and choroidal neovascul...

多模态

Aerial Vision-Language Navigation with a Unified Framework for Spatial, Temporal and Embodied Reasoning

arXiv:2512.08639v2 Announce Type: replace-cross Abstract: Aerial Vision-and-Language Navigation (VLN) aims to enable unm...

多模态

RPTS: Tree-Structured Reasoning Process Scoring for Faithful Multimodal Evaluation

arXiv:2511.06899v3 Announce Type: replace-cross Abstract: Large Vision-Language Models (LVLMs) excel in multimodal reaso...

多模态

Uncovering Grounding IDs: How External Cues Shape Multimodal Binding

arXiv:2509.24072v4 Announce Type: replace-cross Abstract: Large vision-language models (LVLMs) show strong performance a...

多模态

Compose and Fuse: Revisiting the Foundational Bottlenecks in Multimodal Reasoning

arXiv:2509.23744v2 Announce Type: replace-cross Abstract: Multimodal large language models (MLLMs) promise enhanced reas...

多模态

EO-1: An Open Unified Embodied Foundation Model for General Robot Control

arXiv:2508.21112v5 Announce Type: replace-cross Abstract: The human ability to seamlessly perform multimodal reasoning a...

多模态

Learning What Matters: Prioritized Concept Learning via Relative Error-driven Sample Selection

arXiv:2506.01085v2 Announce Type: replace-cross Abstract: Instruction tuning has been central to the success of recent v...

多模态

Renaissance: Investigating the Pretraining of Vision-Language Encoders

arXiv:2411.06657v2 Announce Type: replace-cross Abstract: In the past several years there has been an explosion of avail...

多模态

Measuring the Measurers: Quality Evaluation of Hallucination Benchmarks for Large Vision-Language Models

arXiv:2406.17115v3 Announce Type: replace-cross Abstract: Despite the outstanding performance in multimodal tasks, Large...

多模态

NoLan: Mitigating Object Hallucinations in Large Vision-Language Models via Dynamic Suppression of Language Priors

arXiv:2602.22144v1 Announce Type: cross Abstract: Object hallucination is a critical issue in Large Vision-Language Mode...

多模态

TG-ASR: Translation-Guided Learning with Parallel Gated Cross Attention for Low-Resource Automatic Speech Recognition

arXiv:2602.22039v1 Announce Type: cross Abstract: Low-resource automatic speech recognition (ASR) continues to pose sign...

多模态

A Framework for Cross-Domain Generalization in Coronary Artery Calcium Scoring Across Gated and Non-Gated Computed Tomography

arXiv:2602.21935v1 Announce Type: cross Abstract: Coronary artery calcium (CAC) scoring is a key predictor of cardiovasc...

多模态

DynamicGTR: Leveraging Graph Topology Representation Preferences to Boost VLM Capabilities on Graph QAs

arXiv:2602.21864v1 Announce Type: cross Abstract: Vision-Language Models (VLMs) have emerged as versatile solutions for ...

多模态

Excitation: Momentum For Experts

arXiv:2602.21798v1 Announce Type: cross Abstract: We propose Excitation, a novel optimization framework designed to acce...