多模态 AI - AI资讯

多模态 2026年3月4日

PSQE: A Theoretical-Practical Approach to Pseudo Seed Quality Enhancement for Unsupervised Multimodal Entity Alignment

The PSQE (Pseudo-Seed Quality Enhancement) framework represents a significant advancement in unsupervised multimodal ent...

arXiv cs.LG 阅读全文 →

多模态 2026年3月4日

PSQE: A Theoretical-Practical Approach to Pseudo Seed Quality Enhancement for Unsupervised Multimodal Entity Alignment

PSQE (Pseudo-Seed Quality Enhancement) is a novel method that addresses the critical bottleneck of imbalanced pseudo-ali...

arXiv cs.LG 阅读全文 →

多模态 2026年3月4日

PSQE: A Theoretical-Practical Approach to Pseudo Seed Quality Enhancement for Unsupervised Multimodal Entity Alignment

Researchers introduced Pseudo-Seed Quality Enhancement (PSQE), a novel method addressing imbalanced graph coverage in un...

arXiv cs.LG 阅读全文 →

多模态 2026年3月4日

PSQE: A Theoretical-Practical Approach to Pseudo Seed Quality Enhancement for Unsupervised Multimodal Entity Alignment

PSQE (Pseudo-Seed Quality Enhancement) is a novel theoretical-practical method that addresses pseudo-seed imbalance in u...

arXiv cs.LG 阅读全文 →

多模态 2026年3月4日

PSQE: A Theoretical-Practical Approach to Pseudo Seed Quality Enhancement for Unsupervised Multimodal Entity Alignment

The PSQE (Pseudo-Seed Quality Enhancement) framework addresses a core challenge in unsupervised Multimodal Entity Alignm...

arXiv cs.LG 阅读全文 →

多模态 2026年3月4日

PSQE: A Theoretical-Practical Approach to Pseudo Seed Quality Enhancement for Unsupervised Multimodal Entity Alignment

PSQE (Pseudo-Seed Quality Enhancement) is a novel framework addressing imbalanced pseudo-seed coverage in unsupervised m...

arXiv cs.LG 阅读全文 →

多模态 2026年3月4日

Multimodal Multi-Agent Ransomware Analysis Using AutoGen

A novel AI framework using multi-agent systems and multimodal data fusion achieves state-of-the-art ransomware classific...

arXiv cs.LG 阅读全文 →

多模态 2026年3月4日

Multimodal Multi-Agent Ransomware Analysis Using AutoGen

Researchers have developed a novel multimodal multi-agent AI framework for ransomware analysis that achieves a Macro-F1 ...

arXiv cs.LG 阅读全文 →

多模态 2026年3月4日

Multimodal Multi-Agent Ransomware Analysis Using AutoGen

A novel multimodal multi-agent ransomware detection framework using AutoGen achieves superior classification performance...

arXiv cs.LG 阅读全文 →

多模态 2026年3月4日

MoToRec: Sparse-Regularized Multimodal Tokenization for Cold-Start Recommendation

MoToRec is a novel AI framework that addresses the item cold-start problem in recommender systems by transforming multim...

arXiv cs.LG 阅读全文 →

多模态 2026年3月4日

MoToRec: Sparse-Regularized Multimodal Tokenization for Cold-Start Recommendation

MoToRec is a novel framework that addresses the item cold-start problem in recommender systems by transforming multimoda...

arXiv cs.LG 阅读全文 →

多模态 2026年3月4日

MoToRec: Sparse-Regularized Multimodal Tokenization for Cold-Start Recommendation

MoToRec (Sparse-Regularized Multimodal Tokenization for Cold-Start Recommendation) is a novel AI framework that addresse...

arXiv cs.LG 阅读全文 →

多模态 2026年3月4日

Video TokenCom: Textual Intent-Guided Multi-Rate Video Token Communications with UEP-Based Adaptive Source-Channel Coding

Video TokenCom is a novel framework for semantic-aware video transmission that uses discrete tokens as unified units for...

arXiv cs.LG 阅读全文 →

多模态 2026年3月4日

VL-KGE: Vision-Language Models Meet Knowledge Graph Embeddings

VL-KGE (Vision-Language Knowledge Graph Embeddings) is a novel framework that integrates Vision-Language Models like CLI...

arXiv cs.LG 阅读全文 →

多模态 2026年3月4日

Sequence-Level Unsupervised Training in Speech Recognition: A Theoretical Study

Researchers have established a theoretical framework defining the precise conditions for successful unsupervised speech ...

arXiv cs.LG 阅读全文 →

多模态 2026年3月4日

Whisper-RIR-Mega: A Paired Clean-Reverberant Speech Benchmark for ASR Robustness to Room Acoustics

Whisper-RIR-Mega is a new open-source benchmark that pairs clean LibriSpeech audio with real room impulse responses from...

arXiv cs.LG 阅读全文 →

多模态 2026年3月4日

Whisper-RIR-Mega: A Paired Clean-Reverberant Speech Benchmark for ASR Robustness to Room Acoustics

Whisper-RIR-Mega is a benchmark dataset that pairs clean LibriSpeech samples with reverberant counterparts using real ro...

arXiv cs.LG 阅读全文 →

多模态 2026年3月4日

Whisper-RIR-Mega: A Paired Clean-Reverberant Speech Benchmark for ASR Robustness to Room Acoustics

The Whisper-RIR-Mega benchmark is a paired clean-reverberant speech dataset that systematically evaluates Automatic Spee...

arXiv cs.LG 阅读全文 →

多模态 2026年3月4日

An Empirical Analysis of Calibration and Selective Prediction in Multimodal Clinical Condition Classification

A new empirical analysis reveals that selective prediction—a key safety mechanism where AI models defer uncertain decisi...

arXiv cs.LG 阅读全文 →

多模态 2026年3月4日

Addressing Missing and Noisy Modalities in One Solution: Unified Modality-Quality Framework for Low-quality Multimodal Data

The Unified Modality-Quality (UMQ) framework is a novel AI approach that jointly addresses noisy and missing modalities ...

arXiv cs.LG 阅读全文 →

多模态 2026年3月4日

Addressing Missing and Noisy Modalities in One Solution: Unified Modality-Quality Framework for Low-quality Multimodal Data

The Unified Modality-Quality (UMQ) framework is a novel AI approach that jointly addresses noisy and missing modalities ...

arXiv cs.LG 阅读全文 →

多模态 2026年3月4日

Addressing Missing and Noisy Modalities in One Solution: Unified Modality-Quality Framework for Low-quality Multimodal Data

The Unified Modality-Quality (UMQ) framework addresses both noisy and missing data modalities in multimodal AI as a sing...

arXiv cs.LG 阅读全文 →

多模态 2026年3月4日

Addressing Missing and Noisy Modalities in One Solution: Unified Modality-Quality Framework for Low-quality Multimodal Data

The Unified Modality-Quality (UMQ) framework is a novel AI solution that jointly addresses noisy and missing modalities ...

arXiv cs.LG 阅读全文 →

多模态 2026年3月4日

Graph Attention Based Prioritization of Disease Responsible Genes from Multimodal Alzheimer's Network

NETRA (Node Evaluation through Transformer-based Representation and Attention) is a novel multimodal graph transformer f...

arXiv cs.LG 阅读全文 →

多模态 2026年3月4日

Graph Attention Based Prioritization of Disease Responsible Genes from Multimodal Alzheimer's Network

NETRA (Node Evaluation through Transformer-based Representation and Attention) is a multimodal AI framework that uses gr...

arXiv cs.LG 阅读全文 →

多模态 2026年3月4日

Graph Attention Based Prioritization of Disease Responsible Genes from Multimodal Alzheimer's Network

Researchers developed NETRA, a multimodal graph transformer AI framework that prioritizes disease-relevant genes for Alz...

arXiv cs.LG 阅读全文 →

多模态 2026年3月4日

UniTAF: A Modular Framework for Joint Text-to-Speech and Audio-to-Face Modeling

UniTAF (Unified Text-to-Audio-and-Face) is a novel AI framework that integrates Text-to-Speech (TTS) and Audio-to-Face (...

arXiv cs.CV 阅读全文 →

多模态 2026年3月4日

Kinematify: Open-Vocabulary Synthesis of High-DoF Articulated Objects

Kinematify is an AI framework that synthesizes articulated object models with high degrees of freedom directly from RGB ...

arXiv cs.CV 阅读全文 →

多模态 2026年3月4日

D2E: Scaling Vision-Action Pretraining on Desktop Data for Transfer to Embodied AI

The D2E (Desktop to Embodied AI) framework demonstrates that pretraining on large-scale desktop game interactions dramat...

arXiv cs.CV 阅读全文 →

多模态 2026年3月4日

Zero-shot CT Super-Resolution using Diffusion-based 2D Projection Priors and Signed 3D Gaussians

A novel AI framework enables zero-shot CT super-resolution by integrating diffusion models for 2D projection enhancement...

arXiv cs.CV 阅读全文 →

多模态 2026年3月4日

InstructVLA: Vision-Language-Action Instruction Tuning from Understanding to Manipulation

InstructVLA is a novel vision-language-action model that bridges high-level reasoning with precise robotic manipulation ...

arXiv cs.CV 阅读全文 →

多模态 2026年3月4日

Synthetic Perception: Can Generated Images Unlock Latent Visual Prior for Text-Centric Reasoning?

A new AI research paradigm called 'synthetic perception' investigates whether images generated by Text-to-Image (T2I) mo...

arXiv cs.CV 阅读全文 →

多模态 2026年3月4日

Perception-R1: Advancing Multimodal Reasoning Capabilities of MLLMs via Visual Perception Reward

Perception-R1 is a novel reinforcement learning technique that addresses the critical bottleneck in Multimodal Large Lan...

arXiv cs.CV 阅读全文 →

多模态 2026年3月4日

Slot-BERT: Self-supervised Object Discovery in Surgical Video

Slot-BERT is a novel AI model for unsupervised object-centric learning in surgical videos that introduces a bidirectiona...

arXiv cs.CV 阅读全文 →

多模态 2026年3月4日

APPO: Attention-guided Perception Policy Optimization for Video Reasoning

The APPO (Attention-guided Perception Policy Optimization) algorithm addresses the perception bottleneck in video AI by ...

arXiv cs.CV 阅读全文 →

多模态 2026年3月4日

3D Modality-Aware Pre-training for Vision-Language Model in MRI Multi-organ Abnormality Detection

MedMAP is a novel Medical Modality-Aware Pretraining framework that enhances vision-language models for 3D medical imagi...

arXiv cs.CV 阅读全文 →

多模态 2026年3月4日

From Pairs to Sequences: Track-Aware Policy Gradients for Keypoint Detection

TraqPoint is a novel reinforcement learning framework that reframes keypoint detection as a sequential decision-making p...

arXiv cs.CV 阅读全文 →

多模态 2026年3月4日

The Garbage Dataset (GD): A Multi-Class Image Benchmark for Automated Waste Segregation

The Garbage Dataset (GD) is a comprehensive benchmark containing 12,259 labeled images across ten household waste catego...

arXiv cs.CV 阅读全文 →

多模态 2026年3月4日

WristMIR: Coarse-to-Fine Region-Aware Retrieval of Pediatric Wrist Radiographs with Radiology Report-Driven Learning

WristMIR is a novel AI framework for retrieving analogous pediatric wrist radiographs to aid fracture diagnosis. It uses...

arXiv cs.CV 阅读全文 →

多模态 2026年3月3日

Reasoning-Driven Multimodal LLM for Domain Generalization

arXiv:2602.23777v1 Announce Type: new Abstract: This paper addresses the domain generalization (DG) problem in deep lear...

arXiv cs.AI 阅读全文 →

多模态 2026年2月28日

Scalable Multilingual Multimodal Machine Translation with Speech-Text Fusion

arXiv:2602.21646v1 Announce Type: new Abstract: Multimodal Large Language Models (MLLMs) have achieved notable success i...

arXiv cs.CL 阅读全文 →

多模态 2026年2月26日

Continuous Telemonitoring of Heart Failure using Personalised Speech Dynamics

arXiv:2602.19674v2 Announce Type: replace-cross Abstract: Remote monitoring of heart failure (HF) via speech signals pro...

arXiv cs.AI 阅读全文 →

多模态 2026年2月26日

Dual-Channel Attention Guidance for Training-Free Image Editing Control in Diffusion Transformers

arXiv:2602.18022v2 Announce Type: replace-cross Abstract: Training-free control over editing intensity is a critical req...

arXiv cs.AI 阅读全文 →

多模态 2026年2月26日

Tracing Copied Pixels and Regularizing Patch Affinity in Copy Detection

arXiv:2602.17484v2 Announce Type: replace-cross Abstract: Image Copy Detection (ICD) aims to identify manipulated conten...

arXiv cs.AI 阅读全文 →

多模态 2026年2月26日

Beyond Calibration: Confounding Pathology Limits Foundation Model Specificity in Abdominal Trauma CT

arXiv:2602.10359v2 Announce Type: replace-cross Abstract: Purpose: Translating foundation models into clinical practice ...

arXiv cs.AI 阅读全文 →

多模态 2026年2月26日

FigEx2: Visual-Conditioned Panel Detection and Captioning for Scientific Compound Figures

arXiv:2601.08026v3 Announce Type: replace-cross Abstract: Scientific compound figures combine multiple labeled panels in...

arXiv cs.AI 阅读全文 →

多模态 2026年2月26日

KD-OCT: Efficient Knowledge Distillation for Clinical-Grade Retinal OCT Classification

arXiv:2512.09069v2 Announce Type: replace-cross Abstract: Age-related macular degeneration (AMD) and choroidal neovascul...

arXiv cs.AI 阅读全文 →

多模态 2026年2月26日

Aerial Vision-Language Navigation with a Unified Framework for Spatial, Temporal and Embodied Reasoning

arXiv:2512.08639v2 Announce Type: replace-cross Abstract: Aerial Vision-and-Language Navigation (VLN) aims to enable unm...

arXiv cs.AI 阅读全文 →

多模态 2026年2月26日

RPTS: Tree-Structured Reasoning Process Scoring for Faithful Multimodal Evaluation

arXiv:2511.06899v3 Announce Type: replace-cross Abstract: Large Vision-Language Models (LVLMs) excel in multimodal reaso...

arXiv cs.AI 阅读全文 →

多模态 2026年2月26日

Uncovering Grounding IDs: How External Cues Shape Multimodal Binding

arXiv:2509.24072v4 Announce Type: replace-cross Abstract: Large vision-language models (LVLMs) show strong performance a...

arXiv cs.AI 阅读全文 →

多模态 2026年2月26日

Compose and Fuse: Revisiting the Foundational Bottlenecks in Multimodal Reasoning

arXiv:2509.23744v2 Announce Type: replace-cross Abstract: Multimodal large language models (MLLMs) promise enhanced reas...

arXiv cs.AI 阅读全文 →

多模态 2026年2月26日

EO-1: An Open Unified Embodied Foundation Model for General Robot Control

arXiv:2508.21112v5 Announce Type: replace-cross Abstract: The human ability to seamlessly perform multimodal reasoning a...

arXiv cs.AI 阅读全文 →

多模态 2026年2月26日

Learning What Matters: Prioritized Concept Learning via Relative Error-driven Sample Selection

arXiv:2506.01085v2 Announce Type: replace-cross Abstract: Instruction tuning has been central to the success of recent v...

arXiv cs.AI 阅读全文 →

多模态 2026年2月26日

Renaissance: Investigating the Pretraining of Vision-Language Encoders

arXiv:2411.06657v2 Announce Type: replace-cross Abstract: In the past several years there has been an explosion of avail...

arXiv cs.AI 阅读全文 →

多模态 2026年2月26日

Measuring the Measurers: Quality Evaluation of Hallucination Benchmarks for Large Vision-Language Models

arXiv:2406.17115v3 Announce Type: replace-cross Abstract: Despite the outstanding performance in multimodal tasks, Large...

arXiv cs.AI 阅读全文 →

多模态 2026年2月26日

NoLan: Mitigating Object Hallucinations in Large Vision-Language Models via Dynamic Suppression of Language Priors

arXiv:2602.22144v1 Announce Type: cross Abstract: Object hallucination is a critical issue in Large Vision-Language Mode...

arXiv cs.AI 阅读全文 →

多模态 2026年2月26日

TG-ASR: Translation-Guided Learning with Parallel Gated Cross Attention for Low-Resource Automatic Speech Recognition

arXiv:2602.22039v1 Announce Type: cross Abstract: Low-resource automatic speech recognition (ASR) continues to pose sign...

arXiv cs.AI 阅读全文 →

多模态 2026年2月26日

A Framework for Cross-Domain Generalization in Coronary Artery Calcium Scoring Across Gated and Non-Gated Computed Tomography

arXiv:2602.21935v1 Announce Type: cross Abstract: Coronary artery calcium (CAC) scoring is a key predictor of cardiovasc...

arXiv cs.AI 阅读全文 →

多模态 2026年2月26日

DynamicGTR: Leveraging Graph Topology Representation Preferences to Boost VLM Capabilities on Graph QAs

arXiv:2602.21864v1 Announce Type: cross Abstract: Vision-Language Models (VLMs) have emerged as versatile solutions for ...

arXiv cs.AI 阅读全文 →

多模态 2026年2月26日

Excitation: Momentum For Experts

arXiv:2602.21798v1 Announce Type: cross Abstract: We propose Excitation, a novel optimization framework designed to acce...

arXiv cs.AI 阅读全文 →