案例研究

Listening or Reading? An Empirical Study of Modality Importance Analysis Across AQA Question Types

音频问答模型是否真的在“听”，还是部分题目仅凭文本即可作答？

本页以交互式研究案例的形式呈现项目，而非简单的模型演示。核心问题是：同一套音频问答系统在多大程度上依赖声音、文本线索或二者结合，且这种依赖如何随题型变化。通过比较六大题型在不同模态权重下的表现，研究揭示哪些结果更可能来自真实的感知落地，哪些仍可能存在捷径行为。

DCASE 2025

Audio Question Answering

Modality Analysis

Multimodal Reasoning

论文技术报告返回项目列表

研究视角

不同题型对“聆听”的依赖程度不同。

本页围绕科学现象本身组织：部分题型在仅有文本时仍可部分求解，而另一些则更直接依赖声音中的感知落地。

重点不仅在于搭建 EchoTwin-QA，更在于借此追问模态依赖、捷径行为，以及实践中“基于音频的推理”应意味着什么。

What this page shows

This is not just a model demo. It is an analysis of how modality dependence varies across question types in Audio Question Answering.

The key insight is that different types require different degrees of listening. Some remain partly answerable from text alone, while others genuinely require perceptual grounding in sound.

Highlighted takeaway

Aggregate accuracy alone cannot tell us whether a model is truly grounded in sound.

Question Types

Six question types, six different modality behaviors

Each category invites a different balance between acoustic evidence and textual cues. The cards below summarize the type, while the detail panel explains what can and cannot be shortcut.

mixed

Both

This category is useful because the two selected examples point in opposite directions: one stays solvable from text, while the other needs the sound itself.

This refreshed panel combines contrastive exported examples with a few newly curated reference clips, so the degree of available mode detail can differ across types.

What this type asks

The model has to combine what the prompt says with what is actually happening acoustically in the scene.

Modality reliance

Both modalities may matter, but the degree of dependence can vary substantially even within the same category.

Shortcut risk

If text already narrows the scene strongly, the model may still answer correctly without truly grounding in audio.

Per-type trend

Explorer

Inspect representative examples

The refreshed set swaps in the new selected_examples3 clips. Where per-mode outputs are available, you can compare text-only, balanced fusion, and audio-heavy behavior directly.

Both

Listening-dependent case

dev_aqa_39

Selected example

Audio reference

audio_00091.wav

Question

What might the ominous instrumental music and background noises suggest about the scene?

Answer choices

A. It represents a cheerful and festive environment

B. It suggests a light-hearted, comedic scene

C. It implies a romantic and serene setting

D. It indicates a tense and suspenseful atmosphere

Ground truth: D. It indicates a tense and suspenseful atmosphere

GT index 3

This example becomes correct only when audio dominates, showing that the same question type can also contain genuinely grounded cases.

1 / 2

Mode comparison

Use these cards to switch the active modality setting. When available, the card shows the exported correctness and answer probability for that mode.

Contrast Gallery

When the same question type splits in two directions

The refreshed cards keep the within-type comparison idea, while also accommodating newly added qualitative reference clips whose per-mode outputs have not yet been exported.

Both

mixed

Within-type contrast

This category is useful because the two selected examples point in opposite directions: one stays solvable from text, while the other needs the sound itself.

Listening-dependent case

dev_aqa_39

What might the ominous instrumental music and background noises suggest about the scene?

Text-only

Incorrect · 50.9%

Balanced fusion

Incorrect · 40.1%

Audio-heavy

Correct · 22.0%

This example becomes correct only when audio dominates, showing that the same question type can also contain genuinely grounded cases.

Ground truth: D. It indicates a tense and suspenseful atmosphere

Listening-dependent case

dev_aqa_627

What is the background sound accompanying the man's voice?

Text-only

Incorrect · 52.9%

Balanced fusion

Incorrect · 38.1%

Audio-heavy

Correct · 21.2%

This example becomes correct only when audio dominates, showing that the same question type can also contain genuinely grounded cases.

Ground truth: D. Mechanical hum

Sound Detection

text-favoring

Within-type contrast

Although this type sounds inherently acoustic, the selected examples show that some detection questions remain surprisingly shortcut-friendly while others genuinely need listening.

Text-surviving case

fold2-a-0162

Which sound has the longest duration in the audio clip?

Text-only

Correct · 37.3%

Balanced fusion

Incorrect · 30.6%

Audio-heavy

Incorrect · 21.3%

This example stays correct under text-only evaluation, suggesting that same-type questions can still expose shortcut-friendly behavior.

Ground truth: C. Printer

Text-surviving case

fold2-a-0048

What are the time stamps for the onset and offset of the second clinking sound?

Text-only

Correct · 55.6%

Balanced fusion

Correct · 45.6%

Audio-heavy

Incorrect · 23.0%

This example stays correct under text-only evaluation, suggesting that same-type questions can still expose shortcut-friendly behavior.

Ground truth: C. Onset: 1.5s, offset: 7.0s

Apply Frequency

mixed

Within-type contrast

The selected pair highlights a strong internal contrast: one example survives text-only evaluation, while the other depends heavily on actual pitch information.

Nuanced case

part1_dev_aqa_107

Given the following sound sequence: The first sound occurs before the first silence, the second sound occurs after the first silence, and the third sound occurs after the second silence. Which sound has the highest pitch?

Text-only

Correct · 97.0%

Balanced fusion

Correct · 73.5%

Audio-heavy

Correct · 26.3%

This example is included because its behavior changes across modes, even if the contrast is not purely text-versus-audio.

Ground truth: D. The second sound

Sound Counting

audio-grounded

Within-type contrast

Within this type, counting sometimes looks surprisingly easy from text cues, but other examples clearly require tracking repeated acoustic events.

Listening-dependent case

fold2-a-0157

How many times does the scissors sound occur in the audio clip?

Text-only

Incorrect · 57.6%

Balanced fusion

Incorrect · 47.6%

Audio-heavy

Correct · 22.8%

This example becomes correct only when audio dominates, showing that the same question type can also contain genuinely grounded cases.

Ground truth: C. Four times

Listening-dependent case

fold2-a-0298

How many different sounds are present in the audio clip?

Text-only

Incorrect · 57.4%

Balanced fusion

Incorrect · 45.8%

Audio-heavy

Correct · 23.0%

This example becomes correct only when audio dominates, showing that the same question type can also contain genuinely grounded cases.

Ground truth: B. Five

Remember

audio-grounded

Within-type contrast

Remember questions often support strong text-only behavior, but the selected pair shows that not every example in the category is equally shortcut-prone.

Listening-dependent case

part1_dev_aqa_226

What is a distinguishing characteristic of the species that produced the sound in the given recording?

Text-only

Incorrect · 54.9%

Balanced fusion

Incorrect · 40.9%

Audio-heavy

Correct · 23.7%

This example becomes correct only when audio dominates, showing that the same question type can also contain genuinely grounded cases.

Ground truth: D. It constructs bubble nets to trap schools of fish.

Nuanced case

part1_dev_aqa_102

Which of the following statements is true about the species that produced the sound in the given recording?

Text-only

Correct · 94.8%

Balanced fusion

Correct · 72.3%

Audio-heavy

Correct · 25.7%

This example is included because its behavior changes across modes, even if the contrast is not purely text-versus-audio.

Ground truth: C. It is primarily found in freshwater river systems.

Understand Acoustics

audio-grounded

Within-type contrast

Even this seemingly acoustic category is not uniform: the selected pair shows one text-surviving case and one example that more clearly requires listening.

Listening-dependent case

part1_dev_aqa_75

Based on the sound recording, which of the following most accurately describes the acoustic characteristics of the signal?

Text-only

Incorrect · 70.1%

Balanced fusion

Incorrect · 42.6%

Audio-heavy

Correct · 19.6%

This example becomes correct only when audio dominates, showing that the same question type can also contain genuinely grounded cases.

Ground truth: D. High-frequency sounds above 7 kHz with sharp, sporadic intensity changes over time.

Nuanced case

part1_dev_aqa_165

Based on the acoustic characteristics of the sound, which of the following best describes the main feature of the recording?

Text-only

Correct · 99.3%

Balanced fusion

Correct · 84.0%

Audio-heavy

Correct · 26.2%

This example is included because its behavior changes across modes, even if the contrast is not purely text-versus-audio.

Ground truth: C. A high-frequency continuous tone above 2 kHz, with no noticeable modulation.

Method

How the analysis works

The architecture supports the analysis, but the main scientific question is behavioral: what changes when audio and text are weighted differently?

Audio -> BEATs

Text -> BERT

Fusion -> classifier

Prediction

EchoTwin-QA uses a dual-tower BEATs + BERT setup. The case study then compares prediction behavior under different modality balances to ask whether different question types are grounded in audio, text, or both. The selected examples extend that logic by showing that even within one category, some items are much more shortcut-prone than others.

Technical notes

Preprocessing

Audio resampled to 16 kHz mono.
Clipping and padding applied to fit the model input pipeline.
Question and answer choices tokenized as a joint text sequence.

Representation flow

BEATs audio encoder for acoustic representation.
BERT text encoder for question and choice representation.
Modality-weighted fusion followed by a lightweight classifier.

Findings

What the six types reveal

The takeaways below are framed as behavioral findings rather than as headline benchmark claims.

Within-type contrast

The same question type can contain both shortcut-friendly and listening-dependent examples.

The selected pair for each category is designed to make this internal contrast visible rather than to flatten the type into a single label.

Shortcut diagnosis

Correct text-only behavior does not necessarily imply genuine grounding.

Several examples stay correct when audio is down-weighted, showing how textual priors can sometimes substitute for listening.

Perceptual grounding

Other examples in the same category fail immediately without audio.

These cases make the perceptual side of AQA visible: the answer only stabilizes when the model has access to the sound signal.

Evaluation design

AQA evaluation should be example-aware as well as type-aware.

Per-type trends are useful, but curated within-type comparisons reveal scientific behavior that aggregate numbers alone can hide.

Research significance

The deeper question is not only whether the model is accurate, but why it is accurate.

This project uses controlled modality analysis to distinguish between robust multimodal reasoning and shortcut use.

Significance

Why this matters

The broader value of the project is not only in building an AQA system, but in asking a deeper scientific question about how that system behaves.

Showing that multimodal behavior must be inspected at the level of examples, not only at the level of categories or benchmark scores.

Helping diagnose whether audio-language systems are actually grounded in the intended sensory evidence.

Providing a more interpretable way to evaluate AQA systems than headline accuracy alone.

Motivating future audio-grounded AQA systems that are designed with shortcut diagnosis in mind.

Appendix

Technical appendix / implementation details

A compact expandable section for technical readers who want the data schema and implementation assumptions in one place.

Show implementation notes

JSON schema, preprocessing, embedding flow, and display pipeline.

Expand

JSON schema

id
question
choices
groundTruthText
groundTruthIndex
audio_url
question_type

Preprocessing

Audio resampled to 16 kHz mono.
Clipping and padding applied to fit the model input pipeline.
Question and answer choices tokenized as a joint text sequence.

Embedding flow

BEATs audio encoder for acoustic representation.
BERT text encoder for question and choice representation.
Modality-weighted fusion followed by a lightweight classifier.

Display pipeline

This page now uses 11 refreshed selected examples from the latest curated set in selected_examples3.
The enriched export now restores explicit prediction text, answer probability, and correctness for text-only, balanced fusion, and audio-heavy modes.
Fallback handling for qualitative-only references is still kept in code, but the current file provides full state traces for the selected set.

Resources

Primary documents and contact points for following up on the project.

Workshop paper

Full paper describing the modality weighting study.

Open

Technical report

Challenge system report for EchoTwin-QA.

Open

Code repository

Current public code entry point and related experiments.

Open

Contact / discuss

Reach out to discuss the project, data, or follow-up work.

Open