Listening or Reading? An Empirical Study of Modality Importance Analysis Across AQA Question Types
音频问答模型是否真的在“听”,还是部分题目仅凭文本即可作答?
本页以交互式研究案例的形式呈现项目,而非简单的模型演示。核心问题是:同一套音频问答系统在多大程度上依赖声音、文本线索或二者结合,且这种依赖如何随题型变化。通过比较六大题型在不同模态权重下的表现,研究揭示哪些结果更可能来自真实的感知落地,哪些仍可能存在捷径行为。
不同题型对“聆听”的依赖程度不同。
本页围绕科学现象本身组织:部分题型在仅有文本时仍可部分求解,而另一些则更直接依赖声音中的感知落地。
What this page shows
This is not just a model demo. It is an analysis of how modality dependence varies across question types in Audio Question Answering.
The key insight is that different types require different degrees of listening. Some remain partly answerable from text alone, while others genuinely require perceptual grounding in sound.
Highlighted takeaway
Aggregate accuracy alone cannot tell us whether a model is truly grounded in sound.
Six question types, six different modality behaviors
Each category invites a different balance between acoustic evidence and textual cues. The cards below summarize the type, while the detail panel explains what can and cannot be shortcut.
Both
This category is useful because the two selected examples point in opposite directions: one stays solvable from text, while the other needs the sound itself.
The model has to combine what the prompt says with what is actually happening acoustically in the scene.
Both modalities may matter, but the degree of dependence can vary substantially even within the same category.
If text already narrows the scene strongly, the model may still answer correctly without truly grounding in audio.
Inspect representative examples
The refreshed set swaps in the new selected_examples3 clips. Where per-mode outputs are available, you can compare text-only, balanced fusion, and audio-heavy behavior directly.
Selected example
What might the ominous instrumental music and background noises suggest about the scene?
Mode comparison
Use these cards to switch the active modality setting. When available, the card shows the exported correctness and answer probability for that mode.
When the same question type splits in two directions
The refreshed cards keep the within-type comparison idea, while also accommodating newly added qualitative reference clips whose per-mode outputs have not yet been exported.
Within-type contrast
This category is useful because the two selected examples point in opposite directions: one stays solvable from text, while the other needs the sound itself.
What might the ominous instrumental music and background noises suggest about the scene?
What is the background sound accompanying the man's voice?
Within-type contrast
Although this type sounds inherently acoustic, the selected examples show that some detection questions remain surprisingly shortcut-friendly while others genuinely need listening.
Which sound has the longest duration in the audio clip?
What are the time stamps for the onset and offset of the second clinking sound?
Within-type contrast
The selected pair highlights a strong internal contrast: one example survives text-only evaluation, while the other depends heavily on actual pitch information.
Given the following sound sequence: The first sound occurs before the first silence, the second sound occurs after the first silence, and the third sound occurs after the second silence. Which sound has the highest pitch?
Within-type contrast
Within this type, counting sometimes looks surprisingly easy from text cues, but other examples clearly require tracking repeated acoustic events.
How many times does the scissors sound occur in the audio clip?
How many different sounds are present in the audio clip?
Within-type contrast
Remember questions often support strong text-only behavior, but the selected pair shows that not every example in the category is equally shortcut-prone.
What is a distinguishing characteristic of the species that produced the sound in the given recording?
Which of the following statements is true about the species that produced the sound in the given recording?
Within-type contrast
Even this seemingly acoustic category is not uniform: the selected pair shows one text-surviving case and one example that more clearly requires listening.
Based on the sound recording, which of the following most accurately describes the acoustic characteristics of the signal?
Based on the acoustic characteristics of the sound, which of the following best describes the main feature of the recording?
How the analysis works
The architecture supports the analysis, but the main scientific question is behavioral: what changes when audio and text are weighted differently?
EchoTwin-QA uses a dual-tower BEATs + BERT setup. The case study then compares prediction behavior under different modality balances to ask whether different question types are grounded in audio, text, or both. The selected examples extend that logic by showing that even within one category, some items are much more shortcut-prone than others.
Technical notes
- Audio resampled to 16 kHz mono.
- Clipping and padding applied to fit the model input pipeline.
- Question and answer choices tokenized as a joint text sequence.
- BEATs audio encoder for acoustic representation.
- BERT text encoder for question and choice representation.
- Modality-weighted fusion followed by a lightweight classifier.
What the six types reveal
The takeaways below are framed as behavioral findings rather than as headline benchmark claims.
The same question type can contain both shortcut-friendly and listening-dependent examples.
The selected pair for each category is designed to make this internal contrast visible rather than to flatten the type into a single label.
Correct text-only behavior does not necessarily imply genuine grounding.
Several examples stay correct when audio is down-weighted, showing how textual priors can sometimes substitute for listening.
Other examples in the same category fail immediately without audio.
These cases make the perceptual side of AQA visible: the answer only stabilizes when the model has access to the sound signal.
AQA evaluation should be example-aware as well as type-aware.
Per-type trends are useful, but curated within-type comparisons reveal scientific behavior that aggregate numbers alone can hide.
The deeper question is not only whether the model is accurate, but why it is accurate.
This project uses controlled modality analysis to distinguish between robust multimodal reasoning and shortcut use.
Why this matters
The broader value of the project is not only in building an AQA system, but in asking a deeper scientific question about how that system behaves.
Showing that multimodal behavior must be inspected at the level of examples, not only at the level of categories or benchmark scores.
Helping diagnose whether audio-language systems are actually grounded in the intended sensory evidence.
Providing a more interpretable way to evaluate AQA systems than headline accuracy alone.
Motivating future audio-grounded AQA systems that are designed with shortcut diagnosis in mind.
Technical appendix / implementation details
A compact expandable section for technical readers who want the data schema and implementation assumptions in one place.
Show implementation notesJSON schema, preprocessing, embedding flow, and display pipeline.
ExpandCollapse
JSON schema, preprocessing, embedding flow, and display pipeline.
- id
- question
- choices
- groundTruthText
- groundTruthIndex
- audio_url
- question_type
- Audio resampled to 16 kHz mono.
- Clipping and padding applied to fit the model input pipeline.
- Question and answer choices tokenized as a joint text sequence.
- BEATs audio encoder for acoustic representation.
- BERT text encoder for question and choice representation.
- Modality-weighted fusion followed by a lightweight classifier.
- This page now uses 11 refreshed selected examples from the latest curated set in selected_examples3.
- The enriched export now restores explicit prediction text, answer probability, and correctness for text-only, balanced fusion, and audio-heavy modes.
- Fallback handling for qualitative-only references is still kept in code, but the current file provides full state traces for the selected set.
Resources
Primary documents and contact points for following up on the project.