Research

Research highlights

A short overview of recent results and ongoing directions, with brief context for each area.

Current emphasis

Building hierarchical audio representations and evaluation protocols for general audio understanding.

The audio QA case study provides one concrete example, while the broader proposal extends toward cross-task transfer, efficiency, and robustness under realistic compute budgets.

View case study
Research theme

Audio question answering

I use AQA as a concrete testbed for understanding whether systems answer from actual acoustic evidence or rely too heavily on semantic and textual shortcuts.

Listening or Reading? An Empirical Study of Modality Importance Analysis Across AQA Question Types
ECHOTWIN-QA: A Dual-Tower BEATSBERT System for DCASE 2025 Task 5 Audio Question Answering
Research theme

Modality importance analysis

A central theme in my work is measuring how much different input modalities contribute across question types, so that model behavior can be interpreted more rigorously.

Question-type-specific ablation studies
Acoustic versus semantic dependency analysis
Research theme

Hierarchical audio intelligence

My research proposal asks whether explicit hierarchical representations can improve generalization and robustness across heterogeneous audio tasks compared with flat representations at similar compute budgets.

Acoustic -> units -> events -> scenes -> semantics
Unified representation for general audio understanding
Research theme

Unified evaluation across tasks

A core direction in the proposal is to build a compact evaluation suite spanning recognition, audio-text grounding, and reasoning-style audio tasks, with analysis tied to specific abstraction levels.

Cross-task transfer matrix
Recognition, grounding, and reasoning tasks
Research theme

Efficiency and robustness

I am interested in compute-aware scaling, efficient adaptation, and robustness under compression, noise, domain shift, and long-audio settings.

Compute-capability scaling curves
Robustness under compression and domain shift

Study design and analysis

Recent work uses audio question answering as a testbed for studying modality weighting, question-type-specific behavior, and hierarchical representations.

  • Controlled fusion coefficients for audio/text balance
  • Question-type-stratified accuracy and statistical tests
  • Diagnostics for shortcutting versus perceptual grounding

Benchmarks and tools

The projects build on publicly available benchmarks and toolchains so that ideas can transfer to future work.

  • DCASE 2025 Task 5 multi-domain AQA benchmark
  • BEATs-based audio encoders and BERT-style text towers
  • PyTorch training pipelines with ablations and evaluation