BSc Information and Computing Science

Zeyu Yin (Joey)

BSc Information and Computing Science student working on audio question answering and hierarchical audio intelligence

I am an undergraduate researcher at Xi'an Jiaotong-Liverpool University. My recent work studies modality importance in audio question answering, and my current research proposal focuses on hierarchical audio intelligence: toward unified representations for general audio understanding across recognition, grounding, and reasoning tasks.

View CV Explore Research Contact

Academic profile

Research on audio question answering and hierarchical audio intelligence.

Current emphasis: modality importance analysis in audio question answering and a research proposal on unified representations for general audio understanding.

Institution

Xi'an Jiaotong-Liverpool University

Location

Suzhou, China

Research proposal direction

Hierarchical audio intelligence for general audio understanding, with emphasis on abstraction levels, unified evaluation across tasks, efficiency, and robustness.

Selected research

A few project pages from recent work

These entries summarize recent work in a straightforward way, with room to expand selected pages as more material becomes available.

Case study

Research project

Listening or Reading? An Empirical Study of Modality Importance Analysis Across AQA Question Types

A case study on how audio question answering systems rely on acoustic evidence versus textual or contextual shortcuts across different question types.

Audio Question Answering

modality weighting

Acoustic Reasoning

DCASE 2025

Interactive Research Preview

How do different question types depend on audio?

Compare 6 AQA question types and inspect whether the model is truly listening.

DCASE 2025 Task 5EchoTwin-QA · BEATs + BERT6 question types

Sound CountingMostly audio-groundedBest 35.7% at lambda=0.9

Lambda sweep

Real aggregated accuracy from your experiment, averaged across the available seeds for this question type.

Text-only

Audio-only

lambda=0.0: 30.4%

lambda=1.0: 25.9%

Readout

Range

25.9% - 35.7%

Balanced

30.4%

Audio-only

25.9%

Key settings

Text-only

lambda=0.0

Accuracy: 30.4%

No material change. Remains strong without audio.

Balanced

lambda=0.5

Accuracy: 30.4%

No material change. Useful as a reference point in the sweep.

Audio-heavy

lambda=0.9

Accuracy: 35.7%

+5.4 pts vs text-only. Improves when audio contributes more.

Accuracy improves toward audio-heavy settings and peaks near lambda=0.9.

Open full case study

View project

Challenge system

Research project

ECHOTWIN-QA: A Dual-Tower BEATSBERT System for DCASE 2025 Task 5

An end-to-end audio question answering system built for the DCASE 2025 Challenge, with training, evaluation, and ablation studies conducted from scratch.

DCASE 2025

BEATSBERT

End-to-end AQA

View project

SURF project

Research project

Expressive Timing Modelling in Performed Classical Piano Music

A summer undergraduate research project exploring expressive timing in performed classical piano music through computational modeling.

Music information retrieval

Audio modeling

Research

View project

Publications

Selected publications and reports

A concise view of recent work, including venue, year, contribution summary, and links.

Workshop paper

2025

Listening or Reading? An Empirical Study of Modality Importance Analysis Across AQA Question Types

DCASE 2025 Workshop

Zeyu Yin, Yiqiang Cai, Pingsong Deng, Xinyang Lyu, Shengchen Li

Designed the study, implemented modality-importance experiments, analyzed results across question types, and wrote the paper.

Paper Workshop

Technical report

2025

ECHOTWIN-QA: A Dual-Tower BEATSBERT System for DCASE 2025 Task 5 Audio Question Answering

DCASE 2025 Challenge (Task 5)

Zeyu Yin, Ziyang Zhou, Yiqiang Cai, Shengchen Li, Xi Shao

Built the end-to-end AQA system from scratch, ran training and evaluation pipelines, conducted ablations, and wrote the technical report.

Report

Technical report

2025

ADAPTF-SEPNET: AudioSet-Driven Adaptive Pre-training of TF-SEPNet for Multi-device Acoustic Scene Classification

DCASE 2025 Challenge

Ziyang Zhou, Zeyu Yin, Yiqiang Cai, Shengchen Li, Xi Shao

Contributed to model development and experimental evaluation, and supported results analysis and manuscript preparation.

Report

View all publications

Research interests

Current research themes

The work is organized around a few connected questions in audio question answering, hierarchical representations, evaluation, and robustness.

Hierarchical audio intelligence

Large audio models

Audio question answering

Modality importance analysis

Unified evaluation across tasks

Efficiency and robustness

Research framing

Clear questions, unified evaluation, and careful diagnostics.

The site brings together published work, ongoing research themes, and a case-study page that can later host figures, ablations, and interactive analysis.

Timeline

Academic trajectory

A preview of recent roles and milestones that situate current research interests in a broader academic path.

2025

DCASE 2025 participant and researcher

Developed an end-to-end AQA system and analyzed modality importance across question types

2025

Workshop and challenge papers

Worked on study design, experimentation, ablations, and writing for audio question answering projects

2024

SURF undergraduate researcher

Worked on expressive timing modelling in performed classical piano music at XJTLU

2023

Academic Excellence Award recipient

Received the University Academic Excellence Award with full scholarship support