• 1
  • 2
当前位置: 首页 · 科学研究 · 学术交流 · 正文

学术交流

Progress in Video Captioning 和国家留学基金委博士奖学金的申请
来源: 发布日期:2025/05/24 点击量:

报告时间:2025年5月26日(周一)10:00 -11:30

报告地点:敏学楼406会议室

报告人简介:

Ruili Wang教授,博士生导师,新西兰工程院院士,毕业于华中科技大学(学士)、东北大学(硕士)、都柏林城市大学(博士),担任新西兰梅西大学数学与计算科学学院科研副院长。目前从事的研究包括人工智能、机器学习、机器视觉、语音处理和自然语言处理等多个方面。曾获得多个新西兰国家级重大和重点项目的资助。担任多个SCI期刊的编委,包括IEEE Transactions on Multimedia (TMM), IEEE Transactions on Emerging Topics in Computational Intelligence (TETCI), ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), Knowledge and Information Systems (Springer), Applied Soft Computing (Elsevier), Neurocomputing (Elsevier)

2A384

报告摘要:

在本次报告中,王教授将介绍他们团队在人工智能领域的研究进展,特别介绍他们在视频描述方向的最新研究成果, 以及如何申请国家留学基金委的博士奖学金。

Knowledge Enhancement and Disentanglement Learning for Video Captioning

Video captioning, bridging computer vision and natural language, is crucial for various knowledge-based systems in the age of video streaming. Recent video captioning approaches have shown promise by integrating additional text-related knowledge to enhance understanding of video content and generate more informative captions. However, methods relying heavily on knowledge graphs face several limitations, including (i) a restricted capacity to reason complex relations among object words due to static logic rules, (ii) a lack of context awareness for spatio-temporal relation analysis in videos, and (iii) the complexity of manually constructing a knowledge graph. These limitations lead to insufficient semantic information and obstruct effective alignment between visual and textual modalities. To tackle these issues, we propose a novel knowledge enhancement and disentanglement learning method for video captioning. Our approach introduces a comprehensive and adaptable knowledge source to enhance text-related knowledge, thus directly improving caption generation. Specifically, we leverage a large language model to infer enriched semantic relations between object words and speech transcripts within video frames. By integrating visual, auditory, and textual information into universal tokens with task-specific prompts, our approach enhances semantic understanding and captures more diverse relations. Furthermore, we propose a novel modality-shared disentanglement learning strategy to better align modalities, enabling a more precise link of visual cues to their corresponding textual descriptions. Specifically, we disentangle two modalities into shared and specific features, leveraging shared features to ensure alignment while mitigating uncorrelated information. Extensive experiments demonstrate that our proposed method outperforms existing methods in both quantitative and qualitative results.

诚邀感兴趣的师生参加!

WilliamHill中文

2025年5月23日


下一条:关于举办“大模型的前世今生”学术报告的通知

版权所有:英国·威廉希尔(WilliamHill-有限公司)中文官网|Official Website

威廉概况 师资队伍 本科生教育 研究生培养 科学研究 威廉希尔 党群工作