Multiple visual-semantic embedding for video retrieval from query sentence

Research output: Contribution to journalArticlepeer-review


Visual-semantic embedding aims to learn a joint embedding space where related video and sentence instances are located close to each other. Most existing methods put instances in a single embedding space. However, they struggle to embed instances due to the difficulty of matching visual dynamics in videos to textual features in sentences. A single space is not enough to accommodate various videos and sentences. In this paper, we propose a novel framework that maps instances into multiple individual embedding spaces so that we can capture multiple relationships between instances, leading to compelling video retrieval. We propose to produce a final similarity between instances by fusing similarities measured in each embedding space using a weighted sum strategy. We determine the weights according to a sentence. Therefore, we can flexibly emphasize an embedding space. We conducted sentence-to-video retrieval experiments on a benchmark dataset. The proposed method achieved superior performance, and the results are competitive to state-of-the-art methods. These experimental results demonstrated the effectiveness of the proposed multiple embedding approach compared to existing methods.

Original languageEnglish
Article number3214
JournalApplied Sciences (Switzerland)
Issue number7
Publication statusPublished - 2021 Apr 1
Externally publishedYes


  • Multiple embedding spaces
  • Video retrieval
  • Visual-semantic embedding

ASJC Scopus subject areas

  • Materials Science(all)
  • Instrumentation
  • Engineering(all)
  • Process Chemistry and Technology
  • Computer Science Applications
  • Fluid Flow and Transfer Processes


Dive into the research topics of 'Multiple visual-semantic embedding for video retrieval from query sentence'. Together they form a unique fingerprint.

Cite this