Exploration of a Better Semantic Representation for Multimodal Information

Supervisors

Suitable for

Abstract

1. Introduction
Interaction between different modality information means the representations interacting with each other come from dif-
ferent domain, such as text/voice and image/video), which is a very promising research area, as it can enable many poten-
tial real-world applications, for example, generating an image from a given text can help non-artists easily create visually
appealing images and enable many new visual effects not possible before by simply using natural language descriptions;
manipulating an original image using a given text can allow users to edit the image in order to satisfy their preference; and
object detection and image captioning techniques can help disabled people to better understand the surroundings.
There exist many different research directions involving the interaction between different-domain information, includ-
ing visual interpreting methods like (1) object / scene classification [26, 8, 18], (2) object detection [7, 10, 16], (3) image
captioning [5, 20], and (4) visual question answering [2, 3, 1, 22], which aim at transferring visual data, such as videos
or images, into abstract representations, such as texts, and also including visual synthesis like (1) text-to-image genera-
tion [17, 21, 24, 25, 15, 23, 19, 27], or with the help of scene graphs [4, 11] and semantic layouts (e.g., bounding boxes and
segmentation masks) [9, 12], where scene graphs and layouts contain semantic information of desired objects to ease the
whole generation process, and (2) image manipulation using natural language descriptions [6, 14, 13].
In this project proposal, we aim to explore a better semantic representation of multimodal information, such as knowledge
graphs, as current approaches ignore the internal semantic relations within each information, and simply feed the original
information into a network and hope that the network is able to capture these semantic relations. However, if we can first
convert the given source information from different domains into the same semantic representation, and then we are able to
easily interact (e.g., combine or filter) this information to achieve a better interaction between them.

2. Approach
This project mainly focuses on the exploration of a better common semantic representation for multimodal information,
which may involve the investigation of different semantic data structures, like knowledge and scene graphs. Then, we
apply the proposed new representation into different downstream tasks (such as text-to-image generation, visual question
answering, etc.) to verify the effectiveness of the proposed method.

References
[1] Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. Bottom-up and top-
down attention for image captioning and visual question answering. In Proceedings of the IEEE conference on computer vision and
pattern recognition, pages 6077–6086, 2018.
[2] Jacob Andreas, Marcus Rohrbach, Trevor Darrell, and Dan Klein. Neural module networks. In Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition, pages 39–48, 2016.
[3] Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C Lawrence Zitnick, and Devi Parikh. Vqa: Visual
question answering. In Proceedings of the IEEE international conference on computer vision, pages 2425–2433, 2015.
[4] Oron Ashual and Lior Wolf. Specifying object attributes and relations in interactive scene generation. In Proceedings of the IEEE
International Conference on Computer Vision, pages 4561–4569, 2019.
[5] Jeffrey Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan, Kate Saenko, and Trevor
Darrell. Long-term recurrent convolutional networks for visual recognition and description. In Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition, pages 2625–2634, 2015.
[6] Hao Dong, Simiao Yu, Chao Wu, and Yike Guo. Semantic image synthesis via adversarial learning. In Proceedings of the IEEE
International Conference on Computer Vision, pages 5706–5714, 2017.
[7] Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. Rich feature hierarchies for accurate object detection and semantic
segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 580–587, 2014.
[8] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity mappings in deep residual networks. In European conference on
computer vision, pages 630–645. Springer, 2016.
[9] Tobias Hinz, Stefan Heinrich, and Stefan Wermter. Semantic object accuracy for generative text-to-image synthesis. arXiv preprint
arXiv:1910.13321, 2019.
[10] Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig
Adam. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861, 2017.
[11] Justin Johnson, Agrim Gupta, and Li Fei-Fei. Image generation from scene graphs. In Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, pages 1219–1228, 2018.
[12] Wenbo Li, Pengchuan Zhang, Lei Zhang, Qiuyuan Huang, Xiaodong He, Siwei Lyu, and Jianfeng Gao. Object-driven text-to-image
synthesis via adversarial training. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages
12174–12182, 2019.
[13] Seonghyeon Nam, Yunji Kim, and Seon Joo Kim. Text-adaptive generative adversarial networks: Manipulating images with natural
language. In Advances in Neural Information Processing Systems, pages 42–51, 2018.
[14] Or Patashnik, Zongze Wu, Eli Shechtman, Daniel Cohen-Or, and Dani Lischinski. Styleclip: Text-driven manipulation of stylegan
imagery. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2085–2094, 2021.
[15] Tingting Qiao, Jing Zhang, Duanqing Xu, and Dacheng Tao. Mirrorgan: Learning text-to-image generation by redescription. In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1505–1514, 2019.
[16] Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. You only look once: Unified, real-time object detection. In
Proceedings of the IEEE conference on computer vision and pattern recognition, pages 779–788, 2016.
[17] Scott Reed, Zeynep Akata, Xinchen Yan, Lajanugen Logeswaran, Bernt Schiele, and Honglak Lee. Generative adversarial text to
image synthesis. arXiv preprint arXiv:1605.05396, 2016.
[18] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint
arXiv:1409.1556, 2014.
[19] Ming Tao, Hao Tang, Songsong Wu, Nicu Sebe, Xiao-Yuan Jing, Fei Wu, and Bingkun Bao. DF-GAN: Deep fusion generative
adversarial networks for text-to-image synthesis. arXiv preprint arXiv:2008.05865, 2020.
[20] Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. Show,
attend and tell: Neural image caption generation with visual attention. In International conference on machine learning, pages
2048–2057, 2015.
[21] Tao Xu, Pengchuan Zhang, Qiuyuan Huang, Han Zhang, Zhe Gan, Xiaolei Huang, and Xiaodong He. AttnGAN: Fine-grained text to
image generation with attentional generative adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition, pages 1316–1324, 2018.
[22] Zichao Yang, Xiaodong He, Jianfeng Gao, Li Deng, and Alex Smola. Stacked attention networks for image question answering. In
Proceedings of the IEEE conference on computer vision and pattern recognition, pages 21–29, 2016.
[23] Han Zhang, Jing Yu Koh, Jason Baldridge, Honglak Lee, and Yinfei Yang. Cross-modal contrastive learning for text-to-image
generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 833–842, 2021.
[24] Han Zhang, Tao Xu, Hongsheng Li, Shaoting Zhang, Xiaogang Wang, Xiaolei Huang, and Dimitris N. Metaxas. StackGAN: Text to
photo-realistic image synthesis with stacked generative adversarial networks. In Proceedings of the IEEE International Conference
on Computer Vision, pages 5907–5915, 2017.
[25] Han Zhang, Tao Xu, Hongsheng Li, Shaoting Zhang, Xiaogang Wang, Xiaolei Huang, and Dimitris N. Metaxas. StackGAN++: Real-
istic image synthesis with stacked generative adversarial networks. IEEE Transactions on Pattern Analysis and Machine Intelligence,
41(8):1947–1962, 2018.
[26] Bolei Zhou, Agata Lapedriza, Jianxiong Xiao, Antonio Torralba, and Aude Oliva. Learning deep features for scene recognition using
places database. In Advances in Neural Information Processing Systems, pages 487–495, 2014.
[27] Minfeng Zhu, Pingbo Pan, Wei Chen, and Yi Yang. DM-GAN: Dynamic memory generative adversarial networks for text-to-image
synthesis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5802–5810, 2019.

Exploration of a Better Semantic Representation for Multimodal Information

Supervisors

Suitable for

Abstract

Our Students