[Webinar] Teaching AI to Understand Images with Language: Foundations and Advances in Vision–Language Models (Speaker: Prof. Il-Min_Kim)
Brief Center introduction
The AIM Center aims to realize unmanned manufacturing by integrating advanced artificial intelligence technologies across the entire manufacturing process. Its primary target industries include semiconductors, displays, batteries, biotechnology, future mobility, and robotics.
Overview of this seminar
This seminar introduces vision–language models, a type of artificial intelligence that connects images with natural language. It explains the basic ideas behind how these models learn from image–text data. The session also highlights practical applications such as image recognition and search, while discussing important challenges like bias, difficulty with unfamiliar objects, and high computational costs. It concludes with an overview of recent research aimed at improving fairness, robustness, and efficiency, as well as enhancing the models’ ability to capture detailed visual information.
Seminar Abstract
Recent advances in artificial intelligence (AI) have enabled models to learn from both images and natural language. Vision–Language Models allow computers to connect visual concepts with textual descriptions, making it possible to recognize objects, retrieve images, and perform complex visual reasoning tasks using natural language prompts. One of the most influential models in this area is Contrastive Language-Image Pre-Training (CLIP), which learns visual concepts from large collections of image–text pairs.
In this talk, we introduce the basic ideas behind vision–language models and explain how CLIP works. We then discuss several challenges that arise when deploying these systems in real-world environments, such as biased predictions, difficulty recognizing unfamiliar objects, and the computational cost of adapting large models. Finally, we present recent research that addresses these challenges by improving fairness, robustness, and efficiency in vision–language models, as well as new methods that enhance their ability to capture fine-grained visual details.
Speaker Bio
Il-Min Kim received his B.S. degree in Electronics Engineering from Yonsei University, Seoul, Korea, in 1996, and his M.S. and Ph.D. degrees in Electrical Engineering from the Korea Advanced Institute of Science and Technology (KAIST), Taejon, Korea, in 1998 and 2001, respectively. He then worked as a Postdoctoral Research Fellow in the Department of Electrical Engineering and Computer Sciences (EECS) at the Massachusetts Institute of Technology (MIT) from October 2001 to August 2002 and in the Department of Electrical Engineering at Harvard University from September 2002 to June 2003. In July 2003, he joined the Department of Electrical and Computer Engineering (ECE) at Queen’s University, Kingston, Canada, and he is currently serving as Head of the ECE Department.
His research focuses on artificial intelligence (AI), including agentic AI, physical AI, ubiquitous AI, edge AI, on-device AI, safe AI, universal equity AI, AI governance, AI alignment with human values, foundation models, AI for healthcare applications, machine unlearning, data privacy in machine learning, federated learning, distributed learning, continual learning, diffusion models, out-of-distribution (OOD) detection, self-supervised learning, contrastive representation learning, AI for IoT/IoE/IIoT/Mobile Crowd Sensing (MCS), AI-driven 6G wireless systems, AI-driven vehicle-to-everything (V2X) communications, and Geoscience AI (Geo-AI).