Title: Multimodal foundation models: a taxonomy and reflections

Authors: Won Kim

Addresses: School of Computing, Gachon University, Seongnam, South Korea

Abstract: During the past decade, thousands of technical blogs and research papers on foundation machine learning models, in particular large language models and multimodal models have inundated the technical literature. Tech companies vying for dominant positions or market shares in the AI market have been releasing new or more advanced models in rapid succession. There have even been speculations that artificial general intelligence is imminent. In this paper, first I try to organise the many multimodal foundation models in the form of a reasonable taxonomy. I hope such a taxonomy will help understand the relationships among different types of models. Then I try to provide a reminder that despite the amazing images and videos generated by the multimodal models, and amazing multimodal reasoning capabilities top multimodal models have shown, there are glaring limitations. Further, I emphasise that there is the inevitable 'dark side' to the advancements of technology. There are many potential negative impacts of the LLM and multimodal models; and manufacturing, running, cooling, and disposing of the computers contribute to the accelerated depletion of the natural sources and contamination of the environment.

Keywords: multimodal model; large language model; multimodal generation; multimodal understanding; vision language model; video language model; audio language model; dark side of AI technology.

DOI: 10.1504/IJWGS.2024.143177

International Journal of Web and Grid Services, 2024 Vol.20 No.4, pp.505 - 531

Received: 05 Sep 2024
Accepted: 26 Sep 2024

Published online: 05 Dec 2024 *

Full-text access for editors Full-text access for subscribers Purchase this article Comment on this article