Title: Multimodal foundation models: a taxonomy and reflections
Authors: Won Kim
Addresses: School of Computing, Gachon University, Seongnam, South Korea
Abstract: During the past decade, thousands of technical blogs and research papers on foundation machine learning models, in particular large language models and multimodal models have inundated the technical literature. Tech companies vying for dominant positions or market shares in the AI market have been releasing new or more advanced models in rapid succession. There have even been speculations that artificial general intelligence is imminent. In this paper, first I try to organise the many multimodal foundation models in the form of a reasonable taxonomy. I hope such a taxonomy will help understand the relationships among different types of models. Then I try to provide a reminder that despite the amazing images and videos generated by the multimodal models, and amazing multimodal reasoning capabilities top multimodal models have shown, there are glaring limitations. Further, I emphasise that there is the inevitable 'dark side' to the advancements of technology. There are many potential negative impacts of the LLM and multimodal models; and manufacturing, running, cooling, and disposing of the computers contribute to the accelerated depletion of the natural sources and contamination of the environment.
Keywords: multimodal model; large language model; multimodal generation; multimodal understanding; vision language model; video language model; audio language model; dark side of AI technology.
DOI: 10.1504/IJWGS.2024.143177
International Journal of Web and Grid Services, 2024 Vol.20 No.4, pp.505 - 531
Received: 05 Sep 2024
Accepted: 26 Sep 2024
Published online: 05 Dec 2024 *