to have a crossmodal translation (i.e., bidirectional map-ping) are sent to the multimodal space component. Applications requiring retrieval and classication tasks without translating from the multimodal representation to the single-modal ones can utilize this component without going further. Applications that require crossmodal transla-tion in addition to classication tasks, such as multimodal query expansion and crossmodal retrieval, should go to the joint representation space and bidirectional mapping com-ponents. Based on this general structure, we have developed an MM-sensing family with two representatives, MM-AQI and MM-tracEvent, and 3DCNN for dealing with air pollution, safety driving, and congestion problems. While 3DCNN and MM-tracEvent focus on multimodal and crossmodal approaches, MM-AQI mixes both techniques.MM-Sensing e MM-sensing stands for Multimedia Sensing, a virtual intelligent sensor that can predict complex events in the real world from multimodal observation data such as images, videos, sensory data, and texts. As represented in the name, MM-sensing mainly deals with multimedia data that occupy a signicant portion of data due to the explosion of multimedia IoT devices and the high-speed bandwidth of the Internet (e.g., 5G, 6G). Another reason to build MM-sensing is that multimedia data contain vast semantic meaning that is hard to extract. Hence, it could be good to have an independent component that can provide high-semantic information to the other processes or applications.e following subsections will explain how to down-stream the general framework into dierent applications running in various domains. We introduce MMAQI and MM-tracEvent as two downstream versions of the gen-eral framework working in air pollution prediction and trac incident querying3.1MM-AQI: a crossmodal AI for estimating air quality index from lifelog images Air pollution harmfully impacts human life, including health, economy, urban management, and climate change [9]. Unfortunately, air pollution prediction is not a trivial problem that can predict a new value using a sole data source. Many factors can impact the air pollution predic-tion, such as human activities (e.g., transportation, mining, construction, industrial and agricultural activities), weather (e.g., winds, temperature, humidity), and natural disasters (e.g., volcano eruptions, earthquakes, wildres). Moreover, using data captured by individual modalities may not gather complementary information to express the correlation and causality among factors with air pollution. Hence, the approach of multimodal learning that consoli-dates multi modalities from various mentioned factors into a single air pollution prediction model has become popular [10]–[12].Although many methods have been established to monitor and predict air pollution, an eco-friendly and personal-usage method is still the most signicant chal-lenge. Expensive and large-scale deployed devices that provide high-quality air pollution data do not exist in a dense grid in developed countries, and the situation can be worse in developing and emerging countries. Besides, to cope with such big multimodal data, there is a need for supercomputers or luxury GPU servers, which makes it hard to have an eco-friendly and personal-usage applica-tion for personal usage.To cope with this challenge, we design a crossmodal 3ig. F1 The multimodal and crossmodal AI framework: a general design (VAE: Variational Autoencoders, Aug-Data: Augmented Data)1614-2 マルチモーダルやクロスモーダルAIによるスマートなデータ分析
元のページ ../index.html#167