HTML5 Webook
166/194

e advantage of multimodal is that we can have joint representative space that can compensate for the lack of information on each disjoint modality and strengthen the robust prediction of high-correlation modalities. Hence, we can build models that process and correlate data from multiple modalities.Many surveys have been done to understand the use of multimodal AI for smart data analysis. In reference [1], the authors list out challenges of multimodal machine learning (e.g., representation, translation, alignment, fusion, co-learning), data types (e.g., texts, videos, images, audios) and applications (e.g., Speech recognition and synthesis, Event detection, Emotion and aect, Media description, Multimedia retrieval). In reference [2], the authors empha-size a particular domain computer vision, and introduce advances, trends, applications, and datasets of multimodal AI. In this survey, the authors discuss the general architec-ture of multimodal deep-learning, where a particular fea-ture extraction rst precedes each modality to create a modality representation. en, these representations are fused into one joint representative space and project this space into one unique similarity measure. Several deep learning models are concerned in this survey, including ANN, CNN, RCNN, LSTM, etc.In reference [3], the authors mention crossmodal learn-ing for dealing with the issue when there is a need for mapping from one modality to another and back, as well as representing them in joint representation space. is direction is similar to human beings’ learning process - composers a global perspective from multiple distinct senses and resources. For example, text-image matching, text-video crossmodal retrieval, emotion recognition, and image-captioning are the most popular crossmodal applica-tions where people can use one modality to query another one [4]–[6]. e main dierence between multimodal and crossmodal learning is that crossmodal requires sharing characteristics of dierent modalities to compensate for the lack of information towards enabling the ability to use data of one modality to retrieve/query/predict data of another modality. Unfortunately, this research direction is far from the expectation and has a big gap among research teams and domains [7].In light of the abovementioned discussions, we are conducting research and development to build a multi-modal and crossmodal AI framework for smart data analysis. e framework aims to provide additional intel-ligent layers to data analysis progress that can exibly change from using only multimodal AI, crossmodal AI, or hybrid multi-crossmodal AI for analyzing data. We also introduce several instances of this framework designed for a particular domain, such as air pollution forecast, conges-tion prediction, and trac incident querying.Multimodal and Crossmodal AI Framework for Smart Data Analysis We have researched and developed the Multimodal and Crossmodal AI Framework (MMCRAI) to contribute to the evolution of multimodal and cross-modal AI in smart data analysis. is framework’s signicant advantage is creating a hybrid backbone that can exibly be re-con-structed in dierent ways to build a suitable individual model for a particular problem. e framework is designed to take into account the following criteria:–Strengthen the robust prediction by enhancing simi-lar modalities (i.e., multisensors capture the same data)–Enhance robust inferences and generate new insights from dierent modalities by carrying complemen-tary information about each other during the learn-ing process (e.g., fusion, alignment, co-learning).–Establish cross-modal inferences [8] to overcome noisy and missing data of one modality by using information (e.g., data structure, correlation, atten-tion) found in another modality.–Discover cross-modal attention to enhance cross-modal search (e.g., languagevideo retrieval, image captioning, translation) or ensure the semantic-har-mony among modalities (e.g., cheapfakes detection).Currently, multimodal and crossmodal approaches work independently due to the domain-dependently inten-tional-architecture design. Hence, a framework that allows people to integrate multimodal and crossmodal into a uni-progress can enhance the scaling and the ability to connect or incorporate dierent uni-progresses to solve multi-domain problems.We design the framework as the hierarchical structure of multimodal and crossmodal approaches where a suitable approach can be utilized depending on the purpose of application. Figure 1 illustrates the framework’s general structure that aims to create a joint multimodal representa-tion by embedding every single-modal representation into a common representation space. e design starts with the data pre-processing component. Here disjoint modalities are gathered and pre-proceed, such as cleansing, fusing, and augmenting. Next, those modalities that do not need 2160　　　情報通信研究機構研究報告 Vol.68 No.2 （2022）4　スマートデータ利活用基盤技術

元のページ ../index.html#166

このブックを見る

HTML5 Webook 166/194

HTML5 Webook
166/194