HTML5 Webook
171/194

mantic levels of textual queries (i.e., easy, complex, most complex). For more details, readers can refer to the original paper of MM-tracEvent [18][19]3DCNN: A Multimodal AI Model for Spatio-temporal Event Prediction  e typical approach to dealing with multimodal data is data fusion, which aims to collect signicant fragments of an object distributed in dierent modalities and normal-ize these fragments into the same space for easier manipu-lation. ree fusion methods are popular: early, late, and collaborative fusion [20]. e rst one fuses data rst and processes the fuse data with specic models. e second one performs data analysis of each information indepen-dently with a particular model, then combines the outputs as the outcome. e last one aims to promote collaboration among modalities to achieve the ideal consensus. Hence, we introduce a 3DCNN model with a specic collaborative fusion mechanism, called raster-images, to fuse spatio-temporal multimodal data into one unique data format that up-to-date computer vision deep learning models can ef-ciently utilize. Unlike other multimodal methods working with spatio-temporal data, we want to embed the spatio-temporal dimension into our model without converting them into an alternative space. In other words, we want to keep the geometry topology (i.e., time and location) and map other data into this coordination before projecting a whole dataset to other spaces. To do that, we convert the MMCRAI framework to have two main components: spatio-temporal-based data wrapping and multimodal space working as collaborative fusion and multimodal embedding.To conduct the collaborative fusion, we invent a new fusion schema that can wrap dierent modalities into one unique spatial modality, namely a raster image. Using hash-ing techniques, we distribute multimodal data collected within a specic time window into three channels (R, G, B) of a raster image whose pixels represent particular map areas (i.e., spatio-temporal constraints). When arranging these raster images along a time dimension, we produce a so-called raster video whose frames are raster images. Hence, the raster image perfectly replaces the multimodal space component of the general framework and guarantees 4ig. F7 3D-CNN Multi-sources data deep learning architecture [26] 1654-2 マルチモーダルやクロスモーダルAIによるスマートなデータ分析

元のページ  ../index.html#171

このブックを見る