the consensus of modalities projected on the same spatio-temporal dimension. At this stage, we can apply AI models (e.g., CNN, LSTM, RNN, Transformer) to create multi-modal embedding that will be utilized to predict events. In our case, we develop a model based on 3D convolutional neural networks (3DCNN) architecture that can extract necessary features from a raster video input to form the multimodal embedding. Figure 7 illustrates one of our 3DCNN models published in IEEE Big Data 2019 [21]. To demonstrate the eciency and inuence of the 3DCNN, we apply this model to predict congestion using open data sources that can easily access in cyberspace. We leverage our observation of the correlation between bad weather (e.g., heavy rain, ood, snow), trac congestion, and human behavior (e.g., claim about congestion and/or bad weather on SNS) during the bad period to decide which data sources to be imported to our model. Hence, we utilize data captured from social networks (e.g., Twitter), meteorology agencies (e.g., XRAIN precipitation data www.diasjp.net), and trac agencies (e.g., Trac Congestion Statistics Data www.jartic.or.jp) as our multi modalities.e reason for choosing the congestion prediction topic is that congestion is one of the most prevalent trans-port challenges in large urban agglomerations [22]. erefore, a robust prediction of congestion and congestion-surrounding-environment correlation discovery using data from dierent sources became the most signicant demand from society. Many researchers have developed several multimodal AI models to predict congestion [23]–[25]. e common idea of these methods is to create a joint multi-modal representation by embedding every single-modal representation into a common representation space, with or without constraints of time and locations. Compared to these methods, the 3DCNN model signicantly diers by turning the multimodal space or joint representation space into a popular data format (i.e., videos) and the ability to wrap unlimited modalities into one space without any extra activities.Figure 8 depicts how to wrap three dierent modalities into one space. First, we convert each modality into an individual channel by picking a data value from one mesh code and convert it into one pixel. e (longitude, latitude) of the mesh code is mapped into an image’s (width, height), and the data value is normalized into [0, 255]. en, we merge these channels to make the raster image (R, G, B).Figure 9 depicts one example of using 3DCNN to predict short-term congestion using trac congestion, weather, and tweets data over the Kobe-Japan area. As depicted in Fig. 9, each picture is a single-channel raster image instance containing only congestion information over the transportation network of the Kobe area. Each pixel of this image reports the congestion level of one mesh code. e brighter color is the heavier congestion. e latest version of 3DCNN gains outstanding results with MAE=8.13 compared to other methods, as denoted in Table 3. For more details, readers can refer to the original paper of 3DCNN [21] and its last extension version [26]. Conclusions In this paper, we comprehensively discuss the vital role 5ig. F8 Collaborative Fusion by wrapping data into a raster video [21] able T3 Traffic congestion prediction models comparison [26] able T(measured by MAE, lower is better) ig. F9 Short-term predicting results using 3D-CNN model on traffic congestion, precipitation, and tweets data [21], based on the raster image created as in Fig. 8 166 情報通信研究機構研究報告 Vol.68 No.2 (2022)4 スマートデータ利活用基盤技術
元のページ ../index.html#172