global to consider both low and high-semantic features in the joint representation space. e key idea here is to esti-mate complete AQI from noisy and missing observations of one sensory modality (i.e., air pollution device) using a structure found in another (i.e., images). In other words, with this model, we can estimate the AQI value by simply using all features extracted from an image mapped and linked to a proper AQI level inside the joint representation space and by the bidirectional mapping.e evaluation conducted using three dierent datasets collected from Japan, Vietnam, and India provides an ac-curacy of over 80% (F1-score). at is an impressive result when running on a low-cost device (e.g., smartphones). Besides, MM-AQI not only estimates the air quality index but also oers several cues to understand the causality of air pollution (Fig. 4). anks to MM-sensing exible archi-tecture, both autoencoder-decoder and transformer archi-tectures can be applied to design MM-AQI crossmodal. For more details, readers can refer to the original paper of MM-AQI [13] and its extension version [14].3.2MM-trafficEvent: A crossmodal with attention to query images from textual queries Dashcam, a video camera mounted on a vehicle, has become a popular and economical device for increasing road safety levels [15]. A new generation of the dashcam, the smart dashcam, not only records all events happening during a journey but also alerts users (e.g., drivers, manag-ers, coaches) of potential risks (e.g., crash, near-crash) and driving behaviors (e.g., distraction, drowsiness). One of the signicant benets of dashcam footage is that it can provide insights from dashcam data to support safe driving [16] (e.g., evidence to the police and insurance companies in trac accidents, self-coaching, eet management). Unfortunately, a signicant obstacle to events retrieval is a lack of searching tools for nding the right events from a large-scale dashcam database. e conventional approach to nding an event from dashcam footage is to manually browse a whole video from beginning to end. It consumes a lot of workforces, time, and money. e challenge is the semantic gap between textual queries made by users and visual dashcam data. It needs a crossmodal translation to enable the ability to retrieve related data of one modality (e.g., dashcam video shots) with data of another modality (e.g., textual queries) [5][7].To provide a user-friendly tool that can support users in quickly nding an event they need, we introduce MM-tracEvent as a text-image crossmodal with an attention search engine by modifying the MMCRAI general archi-tecture. Figure 5 illustrates the design of this function. First, we replace “AI modal” modules with encoder models (i.e., Xception for image, BERT for text) that aim to nor-malize and polish raw data into vector spaces. Second, we design the attention mechanisms as the joint representative space to provide an additional focus on a specic area that has the same mapping from dierent modalities. In other words, we utilize self and multi-head attention techniques [17] to generate the bidirectional mapping between text and image. In this design, we replace the joint representa-tive space and bidirectional mapping with multi-head at-tention block and cross-modal attention scores, depicted in Fig. 5.e signicant dierence between our model and oth-ers is that we do not have a full training dataset of text-image pairs. In other words, we do not have the annotation/caption of each incident/suspect event image. Hence, creat-ing a complete text-image crossmodal retrieval is almost impossible by using only the dashcam video dataset. Instead of applying crossmodal translation directly, we utilize able T1 PM2.5 Prediction Accuracy Comparison (F1-score) ig. F4 MM-AQI: An example of high PM2.5 and human activities captured by lifelog camera [13]. In this case, objects = {vehicles}, areas = {dirty/dust}, and abnormal high value of PM2.5 appear at the same place and time 1634-2 マルチモーダルやクロスモーダルAIによるスマートなデータ分析
元のページ ../index.html#169