Figure 3.8: Combined image results from a mobile device with Sony IMX363 sensor (f/1.9
3.94mm) in low light (1 lux). At the left (a) is the original image captured (1/10 sec. ISO
9600). In the middle (b) we have an FPGA processed result (1/10sec. ISO 3200, 8 frames).
Finally, on the right side (c), the output of the FPGA accelerator is shown (1/10sec. ISO 3200,
15 frames). Clearly the output result (c) is very acute, bright and crisp.
3.2
Optimized Multimodal Approaches
As the need for intelligent, real-time decision-making grows, the focus expands
beyond preprocessing to encompass sophisticated AI models capable of running
on edge devices.
Edge computing refers to processing data at or near the
source of data generation, reducing latency and bandwidth usage compared to
centralized data centers. This shift necessitates models that are both powerful
and optimized for the limited computational resources typical of edge devices.
In this section, we will explore highly optimized AI approaches, stemmed from
our design methodology, in the domain of activity recognition, abnormal event
detection, object detection and facial identification. These AI models feature a
novel multimodal design employing different fusion levels and techniques to
exploit the complementary data captured by a wide range of sensors.
3.2.1
Multimodal Stream Classification
Real-world data is rarely confined to a single modality. Images, audio, and
other sensor data often coexist, offering complementary insights into complex
phenomena. Multimodal stream classification addresses the need to analyze and
interpret this diverse data holistically. By integrating information from multiple
modalities, we can unlock a deeper understanding of events, behaviors, and
patterns that would remain hidden when considering each modality in isolation.
58