Page 89

Figure 3.8: Combined image results from a mobile device with Sony IMX363 sensor (f/1.9
3.94mm) in low light (1 lux). At the left (a) is the original image captured (1/10 sec. ISO
9600). In the middle (b) we have an FPGA processed result (1/10sec. ISO 3200, 8 frames).
Finally, on the right side (c), the output of the FPGA accelerator is shown (1/10sec. ISO 3200,
15 frames). Clearly the output result (c) is very acute, bright and crisp.

3.2

Optimized Multimodal Approaches

As the need for intelligent, real-time decision-making grows, the focus expands

beyond preprocessing to encompass sophisticated AI models capable of running

on edge devices.

Edge computing refers to processing data at or near the

source of data generation, reducing latency and bandwidth usage compared to

centralized data centers. This shift necessitates models that are both powerful

and optimized for the limited computational resources typical of edge devices.

In this section, we will explore highly optimized AI approaches, stemmed from

our design methodology, in the domain of activity recognition, abnormal event

detection, object detection and facial identification. These AI models feature a

novel multimodal design employing different fusion levels and techniques to

exploit the complementary data captured by a wide range of sensors.

3.2.1

Multimodal Stream Classification

Real-world data is rarely confined to a single modality. Images, audio, and

other sensor data often coexist, offering complementary insights into complex

phenomena. Multimodal stream classification addresses the need to analyze and

interpret this diverse data holistically. By integrating information from multiple

modalities, we can unlock a deeper understanding of events, behaviors, and

patterns that would remain hidden when considering each modality in isolation.