As part of my effort to build a fall detection system, I’ll share this week’s papers I’ve read
Slot-VLM: SlowFast Slots for Video-Language Modeling
available here
Introduction
Large Language Models, have gained a lot attention due to their tremendous advancements, in generating and understanding human text very accurately. These advancements have also introduced benefits, for captioning techniques [1].
MiniGPT-4 is an implementation that uses a Q-Former (Querying Transformer). The basic principle is that, as we know for transformers we dive an image into “tokens”. These tokens are generated by an image encoder or commonly known as the feature extraction section. The these tokens or queries are fed in to the LLM section of the model.
See [2,3,4] for more information
What would happen if we need to generate a description for a video o livestream. In a 5 min video analyzing a t 1fps would be a task of 300 images total with 32 tokens 9600 token in total. Keep in mind this is at 1 fps. Some applications can be time critical and should maintain from 25 to 60 fps. At 25 fps we would need to check at 32 tokens per image: 240,000 tokens in total an increase of more than 99.8% and please keep in mind that we should be grossly uninterested in the amount of data we need to process. Having more computing power should not equate to not caring to optimize algorithms or models.
The work proposed in [1], compare some other video-oriented solutions that take try to optimize the tokens needed to analyze per image. However as shown in the image below, every token has an influence from nearby pixels or spatial information. The authors show an approach which oriented in object or events to feed into the LLM.
To be able to only extract object-centric tokens or event centric tokens, they make use the slot-attention, a work done by [5]. Usually a different type of connecter is used, such as the Q-Former. In addition, the authors propose processing the frame in two branches, a branch that takes low frame rate high spatial resolution to gain information about the objects in a scene and a high frame rate, low spatial resolution to focus on the events of the video. After these tokens have been extracted with a visual encoder, these tokens now serve as an input to the slot attention to prepare them for the LLM input.
This new approach is indeed comparable to the state of the art with a focus on faster inference without sacrificing accuracy as their results show in the table below.
Uses for my research
If you’ve been following my thesis project and my past research, I’ve been trying to expand on the capability of a model to understand and identify what constitutes a possible fall scenario as well as a fallen person, the Slot-VLM is an interesting choice however it’s overkill for this task, I read this research because my advisor asked me too, since it seem like it had potential, however it requires a very large dataset and resources that make my application a bit less accessible when I try to make my research open source. In any case it was a fun and educational read and I hope you enjoyed it too.
References
[1] J. Xu, C. Lan, W. Xie, X. Chen, and Y. Lu, “Slot-VLM: SlowFast Slots for Video-Language Modeling,” arXiv (Cornell University), Feb. 2024, doi: https://doi.org/10.48550/arxiv.2402.13088.
[2] A. Helwan, “Q-Former,” Medium, Dec. 22, 2023. https://abdulkaderhelwan.medium.com/q-former-1d83163975da (accessed Mar. 05, 2024).
[3] Li, D. Li, S. Savarese, and Steven, “BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models,” Jan. 2023, doi: https://doi.org/10.48550/arxiv.2301.12597.
[4] D. Zhu, J. Chen, X. Shen, X. Li, and M. Elhoseiny, “MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models,” arXiv.org, Apr. 20, 2023. https://arxiv.org/abs/2304.10592
[5] F. Locatello et al., “Object-Centric Learning with Slot Attention,” arXiv (Cornell University), Jan. 2020, doi: https://doi.org/10.48550/arxiv.2006.15055.