My thesis journey — (The Final product)

9 min readMar 2, 2024

I’ll share the state of the art, methods, code and more

A Comparison of Transformer implementations on a Siamese Configuration applied to Fall Detection.

It’s a very well known fact that it appears that CNN based architectures are being replaces and surpassed by Transformer implementations. There are still models, frameworks and applications that still use CNN based networks, however the usually perform poorly when it comes to temporal information while processing text, or videos [1,2,3].

Objectives for this research

To develop a DL framework based on a Siamese configuration capable of detecting or accurately classifying the silhouette of a fallen person, comparable to state of the art methods.
Compare different Transformer implementations to determine which is more suitable.

Limitations and Reach

This particular research is using the work “Vision based human fall detection with Siamese convolutional neural networks” by S. Jeba Berlin and Mala John [4] as a comparison and as a base for this framework.

The datasets used were:

The University of Rzeszow Fall Detection Dataset (URFD) [5]
The Fall Detection Dataset by the Université de Bourgogne (L2ei-FDD)[6]

which are widely used in this field.

State of the art

RGB camera-based fallen person detection system embedded on a mobile platform — S. Lafuente-Arroyo et al. 2022 [7]

In this work they assembled a mobile robot which uses a Jetson TX2 board from NVIDIA. Also in this research is very interesting, they used two similar approaches, in the first one they use YOLOv3 to extract the features corresponding to a person, then they use a Support Vector Machine to classify it as a fall or not. In the second approach, they used YOLOv3 to detect directly the fall. In both cases after fall detection, facial recognition algorithm runs to identify which person in a nursing home has fallen.

RGB camera-based fall detection algorithm in complex home environments- Z. Tian, et al. 2022 [8]

In this research a fall detection system utilizing an RGB camera has been developed, capable of issuing alarm notifications upon detecting a fall. This system encompasses two key components: hardware and a software algorithm. The algorithm for fall detection comprises the following stages: (1) initialization to gather environmental data, (2) identifying human targets and joint locations through 2-dimensional pose detection, and (3) assessing fall occurrence using limb-length and multiframe analysis to validate practical characteristics.

Vision based human fall detection with Siamese convolutional neural networks

In general, a Siamese Network has a lot of advantages. A SNN is usually used to recognize if two images or instances belong to the same class or not. These applications range from signature to facial recognition. In the work done by Berlin, et al, they used this approach in combination with a pre-processing step in which they used optical flow to compare which posture corresponded to a fall[4].

Table 1 shows the summary of the methods

Methods

PWC-Net

For the input, each video needs to be converted to optical flow format. In this work we replicate the same method as in [4]. For model information on the model and how to install it you can use the paper and the GitHub. The official implementation on Caffe might be a bit challenging to install and build, by experience I found what someone who’s a bit smarter could’ve figured out from the beginning, it helps to have an ubuntu 16.04 environment. If the installation gives you any trouble there is a pretty good re-implementation using PyTorch the code is here you can use the ubuntu version 18.04 for this one.

Vision Transformer

The ViT was proposed by Dosovitskiy L, et al. in 2021, as an adaptation of the work by Vaswani. A, et al. in 2017, which was used for Natural Language Processing (NLP) tasks [4,14]. Unlike CNNs, they are much better at extracting spatial and temporal information. Its success is attributed to the Attention block and positional encoding.

Databases

Dataset-1: URFD — UR Fall Detection Dataset from the University of Rzeszow in Poland — It consists of 2 camera angles 30 fall videos and 40 Daily Life Activity videos.
Dataset-2: FDD — Fall Detection Dataset from the University of Burgundy [16] It consists of 1 camera angle, in different areas — home — kitchen, etc. It is a public dataset of 130 videos

Pre-processing

In the image below. we se the process of generating necessary data for training and testing. From a given video we apply the PWC-Net in adjacent frames, then from the generated optical flow video we sample only 20 frames per video fro later use.

Proposed architecture

In comparison with the work done by Berlin, et al, we use the same loss method and similarity function. The only thing that is different is that we use the transformer based encoders as the feature extraction. A Siamese network works by applying a difference or a loss function to the pair feature vectors extracted from the encoder, this gives us a measure of how close/similar or far/different are feature vectors are. Finally, we have the similarity function which will give us the probability that the images are from the same class.

Transformer Encoders

Poolformer

In Yu et al., 2022, it was determined that the overall architecture (Metaformer) achieves competitive results regardless of the token mixer. Poolformer (token mixer) reported the highest accuracy using simpler operations [10].

Shifted Path Tokenization — SPT y Locality Self Attention — LSA

S. Hoon Lee, et al. 2021 introduce the SPT and LSA blocks to address the requirement of transformers to train with substantial data or avoid pre-training. Specifically, the purpose of these blocks is to extract features from an image or block more effectively [11,12].

Results

For the results, the relationship between model size, accuracy and MACs are show.

The first image of the first pair of images, we shoe the results in the URFD dataset. We can see the the best performing model is the one with the ViT+SPT+LSA encoder. However, any of the Transformer implementations are perform way better that the implementation using the CNN based architecture, in which they use a considerable amount of operations. In terms of MACs, the largest CNN model has 8 M parameters with 530 G MACs and the largest Transformer based model with 16 M parameter with 8G MACs which is a reduction of 98.49%

Conclusions

In this study, a new framework based on Siamese networks with different implementations of transformers was presented. A configuration was found that not only outperforms state-of-the-art implementations using convolutional networks in accuracy, achieving 93.5% and 95.03% respectively for the URFD and FDD databases, but also surpassed CNN implementations, which achieved 86.03% and 87.31%. The ViT model with SPT and LSA also achieved superior training time. This is because transformers, in general, have a natural affinity for parallelism due to positional encoding, allowing images to be processed independently of other sections of the image. This utilization of GPU performance is more efficient and consumes less time during training and classification. This model achieves higher accuracy because transformers have a greater capacity to extract relevant information from images even when the inputs are diverse and not very complex.

Regarding the advantage over other transformers, the combination of Shifted Path Tokenization and Locality Self Attention blocks enhances performance. The purpose of SPT is to provide a larger receptive field in each image block, meaning each patch receives more spatial information. The combination with LSA allows for more effective feature extraction from the information highlighted by SPT.

In conclusion, it was determined that using a transformer with SPT and LSA blocks in a Siamese configuration delivers superior performance in terms of accuracy, recall, and training time.

Final Thoughts

If you continue to read till the end, thank you very much. This project meant a lot to me. I started knowing next to nothing about AI and in less than a year I’m capable of modeling and training my own projects. I still have long ways to go and more tools to learn. However, I don’t have to keep feeling intimidated or afraid of the unknown. Something that my advisor told me that made me realize the great effort and how this was not an easy feat is that, “Not many students get a passing grade on their thesis project, let alone on their first try” I feel truly humbled and grateful to have had a great teacher and now I have more hunger for knowledge, that I know will get me far. Thank you for reading. Hopefully I can learn more tools that I can share with you.

References

[1]. Mishra, M 2020, Towards Data Science: Convolutional Neural Networks, Explained

[2]. Inside AI. 2020, Introduction to Deep Learning with Computer Vision — Types of Convolutions & Atrous Convolutions

[3]. Cherre, J. 2023. EMGTFNet: Fuzzy Vision Transformer to decode Upperlimb sEMG signals for Hand Gestures Recognition

[4] Berlin, et al. 2021, Vision based human fall detection with Siamese convolutional neural networks

[5] Bogdan K, et al. 2014. Human fall detection on embedded platform using depth maps and wireless accelerometer

[6]. Charfi, J, et al. Definition and performance evaluation of a robust svm based fall detection solution

[7]. S. Lafuente-Arroyo, 2022 , RGB camera-based fallen person detection system embedded on a mobile platform

[8]. Z. Tian, 2022, RGB camera-based fall detection algorithm in complex home environments- Tian et al.

[9]. A. Dosovitskiy L, et al. 2021 “AN IMAGE IS WORTH 16X16 WORDS: TRANSFORMERS FOR IMAGE RECOGNITION AT SCALE”

[10]. W. Yu, et al. 2021, “Metaformer is actually what you need for vision”

[11]. Dey, S. et al. 2022, “Fall event detection using vision transformer,”

[12].S. Hoon Lee, et al. 2021. Vision Transformer for Small-Size Datasets