Step 3: Object detection

Rick
5 min readMay 1, 2024

--

using Mobile-Net V2

The next step on trying to replicate some of the work presented by Li. M, et al. 2020 [1]. Is to implement object detection with Mobile-Net V2.

Options on how to use the model

There are a few open source options to use this model. The authors of [1], don’t make a definitive statement on which programing language tjeir using. However, they do mention the libraries for 3D plotting and rendering, as well as the audio manipulation (which is not my focus) and some other libraries.

Point Cloud Library to handle the 3D functions and views.

Open Multi-Processing (OpenMP) to manage multithread processing

Also there are C++ implementations of the model

Since this paper focuses on a wearable embedded system as their solution it makes sense that they use C/C++, I consulted this with my advisor, and since the focus of my research is to present a new AI model or more accurately a new framework for fall prevention and detection, I will continue to work with implementations in Python using PyTorch. That being said, I still would like in the future to mess around with RUST, as it’s been shown to be a very fast language. And I hope an even invite you, the reader, to bully me and hold me accountable to actually do this. For now, let’s continue with python implementations.

Media Pipe

To avoid any delays or any additional work, I first want to compare if there’s any prebuilt comparable implementations. I tried out some of the code done by Google’s MediaPipe solutions. Here’s some testing code

First I read 4 sample images

import imageio.v3 as iio
from PIL import Image
import matplotlib
import matplotlib.pyplot as plt
import numpy as np

d1 = iio.imread('./fall-01-cam0-rgb-001.png')
d2 = iio.imread('./fall-01-cam0-rgb-002.png')
d3 = iio.imread('./fall-01-cam0-rgb-003.png')
d4 = iio.imread('./fall-01-cam0-rgb-004.png')

raw_images = [d1, d2, d3, d4]

fig = plt.figure(figsize= (4, 4))
temp = 0

for i in range (0, 4):
ax = fig.add_subplot(2, 2, temp+1)
temp +=1
ax.imshow(raw_images[i])

Then we use the function to make annotations from the detections, this is from the guide of MediaPipe

import cv2
MARGIN = 10 # pixels
ROW_SIZE = 10 # pixels
FONT_SIZE = 1
FONT_THICKNESS = 1
TEXT_COLOR = (255, 0, 0) # red

def visualize(
image,
detection_result
) -> np.ndarray:
"""Draws bounding boxes on the input image and return it.
Args:
image: The input RGB image.
detection_result: The list of all "Detection" entities to be visualize.
Returns:
Image with bounding boxes.
"""
for detection in detection_result.detections:
# Draw bounding_box
bbox = detection.bounding_box
start_point = bbox.origin_x, bbox.origin_y
end_point = bbox.origin_x + bbox.width, bbox.origin_y + bbox.height
cv2.rectangle(image, start_point, end_point, TEXT_COLOR, 3)

# Draw label and score
category = detection.categories[0]
category_name = category.category_name
probability = round(category.score, 2)
result_text = category_name + ' (' + str(probability) + ')'
text_location = (MARGIN + bbox.origin_x,
MARGIN + ROW_SIZE + bbox.origin_y)
cv2.putText(image, result_text, text_location, cv2.FONT_HERSHEY_PLAIN,
FONT_SIZE, TEXT_COLOR, FONT_THICKNESS)

return image

Now we run the code for every one of the sample images

# STEP 1: Import the necessary modules.
import numpy as np
import mediapipe as mp
from mediapipe.tasks import python
from mediapipe.tasks.python import vision

# STEP 2: Create an ObjectDetector object.
base_options = python.BaseOptions(model_asset_path='ssd_mobilenet_v2_32f.tflite')
options = vision.ObjectDetectorOptions(base_options=base_options,
score_threshold=0.48)
detector = vision.ObjectDetector.create_from_options(options)

i = 1
annoted_images = []
for image in raw_images:
IMAGE_FILE = f'fall-01-cam0-rgb-00{i}.png'
# STEP 3: Load the input image.
image = mp.Image.create_from_file(IMAGE_FILE)
# STEP 4: Detect objects in the input image.
detection_result = detector.detect(image)
# STEP 5: Process the detection result. In this case, visualize it.
image_copy = np.copy(image.numpy_view())
annotated_image = visualize(image_copy, detection_result)
annoted_images.append(annotated_image)
i += 1


fig = plt.figure(figsize= (4, 4))
temp = 0

for i in range (0, 4):
ax = fig.add_subplot(2, 2, temp+1)
temp +=1
ax.imshow(annoted_images[i])

Now we can see that it doesn’t detect every object of interest. These annotations work with 0.5 confidence that are mentioned by the authors. By slightly tunning down the confidence to 0.48 it’s capable of detection more object. However, It’s not a good thing to lower the confidence just because.

Left. confidence set to 0.48. Right confidence set to 0.5

We can see now that in addition to the person, the model can detect now 2 out 3 chairs instead of only 1. Just to keep advancing I’ll for now only use the MediaPipe solution at 0.5 confidence for now. Some things I am planning to do in other to enhance the performance of the model is to train it from scratch using PyTorch, or using some image processing methods to illuminate the image or minimize shadows, allowing the model to work with more clear information. However doing this is secondary since what I’d like to do next is to get a 3D projection of the detected object as it is shown by the authors.

references

[1]. Z. Li, F. Song, B. C. Clark, D. R. Grooms, and C. Liu, “A Wearable Device for Indoor Imminent Danger Detection and Avoidance With Region-Based Ground Segmentation,” IEEE Access, vol. 8, pp. 184808–184821, 2020, doi: https://doi.org/10.1109/access.2020.3028527.

--

--

Rick

I blog about everything I learn, Digital Image Processing, Data Science, IoT, Videogame design and much more :)