Watch the YouTube video for a complement to this Instructable post.
This robot has been a dream robot for me to build for a long time. Before diving into how to build this robot, let me give a quick overview of what this robot is capable of!
This robot is truly special because it can use Machine Learning models to 'see' the world via a generic camera and perform tasks depending on how the detected object's position is changing in the camera.
This robot is built around the ever popular Raspberry pi, the incredibly powerful RoboClaw motor controller, and the common Rover 5 robot platform. Furthermore, all the additional physical parts are 3D printed. This robot also uses the Tensorflow USB Coral accelerator to speed up the Raspberry Pi's slow object detection. More on all this in the next few steps
The first part of this Instructables guide will cover some the theory behind how the algorithms and code works. The second half of this Instructables guide will cover how you can build this robot yourself including the physical building + installing libraries etc.
All the code, 3D printed files and other opensource files are available for free here:
3D printable files: https://www.thingiverse.com/thing:3794952
Code: https://github.com/SaralTayal123/Object-Finding-Ro...
(Many more details available on the electronics parts in Step 4)
3D printed Files (more details in step 3)
Raspberry Pi 3b/3b+
2 of 2x7A RoboClaw motor controllers
USB Webcam (or Pi Camera + USB Microphone)
Google Coral USB accelerator
Ultrasonic Distance sensor
7-12V battery + 5V step down converter for the Raspberry Pi
Rover 5 platform
Geared brushed DC motor with encoders (robotic sled)
(Many more details available on the electronics parts in Step 4)
Robot Vision (CV) Explained:
First let me quickly define a few terms
Computer Vision (CV): Computer Vision is using a computer to process a camera stream (2D mathematical values) to get a higher-level understanding of what the camera is capturing.
Machine learning (ML): Use thousands of images of an object --lets assume a cup-- and use this data to calculate a model (picture) for what the average cup looks like based on all the images in the data-set. More pictures = more accurate model.
This particular robot combines Machine Learning with Computer Vision. This means it passes the webcam stream through a machine-learnt model in order to detect objects in the frame. For example, based on the 'average cup model' ML example, the computer can look at the camera's video frame and try to fit the average cup image on to your new image. If it fits within a certain degree of accuracy, the new image will be labeled as a cup! By tracing where exactly this object is detected via the camera frame, we get object detection!
Sourcing 1000s of images of multiple objects and faces can be difficult. Furthermore computing what the avg object should look like (training a model) is an extremely computationally intensive task. Therefore, I decided to stick with the freely available 'Face Detect and 'Mobile-lite SSD COCO' models provided for free by google's TensorFlow library. (TensorFlow is the Machine learning program that actually trains the object detection model and compares new images/videos to classify them and detect objects within them.)
Face detection explained
Now once you have your face-detect model from Google, you can use it to find faces within a video stream! When the face-detect algorithm finds a face in the camera frame, it will output the accuracy with which it thinks it has found your face and the coordinates of your face in the video frame. Now here is the challenging part. How do we use this camera positional data to have the robot follow a person around?
The answer is simple. If the person's X coordinate is at the left 25% of the frame, then the robot will turn left till the person is centered in the frame again. If the person's X coordinate is in the right 25% of the frame, the code will tell the robot to turn right till the person is centered in the frame again. For example: If your webcam's resolution is 1280*720(1280 pixels wide and 720 pixels tall), then your left detection zone would be 0.3 x 1280 = 384 pixels. The right detection zone would be 1280-(0.3*1280) = 896 pixels. Therefore if the face you are tracking has an x coordinate < 384, the robot will turn left till the face's X coordinate is greater than 384 and vice versa for the right side. You can tweak this detection zone to your liking in the code. Furthermore, you could implement a more elegant PID loop for this task, but this simple solution works well for now!
So at this point we have taught the robot how to rotate/pan to track a person. However, the robot cannot successfully follow the person if the robot can't tell if a person is moving closer or further away. Don't worry, the solution to this problem is cleverly simple too!
We achieve distance tracking by tracking the changes in face area! The tensorflow model will return the x starting position, x ending position, y starting postion, and y ending postion of the detected object. By doing X end position- X start position we get the width of the object. By doing Y end position- Y start position of the detected object we get the width of the object. If you multiply these 2 values you get the area of the detected object! Now if the person's face's area is decreasing , that means the person is getting further away and the robot will move forward till the person is at the starting area again. If the person's face's area is increasing, then that person is getting closer to the robot and the robot must move back to follow the person and maintain the face area.
Object detection explained
The object detection model algorithm runs very similarly to the face detection. Instead of using the 'Face Detect' model, we use the COCO model which can detect 90 objects listed here. If the object being detected is to the right of the camera frame, the robot moves right to center the object, and if the object is to the left of the camera frame, the robot moves left to center the object.
However, unlike the face detection algorithm, the object detection doesn't use the detected object's area to determine if it needs to go closer or further. This is because the object detection is used to find objects and pick them up. Not maintain the same distance like the face detection. Some objects will be larger than others and using a standardized area value will not work.
To get around this issue, we can use a simple ultrasonic distance sensor mounted to the underside of the robot to measure the distance from the robot to the detected object. When this detected distance is less than a certain threshold, and if the robot is perfectly centered facing the object, the robot will run through a simple set of commands to open its claw, move forward, close its claw, and switch to face detection using the old face size value to find and go back to the person with the collected object.
Why did we need a secondary face detect model when the COCO model can detect people.
As mentioned in the Object detection section above, we used the COCO tensor flow model since it is already very accurate and fast at detecting 90+ commonly used everyday objects. Now some of you might have noticed that the Object detection COCO model can also detect people. So why didn't I use the COCO model for face detection? Well firstly the COCO model runs slower than the face detection model since its looking for over 90 types of objects vs 1 (faces) in the face detect model. When I say slower, I don't mean a 'lag' I mean a lower FPS. However, this lower FPS isn't the biggest problem. The biggest problem is tracking. You see, the COCO person detection algorithm actually detects the whole person's body rather than just the face which is problematic for determining area and if the robot should move closer or further from a person. If you are close to the robot and its camera, it will see your face and calculate the area based on that. But as you move away from the camera, more of your body comes into the webcam's view and therefore the size of the detected object grows according to the robot. Therefore, although you are walking away, the robot might actually think that you are getting closer so the robot will do the opposite of what it needs to do of coming closer to you and instead go backwards away from you. However, with faces, it is very unlikely for your face to be partially cut off by the webcam's field of view and it is a much more stable way of tracking and following humans.
Filtering the data
Next we come to filtering the actual input data from the object detection and Face tracking models. You see, these models are quite accurate, but sometimes give jumpy/twitchy coordinates between video frames or can give us false positives that can confuse our robot and turn it into a jittery mess.(Look at the project's youtube video for examples of this)
To remedy this, I implemented a simple filtering mechanism. Any time an object is detected with a confidence score below 80%, it will be ignored. This is great for filtering out false positives. But what about filtering out jumpy data? To do this I used a simple averaging function that takes the last 3 coordinates and averages them to give you the current coordinate used for the robot's movement calculation. These 3 coordinates are stored in an array and the oldest of the 3 data points gets replaced as soon as a new coordinate data point is returned from the computer vision model (FIFO). By averaging the 3 latest data points, we can easily smooth out jumpy data and also smooth out any outliers in the data. This method theoretically makes the robot slower to respond to fast movements, but with a high enough frame rate from the object detection models, this will not be an issue. You can always tweak the averaging to be more than or less than 3 values depending on your need for responsiveness vs smoothness. I found 3 to be the perfect balance where the robot didn't suffer from noticeable delays while also being smooth in its movement.
Webcam servo
I also mounted the Webcam used for this project on a tilt mechanism powered by a servo. The reason for this is that humans stand tall and their faces will be outside the webcam's field of view. Therefore, the servo powered tilt mechanism can tilt the webcam upwards when looking for faces, and tilt it back down to the ground when looking for objects to pick up.Do look at the attached pictures above for a better illustration of how the servo is attached.
Now that the robot can find objects using CV & ML, we come to the actual object picking up step. This step is performed when the robot has successfully navigated to the object it is looking for and has also reached a certain trigger distance (in my code it was 30cm) from the object it is trying to pick up. The steps for picking up objects is as follows-
The sled itself is powered using Brushed DC motors. The reasoning for this is multifaceted. Firstly geared brushed DC motors are very small and easy to source. Next it is very easy to do precise positional control through the RoboClaw motor controller and encoders. Next brushed DC motors have incredible torque due to such high gear ratios. This is especially useful for when the robot sled is picking up a larger, heavier object.
I chose the sled design rather than a robotic arm due to a multiple reasons
Of those reasons, the biggest advantage to the sled was how it could actually pick up large objects with much less chance of slippage and without a secondary camera.
The sled works by connecting the brushed DC motor to a linear gear attached to the first sled arm. This linear gear again transfers the linear movement into another rotating gear which flips the direction of movement before transferring it to the secondary linear gear. Look at the annotated picture I've attached above for a much clearer view on how this works.
The voice detection on the robot works via the SnowBoy voice detection program. Snowboy is an incredibly light weight, offline voice wake word detector.
simultaneously.
.
Since Snowboy can detect multiple wake-words at once, I created a voice detection structure like so
Firstly the user says the wake word "Robot". A 10 second internal timer starts and the orange light turns on. Then the user has to give a second command such as
The reason for using the robot wake word before giving a command is to prevent accidentally having the wrong command trigger. As the robot moves around, the vibrations from the robot can sometimes create a false-positive where the program thinks you said a word but didn't. By adding the "robot" wake word and a 10 second timer to say your command after that helps filters out these random false-positives
** all the code is available in the further in the Instrucables post where I teach you how to install it **
Now we move on to the physical building section of the robot. This robot was built around the common Rover 5 platform.
I chose this robot platform because it is very common and sold by multiple distributors, making it easy for anyone in the world to replicate this project. The second reason for choosing this robot was that it uses tank treads rather than wheels which make it much easier to pivot (one tank tread moves forward and the other tread moves back)ans also allows the robot to easily climb over small obstacles. The third reason for picking the Rover 5 Chassis was that this chassis comes with DC motors with encoders built in. Most robot chassis don't come with motors which have encoders. The encoders tell the RoboClaw motor controller where the motor is positioned and gives the RoboClaw and the robot much finer control over its movement.
Here are some links to where you can find the rover 5 platform.
https://www.sparkfun.com/products/10336
https://www.dfrobot.com/product-470.html
https://www.pololu.com/product/1551
https://solarbotics.com/product/50858/
The rest of the physical parts are all 3d printed. You can download all the parts here: https://www.thingiverse.com/thing:3794952 I have also added the .STEP and .F3D files so you can modify the parts to suit your needs. If you don't wish to modify the parts, you can download the .stl files and directly 3D print those.
Here is a list of all the parts you will need to print for this project.
Next we come to the actual electronics you need to make this project work
GPIO pinnout I used (This is customizable except for the TX/RX hardware serial pins on the Raspberry Pi
Green LED (Used for when a command is given after the "robot" wake word): GPIO21+ GND
RoboClaw Pinnout
EN2-Right wheel of the rover (Empty if this is the RoboClaw being used to control the sled)
There are 4 python scripts that run at the same time.
The reason I had to use 4 different scripts instead of merging them all into one script is twofold. Firstly, I was having issues with some of the scripts like the Snowboy voice detection and ultrasonic distance detection being buggy and not working due to the long time it took to run through the code loop when I combined all of them. This could be partially solved via multiprocessing (not multi-threading since only one thread runs at a time vs multiprocessing which can run 2 or more processes at the same time) but I was still having some timing issues. The second reason I could not combine the 4 python scripts was due to compatibility of their libraries. Some of the libraries the scripts needed only worked with python 2.7 while some libraries only work with python 3.x. Therefore I was forced to run separate python scripts.
To ensure communication between these various scripts, I used a very lightweight solution called MQTT. MQTT is often used by makers in IOT applications to add a lightweight and very fast communication protocol to smart-devices. By adding a separate thread to each program to check for incoming MQTT data, I was able to integrate all 4 python processes together in a very simple way. An alternative to MQTT could be pipes (if using multi-processing) or using web sockets. You might think that MQTT would add lag/delay in terms of communication from one process to another but since it was communicating within the Pi rather than transmitting the data over the internet, the actual delay of sending information was almost 0.
The 4 python scripts use MQTT to communicate in this manner. A mindmap drawing of this relationship is attached as a photo above
Here are the steps to follow to install the code on your device.
Firstly download the project zip file from this GitHub link: https://github.com/SaralTayal123/Object-Finding-Ro... This will give you all the code for the robot. However, we need to install some libraries on our Raspberry Pi to make the code work appropriately
https://coral.withgoogle.com/docs/accelerator/get-... (Click yes on the enable maximum frequency)
Set up OpenCV using the following commands in the terminal
Install Mosquito using the following instructions
Install RoboClaw libraries and dependencies
Install the Distance Sensor code:
pip install Bluetin_Echo
pip install monotonic
And that's it for installing the libraries.
To launch the code follow the following steps.(Alternatively you could use a startup script to do this automatically when the rover boots up)
Terminal 1 (main robot code)-
Terminal 2 (ultrasonic distance sensor)-
Terminal 3 (object detection CV code)-
Terminal 4 (Snowboy voice recgnition)-
To end this Instructable guide, I want to highlight some of the alternate methods I had tried before building the robot the way I did
This project has been an incredibly ambitious project that has taught me a lot about Computer vision, Machine Learning, Robotics etc. It has been an absolute joy to put together this project but nonetheless, like every project, this project has room for improvement and upgrades.
Possible improvements and upgrades