Pose Estimation Benchmarks on intelligent edge

Benchmarks on Google Coral, Movidius Neural Compute Stick, Raspberry Pi and others


In an earlier article, we covered running PoseNet on Movidius. We saw that we were able to achieve 30FPS with acceptable accuracy. In this article we are going to evaluate PoseNet on the following mix of hardware:

  1. Raspberry Pi 3B
  2. Movidius NCS + RPi 3B
  3. Ryzen 3
  4. GTX1030 + Ryzen 3
  5. Movidius NCS + Ryzen 3
  6. Google Coral + RPi 3B
  7. Google Coral + Ryzen 3
  8. GTX1080 + i7 7th Gen

This is a comparison of PoseNet’s performance across hardware, to help decide which hardware to use for a specific use case, if optimizations can help. It also gives a glimpse into hardware capabilities in the wild. The hardware included a range from baseline prototyping platforms to tailored for edge to production-grade CPUs.

Hardware Choices

  1. Raspberry Pi: The board of choice for prototyping, although low powered, gives a good initial understanding of what to expect and what to choose for production. It may not be able to run the DNN models, but it sure is fun.
  2. Movidius NCS + RPi 3B: Movidius Neural Compute Stick is a promising candidate if the model is to be run on the edge. NCS has Vision Processing Units (VPU) which are optimized to run deep neural networks.
  3. Ryzen 3: AMD’s quad-core CPUs are not a conventional choice for neural networks, but it is worth checking how the networks perform on the platform.
  4. GTX1030 + Ryzen 3: Adding an Nvidia GPU to the rig (granted, it is comparatively old but it is cheap) allows us to benchmark what is possible on older cuDNN versions and GPUs.
  5. Movidius NCS + Ryzen 3: A desktop system allows for better and faster interfacing with the NCS. This setup is preferred during prototyping your edge application. Having a high performance CPU allows rapid application development while NCS gives the ability to run your models on your development laptop.
  6. Google Coral + RPi 3B: Google’s answer to on-edge ML is their Coral board which has TPUs. Tensor Processing Units are used by Google’s gigantic AI systems. Coral puts the compute power of TPUs on small form factor. It has native support for Raspberry Pi too.
  7. Google Coral + Ryzen 3: As we mentioned in Movidius NCS + Ryzen 3 section, it is going to be insightful to see how Coral interfaces with Ryzen 3 based computer.
  8. GTX1080 + i7 7th Gen: Top of the line system with GTX1080 and Intel i7 CPU. This is the highest performing combination in the list.

Repositories and models used:

  1. PoseNet — tfjs version
  • Based on MobileNetV1_050
  • Based on MobileNetV1_075
  • Based on MobileNetV1_100

2. PoseNet — Google Coral version

3. Read our previous blog post to get Movidius versions of PoseNet

Comparing Edge Compute Units

Google Coral’s PoseNet repository provides a model based on MobileNet 0.75 which is optimized specifically for Coral. At the time of writing, the details of the optimizations have not been provided and it is not possible to generate models for MobileNet 0.50 and 1.00.

Google Coral vs Intel Movidius

The optimized Coral model gives an exceptional performance of 77FPS with Ryzen 3 system. However, the same model gives ~9FPS when running on Raspberry Pi.

Movidius shows differences in performance with RPi and Ryzen, with the general pattern being faster on the Ryzen 3 system

Comparing Desktop CPUs and GPUs

The results are aligning with expectations while comparing CPU with GTX 1030 and GTX 1080. The high-end GPU outperforms the other candidates by a huge margin. However, the competition between Ryzen 3 and GTX 1030 is close.

Ryzen vs GTX 1030 vs GTX 1080

Final Thoughts

The following chart shows frames per second for a standard video input:

Frames per second

Google Coral, when paired with a desktop computer outperforms every other platform — including GTX1080.

Other noteworthy results are:

  1. When paired with Raspberry Pi 3, Coral gives ~9FPS. The reason behind the result is not yet explained but is being looked into.
  2. GTX1080 performs almost equally regardless of the model size.
  3. Movidius NCS performs better than GTX1030.
  4. Raspberry Pi is not able to run the models at all.

Different hardware gives a different flavor of performance, and there is scope for model optimization (quantization for example). It may not always be necessary to go with a high-end GPU such as GTX 1080 if your use case allows for a good trade-off between accuracy and speed/latency.

Our analysis shows that choosing the right hardware coupling with a well-optimized neural network is essential and may require in-depth comparative analysis.

Car or Not a Car

Lessons from Fine Tuning a Convolutional Binary Classifier

Fine tuning has been shown to be very effective in certain types of neural net based tasks such as image classification. Depending upon the dataset used to train the original model, the fine-tuned model can achieve a higher degree of accuracy with comparatively less data. Therefore, we have chosen to fine tune ResNet50 pre-trained on the ImageNet dataset provided by Google.

We are going to explore ways to train a neural network to detect cars, and optimise the model to achieve high accuracy. In technical terms, we are going to train a binary classifier which performs well under real-world conditions.

Taken in a village Near Jaipur (Rajasthan, India) by Sanjay Kattimani http://sanjay-explores.blogspot.com

There are two possible approaches to train such a network:

  • Train from scratch
  • Fine-tune an existing network

To train from scratch, we need a lot of data — millions of positive and negative examples. The process doesn’t end at data acquisition. One has to spend a lot of time cleaning the data and making sure it contains enough examples of real world situations that the model is going to encounter practically. The feasibility of the task is directly determined by the background knowledge and time required to implement that.

Basic Setup

There are certain requisites that are going to be used throughout the exploration:

  1. Datasets
    a.Standford Cars for car images
    b. Caltech256 for non-car images
  2. Base Network
    ResNet — arXiv — fine-tuned on ImageNet
  3. Framework and APIs
    a. TensorFlow
    b. TF Keras API
  4. Hardware 
    a. Intel i7 6th gen
    b. Nvidia GTX1080 with 8GB VRAM
    c. System RAM 16GB DDR4

Experiment 1

To start with a simple approach, we take ResNet50 without the top layer and add a fully connected (dense) layer on top of it. The dense layer contains 32 neurons which are activated with sigmoid activator. This gives approximately 65,000 trainable parameters which are plenty for the task at hand.

Model Architecture for experiment 1

We then add the final output layer having a single neuron with sigmoid activation. This layer has a single neuron because we are performing binary classification. The neuron will output real values ranging from 0 to 1.

Data Preparation

We are randomly sampling 50% of images as the training dataset, 30% as validation and 20% as test sets. Although there is a huge gap between the number of car and non-car images in the training set, it should not skew our process too much because the datasets are comparatively clean and reliable.



As a trial run, we trained for one epoch. The graphs below illustrate that the model starts at high accuracy, and reaches near-perfect performance within the first epoch. The loss goes down as well.

Epoch Accuracy for Experiment 1
Epoch Loss for Experiment 1

However, validation accuracy does not seem very good compared to the training round, and neither does validation loss.

Validation Accuracy for Experiment 1
Validation Loss for Experiment 1

So, we ran for 4 epochs and were left with the following results:

Accuracy and Loss for four epochs
Validation accuracy and validation loss for four epochs

The model performs relatively well, except for the high degree of separation between training and validation losses.

Experiment 2

We decided to keep the model architecture the same as the one we used in the first experiment, using the same ResNet50 without the top layer and adding a fully connected (dense) layer on top of it containing 32 neurons activated with sigmoid activator.

Model Architecture for experiment 2

Data Preparation

This is where the problem lay in the previous experiment. The train/validation/test data splits were random. The hypothesis was that the randomness has added more images of some cars, and too little of others, causing the model to be biased.

So, we took the splits as given by the Cars dataset and added 3000 more images by scraping the good old Web.



These results signify a substantial improvement in the validation accuracy when compared to the previous experiment.

Epoch Accuracy for experiment 2
Epoch Loss for experiment 2

Even though the accuracy matches fairly well, there is a big difference between the training loss and the validation loss.

Validation Accuracy for experiment 2
Validation Loss for experiment 2

This network seems more stable than the previous one. The only observable difference is that of new data splits.

Experiment 3

Here we add an extra dropout layer which provides a 30% chance that a neuron will be dropped out of the training pass. The dropout layer has been known to normalize models, to prevent possible biases caused by interdependence of neurons.

Model Architecture for experiment 3

Since we have a comparatively huge pre-trained network and smaller trainable network, we could add more dense layers to see the effects. We did that and the model ended up achieving saturation in fewer epochs. No other improvements were observed.

Data Preparation

Just like in experiment 2, the default train/validation splits are taken.


Here, we have run the model on a single learning rate but the value can be experimented with. We will talk about the effects of batch size on this network in the results section.


The results here are with the batch size of 32. As seen, in 3 epochs the network seems to saturate (although it might be a bit premature to judge this).

Epoch accuracy for experiment 3
Epoch Loss for experiment 3

At the same time validation accuracy and loss also seem to be performing well.

Validation Accuracy for experiment 3
Validation Loss for experiment 3

So, we increase the batch size to 128 hoping it would help the network find a better local minima and thereby giving a better overall performance. Here is what happened:

Epoch Accuracy and Loss for batch size of 128
Validation Accuracy and Loss for batch size of 128

The model now performs reasonably well on both training and validation sets. The losses between training and validation runs are not too far apart either.

Model Drawbacks

Obviously, the model is not one hundred percent accurate. It does provide certain failed classifications as a result.


When we ran this model on the testing dataset, it failed on only 7 images out of car + non-car sets. This is a very high degree of performance accuracy and closer to production usage.

In conclusion, we can safely assert that dataset splits are crucial. Rigorous evaluations and experimentation with various hyper-parameters give us a better idea of the network. We should also think about modifying the original architecture based on the evidence provided by the various hyper-parameters.

Are you looking for a reliable technology partner for your ideas ? Talk to us