How to Develop a Real-Time Object Detection Project

Developing a real-time object detection project

You can develop a video object classification application using pre-trained YOLO models (that is, transfer learning), Deeplearning4j (DL4J), and OpenCV that can detect labels such as cars and trees inside a video frame. You can find the relevant code files for this tutorial at This application is also about extending an image detection problem to video detection. Time to get started!

Step 1¬†‚Äď Loading a pre-trained YOLO model

Since Alpha release 1.0.0, DL4J provides a Tiny YOLO model via ZOO. For this, you need to add a dependency to your Maven friendly pom.xml file:

Apart from this, if possible, make sure that you utilize the CUDA and cuDNN by adding the following dependencies:

Now, use the below code to load the pre-trained Tiny YOLO model as a Computation Graph. You can use the PASCAL Visual Object Classes (PASCAL VOC) dataset (see more at to train the YOLO model.

In the above code segment, the createObjectLabels() method refers to the labels from the PASCAL Visual Object Classes (PASCAL VOC) dataset. The signature of the method can be seen as follows:

Now, create a Tiny YOLO model instance:

Take a look at the model architecture and the number of hyper parameters in each layer:

Network summary and layer structure of a pre-trained Tiny YOLO model

Your Tiny YOLO model has around 1.6 million parameters across its 29-layer network. However, the original YOLO 2 model has more layers. You can look at the original YOLO 2 at

Step 2¬†‚Äď Generating frames from video clips

To deal with real-time video, you can use video processing tools or frameworks such as JavaCV that can split a video into individual frames. Take the image height and width. For this, include the following dependency in the pom.xml file:

JavaCV uses wrappers from the JavaCPP presets of libraries commonly used by researchers in the field of computer vision (for example, OpenCV and FFmpeg). It provides utility classes to make their functionality easier to use on the Java platform, including Android.

For this project, there are two video clips (each 1 minute long) that should give you a glimpse into an autonomous driving car. This dataset has been downloaded from the following YouTube links:

After downloading them, they were renamed as follows:

  • SelfDrivingCar_Night.mp4
  • SelfDrivingCar_Day.mp4

When you play these clips, you’ll see how Germans drive their cars at 160 km/h or even faster. Now, parse the video (first use day 1) and see some properties to get an idea of video quality hardware requirements:

The inputted video clip has 1802 frames. The inputted video clip has frame rate of 29.97002997002997.

Now grab each frame and use Java2DFrameConverter to convert frames to JPEG images:

The above code will generate 1,802 JPEG images against an equal number of frames. Take a look at the generated images:

From video clip to video frame to image

Thus, the 1-minute long video clip has a fair number (that is, 1,800) of frames and is 30 frames per second. In short, this video clip has 720p video quality. So, you can understand that processing this video should require good hardware; in particular, having a GPU configured should help.

Step 3¬†‚Äď Feeding generated frames into the Tiny YOLO model

Now that you know some properties of the clip, start generating the frames to be passed to the Tiny YOLO pre-trained model. First, look at a less important but transparent approach:

In the above code, you send each frame to the model. Then, you use the Mat class to represent each frame in an n-dimensional, dense, numerical multi-channel (that is, RGB) array.

In other words, you split the video clip into multiple frames and pass into the Tiny YOLO model to process them one by one. This way, you applied a single neural network to the full image.

Step 4¬†‚Äď Real Object detection¬†from image frames

Tiny YOLO extracts the features from each frame as an n-dimensional, dense, numerical multi-channel array. Then, each image is split into a smaller number of rectangles (boxes):

In the above code, the prepareImage() method takes video frames as images, parses them using the NativeImageLoader class, does the necessary preprocessing, and extracts image features that are further converted into a INDArray format, consumable by the model:

Then, the markWithBoundingBox() method is used for non-max suppression in the case of more than one bounding box.

Step 5¬†‚Äď Non-max suppression in case of more than one bounding box

As YOLO predicts more than one bounding box per object, non-max suppression is implemented; it merges all detections that belong to the same object. Therefore, instead of using bx, by, bh, and bw, you can use the top-left and bottom-right points. gridWidth and gridHeight are the number of small boxes you split your image into. In this case, it is 13 x 13, where w and h are the original image frame dimensions:

Finally, remove those objects that intersect with the max suppression, as follows:

In the second block, you scaled each image into 416 x 416 x 3 (that is, W x H x 3 RGB channels). This scaled image is then passed to Tiny YOLO for predicting and marking the bounding boxes as follows:

Your Tiny YOLO model predicts the class of an object detected in a bounding box

Once the markObjectWithBoundingBox() method is executed, the following logs containing the predicted class, bx, by, bh, bw, and confidence (that is, the detection threshold) will be generated and shown on the console:

Step 6¬†‚Äď Wrapping up everything and running the application

Up to this point, you know the overall workflow of your approach. You can now wrap up everything and see whether it really works. However, before this, take a look at the functionalities of different Java classes:

  • java: This shows how to grab frames from the video clip and save each frame as a JPEG image. Besides, it also shows some exploratory properties of the video clip.
  • java: This instantiates the Tiny YOLO model and generates the label. It also creates and marks the object with the¬†bounding¬†box. Nonetheless, it shows how to handle non-max suppression for more than one bounding box per object.
  • java: This main class continuously grabs the frames and feeds them to the Tiny YOLO model (until the user presses the¬†Esckey). Then, it predicts the corresponding class of each object successfully detected inside the normal or overlapped bounding boxes with non-max suppression (if required).

In short, first, you create and instantiate the Tiny YOLO model. Then, you grab the frames and treat each frame as a separate JPEG image. Next, you pass all the images to the model and the model does its trick as outlined previously. The whole workflow can now be depicted with some Java code as follows:

Once the preceding class is executed, the application should load the pre-trained model and the UI should be loaded, showing each object being classified:

Your Tiny YOLO model can predict multiple cars simultaneously from a video clip (day)

Now, to see the effectiveness of your model even in night mode, perform a second experiment on the night dataset. To do this, just change one line in the main() method, as follows:

Once the preceding class is executed using this clip, the application should load the pre-trained model and the UI should be loaded, showing each object being classified:

Your Tiny YOLO model can predict multiple cars simultaneously from a video clip (night)

Furthermore, to see the real-time output, execute the given screen recording clips showing the output of the application.

If you found this interesting, you can explore Md. Rezaul Karim’s Java Deep Learning Projects to build and deploy powerful neural network models using the latest Java deep learning libraries. Java Deep Learning Projects starts with an overview of deep learning concepts and then delves into advanced projects.

Leave a Reply