How to Develop a Real-Time Object Detection Project

Developing a real-time object detection project

You can develop a video object classification application using pre-trained YOLO models (that is, transfer learning), Deeplearning4j (DL4J), and OpenCV that can detect labels such as cars and trees inside a video frame. You can find the relevant code files for this tutorial at https://github.com/PacktPublishing/Java-Deep-Learning-Projects/tree/master/Chapter06. This application is also about extending an image detection problem to video detection. Time to get started!

Step 1 – Loading a pre-trained YOLO model

Since Alpha release 1.0.0, DL4J provides a Tiny YOLO model via ZOO. For this, you need to add a dependency to your Maven friendly pom.xml file:

<dependency>

  <groupId>org.deeplearning4j</groupId>

  <artifactId>deeplearning4j-zoo</artifactId>

  <version>${dl4j.version}</version>

</dependency>

Apart from this, if possible, make sure that you utilize the CUDA and cuDNN by adding the following dependencies:

<dependency>

  <groupId>org.nd4j</groupId>

  <artifactId>nd4j-cuda-9.0-platform</artifactId>

  <version>${nd4j.version}</version>

</dependency>

<dependency>

  <groupId>org.deeplearning4j</groupId>

  <artifactId>deeplearning4j-cuda-9.0</artifactId>

  <version>${dl4j.version}</version>

</dependency>

Now, use the below code to load the pre-trained Tiny YOLO model as a Computation Graph. You can use the PASCAL Visual Object Classes (PASCAL VOC) dataset (see more at http://host.robots.ox.ac.uk/pascal/VOC/) to train the YOLO model.

private ComputationGraph model;

private TinyYoloModel() {

        try {

            model = (ComputationGraph) new TinyYOLO().initPretrained();

            createObjectLabels();

        } catch (IOException e) {

            throw new RuntimeException(e);

        }

    }

In the above code segment, the createObjectLabels() method refers to the labels from the PASCAL Visual Object Classes (PASCAL VOC) dataset. The signature of the method can be seen as follows:

private HashMap<Integer, String> labels; 

void createObjectLabels() {

        if (labels == null) {

            String label = "aeroplanen" + "bicyclen" + "birdn" + "boatn" + "bottlen" + "busn" + "carn" +

                    "catn" + "chairn" + "cown" + "diningtablen" + "dogn" + "horsen" + "motorbiken" +

                    "personn" + "pottedplantn" + "sheepn" + "sofan" + "trainn" + "tvmonitor";

            String[] split = label.split("\n");

            int i = 0;

            labels = new HashMap<>();

            for(String label1 : split) {

                labels.put(i++, label1);

            }

        }

    }

Now, create a Tiny YOLO model instance:

static final TinyYoloModel yolo = new TinyYoloModel();

    public static TinyYoloModel getPretrainedModel() {

        return yolo;

    }

Take a look at the model architecture and the number of hyper parameters in each layer:

TinyYoloModel model = TinyYoloModel.getPretrainedModel(); System.out.println(TinyYoloModel.getSummary());
Network summary and layer structure of a pre-trained Tiny YOLO model

Your Tiny YOLO model has around 1.6 million parameters across its 29-layer network. However, the original YOLO 2 model has more layers. You can look at the original YOLO 2 at https://github.com/yhcc/yolo2/blob/master/model_data/model.png.

Step 2 – Generating frames from video clips

To deal with real-time video, you can use video processing tools or frameworks such as JavaCV that can split a video into individual frames. Take the image height and width. For this, include the following dependency in the pom.xml file:

<dependency>

  <groupId>org.bytedeco</groupId>

  <artifactId>javacv-platform</artifactId>

  <version>1.4.1</version>

</dependency>

JavaCV uses wrappers from the JavaCPP presets of libraries commonly used by researchers in the field of computer vision (for example, OpenCV and FFmpeg). It provides utility classes to make their functionality easier to use on the Java platform, including Android.

For this project, there are two video clips (each 1 minute long) that should give you a glimpse into an autonomous driving car. This dataset has been downloaded from the following YouTube links:

After downloading them, they were renamed as follows:

  • SelfDrivingCar_Night.mp4
  • SelfDrivingCar_Day.mp4

When you play these clips, you’ll see how Germans drive their cars at 160 km/h or even faster. Now, parse the video (first use day 1) and see some properties to get an idea of video quality hardware requirements:

String videoPath = "data/SelfDrivingCar_Day.mp4";

FFmpegFrameGrabber frameGrabber = new FFmpegFrameGrabber(videoPath);

frameGrabber.start();

Frame frame;

double frameRate = frameGrabber.getFrameRate();

System.out.println("The inputted video clip has " + frameGrabber.getLengthInFrames() + " frames");

System.out.println("Frame rate " + framerate + "fps");
>>> 
 The inputted video clip has 1802 frames. The inputted video clip has frame rate of 29.97002997002997.

The inputted video clip has 1802 frames. The inputted video clip has frame rate of 29.97002997002997.

Now grab each frame and use Java2DFrameConverter to convert frames to JPEG images:

Java2DFrameConverter converter = new Java2DFrameConverter();

// grab the first frame

frameGrabber.setFrameNumber(1);

frame = frameGrabber.grab();

BufferedImage bufferedImage = converter.convert(frame);

System.out.println("First Frame" + ", Width: " + bufferedImage.getWidth() + ", Height: " + bufferedImage.getHeight());



// grab the second frame

frameGrabber.setFrameNumber(2);

frame = frameGrabber.grab();

bufferedImage = converter.convert(frame);

System.out.println("Second Frame" + ", Width: " + bufferedImage.getWidth() + ", Height: " + bufferedImage.getHeight());
>>> 

  First Frame: Width-640, Height-360 Second Frame: Width-640, Height-360

The above code will generate 1,802 JPEG images against an equal number of frames. Take a look at the generated images:

From video clip to video frame to image

Thus, the 1-minute long video clip has a fair number (that is, 1,800) of frames and is 30 frames per second. In short, this video clip has 720p video quality. So, you can understand that processing this video should require good hardware; in particular, having a GPU configured should help.

Step 3 – Feeding generated frames into the Tiny YOLO model

Now that you know some properties of the clip, start generating the frames to be passed to the Tiny YOLO pre-trained model. First, look at a less important but transparent approach:

private volatileMat[] v = new Mat[1];

private String windowName = "Object Detection from Video";

try {

    for(int i = 1; i < frameGrabber.getLengthInFrames();    

    i+ = (int)frameRate) {

                frameGrabber.setFrameNumber(i);

                frame = frameGrabber.grab();

                v[0] = new OpenCVFrameConverter.ToMat().convert(frame);

                model.markObjectWithBoundingBox(v[0], frame.imageWidth,

                                               frame.imageHeight, true, windowName);

                imshow(windowName, v[0]);



                char key = (char) waitKey(20);

                // Exit on escape:

                if (key == 27) {

                    destroyAllWindows();

                    break;

                }

            }

        } catch (IOException e) {

            e.printStackTrace();

        } finally {

            frameGrabber.stop();

        }

        frameGrabber.close();

In the above code, you send each frame to the model. Then, you use the Mat class to represent each frame in an n-dimensional, dense, numerical multi-channel (that is, RGB) array.

In other words, you split the video clip into multiple frames and pass into the Tiny YOLO model to process them one by one. This way, you applied a single neural network to the full image.

Step 4 – Real Object detection from image frames

Tiny YOLO extracts the features from each frame as an n-dimensional, dense, numerical multi-channel array. Then, each image is split into a smaller number of rectangles (boxes):

public void markObjectWithBoundingBox(Mat file, int imageWidth, int imageHeight, boolean newBoundingBOx,

String winName) throws Exception {

        // parameters matching the pretrained TinyYOLO model

int W = 416; // width of the video frame 

        int H = 416; // Height of the video frame

        int gW = 13; // Grid width

        int gH = 13; // Grid Height

        double dT = 0.5; // Detection threshold



Yolo2OutputLayer outputLayer = (Yolo2OutputLayer) model.getOutputLayer(0);

        if (newBoundingBOx) {

            INDArray indArray = prepareImage(file, W, H);

            INDArray results = model.outputSingle(indArray);

            predictedObjects = outputLayer.getPredictedObjects(results, dT);

            System.out.println("results = " + predictedObjects);

            markWithBoundingBox(file, gW, gH, imageWidth, imageHeight);

        } else {

            markWithBoundingBox(file, gW, gH, imageWidth, imageHeight);

        }

        imshow(winName, file);

    }

In the above code, the prepareImage() method takes video frames as images, parses them using the NativeImageLoader class, does the necessary preprocessing, and extracts image features that are further converted into a INDArray format, consumable by the model:

INDArray prepareImage(Mat file, int width, int height) throws IOException {

        NativeImageLoader loader = new NativeImageLoader(height, width, 3);

        ImagePreProcessingScaler imagePreProcessingScaler = new ImagePreProcessingScaler(0, 1);

        INDArray indArray = loader.asMatrix(file);

        imagePreProcessingScaler.transform(indArray);

        return indArray;

    }

Then, the markWithBoundingBox() method is used for non-max suppression in the case of more than one bounding box.

Step 5 – Non-max suppression in case of more than one bounding box

As YOLO predicts more than one bounding box per object, non-max suppression is implemented; it merges all detections that belong to the same object. Therefore, instead of using bxbybh, and bw, you can use the top-left and bottom-right points. gridWidth and gridHeight are the number of small boxes you split your image into. In this case, it is 13 x 13, where w and h are the original image frame dimensions:

void markObjectWithBoundingBox(Mat file, int gridWidth, int gridHeight, int w, int h, DetectedObject obj) { 

        double[] xy1 = obj.getTopLeftXY();

        double[] xy2 = obj.getBottomRightXY();

        int predictedClass = obj.getPredictedClass();

int x1 = (int) Math.round(w * xy1[0] / gridWidth);

        int y1 = (int) Math.round(h * xy1[1] / gridHeight);

        int x2 = (int) Math.round(w * xy2[0] / gridWidth);

        int y2 = (int) Math.round(h * xy2[1] / gridHeight);

        rectangle(file, new Point(x1, y1), new Point(x2, y2), Scalar.RED);

        putText(file, labels.get(predictedClass), new Point(x1 + 2, y2 - 2),

                                 FONT_HERSHEY_DUPLEX, 1, Scalar.GREEN);

    }

Finally, remove those objects that intersect with the max suppression, as follows:

static void removeObjectsIntersectingWithMax(ArrayList<DetectedObject> detectedObjects,

DetectedObject maxObjectDetect) {

        double[] bottomRightXY1 = maxObjectDetect.getBottomRightXY();

        double[] topLeftXY1 = maxObjectDetect.getTopLeftXY();

        List<DetectedObject> removeIntersectingObjects = new ArrayList<>();

for(DetectedObject detectedObject : detectedObjects) {

            double[] topLeftXY = detectedObject.getTopLeftXY();

            double[] bottomRightXY = detectedObject.getBottomRightXY();

            double iox1 = Math.max(topLeftXY[0], topLeftXY1[0]);

            double ioy1 = Math.max(topLeftXY[1], topLeftXY1[1]);



            double iox2 = Math.min(bottomRightXY[0], bottomRightXY1[0]);

            double ioy2 = Math.min(bottomRightXY[1], bottomRightXY1[1]);



            double inter_area = (ioy2 - ioy1) * (iox2 - iox1);



            double box1_area = (bottomRightXY1[1] - topLeftXY1[1]) * (bottomRightXY1[0] - topLeftXY1[0]);

            double box2_area = (bottomRightXY[1] - topLeftXY[1]) * (bottomRightXY[0] - topLeftXY[0]);



            double union_area = box1_area + box2_area - inter_area;

            double iou = inter_area / union_area; 



            if(iou > 0.5) {

                removeIntersectingObjects.add(detectedObject);

            }

        }

        detectedObjects.removeAll(removeIntersectingObjects);

    }

In the second block, you scaled each image into 416 x 416 x 3 (that is, W x H x 3 RGB channels). This scaled image is then passed to Tiny YOLO for predicting and marking the bounding boxes as follows:

Your Tiny YOLO model predicts the class of an object detected in a bounding box

Once the markObjectWithBoundingBox() method is executed, the following logs containing the predicted class, bxbybhbw, and confidence (that is, the detection threshold) will be generated and shown on the console:

 [4.6233e-11]], predictedClass=6),

DetectedObject(exampleNumber=0,

centerX=3.5445247292518616, centerY=7.621537864208221,

width=2.2568163871765137, height=1.9423424005508423,

confidence=0.7954192161560059,

classPredictions=[[ 1.5034e-7], [ 3.3064e-9]...

Step 6 – Wrapping up everything and running the application

Up to this point, you know the overall workflow of your approach. You can now wrap up everything and see whether it really works. However, before this, take a look at the functionalities of different Java classes:

  • java: This shows how to grab frames from the video clip and save each frame as a JPEG image. Besides, it also shows some exploratory properties of the video clip.
  • java: This instantiates the Tiny YOLO model and generates the label. It also creates and marks the object with the bounding box. Nonetheless, it shows how to handle non-max suppression for more than one bounding box per object.
  • java: This main class continuously grabs the frames and feeds them to the Tiny YOLO model (until the user presses the Esckey). Then, it predicts the corresponding class of each object successfully detected inside the normal or overlapped bounding boxes with non-max suppression (if required).

In short, first, you create and instantiate the Tiny YOLO model. Then, you grab the frames and treat each frame as a separate JPEG image. Next, you pass all the images to the model and the model does its trick as outlined previously. The whole workflow can now be depicted with some Java code as follows:

// ObjectDetectorFromVideo.java

public class ObjectDetectorFromVideo{

    privatevolatile Mat[] v = new Mat[1];

    private String windowName;



    public static void main(String[] args) throws java.lang.Exception {

        String videoPath = "data/SelfDrivingCar_Day.mp4";

        TinyYoloModel model = TinyYoloModel.getPretrainedModel();

        

        System.out.println(TinyYoloModel.getSummary());

        new ObjectDetectionFromVideo().startRealTimeVideoDetection(videoPath, model);

    }



    public void startRealTimeVideoDetection(String videoFileName, TinyYoloModel model)

throwsjava.lang.Exception {

        windowName = "Object Detection from Video";

        FFmpegFrameGrabber frameGrabber = new FFmpegFrameGrabber(videoFileName);

        frameGrabber.start();



        Frame frame;

        double frameRate = frameGrabber.getFrameRate();

        System.out.println("The inputted video clip has " + frameGrabber.getLengthInFrames() + " frames");

        System.out.println("The inputted video clip has frame rate of " + frameRate);



        try {

            for(int i = 1; i < frameGrabber.getLengthInFrames(); i+ = (int)frameRate) {

                frameGrabber.setFrameNumber(i);

                frame = frameGrabber.grab();

                v[0] = new OpenCVFrameConverter.ToMat().convert(frame);

                model.markObjectWithBoundingBox(v[0], frame.imageWidth, frame.imageHeight,

                                                true, windowName);

                imshow(windowName, v[0]);



                char key = (char) waitKey(20);

                // Exit on escape:

                if(key == 27) {

                    destroyAllWindows();

                    break;

                }

            }

        } catch (IOException e) {

            e.printStackTrace();

        } finally {

            frameGrabber.stop();

        }

        frameGrabber.close();

    }

}

Once the preceding class is executed, the application should load the pre-trained model and the UI should be loaded, showing each object being classified:

Your Tiny YOLO model can predict multiple cars simultaneously from a video clip (day)

Now, to see the effectiveness of your model even in night mode, perform a second experiment on the night dataset. To do this, just change one line in the main() method, as follows:

String videoPath = "data/SelfDrivingCar_Night.mp4";

Once the preceding class is executed using this clip, the application should load the pre-trained model and the UI should be loaded, showing each object being classified:

Your Tiny YOLO model can predict multiple cars simultaneously from a video clip (night)

Furthermore, to see the real-time output, execute the given screen recording clips showing the output of the application.

If you found this interesting, you can explore Md. Rezaul Karim’s Java Deep Learning Projects to build and deploy powerful neural network models using the latest Java deep learning libraries. Java Deep Learning Projects starts with an overview of deep learning concepts and then delves into advanced projects.

Leave a Reply