From what I’ve researched, live streaming flows like this.

IP Camera -> Live Encoder -> Media Server -> Delivery Server (CDN) -> Video Player -> Client

I want to add a feature that utilizes an AI inference server to object detect and monitor the incoming video from the IP camera.

The inference server is currently being developed with Fast API, and I wonder if I can pass the data to the inference server in the step and visualize the inference result together and deliver it to the client?

My idea is that the API server requests a base64-encoded image, and the inference server decodes it, performs object detection, and responds with box information and object type.

Please advise me if I need to add anything to my thought process or at what stage I should get the request response.

Currently, I am developing an Object Detection API that receives Base64 encoded information and I want to make it as real-time as possible.