AI revealing specific details in image recognition based on a text input

  Kiến thức lập trình

I’m trying to perform the following:

  • I wanna use some sort of AI (free) API to perform automatic image recognition
  • The AI must base its research (and extract the consequent specific details) on a provided text input saying on which detail to focus during the image analysis.

e.g. if I provide the AI ​​with an image of a playground full of joggers, a person on a red bicycle in the top left and another one on a green bycicle on the bottom-right, I must also be able to provide it with the text “tell me the color of the bicycle you see in the top left” and the output must be “red”.

or, if it’s an image of a winter playground, I need to be able to say to the AI ​​”look at the tree in the top left, recognize its characteristics, does it seem bare or lush?” and the output should be “bare”.

since I didn’t find any API that performed what I asked for in one go, I tried to manually play with the textual input and tags that the image analysis by Google Cloud Vision API provides.

The specific area where i’m working on is based on the recognition of certain bird’s sex based on chromatic details of its body

# Define a predefined list of compatible details (it's just an example)
compatible_details = ["yellow chest", "green chest", "red head", "blue head"]

# Process the textual input to extract details
def process_textual_input(textual_input):
    extracted_details = []
    for detail in compatible_details:
        if detail in textual_input:
            extracted_details.append(detail)
    return extracted_details

# Analyze images and match the details
def analyze_image_with_details(image_path, extracted_details):
    # Use image analysis APIs to get labels
    # Use extracted details to focus only on interesting specifics
    with open(image_path, "rb") as image_file:
        content = image_file.read()

    image = vision.Image(content=content)

    # Analyze the image via Google cloud vision API
    response = client.label_detection(image=image)
    labels = response.label_annotations

    # Compare the extracted details with identified labels
    results = []
    for detail in extracted_details:
        for label in labels:
            if detail.lower() in label.description.lower():
                results.append((detail, "male" if "yellow" in detail.lower() else "female"))
    return results

# main
if __name__ == "__main__":
    # Example textual input
    textual_input = "Look at the chest of the bird. If it's yellow, then it's male."
    
    # Process textual input to get details
    extracted_details = process_textual_input(textual_input)
    
    # List of images to analyze
    images = ["image1.jpg", "image2.jpg", "image3.jpg"]
    
    # Analyze images with extracted details
    for image_path in images:
        results = analyze_image_with_details(image_path, extracted_details)
        print(f"Image: {image_path}")
        for detail, gender in results:
            print(f"Detail: {detail}, Gender: {gender}")

but of course, as you can see, i’m only statically-extracting keywords from my text, an not letting the AI think about it and comprehend it.
How could I perform this?

New contributor

Maffe is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.

LEAVE A COMMENT