How to Combine Semantic Search with SQL Analytical Queries?

  Kiến thức lập trình

I’m currently working on creating an LLM-agent that can provide insights from a complex database. The database includes several columns of different types (datetime, numeric, and text).

For simplicity, let’s assume I have a table containing information about reports with three columns: Date, Category, and Description. Let’s also assume that the number of categories is indefinite.

For text-related queries, semantic search has me covered. For analytical queries, such as filtering data for a time range, I could use text-to-SQL or an agent.

My problem arises when facing queries like “How many incidents related to falling off a roof happened last year?”. This requires to split the task into three sub-tasks:

  • filter date
  • find relevant reports according to the Description AND the Category (AND N columns)
  • sum up the results

Let me present my initial approach:

Semantic Search + SQL filtering

  • Filter date

  • Filter Category with LIKE

  • Semantic Search over Description

Covers more ground that using pure sql filtering, but it can still be inaccurate. My current problems are:

  • Problem #1: This is about something very specific: counting reports. The semantic search depends on the threshold set for the similarity/distance.

  • Problem #2: The Category will not be fully captured with a LIKE operation.

For Problem #2 I could add the Category to the description and leave that for the semantic search, but it might also add some noise to the similarity computation, as we can see here:

from scipy.spatial.distance import cosine

# LangChain Ollama embeddings

eq1 = embeddings_model.embed_query("Last night I fell off my bed when I was having a nightmare and I felt really bad")
eq2 = embeddings_model.embed_query("the other day I slipped over some ice and I fell to the ground")
eq3 = embeddings_model.embed_query("I fall down all the time.")

# Cosine similarity varies with the rest of context,

# even if all three sentences talk about falling

cosine(eq1, eq2), cosine(eq1, eq3), cosine(eq2, eq3)

> > > (0.5038163339975597, 0.6419394542874378, 0.4899502476580482)

For Problem #1 I guess it depends on the threshold for the similarity as well, but then we would be dealing with a precision/recall scenario (counting irrelevant reports vs excluding relevant reports)

I’ve been scratching my head over this problem for a couple of weeks now, and I can’t seem to find a solid way around it.

Has anyone faced a similar issue or have any ideas on how to effectively combine these techniques? Any insights or suggestions would be greatly appreciated.

Theme wordpress giá rẻ Theme wordpress giá rẻ Thiết kế website