I’m currently working on creating an LLM-agent that can provide insights from a complex database. The database includes several columns of different types (datetime, numeric, and text).
For simplicity, let’s assume I have a table containing information about reports with three columns: Date
, Category
, and Description
. Let’s also assume that the number of categories is indefinite.
For text-related queries, semantic search has me covered. For analytical queries, such as filtering data for a time range, I could use text-to-SQL or an agent.
My problem arises when facing queries like “How many incidents related to falling off a roof happened last year?”. This requires to split the task into three sub-tasks:
- filter date
- find relevant reports according to the Description AND the Category (AND N columns)
- sum up the results
Let me present my initial approach:
Semantic Search + SQL filtering
-
Filter date
-
Filter Category with LIKE
-
Semantic Search over Description
Covers more ground that using pure sql filtering, but it can still be inaccurate. My current problems are:
-
Problem #1: This is about something very specific: counting reports. The semantic search depends on the threshold set for the similarity/distance.
-
Problem #2: The Category will not be fully captured with a LIKE operation.
For Problem #2 I could add the Category to the description and leave that for the semantic search, but it might also add some noise to the similarity computation, as we can see here:
from scipy.spatial.distance import cosine
# LangChain Ollama embeddings
eq1 = embeddings_model.embed_query("Last night I fell off my bed when I was having a nightmare and I felt really bad")
eq2 = embeddings_model.embed_query("the other day I slipped over some ice and I fell to the ground")
eq3 = embeddings_model.embed_query("I fall down all the time.")
# Cosine similarity varies with the rest of context,
# even if all three sentences talk about falling
cosine(eq1, eq2), cosine(eq1, eq3), cosine(eq2, eq3)
> > > (0.5038163339975597, 0.6419394542874378, 0.4899502476580482)
For Problem #1 I guess it depends on the threshold for the similarity as well, but then we would be dealing with a precision/recall scenario (counting irrelevant reports vs excluding relevant reports)
I’ve been scratching my head over this problem for a couple of weeks now, and I can’t seem to find a solid way around it.
Has anyone faced a similar issue or have any ideas on how to effectively combine these techniques? Any insights or suggestions would be greatly appreciated.