Folks,

I am wondering if there is any dis/advantage to using json objects as message keys.

Assume I have 2 event streams coupled with 2 tables in a source system. Each of those tables have its own business keys;

  1. Product table has product_number and base_product_id as business key in the source system (key: product_number, base_product_id)

  2. Invoice table has invoice_id as its business key and prod_id and prod_num as foreign key that point to Product table in the source system.

I’d like to enrich my Invoice stream with records in a globalKtable built on top of my Product event stream in a kstream application by applying an inner join between the two.

I can think of 3 ways to configure my data producers for assigning keys:

  1. concatenate the value of the keys. e.g.: key=valueOf(prod_id).concat(prod_num)

  2. define key as json object with schema maintained in schema registry. e.g.: key={“prod_id”: “AAM64”, “prod_num”: “334”}

  3. use a hash function to construct the key. e.g.: key=hashFunction(valueOf(prod_id).concat(prod_num))

Option 2 enforces using the same structure and order of key in my Product and Invoice event streams as field names are part of my keys, and as such the join condition will fail if field names do not match.

Any recommendation as to which approach would make sense is highly appreciated.

Options 1 & 2 are essentially equivalent,
with 1 being better since it retains the
two columns of your compound
PK.
Any RDBMS will certainly let you model this as a compound PK.
The single JSON string could be a PK,
but it seems less convenient.
It also seems a bit fragile, as listing prod_num
before prod_id would break things.

I would only recommend option 3, hashing,
if you’re willing to have such hashes
accompany all records and become the PK.
Otherwise it seems like a debugging nightmare.
“I’m looking at some new hash — wonder which record
it was supposed to go with?”
Plus, due to the
birthday paradox,
you will need to hang on to quite a few bits of hash
to make the risk of collisions acceptably low.

3

There is a problem with option 1.

key=valueOf(prod_id).concat(prod_num)

For one item, valueOf(prod_id) is 1 and prod_num is 112.

For another item, valueOf(prod_id) is 11 and prod_num is 12.

These 2 items will end up with the same key. The solution is to have some separator character that cannot be contained in the concatenated values e.g.

key=valueOf(prod_id).concat(":").concat(prod_num)

0