I have a compacted Kafka topic that is a stream of entities have the latest representation of that entity in a many-to-many relationship that I'd like to invert.
An example would be a topic of Author objects where the topic key is the Author.id (AAA) and the value is an array of `Book' identifier values:
"AAA" -> {"books": [456]}
When an Author writes a new Book with an ID of 333, a new event with the same key is written to the stream with the updated list of books:
"AAA" -> {"books": [456, 333]}
It is also possible that a Book had multiple Authors so that same Book identifier could appear in another event:
"BBB" -> {"books": [333, 555]}
I'd like to invert this using kafka streams into a stream of Books -> [Author], so the above events would result in something like:
456 -> {"authors": ["AAA"]}
333 -> {"authors": ["AAA", "BBB"]}
555 -> {"authors": ["BBB"]}
When I start my app up again, I want the state to be restored such that if I read in another Author record it will invert the relationship appropriatley. So this:
"CCC" -> {"books": [555]}
would know that "BBB" was also an Author and would emit the updated event:
555 -> {"authors": ["BBB", "CCC"]}
I've been eyeing the GlobalKTable which reads in the full topic state locally, but can't figure out how to get it to invert the relationship and aggregate the values together.
If I could, I think I could join that GlobalKTable with a stream of the events and get the full list of Authors for each Book.
You don't have to use GlobakKTable to achieve your requirement.
In Kafka Streams, internal data redistribution caused by changing key occurs automatically. For example :
orgKStream
.flatMapValues(books -> getBookList) (1)
.map((k,v) -> new KeyValue<>(v, k)) (2)
.groupByKey() (3)
.aggregate(//aggregate author list ) (4)
.toStream(// sink topic) (5)
(1) will change your original topic like below.
<before>
"AAA" -> {"books": [456, 333]}
"BBB" -> {"books": [333, 555]}
<after>
"AAA" -> 456
"AAA" -> 333
"BBB" -> 333
"BBB" -> 555
(2) will replace the key with the value.
<after>
456 -> "AAA"
333 -> "AAA"
333 -> "BBB"
555 -> "BBB"
(3) and (4) will aggregate and generate KTable (and state store)
<after>
456 -> {"authors": ["AAA"]}
333 -> {"authors": ["AAA", "BBB"]}
555 -> {"authors": ["BBB"]}
(5) will write whole records in the table into the given topic.
Now, you have a new topic that contains book as key and author list as values. If you want to have the whole result in one place, now just create GlobalKTable like below.
StreamsBuilder.globalTable(<sink topic>)
If (2) is called (map) and then (3) is called (groupByKey), internal data redistribution via repartition topic will occur. It means that all record that has same book id as a key will be published into the same partition of internal repartition topic. As a result, you will not lose any data for your aggregation.
Related
I want to perform a left join on an updated kTable.
I have a kstream with input Topic A and a materialized kTable with input Topic B. What I'm doing is I perform a kstream-ktable left join and writing that join result back to Topic B so that I can update my kTable.
The problem is that when multiple messages come in, I do a left join on an old table before it gets updated. So, data are lost. Below is the diagram of my topology.
Topic A --> kStream \
left join --> write to Topic B --
Topic B --> kTable / |
^ |
|___________________________________________________|
Expected:
stream_message1 with state A --> get new state B --> update kTable
stream_message2 with state B --> get new state C --> update kTable
Actual:
stream_message1 with state A --> get new state B
stream_message2 with state A --> get totally different state D
is there anyway to make all this synchronous?
I have three tables in my database that are connected as a many to many relation.
They are called Asset, MeterPoint and AssetMeterPointMember.
When performing this query I get an unexpected result:
query
{
meterPoints
{
id,
eic,
assetMeterPointMembers (79 keys)
{
asset (80 keys)
{
id,
assetMeterPointMembers (13 keys)
{
meterPoint (46 keys)
{
id,
eic
}
}
}
}
}
}
Overview of query results
The image shows relations when fetching from the database and fetching with GraphQl.
Database
There are 79 MeterPoints.
53 of those have Assets connected through AssetMeterPointMember.
There are 88 Assets and 80 of those are connected to a MeterPoint.
GraphQl
Keys = primary keys for the database.
-> = Are sent to GroupDataLoader.
79 keys -> AssetMeterPointMember GroupDataLoader in MeterPoint (Correct)
80 keys -> Asset GroupDataLoader in AssetMeterPointMember (correct)
13 keys -> AssetMeterPointMember GroupDataLoader in Asset (Incorrect)
The 13 keys are always fetched from the newest data, but there are some keys left out, so no logic there.
46 keys -> MeterPoint GroupDataLoader from AssetMeterPointMember (Incorrect)
Expected
I would expect to have at least the same amount of keys going up again in the hierarchy.
Can someone please explain what might cause this.
Let me break down the problem it will take some time.
Consider that you have an entities A, B, C in your system.
A is the parent of everything
B is the child of A
C can be child of A or B, Please note there are some more entities like D,E,F which are same as C. So lets consider C only for time being
So basically its a tree alike structure like
```
A
/ \
/ \
B C(there are similar elements like D, E, F)
|
|
C
```
Now we need are using Elastic Search as secondary DB to store this. In the data base the structure is completely different since A, B, C have dynamic fields, so they are different tables and we join them to get data, but from business prospective this is design.
Now when we try to flat it and store in es for under set
We have a entity A1 who has 2 children C1 and B1, B1 has further children C2
A B C
1 A1 null null
2 A1 null C1
3 A1 B1 null
4 A1 B1 C2
Now what your can query
use says he wants All columns of A,B,C where value of columns A is A1, so adding some null removing rules we can give him row number 2,3,4
now the problem set , now user says he want all As where value of A is A1 , so basically we will return him all rows 1,2,3,4 or 2,3,4 so we will see values like
A
A1
A1
A1
but logically he should see only one column A1 since that is only unique value. As ES doesn't have the ability to group by things.
So how we solved things.
We solved this problem by creating multiple indices and one nested index
So when we need to group by index we go to nested index and other index work as flat index
so we have different index, like index for A and B, A or B and C . But we have more elements so it lead to creation of 5 indices.
As data started increasing its becoming difficult to maintain 5 indices and indexing them from scratch takes too much time.
So to solve this we started to look for other options and we are testing cratedb. But on the first place we are still trying to figure is there any way to do that in ES since need to use many feature of ES as percolation, watcher etc. Any clues on that?
Please also note that we need to apply pagination also. That's why single nested index will not work
I have a custom Pig loader like this:
A = LOAD 'myfile' USING myudf_loader()
A contains:
((key1, val1), (key2, val2), (key3, val3), ...)
That is A has an outer tuple that contains key-value pairs stored in inner tuples.
I am not using maps because maps require that key values within a relation must be unique. The keys I have do not necessarily have to be unique.
The keys are chararrays, while the values can be chararrays, ints, and floats.
I would like to access A's inner tuples, as well as the (key, value) pairs within those tuples.
For example, I want to FILTER the keys of A such that the only fields remaining are key = "city" and value = "New York City".
Example input:
DUMP A;
(("city", "New York City"), ("city", "Boston"),
("city", "Washington, D.C."), ("non-city-key", "non-city-value"),
("city", "New York City"), ("non-city-key", "non-city-value"))
Example output of the filtering, which is stored into B:
DUMP B;
("city", "New York City")
("city", "New York City")
I dont have your full pig latin script.
But you can achieve using the below idea
grouped_records = GROUP records By Key;
filtered_records = FILTER grouped_records By group='CITY'
Dump filtered_records
Cheers
Nag
I have 3 tables: A, B and C.
Table A is in relation (n:1) with B and with C.
Typically I store in A the B.Id (or the C.Id) and the table name.
e.g.
A.ParentId = 1
A.TableName = "B"
A.ParentId = 1
A.TableName = "C"
A.ParentId = 2
A.TableName = "B"
Is it a good solution? Are there any other solutions?
Why not 2 parentid columns?
A.ParentIdB = 1
A.ParentIdC = 3
Another possibility is to introduce another table Content (D) that serves as a "supertype" to Posts and Images. Then a row in Comments (A) would reference a primary key in Content as would each row in Posts (B) and Images (D). Any common fields in Posts and Images would be moved to Content (perhaps "title" or "date") and those original tables would then only contain information specific to a post or image (perhaps "body" or "resolution"). This would make it easier to perform joins than having the table names in a field, but it does mean that a real-world entity could be both a post and a comment (or indeed, be multiply a post or comment!). Really, though, it depends on the situation that you're trying to model.