Delete data from KTable that has a custom stream StreamPartitioner - apache-kafka-streams

we have a kafka-topic product-update-events which contains data about product updates and their variations.
We aggregate these events into a products KTable using the kstreams 'aggregate' function.
For every product in this products KTable we then want to calculate the 'best' variation (e.g one of the variations of the product by some criteria).
These 'best' variations are then written to another KTable and to a Kafka-Topic.
We only want to emit a best-variation update, when the best-variation actually has changed because of a product update. Therefore we use a custom transformer which checks the current best-variation in its state store.
The product-events and product table have the 'productId' as key and are partitioned by this. The best-variation records have the 'variationId' as key. We use a custom StreamPartitioner to also partition these records by productId, so that each KStreams application instance has the matching product and best-variation data:
{ _, _, variation, numPartitions -> Utils.toPositive(Utils.murmur2(StringSerializer().serialize("", variation.productId))) % numPartitions }
Now we come to the actual question :)
We want to delete the best-variation when we receive a 'delete' product-update event. Therefore we need to set the payload of the best-variation record to 'null'. But now we don't have any information about the productId this record belongs to for our custom partitioner.
Do you have any suggestion on how to solve this?
Our topology is as follows:
Topologies:
Sub-topology: 0
Source: KSTREAM-SOURCE-0000000000 (topics: [product-update-events])
--> KSTREAM-AGGREGATE-0000000002
Processor: KSTREAM-AGGREGATE-0000000002 (stores: [KSTREAM-AGGREGATE-STATE-STORE-0000000001])
--> KTABLE-TOSTREAM-0000000003
<-- KSTREAM-SOURCE-0000000000
Processor: KTABLE-TOSTREAM-0000000003 (stores: [])
--> KSTREAM-SINK-0000000004
<-- KSTREAM-AGGREGATE-0000000002
Sink: KSTREAM-SINK-0000000004 (topic: products)
<-- KTABLE-TOSTREAM-0000000003
Sub-topology: 1
Source: KSTREAM-SOURCE-0000000007 (topics: [products])
--> KSTREAM-TRANSFORM-0000000008
Source: KSTREAM-SOURCE-0000000005 (topics: [best-variation-per-article])
--> KTABLE-SOURCE-0000000006
Processor: KSTREAM-TRANSFORM-0000000008 (stores: [best-variation-per-article])
--> KSTREAM-SINK-0000000009
<-- KSTREAM-SOURCE-0000000007
Sink: KSTREAM-SINK-0000000009 (topic: best-variation-per-article)
<-- KSTREAM-TRANSFORM-0000000008
Processor: KTABLE-SOURCE-0000000006 (stores: [best-variation-per-article])
--> none
<-- KSTREAM-SOURCE-0000000005

You will want to use a tombstone basically it is the same key with a null value this will cause the store to drop the entry with that key.
this is a pretty decent example that includes deletion

Related

Netsuite Saved Search Criteria Formula with 2 fields

I'm trying to do a saved search filter with a formula (text) criteria based on 2 fields, an item field and custom field.
How can I write the correct formula text in criteria? to create the correct filter to find a specific word in this 2 fields
Eg: I have items called with special nomenclature (SERIAL NUMBER_CODE) and also I created a custom field into journal entry line called (SERIAL NUMBER_CODE_RELATED not an item) now I need to find in the saved search all type of transaction with a specific SERIAL NUMBER_CODE + journal entry that have that SERIAL CODE as well into the line, also add a filter that a user can type the SERIAL CODE and bring transactions + journals.
I used this formula (text) in criteria:
CASE WHEN {custom_field} = 'SERIALCODE' OR {item} = 'SERIALCODE' THEN '1' ELSE '0' END
IS = space
Type = all kind of netsuite transaction
in available filter tap I added formula text show in filter region
the result doesn't bring me anything
Thank you
Try
formulatext NVL({custom_field},{item}) is %
Or
formulatext {custom_field}||' '||{item} contains space

Add field to existing documents over million records

Scenario
We have over 5 million document in a bucket and all of it has nested JSON with a simple uuid key. We want to add one extra field to ALL of the documents.
Example
ee6ae656-6e07-4aa2-951e-ea788e24856a
{
"field1":"data1",
"field2":{
"nested_field1":"data2"
}
}
After adding extra field
ee6ae656-6e07-4aa2-951e-ea788e24856a
{
"field1":"data1",
"field3":"data3",
"field2":{
"nested_field1":"data2"
}
}
It has only one Primary Index: CREATE PRIMARY INDEX idx FOR bucket.
Problem
It takes ages. We tried it with n1ql, UPDATE bucket SET field3 = data3. Also sub-document mutation. But all of it takes hours. It's written in Go so we could put it into a goroutine, but it's still too much time.
Question
Is there any solution to reduce that time?
As you need to add new field, not modifying any existing field it is better to use SDKs SUBDOC API vs N1QL UPDATE (It is whole document update and require fetch the document).
The Best option will be Use N1QL get the document keys then use
SDK SUBDOC API to add the field you need. You can use reactive API(asynchronously)
You have 5M documents and have primary index use following
val = ""
In loop
SELECT RAW META().id FROM mybucket WHERE META().id > $val LIMIT 10000;
SDK SUBDOC update
val = last value from the SELECT
https://blog.couchbase.com/offset-keyset-pagination-n1ql-query-couchbase/
The Eventing Service can be quite performant for these sort of enrichment tasks. Even a low end system should be able to do 5M rows in under two (2) minutes.
// Note src_bkt is an alias to the source bucket for your handler
// in read+write mode supported for version 6.5.1+, this uses DCP
// and can be 100X more performant than N1QL.
function OnUpdate(doc, meta) {
// optional filter to be more selective
// if (!doc.type && doc.type !== "mytype") return;
// test if we already have the field we want to add
if (doc.field3) return;
doc.field3 = "data3";
src_bkt[meta.id] = doc;
}
For more details on Eventing refer to https://docs.couchbase.com/server/current/eventing/eventing-overview.html I typically enrich 3/4 of a billion documents. The Eventing function will also run faster (enrich more documents per second) if you increase the number of workers in your Eventing function's setting from say 3 to 16 provided you have 8+ physical cores on your Eventing node.
I tested the above Eventing function and it enriches 5M documents (modeled on your example) on my non-MDS single node couchbase test system (12 cores at 2.2GHz) in just 72 seconds. Obviously if you have a real multi node cluster it will be faster (maybe all 5M docs in just 5 seconds).

Kafka Streams: Add Sequence to each message within a group of message

Set Up
Kafka 2.5
Apache KStreams 2.4
Deployment to Openshift(Containerized)
Objective
Group a set of messages from a topic using a set of value attributes & assign a unique group identifier
-- This can be achieved by using selectKey and groupByKey
originalStreamFromTopic
.selectKey((k,v)-> String.join("|",v.attribute1,v.attribute2))
.groupByKey()
groupedStream.mapValues((k,v)->
{
v.setGroupKey(k);
return v;
});
For each message within a specific group , create a new message with an itemCount number as one of the attributes
e.g. A group with key "keypart1|keyPart2" can have 10 messages and each of the message should have an incremental id from 1 through 10.
aggregate?
count and some additional StateStore based implementation.
One of the options (that i listed above), can make use of a couple of state stores
state store 1-> Mapping of each groupId and individual Item (KTable)
state store 2 -> Count per groupId (KTable)
A join of these 2 tables to stamp a sequence on the message as they get published to the final topic.
Other statistics:
Average number of messages per group would be in some 1000s except for an outlier case where it can go upto 500k.
In general the candidates for a group should be made available on the source within a span of 15 mins max.
Following points are of concern from the optimum solution perspective .
I am still not clear how i would be able to stamp a sequence number on the messages unless some kind of state store is used for keeping track of messages published within a group.
Use of KTable and state stores (either explicit usage or implicitly by the use of KTable) , would add to the state store size considerably.
Given the problem involves some kind of tasteful processing , the state store cant be avoided but any possible optimizations might be useful.
Any thoughts or references to similar patterns would be helpful.
You can use one state store with which you maintain the ID for each composite key. When you get a message you select a new composite key and then you lookup the next ID for the composite key in the state store. You stamp the message with the new ID that you just looked up. Finally, you increase the ID and write it back to the state store.
Code-wise, it would be something like:
// create state store
StoreBuilder<KeyValueStore<String,String>> keyValueStoreBuilder = Stores.keyValueStoreBuilder(
Stores.persistentKeyValueStore("idMaintainer"),
Serdes.String(),
Serdes.Long()
);
// add store
builder.addStateStore(keyValueStoreBuilder);
originalStreamFromTopic
.selectKey((k,v)-> String.join("|",v.attribute1,v.attribute2))
.repartition()
.transformValues(() -> new ValueTransformer() {
private StateStore state;
void init(ProcessorContext context) {
state = context.getStateStore("idMaintainer");
}
NewValueType transform(V value) {
// your logic to:
// - get the ID for the new composite key,
// - stamp the record
// - increase the ID
// - write the ID back to the state store
// - return the stamped record
}
void close() {
}
}, "idMaintainer")
.to("output-topic");
You do not need to worry about concurrent access to the state store because in Kafka Streams same keys are processed by one single task and tasks do not share state stores. That means, your new composite keys with the same value will be processed by one single task that exclusively maintains the IDs for the composite keys in its state store.

Search by values in Redis cache - Secondary Indexing

I am new to Redis. I want to search by one or multiple values that comes from API.
e.g - Let's say that I want to store some sec data as below:
Value1
{
"isin":"isin123",
"id_bb_global":"BBg12345676",
"cusip":"cusip123",
"sedol":"sedol123",
"cpn":"0.09",
"cntry":"US",
"144A":"xyz",
"issue_cntry":"UK"
}
Value2
{
"isin":"isin222",
"id_bb_global":"BBG222",
"cusip":"cusip222",
"sedol":"sedol222",
"cpn":"1.0",
"cntry":"IN",
"144A":"Y",
"issue_cntry":"DE"
}
...
...
I want to search by cusip or cusip and id_bb_global, ISIN plus Exchange, or sedol.
e.g - search query data -> {"isin":"isin222", "cusip":"cusip222"} , should return all data sets from value.
What is the best way to store this kind of data structure in Redis and API to retrieve the same faster.
when you insert data, you can create sets to maintain the index.
{
"isin":"isin123",
"id_bb_global":"BBg12345676",
"cusip":"cusip123",
"sedol":"sedol123",
"cpn":"0.09",
"cntry":"US",
"144A":"xyz",
"issue_cntry":"UK"
}
example for the above data, if you wnat to filter by isin and cusip, you can create the respective set for isin:123 and cusip:123 and add that item id to both of those sets.
later on, if you want to find item that are in both isin:123 and cusip:123, you just have to run SINTER on those 2 sets.
Or if you want to find items that are either in isin:123 OR cusip:123, you can union them.

why Mutation does not make inserts for existing columns

I am loading initial data (url list for a crawler) to Cassandra with status crawled=0. Then using Hadoop I crawl all the links and try to change crawled from 0 to something else, for example 1 or 2, or 3. When I check in Cassandra cli interface get ColumnFamily['www.somedomain.com'] the value of crawler column remains the same. If during initial import I have not mentioned crawled column, it adds correctly. This is only one part of the algorithm and I need further updates of this column with other Map/Reduce jobs, etc.
In Thrift and Cassandra API it is said that we have only inserts and deletions. Insert should work as an update.
For crawled column I have UTF8 type.
Mutation class is like this:
private static Mutation getMutationCrawled(Text crawledVal)
{
Text column = new Text();
column.set("crawled");
Column c = new Column();
c.setName(ByteBuffer.wrap(Arrays.copyOf(column.getBytes(), column.getLength())));
c.setValue(ByteBuffer.wrap(crawledVal.getBytes()));
c.setTimestamp(System.currentTimeMillis());
Mutation m = new Mutation();
m.setColumn_or_supercolumn(new ColumnOrSuperColumn());
m.column_or_supercolumn.setColumn(c);
return m;
}
Cassandra resolves conflicts using the timestamp of the mutation, with the largest timestamp winning. You can set the timestamp value to whatever you want, but the convention is to set the timestamp as a value in micro seconds. In the example above, you set the timestamp with,
c.setTimestamp(System.currentTimeMillis());
Most likely the initial import code to populate the values is setting the timestamp in micro seconds. The micro second timestamp values are larger than the millisecond timestamp values, so your updates are being ignored.

Resources