I was reading about Kafka connect (JDBC source) and I couldn't wrap my head around the actual need of timestamp+incrementing mode. If timestamp mode can get updates/inserts once we have a updatedAt timestamp column which might not necessarily be unique but it will always increase with updates/inserts.
I read somewhere that timestamp+incrementing provides a globally unique ID for streaming but I am not really sure how that ID will be used if I am dumping the data into Kafka and then from Kafka to s3. Also I read that due to some early shutdown procedure we require incrementing column, but again I am not sure if that is some internal kafka connect issue.
Here is the code that I found on github for the implementation of timestamp+incrementing.
protected void timestampIncrementingWhereClause(ExpressionBuilder builder) {
// This version combines two possible conditions. The first checks timestamp == last
// timestamp and incrementing > last incrementing. The timestamp alone would include
// duplicates, but adding the incrementing condition ensures no duplicates, e.g. you would
// get only the row with id = 23:
// timestamp 1234, id 22 <- last
// timestamp 1234, id 23
// The second check only uses the timestamp > last timestamp. This covers everything new,
// even if it is an update of the existing row. If we previously had:
// timestamp 1234, id 22 <- last
// and then these rows were written:
// timestamp 1235, id 22
// timestamp 1236, id 23
// We should capture both id = 22 (an update) and id = 23 (a new row)
builder.append(" WHERE ");
coalesceTimestampColumns(builder);
builder.append(" < ? AND ((");
coalesceTimestampColumns(builder);
builder.append(" = ? AND ");
builder.append(incrementingColumn);
builder.append(" > ?");
builder.append(") OR ");
coalesceTimestampColumns(builder);
builder.append(" > ?)");
builder.append(" ORDER BY ");
coalesceTimestampColumns(builder);
builder.append(",");
builder.append(incrementingColumn);
builder.append(" ASC");
}
Thanks for the help in advance!
mode=timestamp+incrementing is not "required". There are other options.
read somewhere that...
The documentation perhaps? Where you'd also see this mode is not required.
As you found, it is considered a unique ID that prevents reading the same rows upon restarts. It helps determine if there are duplicate timestamps, which ones might have already been read.
S3 is irrelevant to the question or how the JDBC source connector works
Related
Set Up
Kafka 2.5
Apache KStreams 2.4
Deployment to Openshift(Containerized)
Objective
Group a set of messages from a topic using a set of value attributes & assign a unique group identifier
-- This can be achieved by using selectKey and groupByKey
originalStreamFromTopic
.selectKey((k,v)-> String.join("|",v.attribute1,v.attribute2))
.groupByKey()
groupedStream.mapValues((k,v)->
{
v.setGroupKey(k);
return v;
});
For each message within a specific group , create a new message with an itemCount number as one of the attributes
e.g. A group with key "keypart1|keyPart2" can have 10 messages and each of the message should have an incremental id from 1 through 10.
aggregate?
count and some additional StateStore based implementation.
One of the options (that i listed above), can make use of a couple of state stores
state store 1-> Mapping of each groupId and individual Item (KTable)
state store 2 -> Count per groupId (KTable)
A join of these 2 tables to stamp a sequence on the message as they get published to the final topic.
Other statistics:
Average number of messages per group would be in some 1000s except for an outlier case where it can go upto 500k.
In general the candidates for a group should be made available on the source within a span of 15 mins max.
Following points are of concern from the optimum solution perspective .
I am still not clear how i would be able to stamp a sequence number on the messages unless some kind of state store is used for keeping track of messages published within a group.
Use of KTable and state stores (either explicit usage or implicitly by the use of KTable) , would add to the state store size considerably.
Given the problem involves some kind of tasteful processing , the state store cant be avoided but any possible optimizations might be useful.
Any thoughts or references to similar patterns would be helpful.
You can use one state store with which you maintain the ID for each composite key. When you get a message you select a new composite key and then you lookup the next ID for the composite key in the state store. You stamp the message with the new ID that you just looked up. Finally, you increase the ID and write it back to the state store.
Code-wise, it would be something like:
// create state store
StoreBuilder<KeyValueStore<String,String>> keyValueStoreBuilder = Stores.keyValueStoreBuilder(
Stores.persistentKeyValueStore("idMaintainer"),
Serdes.String(),
Serdes.Long()
);
// add store
builder.addStateStore(keyValueStoreBuilder);
originalStreamFromTopic
.selectKey((k,v)-> String.join("|",v.attribute1,v.attribute2))
.repartition()
.transformValues(() -> new ValueTransformer() {
private StateStore state;
void init(ProcessorContext context) {
state = context.getStateStore("idMaintainer");
}
NewValueType transform(V value) {
// your logic to:
// - get the ID for the new composite key,
// - stamp the record
// - increase the ID
// - write the ID back to the state store
// - return the stamped record
}
void close() {
}
}, "idMaintainer")
.to("output-topic");
You do not need to worry about concurrent access to the state store because in Kafka Streams same keys are processed by one single task and tasks do not share state stores. That means, your new composite keys with the same value will be processed by one single task that exclusively maintains the IDs for the composite keys in its state store.
I am sending JSON messages containing details about a web service request and response to a Kafka topic. I want to process each message as it arrives in Kafka using Kafka streams and send the results as a continuously updated summary(JSON message) to a websocket to which a client is connected.
The client will then parse the JSON and display the various counts/summaries on a web page.
Sample input messages are as below
{
"reqrespid":"048df165-71c2-429c-9466-365ad057eacd",
"reqDate":"30-Aug-2017",
"dId":"B198693",
"resp_UID":"N",
"resp_errorcode":"T0001",
"resp_errormsg":"Unable to retrieve id details. DB Procedure error",
"timeTaken":11,
"timeTakenStr":"[0 minutes], [0 seconds], [11 milli-seconds]",
"invocation_result":"T"
}
{
"reqrespid":"f449af2d-1f8e-46bd-bfda-1fe0feea7140",
"reqDate":"30-Aug-2017",
"dId":"G335887",
"resp_UID":"Y",
"resp_errorcode":"N/A",
"resp_errormsg":"N/A",
"timeTaken":23,
"timeTakenStr":"[0 minutes], [0 seconds], [23 milli-seconds]",
"invocation_result":"S"
}
{
"reqrespid":"e71b802d-e78b-4dcd-b100-fb5f542ea2e2",
"reqDate":"30-Aug-2017",
"dId":"X205014",
"resp_UID":"Y",
"resp_errorcode":"N/A",
"resp_errormsg":"N/A",
"timeTaken":18,
"timeTakenStr":"[0 minutes], [0 seconds], [18 milli-seconds]",
"invocation_result":"S"
}
As the stream of messages comes into Kafka, I want to be able to compute on the fly
**
total number of requests i.e a count of all
total number of requests with invocation_result equal to 'S'
total number of requests with invocation_result not equal to 'S'
total number of requests with invocation_result equal to 'S' and UID
equal to 'Y'
total number of requests with invocation_result equal to 'S' and UID
equal to 'Y'
minimum time taken i.e. min(timeTaken)
maximum time taken i.e. max(timeTaken)
average time taken i.e. avg(timeTaken)
**
and write them out into a KStream with new key set to the reqdate value and new value a JSON message that contains the computed values as shown below using the 3 messages shown earlier
{
"total_cnt":3, "num_succ":2, "num_fail":1, "num_succ_data":2,
"num_succ_nodata":0, "num_fail_biz":0, "num_fail_tech":1,
"min_timeTaken":11, "max_timeTaken":23, "avg_timeTaken":17.3
}
Am new to Kafka streams. How do i do the multiple counts and by differing columns all in one or as a chain of different steps? Would Apache flink or calcite be more appropriate as my understanding of a KTable suggests that you can only have a key e.g. 30-AUG-2017 and then a single column value e.g a count say 3. I need a resulting table structure with one key and multiple count values.
All help is very much appreciated.
You can just do a complex aggregation step that computes all those at once. I am just sketching the idea:
class AggResult {
long total_cnt = 0;
long num_succ = 0;
// and many more
}
stream.groupBy(...).aggregate(
new Initializer<AggResult>() {
public AggResult apply() {
return new AggResult();
}
},
new Aggregator<KeyType, JSON, AggResult> {
AggResult apply(KeyType key, JSON value, AggResult aggregate) {
++aggregate.total_cnt;
if (value.get("success").equals("true")) {
++aggregate.num_succ;
}
// add more conditions to get all the other aggregate results
return aggregate;
}
},
// other parameters omitted for brevity
)
.to("result-topic");
I have data in which i need to compare month of data if it is previous month then it should be insert otherwise not.
Example:
23.12.2016 12:02:23,Koji,24
22.01.2016 01:21:22,Mahi,24
Now i need to get first column of data (23.12.2016 12:02:23) and then get month (12) on it.
Compared that with before of current month like.,
If current month is 'JAN_2017',then get before of 'JAN_2017' it should be 'Dec_2016'
For First row,
compare this 'Dec_2016'[month before] with month of data 'Dec_2016' [23.12.2016].
It matched then insert into database.
EDIT 1:
i have already tried with your suggestions.
"UpdateAttribute to add a new attribute with the previous month value, and then RouteOnAttribute to determine if the flowfile should be inserted "
i have used below expression language in RouteOnAttribute,
${literal('Jan,Feb,Mar,Apr,May,Jun,Jul,Aug,Sep,Oct,Nov,Dec'):getDelimitedField(${csv.1:toDate('dd.MM.yyyy hh:mm:ss'):format('MM')}):equals(${literal('Dec,Jan,Feb,Mar,Apr,May,Jun,Jul,Aug,Sep,Oct,Nov'):getDelimitedField(${now():toDate(' Z MM dd HH:mm:ss.SSS yyyy'):format('MM'):toNumber()})})}
it could be failed in below data.,
23.12.2015,Andy,21
23.12.2017,Present,32
My data may contains some past years and future years
It matches with my expression it also inserted.
I need to check month with year in data.
How can i check it?
The easiest answer is to use the ExecuteScript processor with simple date logic (this will allow you to use the Groovy/Java date framework to correctly handle things like leap years, time zones, etc.).
If you really don't want to do that, you could probably use a regex and Expression Language in UpdateAttribute to add a new attribute with the previous month value, and then RouteOnAttribute to determine if the flowfile should be inserted into the database.
Here's a simple Groovy test demonstrating the logic. You'll need to add the code to process the session, flowfile, etc.
#Test
public void textScriptShouldFindPreviousMonth() throws Exception {
// Arrange
def input = ["23.12.2016 12:02:23,Koji,24", "22.01.2016 01:21:22,Mahi,24"]
def EXPECTED = ["NOV_2016", "DEC_2015"]
// Act
input.eachWithIndex { String data, int i ->
Calendar calendar = Date.parse("dd.MM.yyyy", data.tokenize(" ")[0]).toCalendar()
calendar.add(Calendar.MONTH, -1)
String result = calendar.format("MMM_yyyy").toUpperCase()
// Assert
assert result == EXPECTED[i]
}
}
I am loading initial data (url list for a crawler) to Cassandra with status crawled=0. Then using Hadoop I crawl all the links and try to change crawled from 0 to something else, for example 1 or 2, or 3. When I check in Cassandra cli interface get ColumnFamily['www.somedomain.com'] the value of crawler column remains the same. If during initial import I have not mentioned crawled column, it adds correctly. This is only one part of the algorithm and I need further updates of this column with other Map/Reduce jobs, etc.
In Thrift and Cassandra API it is said that we have only inserts and deletions. Insert should work as an update.
For crawled column I have UTF8 type.
Mutation class is like this:
private static Mutation getMutationCrawled(Text crawledVal)
{
Text column = new Text();
column.set("crawled");
Column c = new Column();
c.setName(ByteBuffer.wrap(Arrays.copyOf(column.getBytes(), column.getLength())));
c.setValue(ByteBuffer.wrap(crawledVal.getBytes()));
c.setTimestamp(System.currentTimeMillis());
Mutation m = new Mutation();
m.setColumn_or_supercolumn(new ColumnOrSuperColumn());
m.column_or_supercolumn.setColumn(c);
return m;
}
Cassandra resolves conflicts using the timestamp of the mutation, with the largest timestamp winning. You can set the timestamp value to whatever you want, but the convention is to set the timestamp as a value in micro seconds. In the example above, you set the timestamp with,
c.setTimestamp(System.currentTimeMillis());
Most likely the initial import code to populate the values is setting the timestamp in micro seconds. The micro second timestamp values are larger than the millisecond timestamp values, so your updates are being ignored.
Is there an efficient way to delete multiple rows in HBase or does my use case smell like not suitable for HBase?
There is a table say 'chart', which contains items that are in charts. Row keys are in the following format:
chart|date_reversed|ranked_attribute_value_reversed|content_id
Sometimes I want to regenerate chart for a given date, so I want to delete all rows starting from 'chart|date_reversed_1' till 'chart|date_reversed_2'. Is there a better way than to issue a Delete for each row found by a Scan? All the rows to be deleted are going to be close to each other.
I need to delete the rows, because I don't want one item (one content_id) to have multiple entries which it will have if its ranked_attribute_value had been changed (its change is the reason why chart needs to be regenerated).
Being a HBase beginner, so perhaps I might be misusing rows for something that columns would be better -- if you have a design suggestions, cool! Or, maybe the charts are better generated in a file (e.g. no HBase for output)? I'm using MapReduce.
Firstly, coming to the point of range delete there is no range delete yet in HBase, AFAIK. But there is a way to delete more than one rows at a time in the HTableInterface API. For this simply form a Delete object with row keys from scan and put them in a List and use the API, done! To make scan faster do not include any column family in the scan result as all you need is the row key for deleting whole rows.
Secondly, about the design. First my understanding of the requirement is, there are contents with content id and each content has charts generated against them and those data are stored; there can be multiple charts per content via dates and depends on the rank. In addition we want the last generated content's chart to show at the top of the table.
For my assumption of the requirement I would suggest using three tables - auto_id, content_charts and generated_order. The row key for content_charts would be its content id and the row key for generated_order would be a long, which would auto-decremented using HTableInterface API. For decrementing use '-1' as the amount to offset and initialize the value Long.MAX_VALUE in the auto_id table at the first start up of the app or manually. So now if you want to delete the chart data simply clean the column family using delete and then put back the new data and then make put in the generated_order table. This way the latest insertion will also be at the top in the latest insertion table which will hold the content id as a cell value. If you want to ensure generated_order has only one entry per content save the generated_order id first and take the value and save it into content_charts when putting and before deleting the column family first delete the row from generated_order. This way you could lookup and charts for a content using 2 gets at max and no scan required for the charts.
I hope this is helpful.
You can use the BulkDeleteProtocol which uses a Scan that defines the relevant range (start row, end row, filters).
See here
I ran into your situation and this is my code to implement what you want
Scan scan = new Scan();
scan.addFamily("Family");
scan.setStartRow(structuredKeyMaker.key(starDate));
scan.setStopRow(structuredKeyMaker.key(endDate + 1));
try {
ResultScanner scanner = table.getScanner(scan);
Iterator<Entity> cdrIterator = new EntityIteratorWrapper(scanner.iterator(), EntityMapper.create(); // this is a simple iterator that maps rows to exact entity of mine, not so important !
List<Delete> deletes = new ArrayList<Delete>();
int bufferSize = 10000000; // this is needed so I don't run out of memory as I have a huge amount of data ! so this is a simple in memory buffer
int counter = 0;
while (entityIterator.hasNext()) {
if (counter < bufferSize) {
// key maker is used to extract key as byte[] from my entity
deletes.add(new Delete(KeyMaker.key(entityIterator.next())));
counter++;
} else {
table.delete(deletes);
deletes.clear();
counter = 0;
}
}
if (deletes.size() > 0) {
table.delete(deletes);
deletes.clear();
}
} catch (IOException e) {
e.printStackTrace();
}