Getting duplicates with NiFi HBase_1_1_2_ClientMapCacheService - apache-nifi

I need to remove duplicates from a flow I've developed, it can receive the same ${filename} multiple times. I tried using HBase_1_1_2_ClientMapCacheService with DetectDuplicate (I am using NiFi v1.4), but found that it lets a few duplicates through. If I use DistributedMapCache (ClientService and Server), I do not get any duplicates. Why would I receive some duplicates with the HBase Cache?
As a test, I listed a directory (ListSFTP) with 20,000 files on all cluster nodes (4 nodes) and passed to DetectDuplicate (using the HBase Cache service). It routed 20,020 to "non-duplicate", and interestingly the table actually has 20,000 rows.

Unfortunately I think this is due to a limitation in the operations that are offered by HBase.
The DetectDuplicate processor relies on an operation "getAndPutIfAbsent" which is expected to return the original value, and then set the new value if it wasn't there. For example, first time through it would return null and set the new value, indicating it wasn't a duplicate.
HBase doesn't natively support this operation, so the implementation of this method in the HBase map cache client does this:
V got = get(key, keySerializer, valueDeserializer);
boolean wasAbsent = putIfAbsent(key, value, keySerializer, valueSerializer);
if (! wasAbsent) return got;
else return null;
So because it is two separate calls there is a possible race condition...
Imagine node 1 calls the first line and gets null, but then node 2 performs the get and the putIfAbsent, now when node 1 calls putIfAbsent it gets false because node 2 just populated the cache, so now node 1 returns the null value from the original get... both of these look like non-duplicates to DetectDuplicate.
In the DistributedMapCacheServer, it locks the entire cache per operation so it can provide an atomic getAndPutIfAbsent.

Related

Concurrency in Redis for flash sale in distributed system

I Am going to build a system for flash sale which will share the same Redis instance and will run on 15 servers at a time.
So the algorithm of Flash sale will be.
Set Max inventory for any product id in Redis
using redisTemplate.opsForValue().set(key, 400L);
for every request :
get current inventory using Long val = redisTemplate.opsForValue().get(key);
check if it is non zero
if (val == null || val == 0) {
System.out.println("not taking order....");
}
else{
put order in kafka
and decrement using redisTemplate.opsForValue().decrement(key)
}
But the problem here is concurrency :
If I set inventory 400 and test it with 500 request thread,
Inventory becomes negative,
If I make function synchronized I cannot manage it in distributed servers.
So what will be the best approach to it?
Note: I can not go for RDBMS and set isolation level because of high request count.
Redis is monothreaded, so running a Lua Script on it is always atomic.
You can define then a Lua script on your Redis instance and running it from your Spring instances.
Your Lua script would just be a sequence of operations to execute against your redis instance (the only one to have the correct value of your stock) and returns the new value for instance or an error if the value is negative.
Your Lua script is basically a Redis transaction, there are other methods to achieve Redis transaction but IMHO Lua is the simplest above all (maybe the least performant, but I have found that in most cases it is fast enough).

Neo4j cypher query for updating nodes taking a long time

I have a graph with the following:
(:Customer)-[:MATCHES]->(:Customer)
This is not limited to two customers but the depth does not exceed 5. The following also exists:
(:Customer)-[:MATCHES]->(:AuditCustomer)
There is a label :Master that needs to be added to one customer node in a matching set depending on some criteria.
My query therefore needs to find all sets of matching customers where none have the label master and add that label to one node in the set. There are a large number of customer nodes and doing it all in one go causes the database to become very slow.
I have attempted to do it using apoc.periodic.commit:
CALL apoc.periodic.commit("MATCH (c:Customer)
WHERE NOT c:Master AND NOT (c)-[:MATCHES*0..5]-(:Master) WITH c limit {limit}
CALL apoc.path.expand(c, 'MATCHES', '+Customer', 0, -1)
YIELD path UNWIND NODES(path) AS nodes WITH c,nodes
order by nodes:Searchable desc, nodes.createdTimestamp asc
with c, head(collect(distinct nodes)) as collectedNodes
set collectedNodes:Master
return count(collectedNodes)", {limit:100})
However this still causes the database to become very slow even if the limit parameter is set to 1. I read that apoc.periodic.commit is a blocking process so this may be causing a problem. Is there a way to do this that is not so resource intensive and the DB can continue to process other transactions while this is running?
The slowest part of the query is the initial match:
MATCH (c:Customer)
WHERE NOT c:Master AND NOT (c)-[:MATCHES*0..5]-(:Master) WITH c limit {limit}
This takes around 3.5s and the entire query takes about 4s. There is very little difference between limit 1 and limit 20. Maybe if there is a way to rewrite this so it is faster, it might be a better approach?
Also, if it is of any use, the following returns ~70K
MATCH (c:Customer)
WHERE NOT c:Master AND NOT (c)-[:MATCHES*0..5]-(:Master)
RETURN COUNT(c)

Redis pipeline, dealing with cache misses

I'm trying to figure out the best way to implement Redis pipelining. We use redis as a cache on top of MySQL to store user data, product listings, etc.
I'm using this as a starting point: https://joshtronic.com/2014/06/08/how-to-pipeline-with-phpredis/
My question is, assuming you have an array of ids properly sorted. You loop through the redis pipeline like this:
$redis = new Redis();
// Opens up the pipeline
$pipe = $redis->multi(Redis::PIPELINE);
// Loops through the data and performs actions
foreach ($users as $user_id => $username)
{
// Increment the number of times the user record has been accessed
$pipe->incr('accessed:' . $user_id);
// Pulls the user record
$pipe->get('user:' . $user_id);
}
// Executes all of the commands in one shot
$users = $pipe->exec();
What happens when $pipe->get('user:' . $user_id); is not available, because it hasn't been requested before or has been evicted by Redis, etc? Assuming it's result # 13 from 50, how do we a) find out that we weren't able to retrieve that object and b) keep the array of users properly sorted?
Thank you
I will answer the question referring to Redis protocol. How it works in particular language is more or less the same in that case.
First of all, let's check how Redis pipeline works:
It is just a way to send multiple commands to server, execute them and get multiple replies. There is nothing special, you just get an array with replies for each command in the pipeline.
Why pipelines are much faster is because roundtrip time for each command is saved, i.e. for 100 commands there is only one round-trip time instead of 100. In addition, Redis executes every command synchronously. Executing 100 commands needs potentially fighting 100 times, for Redis to pick that singular command, pipeline is treated as one long command, thus requiring only once to wait being picked synchronously.
You can read more about pipelining here: https://redis.io/topics/pipelining. One more note, because each pipelined batch runs uninterruptible (in terms of Redis) it makes sense to send these commands in overviewable chunks, i.e. don't send 100k commands in a single pipeline, that might block Redis for a long period of time, split them into chunks of 1k or 10k commands.
In your case you run in the loop the following fragment:
// Increment the number of times the user record has been accessed
$pipe->incr('accessed:' . $user_id);
// Pulls the user record
$pipe->get('user:' . $user_id);
The question is what is put into pipeline? Let's say you'd update data for u1, u2, u3, u4 as user ids. Thus the pipeline with Redis commands will look like:
INCR accessed:u1
GET user:u1
INCR accessed:u2
GET user:u2
INCR accessed:u3
GET user:u3
INCR accessed:u4
GET user:u4
Let's say:
u1 was accessed 100 times before,
u2 was accessed 5 times before,
u3 was not accessed before and
u4 and accompanying data does not exist.
The result will be in that case an array of Redis replies having:
101
u1 string data stored at user:u1
6
u2 string data stored at user:u2
1
u3 string data stored at user:u3
1
NIL
As you can see, Redis will treat missing INCR values as being 0 and execute incr(0). Finally, there is nothing being sorted by Redis and the results will come in the oder as requested.
The language binding, e.g. Redis driver, will just parse for you that protocol and give the view to parsed data. Without keeping the oder of commands it'll be impossible for Redis driver to work correctly and for you as programmer to deduce smth. Just keep in mind, that request is not duplicated in the reply i.e. you will not receive key for u1 or u2 when doing GET, but just the data for that key. Thus your implementation must remember that on position 1 (zero based index) comes the result of GET for u1.

Spark RDD Lineage and Storage

inputRDD = sc.textFile("log.txt")
errorsRDD = inputRDD.filter(lambda x: "error" in x)
warningsRDD = inputRDD.filter(lambda x: "warning" in x)
badLinesRDD = errorsRDD.union(warningsRDD)
badLinesCount = badLinesRDD.count()
warningCount = warningsRDD.count()
In the code above none of the transformations are evaluated until the second to last line of code is executed where you count the number of objects in the badLinesRDD. So when this badLinesRDD.count() is run it will compute the first four RDD's up till the union and return you the result. But when warningsRDD.count() is run it will only compute the transformation RDD's until the top 3 lines and return you a result correct?
Also when these RDD transformations are computed when an action is called on them where are the objects from the last RDD transformation, which is union, stored? Does it get stored in memory on the each of the DataNodes where the filter transformation was run in parallel?
Unless task output is persisted explicitly (cache, persist for example) or implicitly (shuffle write) and there is enough free space every action will execute complete lineage.
So when you call warningsRDD.count() it will load the file (sc.textFile("log.txt")) and filter (inputRDD.filter(lambda x: "warning" in x)).
Also when these RDD transformations are computed when an action is called on them where are the objects from the last RDD transformation, which is union, stored?
Assuming data is not persisted, nowhere. All task outputs are discarded after data is passed to the next stage or output. If data is persisted it depends on the settings (disk, on-heap, off-heap, DFS).

Spark Streaming and ElasticSearch - Could not write all entries

I'm currently writing a Scala application made of a Producer and a Consumer. The Producers get some data from and external source and writes em inside Kafka. The Consumer reads from Kafka and writes to Elasticsearch.
The consumer is based on Spark Streaming and every 5 seconds fetches new messages from Kafka and writes them to ElasticSearch. The problem is I'm not able to write to ES because I get a lot of errors like the one below :
ERROR] [2015-04-24 11:21:14,734] [org.apache.spark.TaskContextImpl]:
Error in TaskCompletionListener
org.elasticsearch.hadoop.EsHadoopException: Could not write all
entries [3/26560] (maybe ES was overloaded?). Bailing out... at
org.elasticsearch.hadoop.rest.RestRepository.flush(RestRepository.java:225)
~[elasticsearch-spark_2.10-2.1.0.Beta3.jar:2.1.0.Beta3] at
org.elasticsearch.hadoop.rest.RestRepository.close(RestRepository.java:236)
~[elasticsearch-spark_2.10-2.1.0.Beta3.jar:2.1.0.Beta3] at
org.elasticsearch.hadoop.rest.RestService$PartitionWriter.close(RestService.java:125)
~[elasticsearch-spark_2.10-2.1.0.Beta3.jar:2.1.0.Beta3] at
org.elasticsearch.spark.rdd.EsRDDWriter$$anonfun$write$1.apply$mcV$sp(EsRDDWriter.scala:33)
~[elasticsearch-spark_2.10-2.1.0.Beta3.jar:2.1.0.Beta3] at
org.apache.spark.TaskContextImpl$$anon$2.onTaskCompletion(TaskContextImpl.scala:57)
~[spark-core_2.10-1.2.1.jar:1.2.1] at
org.apache.spark.TaskContextImpl$$anonfun$markTaskCompleted$1.apply(TaskContextImpl.scala:68)
[spark-core_2.10-1.2.1.jar:1.2.1] at
org.apache.spark.TaskContextImpl$$anonfun$markTaskCompleted$1.apply(TaskContextImpl.scala:66)
[spark-core_2.10-1.2.1.jar:1.2.1] at
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
[na:na] at
scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
[na:na] at
org.apache.spark.TaskContextImpl.markTaskCompleted(TaskContextImpl.scala:66)
[spark-core_2.10-1.2.1.jar:1.2.1] at
org.apache.spark.scheduler.Task.run(Task.scala:58)
[spark-core_2.10-1.2.1.jar:1.2.1] at
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:200)
[spark-core_2.10-1.2.1.jar:1.2.1] at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
[na:1.7.0_65] at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
[na:1.7.0_65] at java.lang.Thread.run(Thread.java:745) [na:1.7.0_65]
Consider that the producer is writing 6 messages every 15 seconds so I really don't understand how this "overload" can possibly happen (I even cleaned the topic and flushed all old messages, I thought it was related to an offset issue). The task executed by Spark Streaming every 5 seconds can be summarized by the following code :
val result = KafkaUtils.createStream[String, Array[Byte], StringDecoder, DefaultDecoder](ssc, kafkaParams, Map("wasp.raw" -> 1), StorageLevel.MEMORY_ONLY_SER_2)
val convertedResult = result.map(k => (k._1 ,AvroToJsonUtil.avroToJson(k._2)))
//TO-DO : Remove resource (yahoo/yahoo) hardcoded parameter
log.info(s"*** EXECUTING SPARK STREAMING TASK + ${java.lang.System.currentTimeMillis()}***")
convertedResult.foreachRDD(rdd => {
rdd.map(data => data._2).saveToEs("yahoo/yahoo", Map("es.input.json" -> "true"))
})
If I try to print the messages instead of sending to ES, everything is fine and I actually see only 6 messages. Why can't I write to ES?
For the sake of completeness, I'm using this library to write to ES : elasticsearch-spark_2.10 with the latest beta version.
I found, after many retries, a way to write to ElasticSearch without getting any error. Basically passing the parameter "es.batch.size.entries" -> "1" to the saveToES method solved the problem. I don't understand why using the default or any other batch size leads to the aforementioned error considering that I would expect an error message if I'm trying to write more stuff than the allowed max batch size, not less.
Moreover I've noticed that actually I was writing to ES but not all my messages, I was losing between 1 and 3 messages per batch.
When I pushed dataframe to ES on Spark, I had the same error message. Even with "es.batch.size.entries" -> "1" configuration,I had the same error.
Once I increased thread pool in ES, I could figure out this issue.
for example,
Bulk pool
threadpool.bulk.type: fixed
threadpool.bulk.size: 600
threadpool.bulk.queue_size: 30000
Like it was already mentioned here, this is a document write conflict.
Your convertedResult data stream contains multiple records with the same id. When written to elastic as part of the same batch produces the error above.
Possible solutions:
Generate unique id for each record. Depending on your use case it can be done in a few different ways. As example, one common solution is to create a new field by combining the id and lastModifiedDate fields and use that field as id when writing to elastic.
Perform de-duplication of records based on id - select only one record with particular id and discard other duplicates. Depending on your use case, this could be the most current record (based on time stamp field), most complete (most of the fields contain data), etc.
The #1 solution will store all records that you receive in the stream.
The #2 solution will store only the unique records for a specific id based on your de-duplication logic. This result would be the same as setting "es.batch.size.entries" -> "1", except you will not limit the performance by writing one record at a time.
One of the possibility is the cluster/shard status being RED. Please address this issue which may be due to unassigned replicas. Once status turned GREEN the API call succeeded just fine.
This is a document write conflict.
For example:
Multiple documents specify the same _id for Elasticsearch to use.
These documents are located in different partitions.
Spark writes multiple partitions to ES simultaneously.
Result is Elasticsearch receiving multiple updates for a single Document at once - from multiple sources / through multiple nodes / containing different data
"I was losing between 1 and 3 messages per batch."
Fluctuating number of failures when batch size > 1
Success if batch write size "1"
Just adding another potential reason for this error, hopefully it helps someone.
If your Elasticsearch index has child documents then:
if you are using a custom routing field (not _id), then according to
the documentation the uniqueness of the documents is not guaranteed.
This might cause issues while updating from spark.
If you are using the standard _id, the uniqueness will be preserved, however you need to make sure the following options are provided while writing from Spark to Elasticsearch:
es.mapping.join
es.mapping.routing

Resources