I have two different sources of data which I need to marry together. Data set A will have a foo_key attribute which can map to Data set B's bar_key attribute with a one to many relationship.
Data set A:
[{ foo_key: 12345, other: 'blahblah' }, ...]
Data set B:
[{ bar_key: 12345, other: '' }, { bar_key: 12345, other: '' }, { bar_key: 12345, other: '' }, ...]
Data set A is coming from a SQS queue and any relationships with data set B will be available as I poll A.
Data set B is coming from a separate SQS queue that I am trying to dump into a memcached cache to do quick look ups on when an object drops into data set A.
Originally I was planning on setting the memcached key to be bar_key from the objects in data set B but then realized that if I did that it would be possible to overwrite the value since there can be many of the same bar_key value. Then I was thinking well I can create a key bar_key and the value just be an array of the SQS messages. But since I have multiple hosts polling the SQS queue I think it might be possible that when I check to see if the key is in memcached, check it out, append the new message to it, and then set it, that another host could be trying to preform the same operation and thus the first host's attempt at appending the value would just be overwritten.
I've looked around at memcached key locking but I'm not sure I understand it entirely. Would the solution be that when I get the key/value pair from memcached I create a temporary dummy lock on a new key called bar_key_dummy that expires in x seconds, and if I try to fetch a key that has a bar_key_dummy lock active I just send the SQS message back to the queue without deleting to try again in x seconds?
Here's some pseudocode for what I have going on in my head. Does this make any sense?
store = MemCache.new(host)
sqs_messages.poll do |message|
dummy_key = "#{message.bar_key}_dummy"
sqs.dont_delete_message && next unless store.get(dummy_key).nil?
# set dummy_key in memcache with a value of 1 for 3 seconds
store.set(dummy_key, 1, 3)
temp_data = store.get(message.bar_key) || []
temp_data << message
store.set(message.bar_key, temp_data, 300)
# delete dummy key when done in case shorter than x seconds
store.delete(dummy_key)
end
Thanks for any help!
Memcached has a special operation - cas Compare and Swap.
Command gets returns Item along with its unique CAS value.
Then dataset can be searched and update must be issued with the cas command which takes original unique CAS value.
If CAS was changed in between two command, update operation will fail with the EXISTS error
Related
I have a scheduled script execution that needs to persist a value between runs. It is updated with each run. Using gs.setProperty seemed like the natural place until I came across this:
Care should be taken when setting system properties (sys_properties)
using this method as it causes a system-wide cache flush. Each flush
can cause system degradation while the caches rebuild. If a value must
be updated often, it should not be stored as a system property. In
general, you should only place values in the sys_properties table that
do not frequently change.
Creating a separate table to store a single scalar value seems like overkill. Is there a better place to store it?
You could set a preference if you need it in the instance. Another place could be the events table. Log the event with the data in parm1 or parm2 and on next run query the most recent event.
I'd avoid making a table as that has cost implications for some clients. I agree with the sys_properties.
var encrypter = new GlideEncrypter();
var encrypted = encrypter.encrypt('Super Secret Phrase');
gs.info('encrypted: ' + encrypted);
var decrypted = encrypter.decrypt(encrypted);
gs.info('decrypted: ' + decrypted);
/**
*** Script: encrypted: g/bXLJHa7xNRMKZEo5q/YtLMEdse36ED
*** Script: decrypted: Super Secret Phrase
*/
This way only administrators could really read this data. Also if I recall correctly, the sysevent table is cleared after 7 days. You could have the job remove the event as soon as it has it in memory.
I need to remove duplicates from a flow I've developed, it can receive the same ${filename} multiple times. I tried using HBase_1_1_2_ClientMapCacheService with DetectDuplicate (I am using NiFi v1.4), but found that it lets a few duplicates through. If I use DistributedMapCache (ClientService and Server), I do not get any duplicates. Why would I receive some duplicates with the HBase Cache?
As a test, I listed a directory (ListSFTP) with 20,000 files on all cluster nodes (4 nodes) and passed to DetectDuplicate (using the HBase Cache service). It routed 20,020 to "non-duplicate", and interestingly the table actually has 20,000 rows.
Unfortunately I think this is due to a limitation in the operations that are offered by HBase.
The DetectDuplicate processor relies on an operation "getAndPutIfAbsent" which is expected to return the original value, and then set the new value if it wasn't there. For example, first time through it would return null and set the new value, indicating it wasn't a duplicate.
HBase doesn't natively support this operation, so the implementation of this method in the HBase map cache client does this:
V got = get(key, keySerializer, valueDeserializer);
boolean wasAbsent = putIfAbsent(key, value, keySerializer, valueSerializer);
if (! wasAbsent) return got;
else return null;
So because it is two separate calls there is a possible race condition...
Imagine node 1 calls the first line and gets null, but then node 2 performs the get and the putIfAbsent, now when node 1 calls putIfAbsent it gets false because node 2 just populated the cache, so now node 1 returns the null value from the original get... both of these look like non-duplicates to DetectDuplicate.
In the DistributedMapCacheServer, it locks the entire cache per operation so it can provide an atomic getAndPutIfAbsent.
I'm trying to figure out the best way to implement Redis pipelining. We use redis as a cache on top of MySQL to store user data, product listings, etc.
I'm using this as a starting point: https://joshtronic.com/2014/06/08/how-to-pipeline-with-phpredis/
My question is, assuming you have an array of ids properly sorted. You loop through the redis pipeline like this:
$redis = new Redis();
// Opens up the pipeline
$pipe = $redis->multi(Redis::PIPELINE);
// Loops through the data and performs actions
foreach ($users as $user_id => $username)
{
// Increment the number of times the user record has been accessed
$pipe->incr('accessed:' . $user_id);
// Pulls the user record
$pipe->get('user:' . $user_id);
}
// Executes all of the commands in one shot
$users = $pipe->exec();
What happens when $pipe->get('user:' . $user_id); is not available, because it hasn't been requested before or has been evicted by Redis, etc? Assuming it's result # 13 from 50, how do we a) find out that we weren't able to retrieve that object and b) keep the array of users properly sorted?
Thank you
I will answer the question referring to Redis protocol. How it works in particular language is more or less the same in that case.
First of all, let's check how Redis pipeline works:
It is just a way to send multiple commands to server, execute them and get multiple replies. There is nothing special, you just get an array with replies for each command in the pipeline.
Why pipelines are much faster is because roundtrip time for each command is saved, i.e. for 100 commands there is only one round-trip time instead of 100. In addition, Redis executes every command synchronously. Executing 100 commands needs potentially fighting 100 times, for Redis to pick that singular command, pipeline is treated as one long command, thus requiring only once to wait being picked synchronously.
You can read more about pipelining here: https://redis.io/topics/pipelining. One more note, because each pipelined batch runs uninterruptible (in terms of Redis) it makes sense to send these commands in overviewable chunks, i.e. don't send 100k commands in a single pipeline, that might block Redis for a long period of time, split them into chunks of 1k or 10k commands.
In your case you run in the loop the following fragment:
// Increment the number of times the user record has been accessed
$pipe->incr('accessed:' . $user_id);
// Pulls the user record
$pipe->get('user:' . $user_id);
The question is what is put into pipeline? Let's say you'd update data for u1, u2, u3, u4 as user ids. Thus the pipeline with Redis commands will look like:
INCR accessed:u1
GET user:u1
INCR accessed:u2
GET user:u2
INCR accessed:u3
GET user:u3
INCR accessed:u4
GET user:u4
Let's say:
u1 was accessed 100 times before,
u2 was accessed 5 times before,
u3 was not accessed before and
u4 and accompanying data does not exist.
The result will be in that case an array of Redis replies having:
101
u1 string data stored at user:u1
6
u2 string data stored at user:u2
1
u3 string data stored at user:u3
1
NIL
As you can see, Redis will treat missing INCR values as being 0 and execute incr(0). Finally, there is nothing being sorted by Redis and the results will come in the oder as requested.
The language binding, e.g. Redis driver, will just parse for you that protocol and give the view to parsed data. Without keeping the oder of commands it'll be impossible for Redis driver to work correctly and for you as programmer to deduce smth. Just keep in mind, that request is not duplicated in the reply i.e. you will not receive key for u1 or u2 when doing GET, but just the data for that key. Thus your implementation must remember that on position 1 (zero based index) comes the result of GET for u1.
The AWS SimpleDB documentation for the Ruby SDK provides the following example with regard to using the get_attributes method:
resp = client.get_attributes({
domain_name: "String", # required
item_name: "String", # required
attribute_names: ["String"],
consistent_read: false,
})
...and then the following example response:
resp.attributes #=> Array
resp.attributes[0].name #=> String
resp.attributes[0].alternate_name_encoding #=> String
resp.attributes[0].value #=> String
resp.attributes[0].alternate_value_encoding #=> String
It also states the following piece of advice:
If the item does not exist on the replica that was accessed for this operation, an empty set is returned. The system does not return an error as it cannot guarantee the item does not exist on other replicas.
I hope that I'm misunderstanding this, but if your response does return an empty set, then how are you supposed to know if it's because no item exists with the supplied item name, or if your request just hit a replica that doesn't contain your item?
I have never used AWS SimpleDB before but from the little knowledge I have about replication from Amazon's DynamoDB the data is usually eventually consistent - while any of the replicas handles your request to read the attributes, the process of replication the previously written data can still take place across the replicas responsible for storing your data and that's why it's possible that the replica handling your request to read the attributes does not have to have the data stored (yet) - that's why it cannot respond with an error message.
What you should be able to do in order to be 100% sure is to specify the consistent_read: true parameter as it should tell you whether the data exists in AWS SimpleDB or not:
according to the documentation of get_attributes method
:consistent_read (Boolean) —
Determines whether or not strong consistency should be enforced when data is read from SimpleDB. If true, any data previously written to SimpleDB will be returned. Otherwise, results will be consistent eventually, and the client may not see data that was written immediately before your read.
From: https://rethinkdb.com/docs/changefeeds/javascript/#including-result-types
Could the uninitial be further defined? If initial is just an add that happened before I started the feed, then how do i get uninitial?
How do I get state? With includeInitial, includeState, includeTypes set to true, I'll get separate state docs, but never a type: state interface.
There's a better explanation of what "uninitial" results are in the "Including initial values" section of the document that you linked. To quote:
If an initial result for a document has been sent and a change is made to that document that would move it to the unsent part of the result set (for instance, a changefeed monitors the top 100 posters, the first 50 have been sent, and poster 48 has become poster 52), an “uninitial” notification will be sent, with an old_val field but no new_val field.
The reason these exist is due to how RethinkDB changefeeds implement the initial results logic. Initial results are processed more or less from left to right in the key space of the table. There is always a slice of the key space for which initial results are being send, and a remaining slice for which the changefeed has started "streaming" current updates in realtime. When you first open a changefeed with includeInitial: true, the whole key range will be in the initializing state. Then, as initial results are being sent over the changefeed, the key boundary between the initializing and streaming part moves and more of the keyspace become streaming.
uninitial values happen if a document's key moves from a part of the key space that is already streaming, to a part that's still initializing. This can only happen for changefeeds that use secondary indexes, since the primary key of a given document can never change.
Regarding state: I seem to be getting the type: "state" documents just fine. For example:
r.table('t1').changes({includeStates: true, includeInitial: true, includeTypes: true})
{ "state": "ready" ,
"type": "state" }
{ "state": "initializing" ,
"type": "state" }
Are you not getting these documents?