how do i do an atomic update with etcd - etcd

I am trying to understand what an 'atomic' update is in terms on etcd.
When I think 'atomic', I think there is a 'before' and an 'after' (there isn't a during, and if the update fails, it is still 'before').
Here is an example:
curl -s -XPUT http://localhost:2379/v2/keys/message -d value='Hidee Ho'
So, at this point, anyone can access that message and get the current value:
curl -s http://localhost:2379/v2/keys/message
{"action":"get","node":{"key":"/message","value":"Hidee Ho","modifiedIndex":4748,"createdIndex":4748}}
Later on, I can modify this value, like this:
curl -s -XPUT http://localhost:2379/v2/keys/message -d value='Mr Hanky'
And the result can be fetched, just like before. Before my change the value 'Hidee Ho' comes back, after the change the value 'Mr Hanky' comes back. So, my question is am I guaranteed one or the other of the results? That is, I want to confirm that one or the other will be returned (and not a nil value between the result).
I don't particularly care about the timing. If I do the Mr Hanky update and subsequent fetchers of the value continue to get Hidee Ho for a (short) period of time, that's OK.
I am confused because there is an Atomic CompareAndSwap function in the protocol. As far as I can tell, it isn't so much Atomic as it is 'only do the update if the value is what I say it is'. In my case I don't much care what the value used to be. I just want to know that it is changed and that no readers will see anything other than the 'before' or 'after' values.

You are correct in that a plain PUT is atomic in that a client will only see the previous value or the new value.
The CompareAndSwap feature allows you to do optimistic locking so that you can write new values which depend on the prior value, e.g. a counter. If you were to implement a counter without using CompareAndSwap, you'd have something like write("count", 1 + read("count")) , in this case the read and write are separate, if 2 callers did this at the same time, then its possible they'd both see the same starting value, and you'd loose one of the increments. using CAS the caller can say set it to 12 only if the previous value is 11, now if this happens concurrently, one of the writes will fail, and it can then re-read and reapply its delta, so that you don't loose any increments.

Related

Where is the best place to store an application setting that needs to be updated frequently in ServiceNow

I have a scheduled script execution that needs to persist a value between runs. It is updated with each run. Using gs.setProperty seemed like the natural place until I came across this:
Care should be taken when setting system properties (sys_properties)
using this method as it causes a system-wide cache flush. Each flush
can cause system degradation while the caches rebuild. If a value must
be updated often, it should not be stored as a system property. In
general, you should only place values in the sys_properties table that
do not frequently change.
Creating a separate table to store a single scalar value seems like overkill. Is there a better place to store it?
You could set a preference if you need it in the instance. Another place could be the events table. Log the event with the data in parm1 or parm2 and on next run query the most recent event.
I'd avoid making a table as that has cost implications for some clients. I agree with the sys_properties.
var encrypter = new GlideEncrypter();
var encrypted = encrypter.encrypt('Super Secret Phrase');
gs.info('encrypted: ' + encrypted);
var decrypted = encrypter.decrypt(encrypted);
gs.info('decrypted: ' + decrypted);
/**
*** Script: encrypted: g/bXLJHa7xNRMKZEo5q/YtLMEdse36ED
*** Script: decrypted: Super Secret Phrase
*/
This way only administrators could really read this data. Also if I recall correctly, the sysevent table is cleared after 7 days. You could have the job remove the event as soon as it has it in memory.

Elasticsearch Delete by Query Version Conflict

I am using Elasticsearch version 5.6.10. I have a query that deletes records for a given agency, so they can later be updated by a nightly script.
The query is in elasticsearch-dsl and look like this:
def remove_employees_from_search(jurisdiction_slug, year):
s = EmployeeDocument.search()
s = s.filter('term', year=year)
s = s.query('nested', path='jurisdiction', query=Q("term", **{'jurisdiction.slug': jurisdiction_slug}))
response = s.delete()
return response
The problem is I am getting a ConflictError exception when trying to delete the records via that function. I have read this occurs because the documents were different between the time the delete process started and executed. But I don't know how this can be, because nothing else is modifying the records during the delete process.
I am going to add s = s.params(conflicts='proceed') in order to silence the exception. But this is a band-aid as I do not understand why the delete is not processing as expected. Any ideas on how to troubleshoot this? A snapshot of the error is below:
ConflictError:TransportError(409,
u'{
"took":10,
"timed_out":false,
"total":55,
"deleted":0,
"batches":1,
"version_conflicts":55,
"noops":0,
"retries":{
"bulk":0,
"search":0
},
"throttled_millis":0,
"requests_per_second":-1.0,
"throttled_until_millis":0,
"failures":[
{
"index":"employees",
"type":"employee_document",
"id":"24681043",
"cause":{
"type":"version_conflict_engine_exception",
"reason":"[employee_document][24681043]: version conflict, current version [5] is different than the one provided [4]",
"index_uuid":"G1QPF-wcRUOCLhubdSpqYQ",
"shard":"0",
"index":"employees"
},
"status":409
},
{
"index":"employees",
"type":"employee_document",
"id":"24681063",
"cause":{
"type":"version_conflict_engine_exception",
"reason":"[employee_document][24681063]: version conflict, current version [5] is different than the one provided [4]",
"index_uuid":"G1QPF-wcRUOCLhubdSpqYQ",
"shard":"0",
"index":"employees"
},
"status":409
}
You could try making it do a refresh first
client.indices.refresh(index='your-index')
source https://www.elastic.co/guide/en/elasticsearch/client/javascript-api/current/api-reference.html#_indices_refresh
First, this is a question that was asked 2 years ago, so take my response with a grain of salt due to the time gap.
I am using the javascript API, but I would bet that the flags are similar. When you index or delete there is a refresh flag which allows you to force the index to have the result appear to search.
I am not an Elasticsearch guru, but the engine must perform some systematic maintenance on the indices and shards so that it moves the indices to a stable state. It's probably done over time, so you would not necessarily get an immediate state update. Furthermore, from personal experience, I have seen when delete does not seemingly remove the item from the index. It might mark it as "deleted", give the document a new version number, but it seems to "stick around" (probably until general maintenance sweeps run).
Here I am showing the js API for delete, but it is the same for index and some of the other calls.
client.delete({
id: string,
index: string,
type: string,
wait_for_active_shards: string,
refresh: 'true' | 'false' | 'wait_for',
routing: string,
timeout: string,
if_seq_no: number,
if_primary_term: number,
version: number,
version_type: 'internal' | 'external' | 'external_gte' | 'force'
})
https://www.elastic.co/guide/en/elasticsearch/client/javascript-api/current/api-reference.html#_delete
refresh
'true' | 'false' | 'wait_for' - If true then refresh the affected shards to make this operation visible to search, if wait_for then wait for a refresh to make this operation visible to search, if false (the default) then do nothing with refreshes.
For additional reference, here is the page on Elasticsearch refresh info and what might be a fairly relevant blurb for you.
https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-refresh.html
Use the refresh API to explicitly refresh one or more indices. If the request targets a data stream, it refreshes the stream’s backing indices. A refresh makes all operations performed on an index since the last refresh available for search.
By default, Elasticsearch periodically refreshes indices every second, but only on indices that have received one search request or more in the last 30 seconds. You can change this default interval using the index.refresh_interval setting.

Redis pipeline, dealing with cache misses

I'm trying to figure out the best way to implement Redis pipelining. We use redis as a cache on top of MySQL to store user data, product listings, etc.
I'm using this as a starting point: https://joshtronic.com/2014/06/08/how-to-pipeline-with-phpredis/
My question is, assuming you have an array of ids properly sorted. You loop through the redis pipeline like this:
$redis = new Redis();
// Opens up the pipeline
$pipe = $redis->multi(Redis::PIPELINE);
// Loops through the data and performs actions
foreach ($users as $user_id => $username)
{
// Increment the number of times the user record has been accessed
$pipe->incr('accessed:' . $user_id);
// Pulls the user record
$pipe->get('user:' . $user_id);
}
// Executes all of the commands in one shot
$users = $pipe->exec();
What happens when $pipe->get('user:' . $user_id); is not available, because it hasn't been requested before or has been evicted by Redis, etc? Assuming it's result # 13 from 50, how do we a) find out that we weren't able to retrieve that object and b) keep the array of users properly sorted?
Thank you
I will answer the question referring to Redis protocol. How it works in particular language is more or less the same in that case.
First of all, let's check how Redis pipeline works:
It is just a way to send multiple commands to server, execute them and get multiple replies. There is nothing special, you just get an array with replies for each command in the pipeline.
Why pipelines are much faster is because roundtrip time for each command is saved, i.e. for 100 commands there is only one round-trip time instead of 100. In addition, Redis executes every command synchronously. Executing 100 commands needs potentially fighting 100 times, for Redis to pick that singular command, pipeline is treated as one long command, thus requiring only once to wait being picked synchronously.
You can read more about pipelining here: https://redis.io/topics/pipelining. One more note, because each pipelined batch runs uninterruptible (in terms of Redis) it makes sense to send these commands in overviewable chunks, i.e. don't send 100k commands in a single pipeline, that might block Redis for a long period of time, split them into chunks of 1k or 10k commands.
In your case you run in the loop the following fragment:
// Increment the number of times the user record has been accessed
$pipe->incr('accessed:' . $user_id);
// Pulls the user record
$pipe->get('user:' . $user_id);
The question is what is put into pipeline? Let's say you'd update data for u1, u2, u3, u4 as user ids. Thus the pipeline with Redis commands will look like:
INCR accessed:u1
GET user:u1
INCR accessed:u2
GET user:u2
INCR accessed:u3
GET user:u3
INCR accessed:u4
GET user:u4
Let's say:
u1 was accessed 100 times before,
u2 was accessed 5 times before,
u3 was not accessed before and
u4 and accompanying data does not exist.
The result will be in that case an array of Redis replies having:
101
u1 string data stored at user:u1
6
u2 string data stored at user:u2
1
u3 string data stored at user:u3
1
NIL
As you can see, Redis will treat missing INCR values as being 0 and execute incr(0). Finally, there is nothing being sorted by Redis and the results will come in the oder as requested.
The language binding, e.g. Redis driver, will just parse for you that protocol and give the view to parsed data. Without keeping the oder of commands it'll be impossible for Redis driver to work correctly and for you as programmer to deduce smth. Just keep in mind, that request is not duplicated in the reply i.e. you will not receive key for u1 or u2 when doing GET, but just the data for that key. Thus your implementation must remember that on position 1 (zero based index) comes the result of GET for u1.

How to get all message history from Hipchat for a room via the API?

I was using the Hipchat API (v2) a bit today and ran into an odd issue where I was not able to really pull up all of the history for a room. It seemed as though when I queried a specific date, for example, it would only retrieve a fraction of the history for that date given. I had had plans to simply iterate across all of the dates for a Room to extract the history in a format that I could use, but ended up hitting this and am now unsure if it is really possible to pull out the history fully.
I realize that this is a bit clunky. It is pulling the JSON as a string and then I have to form it into a hash so I know I'm not doing this as good as it could be done, but here is roughly what I quickly did just to test out the history method for the API:
api_token = "MY_TOKEN"
client = HipChat::Client.new(api_token, :api_version => 'v2')
history = client['ROOM_NAME'].history
history = JSON.parse(history)
history.each do |key, history|
if history.is_a? Array
history.each do |message|
if message.is_a? Hash
puts "#{message['from']['name']}: #{message['message']}"
end
end
end
end
Obviously then the extension to that was to just curse through the dates in the desired range (using: client['ROOM_NAME'].history(:date => '2010-11-19', :timezone => 'PST')), but again, I was only getting a fraction of the history for the room. Are there some additional parameters that I'm missing for this to make it work as expected?
I got this working but it was a big pain.
Start by sending a query with the current time, in UTC, but without including the time zone, as the start date:
https://internal-hipchat-server/v2/room/2/history?reverse=false&date=2015-06-25T20:42:18.658439&max-results=1000&auth_token=XXX
This is very fiddly:
If you specify just the current date, without a timezone, as documented in the API, it is interpreted as midnight last night and you only get messages from yesterday or older.
If you try specifying tomorrow’s date instead, the response is 400 Bad Request This day has not yet come to pass.
If you specify the time as 2015-06-25T20:42:18.658439+00:00, which is the format that times come in HipChat API responses, HipChat’s parser seems to fail and interpret it as midnight last night.
When you get the response back, take the oldest items.date property, strip the timezone, and resubmit the above URL with an updated date parameter:
https://internal-hipchat-server/v2/room/2/history?reverse=false&date=2015-06-17T19:56:34.533182&max-results=1000&auth_token=XXX
Be sure to include the microseconds, in case a notification posted multiple messages to the same room in the same second.
This will get you the next page of messages. Keep doing this until you get fewer than max-results messages back.
There is a start-index parameter I tried passing before I got the above working, and it will give you a few pages of results, with responses lacking a links.next property, but it won’t give you the full history. On a chatroom with 9166 messages in the history according to statistics.messages_sent, it only returned 3217 messages. So don’t use it. You can use statistics.messages_sent as a sanity check for whether you get all messages.
Oh yeah, and the last_active property in the /v2/room call cannot be trusted because it doesn’t update when notification messages are posted to the room.

python slow to check if mongodb record found

I have a python (3.2) request that goes to MongoDB and the request itself is running fast enough. When I then perform an if statement check to see if any records were found it takes 50 times as long:
Line # Hits Time Per Hit % Time Line Contents
==============================================================
58 27623 6475988 234.4 1.7 itemInDB = db.mainData.find({"x":item[x]}).limit(1)
59
60 #existing item in db
61 27623 293419802 10622.3 77.6 if itemInDB.count():
What on earth is the cause for that if statement taking so long?! I presume there must be a better way to check if a record was found but google has come up empty.
Thanks for the help.
Perhaps a Better Way
If you're only interested in returning one value, you might want to use find_one instead of find. It will stop looking for values after one has been found, as opposed to find, which has to run through the collection:
itemInDB = db.mainData.find_one({"x":item[x]})
if itemInDB:
print("Item found")
else:
print("Item not found")
For Your Example
According to the PyMongo docs, when querying the count of a cursor, you can pass in a parameter (True or False) to take into account any skip or limit calls previously made to the cursor. The default for that parameter is False (namely, not taking those calls into account). That may be affecting the performance of your count query.
Gauging Query Performance
If you want to see how your query will be carried out by mongo, you can call explain on your cursor:
db.coll.find({"x":4}).explain()
The explain function is also implemented in PyMongo.
Turns out it was due to the find() function and not the if statement. I created an index on "x" (as I should have anyway). Changed the find to find_one and removed the .count() from the if statement. Overall 75% faster.

Resources