How to detect if SimpleDB domain contains the requested item? - ruby

The AWS SimpleDB documentation for the Ruby SDK provides the following example with regard to using the get_attributes method:
resp = client.get_attributes({
domain_name: "String", # required
item_name: "String", # required
attribute_names: ["String"],
consistent_read: false,
})
...and then the following example response:
resp.attributes #=> Array
resp.attributes[0].name #=> String
resp.attributes[0].alternate_name_encoding #=> String
resp.attributes[0].value #=> String
resp.attributes[0].alternate_value_encoding #=> String
It also states the following piece of advice:
If the item does not exist on the replica that was accessed for this operation, an empty set is returned. The system does not return an error as it cannot guarantee the item does not exist on other replicas.
I hope that I'm misunderstanding this, but if your response does return an empty set, then how are you supposed to know if it's because no item exists with the supplied item name, or if your request just hit a replica that doesn't contain your item?

I have never used AWS SimpleDB before but from the little knowledge I have about replication from Amazon's DynamoDB the data is usually eventually consistent - while any of the replicas handles your request to read the attributes, the process of replication the previously written data can still take place across the replicas responsible for storing your data and that's why it's possible that the replica handling your request to read the attributes does not have to have the data stored (yet) - that's why it cannot respond with an error message.
What you should be able to do in order to be 100% sure is to specify the consistent_read: true parameter as it should tell you whether the data exists in AWS SimpleDB or not:
according to the documentation of get_attributes method
:consistent_read (Boolean) —
Determines whether or not strong consistency should be enforced when data is read from SimpleDB. If true, any data previously written to SimpleDB will be returned. Otherwise, results will be consistent eventually, and the client may not see data that was written immediately before your read.

Related

golang get kubernetes resources(30000+ configmaps) failed

I want to use client-go to get resources in Kubernetes cluster. Due to a large amount of data, when I get the configmap connection is closed.
stream error when reading response body, may be caused by closed connection. Please retry. Original error: stream error: stream ID 695; INTERNAL_ERROR
configmaps:
$ kubectl -n kube-system get cm |wc -l
35937
code:
cms, err := client.CoreV1().ConfigMaps(kube-system).List(context.TODO(), v1.ListOptions{})
I try to use Limit parameter, I can get some data, but I don’t know how to get all.
cms, err := client.CoreV1().ConfigMaps(kube-system).List(context.TODO(), v1.ListOptions{Limit: 1000 })
I'm new to Go. Any pointers as to how to go about it would be greatly appreciated.
The documentation for v1.ListOptions describes how it works:
limit is a maximum number of responses to return for a list call. If more items exist, the
server will set the continue field on the list metadata to a value that can be used with the
same initial query to retrieve the next set of results.
This means that you should examine the response, save the value of the continue field (as well as the actual results), then reissue the same command but with continue set to the just seen value. Repeat until the returned continue field is empty (or an error occurs).
See the API concepts page for details on handling chunking of large results.
You should use a ListPager to paginate requests that need to query many objects. The ListPager includes buffering pages, so it has improved performance over simply using the Limit and Continue values.

409 error when using streaming_bulk() - certain that document is only included once.

I am attempting to upload a large number of documents - about 7 million.
I have created actions for each document to be added and split them up into about 260 files, about 30K documents each.
Here is the format of the actions:
a = someDocument with nesting
esActionFromFile = [{
'_index': 'mt-interval-test-9',
'_type': 'doc',
'_id': 5641254,
'_source': a,
'_op_type': 'create'}]
I have tried using helpers.bulk, helpers.parallel_bulk, and helpers.streaming_bulk and have had partial success using helpers.bulk and helpers.streaming_bulk.
Each time I run a test, I delete, and then recreate the index using:
# Refresh Index
es.indices.delete(index=index, ignore=[400, 404])
es.indices.create(index = index, body = mappings_request_body)
When I am partially successful - many documents are loaded, but eventually I get a 409 version conflict error.
I am aware that there can be version conflicts created when there has not been sufficient time for ES to process the deletion of individual documents after doing a delete by query.
At first, I thought that something similar was happening here. However, I realized that I am often getting the errors from files the first time they have ever been processed (i.e. even if the deletion was causing issues, this particular file had never been loaded, so there couldn't be a conflict).
The _id value I am using is the primary key from the original database where I am extracting the data from - so I am certain they are unique. Furthermore, I have checked whether there was unintentional duplication of records in my actions arrays, or the files I created them from, and there are no duplicates.
I am at a loss to explain why this is happening, and struggling to find a solution to upload my data.
Any assistance would be greatly appreciated!
There should be information attached to the 409 response that should tell you exactly what's going wrong and which document caused it.
Another thing that could cause this would be a retry - when elasticsearch-py cannot connect to the cluster it will resend the request again to a different node. In some complex scenarios it can happen that a request will be thus sent twice. This is especially true if you enabled retry_on_timeout option.

Getting duplicates with NiFi HBase_1_1_2_ClientMapCacheService

I need to remove duplicates from a flow I've developed, it can receive the same ${filename} multiple times. I tried using HBase_1_1_2_ClientMapCacheService with DetectDuplicate (I am using NiFi v1.4), but found that it lets a few duplicates through. If I use DistributedMapCache (ClientService and Server), I do not get any duplicates. Why would I receive some duplicates with the HBase Cache?
As a test, I listed a directory (ListSFTP) with 20,000 files on all cluster nodes (4 nodes) and passed to DetectDuplicate (using the HBase Cache service). It routed 20,020 to "non-duplicate", and interestingly the table actually has 20,000 rows.
Unfortunately I think this is due to a limitation in the operations that are offered by HBase.
The DetectDuplicate processor relies on an operation "getAndPutIfAbsent" which is expected to return the original value, and then set the new value if it wasn't there. For example, first time through it would return null and set the new value, indicating it wasn't a duplicate.
HBase doesn't natively support this operation, so the implementation of this method in the HBase map cache client does this:
V got = get(key, keySerializer, valueDeserializer);
boolean wasAbsent = putIfAbsent(key, value, keySerializer, valueSerializer);
if (! wasAbsent) return got;
else return null;
So because it is two separate calls there is a possible race condition...
Imagine node 1 calls the first line and gets null, but then node 2 performs the get and the putIfAbsent, now when node 1 calls putIfAbsent it gets false because node 2 just populated the cache, so now node 1 returns the null value from the original get... both of these look like non-duplicates to DetectDuplicate.
In the DistributedMapCacheServer, it locks the entire cache per operation so it can provide an atomic getAndPutIfAbsent.

Using includeTypes flag on changefeeds

From: https://rethinkdb.com/docs/changefeeds/javascript/#including-result-types
Could the uninitial be further defined? If initial is just an add that happened before I started the feed, then how do i get uninitial?
How do I get state? With includeInitial, includeState, includeTypes set to true, I'll get separate state docs, but never a type: state interface.
There's a better explanation of what "uninitial" results are in the "Including initial values" section of the document that you linked. To quote:
If an initial result for a document has been sent and a change is made to that document that would move it to the unsent part of the result set (for instance, a changefeed monitors the top 100 posters, the first 50 have been sent, and poster 48 has become poster 52), an “uninitial” notification will be sent, with an old_val field but no new_val field.
The reason these exist is due to how RethinkDB changefeeds implement the initial results logic. Initial results are processed more or less from left to right in the key space of the table. There is always a slice of the key space for which initial results are being send, and a remaining slice for which the changefeed has started "streaming" current updates in realtime. When you first open a changefeed with includeInitial: true, the whole key range will be in the initializing state. Then, as initial results are being sent over the changefeed, the key boundary between the initializing and streaming part moves and more of the keyspace become streaming.
uninitial values happen if a document's key moves from a part of the key space that is already streaming, to a part that's still initializing. This can only happen for changefeeds that use secondary indexes, since the primary key of a given document can never change.
Regarding state: I seem to be getting the type: "state" documents just fine. For example:
r.table('t1').changes({includeStates: true, includeInitial: true, includeTypes: true})
{ "state": "ready" ,
"type": "state" }
{ "state": "initializing" ,
"type": "state" }
Are you not getting these documents?

Dealing with Memcached Race Conditions

I have two different sources of data which I need to marry together. Data set A will have a foo_key attribute which can map to Data set B's bar_key attribute with a one to many relationship.
Data set A:
[{ foo_key: 12345, other: 'blahblah' }, ...]
Data set B:
[{ bar_key: 12345, other: '' }, { bar_key: 12345, other: '' }, { bar_key: 12345, other: '' }, ...]
Data set A is coming from a SQS queue and any relationships with data set B will be available as I poll A.
Data set B is coming from a separate SQS queue that I am trying to dump into a memcached cache to do quick look ups on when an object drops into data set A.
Originally I was planning on setting the memcached key to be bar_key from the objects in data set B but then realized that if I did that it would be possible to overwrite the value since there can be many of the same bar_key value. Then I was thinking well I can create a key bar_key and the value just be an array of the SQS messages. But since I have multiple hosts polling the SQS queue I think it might be possible that when I check to see if the key is in memcached, check it out, append the new message to it, and then set it, that another host could be trying to preform the same operation and thus the first host's attempt at appending the value would just be overwritten.
I've looked around at memcached key locking but I'm not sure I understand it entirely. Would the solution be that when I get the key/value pair from memcached I create a temporary dummy lock on a new key called bar_key_dummy that expires in x seconds, and if I try to fetch a key that has a bar_key_dummy lock active I just send the SQS message back to the queue without deleting to try again in x seconds?
Here's some pseudocode for what I have going on in my head. Does this make any sense?
store = MemCache.new(host)
sqs_messages.poll do |message|
dummy_key = "#{message.bar_key}_dummy"
sqs.dont_delete_message && next unless store.get(dummy_key).nil?
# set dummy_key in memcache with a value of 1 for 3 seconds
store.set(dummy_key, 1, 3)
temp_data = store.get(message.bar_key) || []
temp_data << message
store.set(message.bar_key, temp_data, 300)
# delete dummy key when done in case shorter than x seconds
store.delete(dummy_key)
end
Thanks for any help!
Memcached has a special operation - cas Compare and Swap.
Command gets returns Item along with its unique CAS value.
Then dataset can be searched and update must be issued with the cas command which takes original unique CAS value.
If CAS was changed in between two command, update operation will fail with the EXISTS error

Resources