Using includeTypes flag on changefeeds - rethinkdb

From: https://rethinkdb.com/docs/changefeeds/javascript/#including-result-types
Could the uninitial be further defined? If initial is just an add that happened before I started the feed, then how do i get uninitial?
How do I get state? With includeInitial, includeState, includeTypes set to true, I'll get separate state docs, but never a type: state interface.

There's a better explanation of what "uninitial" results are in the "Including initial values" section of the document that you linked. To quote:
If an initial result for a document has been sent and a change is made to that document that would move it to the unsent part of the result set (for instance, a changefeed monitors the top 100 posters, the first 50 have been sent, and poster 48 has become poster 52), an “uninitial” notification will be sent, with an old_val field but no new_val field.
The reason these exist is due to how RethinkDB changefeeds implement the initial results logic. Initial results are processed more or less from left to right in the key space of the table. There is always a slice of the key space for which initial results are being send, and a remaining slice for which the changefeed has started "streaming" current updates in realtime. When you first open a changefeed with includeInitial: true, the whole key range will be in the initializing state. Then, as initial results are being sent over the changefeed, the key boundary between the initializing and streaming part moves and more of the keyspace become streaming.
uninitial values happen if a document's key moves from a part of the key space that is already streaming, to a part that's still initializing. This can only happen for changefeeds that use secondary indexes, since the primary key of a given document can never change.
Regarding state: I seem to be getting the type: "state" documents just fine. For example:
r.table('t1').changes({includeStates: true, includeInitial: true, includeTypes: true})
{ "state": "ready" ,
"type": "state" }
{ "state": "initializing" ,
"type": "state" }
Are you not getting these documents?

Related

ElasticSearch: Result window is too large

My friend stored 65000 documents on the Elastic Search cloud and I would like to retrieve all of them (using python). However, when I am running my current script, there is an error noticing that :
RequestError(400, 'search_phase_execution_exception', 'Result window is too large, from + size must be less than or equal to: [10000] but was [30000]. See the scroll api for a more efficient way to request large data sets. This limit can be set by changing the [index.max_result_window] index level setting.')
My script
es = Elasticsearch(cloud_id=cloud_id, http_auth=(username, password))
docs = es.search(body={"query": {"match_all": {}}, '_source': ["_id"], 'size': 65000})
What would be the easiest way to retrieve all those document and not limit it to 10000 docs? thanks
The limit has been set so that the result set does not overwhelm your nodes. Results will occupy memory in the elastic node. So bigger the result set, bigger the memory footprint and impact on the nodes.
Depending on what you want to do with the retrieved documents,
try to use the scroll api (as suggested in your error message) if its a batch job. Be mindful of the lifetime of scroll context in that case.
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-body.html#request-body-search-scroll
or, use the Search After
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-body.html#request-body-search-search-after
You should use the scroll API and get the results in different calls. The scroll API will return to you the results 10000 by 10000 as maximum (that will be available to consult during the amount of time you indicate in the call) and you will be able then to paginate the results and obtain them thanks to a scroll_id.
The error message itself is mentioning that how can you solve the issue, look carefully this part of the error message.
This limit can be set by changing the [index.max_result_window] index
level setting.
Please refer update indices level setting on how to change that.
So for your setting it would look like:
PUT /<your-index-name>/_settings
{
"index" : {
"index.max_result_window" : 65000 -> note its equal to your all the docs in your index
}
}

Elasticsearch Delete by Query Version Conflict

I am using Elasticsearch version 5.6.10. I have a query that deletes records for a given agency, so they can later be updated by a nightly script.
The query is in elasticsearch-dsl and look like this:
def remove_employees_from_search(jurisdiction_slug, year):
s = EmployeeDocument.search()
s = s.filter('term', year=year)
s = s.query('nested', path='jurisdiction', query=Q("term", **{'jurisdiction.slug': jurisdiction_slug}))
response = s.delete()
return response
The problem is I am getting a ConflictError exception when trying to delete the records via that function. I have read this occurs because the documents were different between the time the delete process started and executed. But I don't know how this can be, because nothing else is modifying the records during the delete process.
I am going to add s = s.params(conflicts='proceed') in order to silence the exception. But this is a band-aid as I do not understand why the delete is not processing as expected. Any ideas on how to troubleshoot this? A snapshot of the error is below:
ConflictError:TransportError(409,
u'{
"took":10,
"timed_out":false,
"total":55,
"deleted":0,
"batches":1,
"version_conflicts":55,
"noops":0,
"retries":{
"bulk":0,
"search":0
},
"throttled_millis":0,
"requests_per_second":-1.0,
"throttled_until_millis":0,
"failures":[
{
"index":"employees",
"type":"employee_document",
"id":"24681043",
"cause":{
"type":"version_conflict_engine_exception",
"reason":"[employee_document][24681043]: version conflict, current version [5] is different than the one provided [4]",
"index_uuid":"G1QPF-wcRUOCLhubdSpqYQ",
"shard":"0",
"index":"employees"
},
"status":409
},
{
"index":"employees",
"type":"employee_document",
"id":"24681063",
"cause":{
"type":"version_conflict_engine_exception",
"reason":"[employee_document][24681063]: version conflict, current version [5] is different than the one provided [4]",
"index_uuid":"G1QPF-wcRUOCLhubdSpqYQ",
"shard":"0",
"index":"employees"
},
"status":409
}
You could try making it do a refresh first
client.indices.refresh(index='your-index')
source https://www.elastic.co/guide/en/elasticsearch/client/javascript-api/current/api-reference.html#_indices_refresh
First, this is a question that was asked 2 years ago, so take my response with a grain of salt due to the time gap.
I am using the javascript API, but I would bet that the flags are similar. When you index or delete there is a refresh flag which allows you to force the index to have the result appear to search.
I am not an Elasticsearch guru, but the engine must perform some systematic maintenance on the indices and shards so that it moves the indices to a stable state. It's probably done over time, so you would not necessarily get an immediate state update. Furthermore, from personal experience, I have seen when delete does not seemingly remove the item from the index. It might mark it as "deleted", give the document a new version number, but it seems to "stick around" (probably until general maintenance sweeps run).
Here I am showing the js API for delete, but it is the same for index and some of the other calls.
client.delete({
id: string,
index: string,
type: string,
wait_for_active_shards: string,
refresh: 'true' | 'false' | 'wait_for',
routing: string,
timeout: string,
if_seq_no: number,
if_primary_term: number,
version: number,
version_type: 'internal' | 'external' | 'external_gte' | 'force'
})
https://www.elastic.co/guide/en/elasticsearch/client/javascript-api/current/api-reference.html#_delete
refresh
'true' | 'false' | 'wait_for' - If true then refresh the affected shards to make this operation visible to search, if wait_for then wait for a refresh to make this operation visible to search, if false (the default) then do nothing with refreshes.
For additional reference, here is the page on Elasticsearch refresh info and what might be a fairly relevant blurb for you.
https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-refresh.html
Use the refresh API to explicitly refresh one or more indices. If the request targets a data stream, it refreshes the stream’s backing indices. A refresh makes all operations performed on an index since the last refresh available for search.
By default, Elasticsearch periodically refreshes indices every second, but only on indices that have received one search request or more in the last 30 seconds. You can change this default interval using the index.refresh_interval setting.

How to detect if SimpleDB domain contains the requested item?

The AWS SimpleDB documentation for the Ruby SDK provides the following example with regard to using the get_attributes method:
resp = client.get_attributes({
domain_name: "String", # required
item_name: "String", # required
attribute_names: ["String"],
consistent_read: false,
})
...and then the following example response:
resp.attributes #=> Array
resp.attributes[0].name #=> String
resp.attributes[0].alternate_name_encoding #=> String
resp.attributes[0].value #=> String
resp.attributes[0].alternate_value_encoding #=> String
It also states the following piece of advice:
If the item does not exist on the replica that was accessed for this operation, an empty set is returned. The system does not return an error as it cannot guarantee the item does not exist on other replicas.
I hope that I'm misunderstanding this, but if your response does return an empty set, then how are you supposed to know if it's because no item exists with the supplied item name, or if your request just hit a replica that doesn't contain your item?
I have never used AWS SimpleDB before but from the little knowledge I have about replication from Amazon's DynamoDB the data is usually eventually consistent - while any of the replicas handles your request to read the attributes, the process of replication the previously written data can still take place across the replicas responsible for storing your data and that's why it's possible that the replica handling your request to read the attributes does not have to have the data stored (yet) - that's why it cannot respond with an error message.
What you should be able to do in order to be 100% sure is to specify the consistent_read: true parameter as it should tell you whether the data exists in AWS SimpleDB or not:
according to the documentation of get_attributes method
:consistent_read (Boolean) —
Determines whether or not strong consistency should be enforced when data is read from SimpleDB. If true, any data previously written to SimpleDB will be returned. Otherwise, results will be consistent eventually, and the client may not see data that was written immediately before your read.

Elasticsearch Realtime GET support

When I index a document in ES, I am trying to access the same document within in the refresh interval has passed and the search is not returning the result. Is there a Realtime GET support which allows to get a document once indexed regardless of the "refresh rate" of the index. I tried reducing the refresh_interval to 500ms instead of 1s, but my search query happens even before 500 ms and it is not a good idea to reduce it even further.
After indexing a document, you can GET it immediately without waiting for the refresh interval.
The GET API is real-time
So if you index a new document like this
POST index/type/1
{ "name": "John Doe" }
You can get it immediately without waiting using
GET index/type/1
If you search, however, you'll need to wait for the refresh interval to pass in order to retrieve the new document or call the refresh API.
For completeness' sake, it's worth stating that when indexing you also have the option of refreshing the shards immediately, by passing the refresh=true parameter like below. Note, however, that this can have bad performance implications, so it should be used sparingly.
POST index/type/1?refresh=true
{ "name": "John Doe" }
Also worth noting that in ES 5, you'll have the option of telling ES to wait for a refresh before returning from the create call:
POST index/type/1?refresh=wait_for
{ "name": "John Doe" }
In this case, once the POST request returns, you're guaranteed that the new document is available in the next search call.

Dealing with Memcached Race Conditions

I have two different sources of data which I need to marry together. Data set A will have a foo_key attribute which can map to Data set B's bar_key attribute with a one to many relationship.
Data set A:
[{ foo_key: 12345, other: 'blahblah' }, ...]
Data set B:
[{ bar_key: 12345, other: '' }, { bar_key: 12345, other: '' }, { bar_key: 12345, other: '' }, ...]
Data set A is coming from a SQS queue and any relationships with data set B will be available as I poll A.
Data set B is coming from a separate SQS queue that I am trying to dump into a memcached cache to do quick look ups on when an object drops into data set A.
Originally I was planning on setting the memcached key to be bar_key from the objects in data set B but then realized that if I did that it would be possible to overwrite the value since there can be many of the same bar_key value. Then I was thinking well I can create a key bar_key and the value just be an array of the SQS messages. But since I have multiple hosts polling the SQS queue I think it might be possible that when I check to see if the key is in memcached, check it out, append the new message to it, and then set it, that another host could be trying to preform the same operation and thus the first host's attempt at appending the value would just be overwritten.
I've looked around at memcached key locking but I'm not sure I understand it entirely. Would the solution be that when I get the key/value pair from memcached I create a temporary dummy lock on a new key called bar_key_dummy that expires in x seconds, and if I try to fetch a key that has a bar_key_dummy lock active I just send the SQS message back to the queue without deleting to try again in x seconds?
Here's some pseudocode for what I have going on in my head. Does this make any sense?
store = MemCache.new(host)
sqs_messages.poll do |message|
dummy_key = "#{message.bar_key}_dummy"
sqs.dont_delete_message && next unless store.get(dummy_key).nil?
# set dummy_key in memcache with a value of 1 for 3 seconds
store.set(dummy_key, 1, 3)
temp_data = store.get(message.bar_key) || []
temp_data << message
store.set(message.bar_key, temp_data, 300)
# delete dummy key when done in case shorter than x seconds
store.delete(dummy_key)
end
Thanks for any help!
Memcached has a special operation - cas Compare and Swap.
Command gets returns Item along with its unique CAS value.
Then dataset can be searched and update must be issued with the cas command which takes original unique CAS value.
If CAS was changed in between two command, update operation will fail with the EXISTS error

Resources