Partially updating elasticsearch list field value using python - elasticsearch

The purpose of this question is to ask the community how to go about partially updating a field without removing any other contents of that field.
There are many examples in StackOverflow to partially update ElasticSearch _source fields using python, curl, etc. The elasticsearch python library comes equipped with a elasticsearch.helpers folder with functions - parallel_bulk, streaming_bulk, bulk, which allow developers to easily update documents.
If users have data in a pandas dataframe, one can easily iterate over the rows to create a generator to update/create documents in elasticsearch. Elasticsearch documents are immutable, thus, when an update occurs elasticsearch takes the information being passed to create a new document, incrementing the docs version, while updating what needs to be updated. If a document has a field as a list, if the update request has a single value it will replace the entire list with that new value. (Many SO QAs covering this).
I do not want to replace the value of that list with the new value, but instead to update a single value in a list to a new value.
For example, in my _source I have a field as ['101 country drive', '35 park drive', '277 thunderroad belway']. This field has three values, but let's say we realize that this document is incorrect and we need to update '101 country drive' to '1001 country drive'.
I do not want to delete the other values in the list, instead, I want to only update the index value with a new value.
Do I need to write a painless script to perform this action, or is there another method to perform this action?
Example:
Want to update the document
From ---
{'took': 176,
'timed_out': False,
'_shards': {'total': 1, 'successful': 1, 'skipped': 0, 'failed': 0},
'hits': {'total': {'value': 0, 'relation': 'eq'},
'max_score': None,
'hits': [{'_index': 'docobot', '_type': '_doc', '_id': '19010239',
'_source': {'name': 'josephine drwaler', 'address': ['101 country drive', '35 park drive', '277 thunderroad belway']
}}]}}
to
{'took': 176,
'timed_out': False,
'_shards': {'total': 1, 'successful': 1, 'skipped': 0, 'failed': 0},
'hits': {'total': {'value': 0, 'relation': 'eq'},
'max_score': None,
'hits': [{'_index': 'docobot', '_type': '_doc', '_id': '19010239',
'_source': {'name': 'josephine drwaler', 'address': ['1001 country drive', '35 park drive', '277 thunderroad belway']
}}]}}
Notice that the address is updated only for the first index, but the index number should not be a factor in updating the value of address in _source.
What is the most efficient and pythonic way to go about partially updating documents in elasticsearch while keeping the integrity of the remaining values in that field?

the _source is what is passed to Elasticsearch in the API request, it's not a "field" in the same context of what address is considered
that said, you need to replace the entire address field with what you want, not just the value you want corrected. Elasticsearch assumes that what you pass in is what the entirety of the field's value should be and will overwrite that field with what it gets

Need to create a painless script to update. When doing so need to keep in mind that you can access any field in source by:
ctx._source.address = ['1001 country drive', '35 park drive', '277 thunderroad belway']
But this doesn't solve the problem...
The field is a list, so we need to iterate through the list. Below painless script loops through each item, compares it to the search param, if it matches returns the answer.
def upd_address= [];
for (def item: ctx._source.address) ]
{
if (item == params.search_id) {
upd_address.add(params.answer)
}
else {
upd_address.add(item)
}} ctx._source.address = upd_address;
You can use the above with elasticsearch_dsl as
ubq = UpdateByQuery(using=[your es connection], doc_type='doc', index=['your index']
ubq = ubq.script(source=[above query], params={'search_id': addrss, 'answer': upd_addrss)
res = ubq.execute()
print(res, type(res))
Update query loops through each item in the list. Checks if the item is the search id, if so keep the answer else keep same id.

Related

Can I apply facets on the entire solr result instead of the filtered solr result?

Let's say I have a collection of solr documents:
{'id': 1, 'title': 'shirt', 'brand': 'adidas'},
{'id': 2, 'title': 'shirt2', 'brand': 'adidas'},
{'id': 3, 'title': 'shirt', 'brand': 'nike'},
{'id': 4, 'title': 'shirt', 'brand': 'puma'}
Now if I try to apply faceting on this with facet.field=brand, I will get the following
{'adidas': 2, 'nike': 1, 'puma': 1}
In addition, if I apply a filter brand:adidas, I get the following for faceting:
{'adidas': 2, 'nike': 0, 'puma': 0}
Is it possible to still get the count of documents returned by solr before the filtering happened? I would like to still see {'adidas': 2, 'nike': 1, 'puma': 1} regardless of what filters I apply
You can use tagging and excluding filters as shown in the reference guide.
q=yourquery&fq={!tag=brand}brand:adidas&facet=true&facet.field={!ex=brand}brand
The {!ex} instruction will then exclude any filters matching that tag from the active filters when generating the facet results, allowing you to keep the complete set of brands matching the query available as facets.

Storing data in ElasticSearch

I'm looking at two ways of storing data in Elastic Search.
[
{
'first': 'dave',
'last': 'jones',
'age': 43,
'height': '6ft'
},
{
'first': 'james',
'last': 'smith',
'age': 43,
'height': '6ft'
},
{
'first': 'bill',
'last': 'baker',
'age': 43,
'height': '6ft'
}
]
or
[
{
'first': ['dave','james','bill'],
'last': ['jones','smith','baker']
'age': 43,
'height': '6ft'
}
]
(names are +30 character hashes. Nesting would not exceed the above)
My goals are:
Query speed
Disk space
We are talking the difference between 300Gb and a terabyte.
My question is can Elastic Search search nested data just as quickly as flattened out data?
Elasticsearch will flatten your arrays of objects by default, exactly like you demonstrated in your example:
Arrays of inner object fields do not work the way you may expect. Lucene has no concept of inner objects, so Elasticsearch flattens object hierarchies into a simple list of field names and values.
So from the point of view of querying nothing will change. (However, if you need to query individual items of the inner arrays, like to query for dave jones, you may want to explicitly index it as nested data type, which will have poorer performance.)
Speaking about size on disk, by default there's compression enabled. Here you should keep in mind that Elasticsearch will store your original documents in two ways simultaneously: the original JSONs as source, and implicitly in the inverted indexes (which are actually used for the super fast searching).
If you want to read more about tuning for disk usage, here's a good doc page. For instance, you could enable even more aggressive compression for the source, or not store source on disk at all (although not advised).
Hope that helps!

Rethinkdb: Chain getAll & between

I get RqlRuntimeError: Expected type TABLE_SLICE but found SELECTION: error when chaining getAll and between methods.
r.db('mydb').table('logs')
.getAll('1', {index: 'userId'})
.between(r.time(2015, 5, 1, 'Z'), r.time(2015, 5, 4, 'Z'), {index: 'createdAt'})
Is there a way to use index(es) when querying for data that belongs to userId=1 and where createdAt is between dates? As I understand filter does not use indexes.
You can't use two indexes like that, but you can create a compound index:
r.table('logs').indexCreate('compound', function(row) {
return [row('userId'), row('createdAt')];
})
r.table('logs').between([1, firstTime], [1, secondTime], {index: 'compound'})

quick check whether an elasticsearch index will return search hits

We're running commands against an elasticsearch source against a few indices like so:
curl -XGET 'http://es-server:9200/logstash-2015.01.28,logstash-2015.01.27/_search?pretty' -d #/a_query_file_in_json_format
Works great most of the time, and we can parse the results we need.
However when the indices are in a bad state-- maybe there's been a lag in indexing, or some shards are acting up-- the query above will return no results, and it's impossible to know whether it's because there's no matching records or the index is unstable in some way.
I've been looking at the elastic search indices recovery API but am a bit overwhelmed. Are there some queries I can run that will give a yes/no answer to 'can a search against these indices be relied upon at the moment?'
You have multiple ways to get this information.
1) You can use the cluster health API at the indices level like this :
GET _cluster/health/my_index?level=indices
This will output the status of the cluster, with information about status and shards of the index my_index :
{
"cluster_name": "elasticsearch_thomas",
"status": "yellow",
"timed_out": false,
"number_of_nodes": 1,
"number_of_data_nodes": 1,
"active_primary_shards": 5,
"active_shards": 5,
"relocating_shards": 0,
"initializing_shards": 0,
"unassigned_shards": 5,
"indices": {
"my_index": {
"status": "yellow",
"number_of_shards": 5,
"number_of_replicas": 1,
"active_primary_shards": 5,
"active_shards": 5,
"relocating_shards": 0,
"initializing_shards": 0,
"unassigned_shards": 5
}
}
}
2) If you want to have a less verbose answer, or to filter only on some specific information, you can rely on the _cat API, which allows you to customize the output. However, the output is no longer a JSON.
For example, if you want only the name and health status of the indices, the following request will do the trick :
GET _cat/indices/my_index?h=index,health&v
by outputting this :
index health
my_index yellow
Note that the column headers are shown only because of the verbose flag (v GET parameter in the previous request).
To have a complete list of what columns are available, use the help parameter :
GET _cat/indices?help

Paginating Distinct Values in Mongo

I would like to retrieve district values for a field from a collection in my database. The distinct command is the obvious solution. The problem is that some fields have a large number of possible values and are not simple primitive values (i.e. are complex sub-documents instead of just a string). This means the results are large causing it to overload the client I am delivering the results to.
The obvious solution is to paginate the resulting distinct values. But I cannot find a good way to optimally do this. Since distinct doesn't have pagination options (limit, skip, etc) I am turning to the aggregation framework. My basic pipeline is:
[
{$match: {... the documents I am interested in ...}},
{$group: {_id: '$myfield'},
{$sort: {_id: 1},
{$limit: 10},
]
This gives me the first 10 unique values for myfield. To get the next page a $skip operator is added to the pipeline. So:
[
{$match: {... the documents I am interested in ...}},
{$group: {_id: '$myfield'},
{$sort: {_id: 1},
{$skip: 10},
{$limit: 10},
]
But sometimes the field I am collecting unique values from is an array. This means I have to unwind it before grouping. So:
[
{$match: {... the documents I am interested in ...}},
{$unwind: '$myfield'}
{$group: {_id: '$myfield'},
{$sort: {_id: 1},
{$skip: 10},
{$limit: 10},
]
Other times the field I am getting unique values for may not be an array but it's parent node might be an array. So:
[
{$match: {... the documents I am interested in ...}},
{$unwind: '$list'}
{$group: {_id: '$list.myfield'},
{$sort: {_id: 1},
{$skip: 10},
{$limit: 10},
]
Finally sometimes I also need to filter on data inside the field I am getting distinct values. This means I sometimes need another match operator after the unwind:
[
{$match: {... the documents I am interested in ...}},
{$unwind: '$list'}
{$match: {... filter within list.myfield ...}},
{$group: {_id: '$list.myfield'},
{$sort: {_id: 1},
{$skip: 10},
{$limit: 10},
]
The above are all simplifications of my actual data. Here is a real pipeline from the application:
[
{"$match": {
"source_name":"fmls",
"data.surroundings.school.nearby.district.id": 1300120,
"compiled_at":{"$exists":true},
"data.status.current":{"$ne":"Sold"}
}},
{"$unwind":"$data.surroundings.school.nearby"},
{"$match": {
"source_name":"fmls",
"data.surroundings.school.nearby.district.id":1300120,
"compiled_at":{"$exists":true},
"data.status.current":{"$ne":"Sold"}
}},
{"$group":{"_id":"$data.surroundings.school.nearby"}},
{"$sort":{"_id":1}},
{"$skip":10},
{"$limit":10}
]
I hand the same $match document to both the initial filter and the filter after the unwind because the $match document is somewhat opaque from a 3rd party so I don't really know which parts of the query are filtering outside my unwind vs inside the data I am getting distinct values for.
Is there any obvious different way to go about this. In general, my strategy is working but there are some queries where it is taking 10-15 seconds to return the results. There are about 200,000 documents in the collection although only about 60,000 after the first $match in the pipeline is applied (which can use indexes).

Resources