Elasticsearch remove field performance - elasticsearch

I'm calling the Update API frequently to remove fields from documents. What is the performance difference between
"script" : "ctx._source[\"name_of_field\"] = null"
and
"script" : "ctx._source.remove(\"name_of_field\")"
To my understanding, the second one deletes the document and reindexes the entire document. What is the behaviour of the first one?

Related

Update document field in index where source is not stored (_source enabled=false)

My use case has a large/complex source document and uses strict mapping with a fairly complex mapping schema.
Due to the number [10's of millions] and size [10KB to 2MB] of the source documents, I do not store _source in the index.
The original documents come in different formats: HL7V2 ER7, C-CDA XML and EDI XML, etc). Those original docs are transformed into JSON representations with the original source stored in S3. The JSON documents are then indexed (without source) in Elastic. The JSON is also stored in S3.
I would like to do some trivial mutates to the information stored in Elastic. Primarily for tagging use cases. But it seems that, AFAIK, document updates in Elastic require either the entire original source document to be presented or _source stored to effect the mutate/update.
Example: I would like to TAG a subset of documents stored in the Elastic Index. The "tag" field, an array type keyword, could be updated as follows, if source were stored:
POST /my-index/_update/1
{
"doc": {
"tag": ["RED","BLUE"]
}
}
Again, that update would work properly, provided source were stored. Without _source we get the expected (but unfortunate for me) error below:
...
"error" : {
"root_cause" : [
{
"type" : "document_source_missing_exception",
"reason" : "[_doc][1]: document source missing",
"index_uuid" : "SIsgMIeLT_694ATEmHz05g",
"shard" : "0",
"index" : "my_index"
}
],
...
I really would prefer not to store _source in the index, as Elastic is handling everything else in my use case, from an ingest/search/performance point of view, just wonderfully.
In a nutshell, I want Elastic to be an index, and not, effectively, a document store+index.
Is there some API that would allow direct updates to the data in the index, by document? In this case a particular field in the document (e.g., a tag array)??
Cheers for any thoughts.

combine fields of different documents in same index

I have 2 fields type in my index;
doc1
{
"category":"15",
"url":"http://stackoverflow.com/questions/ask"
}
doc2
{
"url":"http://stackoverflow.com/questions/ask"
"requestsize":"231",
"logdate":"22/12/2012",
"username":"mehmetyeneryilmaz"
}
now I need such a query that filter in same url field and returns fields both of documents:
result:
{
"category":"15",
"url":"http://stackoverflow.com/questions/ask"
"requestsize":"231",
"logdate":"22/12/2012",
"username":"mehmetyeneryilmaz"
}
The results given by elasticsearch are always per document, means that if there are multiple documents satisfying your query/filter, they would always appear as a different documents in the result and never merged into a single document. Hence merging them at client side is the one option which you can use. To avoid getting complete document and just to get the relevant fields, you can use "fields" in your query.
If this is not what you need and still needs narrowing down the result from the query itself, you can use top hit aggregations. It will give you the complete list of documents under a single bucket. But it would also have source field which would contain the complete documents itself.
Try giving a read to page:
https://www.elastic.co/guide/en/elasticsearch/reference/1.4/search-aggregations-metrics-top-hits-aggregation.html

when to use source filtering and when to use fields? Elasticsearch

I read about two ways to filter the fields returned by elasticsearch. fields And source filtering. when to use which?
If you are storing the complete document (using "_source" : {"enabled" : true}) then source filtering can be used.
If you store individual fields (using "store" : true) then use fields.
However if the individual field isn't found then fields will get the data from the _source anyway.
In addition to above comment ,
fields are usually used when _source is too large and i am only interested in certain fields alone. For eg: , i might a document each for news. The news might contain title , content and URL. I would like to search on title or content but just return the URL's . In doing so , you just get what you are looking for and some of the network latency would be saved in transporting back the response.

Elasticsearch: Is it possible to index fields not present in source?

Is it possible to make Elasticsearch index fields that are not present in the source document? An example of what I want to do is to index a Geo Point, but not store the value and to leave it out of the _source field. Then I could do searches and aggregations based on location, geohash etc., but not return the position in the result documents themselves, e.g., for privacy reasons.
The possibility seems to not be too far fetched, since mappings can cause fields in the source to be indexed in several different ways, for instance the Geo Point type can index pos.lon, pos.lat and pos.geohash even though these are not in the original source document.
I have looked at source filtering, but that seems to only apply to searches and not indexing. I did not find a way to use it in aliases.
The only way I've found to accomplish something like this would be to not store _source, but do store all other fields, except the single one I want to hide. That seems overly clumsy though.
I think you can do this with mappings:
In my index creation code, I have the following:
"mappings" : {
"item" : {
"_source" : {"excludes" : ["uploader"]},
"properties" : { ... }
}
},
"settings" : { ... }
('item' is the document type of my index. 'uploader' in this case is an email address - something we want to search by, but don't want to leak to the user.)
Then I just include 'uploader' as usual when indexing source documents. I can search by it, but it's not returned in any results.
My related question: How to create elasticsearch index alias that excludes specific fields - not quite the same :)

How to handle multiple updates / deletes with Elasticsearch?

I need to update or delete several documents.
When I update I do this:
I first search for the documents, setting a greater limit for the returned results (let’s say, size: 10000).
For each of the returned documents, I modify certain values.
I resent to elasticsearch the whole modified list (bulk index).
This operation takes place until point 1 no longer returns results.
When I delete I do this:
I first search for the documents, setting a greater limit for the returned results (let’s say, size: 10000)
I delete every found document sending to elasticsearch _id document (10000 requests)
This operation repeats until point 1 no longer returns results.
Is this the right way to make an update?
When I delete, is there a way I can send several ids to delete multiple documents at once?
For your massive index/update operation, if you don't use it already (not sure), you can take a look at the bulk api documentation. it is tailored for this kind of job.
If you want to retrieve lots of documents by small batches, you should use the scan-scroll search instead of using from/size. Related information can be found here.
To sum up :
scroll api is used to load results in memory and to be able to iterate over it efficiently
scan search type disable sorting, which is costly
Give it a try, depending on the data volume, it could improve the performance of your batch operations.
For the delete operation, you can use this same _bulk api to send multiple delete operation at once.
The format of each line is the following :
{ "delete" : { "_index" : "indexName", "_type" : "typeName", "_id" : "1" } }
{ "delete" : { "_index" : "indexName", "_type" : "typeName", "_id" : "2" } }
For deletion and update, if you want to delete or update by id you can use the bulk api:
Bulk API
The bulk API makes it possible to perform many index/delete operations
in a single API call. This can greatly increase the indexing speed.
The possible actions are index, create, delete and update. index and
create expect a source on the next line, and have the same semantics
as the op_type parameter to the standard index API (i.e. create will
fail if a document with the same index and type exists already,
whereas index will add or replace a document as necessary). delete
does not expect a source on the following line, and has the same
semantics as the standard delete API. update expects that the partial
doc, upsert and script and its options are specified on the next line.
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/docs-bulk.html
You can also delete by query instead:
Delete By Query API
The delete by query API allows to delete documents from one or more
indices and one or more types based on a query. The query can either
be provided using a simple query string as a parameter, or using the
Query DSL defined within the request body.
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/docs-delete-by-query.html

Resources