Elasticsearch: Is it possible to index fields not present in source? - elasticsearch

Is it possible to make Elasticsearch index fields that are not present in the source document? An example of what I want to do is to index a Geo Point, but not store the value and to leave it out of the _source field. Then I could do searches and aggregations based on location, geohash etc., but not return the position in the result documents themselves, e.g., for privacy reasons.
The possibility seems to not be too far fetched, since mappings can cause fields in the source to be indexed in several different ways, for instance the Geo Point type can index pos.lon, pos.lat and pos.geohash even though these are not in the original source document.
I have looked at source filtering, but that seems to only apply to searches and not indexing. I did not find a way to use it in aliases.
The only way I've found to accomplish something like this would be to not store _source, but do store all other fields, except the single one I want to hide. That seems overly clumsy though.

I think you can do this with mappings:
In my index creation code, I have the following:
"mappings" : {
"item" : {
"_source" : {"excludes" : ["uploader"]},
"properties" : { ... }
}
},
"settings" : { ... }
('item' is the document type of my index. 'uploader' in this case is an email address - something we want to search by, but don't want to leak to the user.)
Then I just include 'uploader' as usual when indexing source documents. I can search by it, but it's not returned in any results.
My related question: How to create elasticsearch index alias that excludes specific fields - not quite the same :)

Related

Update document field in index where source is not stored (_source enabled=false)

My use case has a large/complex source document and uses strict mapping with a fairly complex mapping schema.
Due to the number [10's of millions] and size [10KB to 2MB] of the source documents, I do not store _source in the index.
The original documents come in different formats: HL7V2 ER7, C-CDA XML and EDI XML, etc). Those original docs are transformed into JSON representations with the original source stored in S3. The JSON documents are then indexed (without source) in Elastic. The JSON is also stored in S3.
I would like to do some trivial mutates to the information stored in Elastic. Primarily for tagging use cases. But it seems that, AFAIK, document updates in Elastic require either the entire original source document to be presented or _source stored to effect the mutate/update.
Example: I would like to TAG a subset of documents stored in the Elastic Index. The "tag" field, an array type keyword, could be updated as follows, if source were stored:
POST /my-index/_update/1
{
"doc": {
"tag": ["RED","BLUE"]
}
}
Again, that update would work properly, provided source were stored. Without _source we get the expected (but unfortunate for me) error below:
...
"error" : {
"root_cause" : [
{
"type" : "document_source_missing_exception",
"reason" : "[_doc][1]: document source missing",
"index_uuid" : "SIsgMIeLT_694ATEmHz05g",
"shard" : "0",
"index" : "my_index"
}
],
...
I really would prefer not to store _source in the index, as Elastic is handling everything else in my use case, from an ingest/search/performance point of view, just wonderfully.
In a nutshell, I want Elastic to be an index, and not, effectively, a document store+index.
Is there some API that would allow direct updates to the data in the index, by document? In this case a particular field in the document (e.g., a tag array)??
Cheers for any thoughts.

combine fields of different documents in same index

I have 2 fields type in my index;
doc1
{
"category":"15",
"url":"http://stackoverflow.com/questions/ask"
}
doc2
{
"url":"http://stackoverflow.com/questions/ask"
"requestsize":"231",
"logdate":"22/12/2012",
"username":"mehmetyeneryilmaz"
}
now I need such a query that filter in same url field and returns fields both of documents:
result:
{
"category":"15",
"url":"http://stackoverflow.com/questions/ask"
"requestsize":"231",
"logdate":"22/12/2012",
"username":"mehmetyeneryilmaz"
}
The results given by elasticsearch are always per document, means that if there are multiple documents satisfying your query/filter, they would always appear as a different documents in the result and never merged into a single document. Hence merging them at client side is the one option which you can use. To avoid getting complete document and just to get the relevant fields, you can use "fields" in your query.
If this is not what you need and still needs narrowing down the result from the query itself, you can use top hit aggregations. It will give you the complete list of documents under a single bucket. But it would also have source field which would contain the complete documents itself.
Try giving a read to page:
https://www.elastic.co/guide/en/elasticsearch/reference/1.4/search-aggregations-metrics-top-hits-aggregation.html

Cost of adding field mapping in elasticsearch type

I have a use-case, where I have got a set of predefined fields and also need to support adding dynamic fields to ElasticSearch with some basic searching on them. I am able to achieve this using dynamic template mapping. However, the frequency of adding such dynamic fields is quite high.
Consider the this ES document for the Event type:
{
"name":"Youth Conference",
"venue":"Ahmedabad",
"date":"10/01/2015",
"organizer":"Invincible",
"extensions":{
"about": {
"vision":"Visualizes the image of an ideal Country. ",
"mission":"Encapsulates the gravity of the top reformative solutions for betterment of Country."
}
// Any thing can go here..
}
}
In the example above, each event document may have any unknown/new fields. Hence, for every such new dynamic field introduced, ES will update the mapping of the type. My concern is what is the cost of adding new field mapping in the existing type?
I am planning to separate out all dynamic mappings(inside extensions) from Event type by introducing another type, say EventExtensions and using parent/child relationship to map it with Event type. I believe this may limit the cost(if any) of adding dynamic fields frequently to the type. However, to my knowledge, using parent/child relationship will need more memory.
The first thing to remember here is that field is per index and not per type.
So wherever you add new fields , it would be made in the same index. Be it , in another type or as parent or child.
So decoupling the new fields to another type but same index is not going to make any change.
Second field addition is not that very expensive thing. I know people who uses 1000 of fields and are good with it. That being said , there should be a tab on number of field so that it wont go out to crazy numbers.
Here we have multiple approaches to solve the problem
1) Lets assume that the new field data need not be exactly searchable. In this case , you can deserialize the entire JSON as a string and add it to a field. Also make sure this field is not indexed. This way you can search based on other fields but then on retrieval of the document , get the information that was deserialized.
2) Lets say the new field looks like this
{
"newInfo1" : "log Of Info",
"newInfo2" : "A lot more info"
}
Instead of this , you can use
{
"newInfo" : [
{
"fieldName" : "newInfo1",
"fieldValue" : "log Of Info"
},
{
"fieldName" : "newInfo2",
"fieldValue" : "A lot more info"
}
]
}
This way , fields wont increase. But then to make field level specific search , like give me all documents with filedName as newInfo2 and having the word more in it , you will need to make newInfo field nested.
Hope this helps.

ElasticSearch: mappings for fields that are sorted often

Suppose I have a field "epoch_date" that will be sorted often when I do Elastic Search queries. How should I map this field? Right now, I just have stored: yes. Should I index it even though this field will not count towards the relevancy scoring? What should I add to this field if I intend to sort on this field often, so it will be more efficient?
{
"tweet" : {
"properties" : {
"epoch_date" : {
"type" : "integer",
"store" : "yes"
}
}
}
}
There's nothing you need to change to sort on the field given your mapping. You can only sort on a field if it's indexed, and the default is "index":"yes" for numeric or dates. You can not set a numeric type to analyzed, since there's no text to analyze. Also, better to use the date type for a date instead of the integer.
Sorting can be memory expensive if your field you are sorting on has a lot of unique terms. Just make sure you have enough memory for it. Also, keep in mind that sorting on a specific field you throw away the relevance ranking, which is a big part of what a search engine is all about.
Whether you want to store the field too doesn't have anything to do with sorting, but just with the way you retrieve it in order to return it together with your search results. If you use the _source field (default behaviour) there's no reason to store specific fields. If you ask for specific fields using the fields option when querying, then the stored fields would be retrieved directly from lucene rather than extracted from the _source field parsing the json.
An index is used for efficient sorting. So YES, you want to create an index for the field.
As to needing it to be "more efficient", I'd kindly advise you to first check your results and see if they're fast enough. I don't see a reason beforehand (with the limited info you provided) to think it wouldn't be efficient.
If you intend to filter on the field as well (date-ranges?) be sure to use filters instead of queries whenever you feel the filters used will be used often. This because filters can be efficiently cached.

Field not searchable in ES?

I created an index myindex in elasticsearch, loaded a few documents into it. When I visit:
localhost:9200/myindex/mytype/1023
I noticed that my particular index has the following metadata for mappings:
mappings: {
mappinggroupname: {
properties: {
Aproperty: {
type: string
}
Bproperty: {
type: string
}
}
}
}
Is there some way to add "store:yes" and index: "analyzed" without having to reload/reindex all the documents?
Note that when i want to view a single document...
i.e. localhost:9200/myindex/mytype/1023
I can see the _source field contains all the fields of that document are and when I go to the "Browser" section of the head plugin it appears that all the columns are correct and corresponding to my fieldnames. So why is it that "stored" is not showing up in metadata? I can even perform a _search on them.
What is the difference between "stored":"true" versus the fact that I can see all my fields and values after indexing all my documents via the means I mention above?
Nope, no way! That's how your documents got indexed in the underlying lucene. The only way to change it is to reindex them all!
You see all those fields because you see the content of the special _source field in lucene, that's stored by default through elasticsearch. You are not storing all the fields separately but you do have the source document that you originally indexed through the _source, a single field that contains the whole document.
Generally the _source field is just enough, you don't usually need to configure every field as stored.
Also, the default is "index":"analyzed" if not specified for all the string fields. That means those fields are indexed and analyzed using the standard analyzer if not specified in the mapping. Therefore, as far as I can see from your mapping those two fields should be indexed, thus searchable.

Resources