Only index certain fields from Wikipedia River - elasticsearch

I'm trying to use the Wikipedia River
Is there a way / How can I customize the mapping so that ElasticSearch only index the title fields (I'd still like to access the whole text)?

The mapping is useful more to decide how you index data rather than what you index, unless you set it to dynamic: false which means that elasticsearch effectively accepts only the fields that are explicitly declared in the mapping.
The problem is that the wikipedia river always sends a set of fields for every document and this behaviour is not currently configurable, thus there's no way to index only a subset of those fields (e.g. only title and _source). What you could do is modify your search request so that you get back only the fields that you are interested in, but the content of the index will stay the same.

Related

How to only store the index,not the original text in ES

I am using elastic search 7.10.1. I would store and search against my blogs. The blog has id,title and content fields.
I would like to search against id, title and content, but since the content of blog is too big, so that I would like to save the original content text outside of Elastic Search, such as HBase.
I would ask how to achieve this in ES?
If you are using a static mapping then simply don't define your content field in your index mapping, and don't populate it while indexing your document to ES.
Refer to Mapping param for more info, and specifically, store param default false which means you can't retrieve field value if _source(true by default) is also disabled.
index param default true, which controls whether the field is searchable or not, in your case if you don't want to search and retrieve it you have to disable these two params.

Filtering Elasticsearch fields from index/store

I was wondering what is the recommended approach to filter out some of the fields that are sent to Elasticsearch from Store and Index?
I want to filter our some fields from getting indexed in Elasticsearch. You may ask why you are sending them to Elasticsearch from the first place. Unfortunately, it is sent via another application that doesn't accept any filtering mechanism. Hence, filtering should be addressed at the time of indexing. Here is what we have done, but I am not sure what would be the consequences of these steps:
1- Disable dynamic mapping ("dynamic": "false" ) in ES templates.
2- Including only the required fields in _source and excluding the rest.
According to ES website, some of the ES functionalities will be disabled by disabling _source fields. Given I don't need the filtered fields at all, I was wondering whether the mentioned solution will break anything regarding the remaining fields or not?
There are a few mapping parameters that allow you to do what you want:
index: true/false: if true the field value is indexed in order to be searched later on (default: true)
store: true/false: if true the field values are stored in addition to being indexed. Usually, the field values are stored in the source already, but you can choose to not store the source but store the field value itself (default: false)
enabled: true/false: only for the mapping type as a whole or for object types. you can decide whether to only store the value but not index it
So you can use any combination of the above parameters if you don't want to modify the source documents and simple let ES do it for you.

analyzed field vs doc_values: true field

We have an elasticsearch that contains over half a billion documents that each have a url field that stores a URL.
The url field mapping currently has the settings:
{
index: not_analyzed
doc_values: true
...
}
We want our users to be able to search URLs, or portions of URLs without having to use wildcards.
For example, taking the URL with path: /part1/user#site/part2/part3.ext
They should be able to bring back a matching document by searching:
part3.ext
user#site
part1
part2/part3.ext
The way I see it, we have two options:
Implement an analysed version of this field (which can no longer have doc_values: true) and do match querying instead of wildcards. This would also require using a custom analyser to leverage the pattern tokeniser to make the extracted terms correct (the standard tokeniser would split user#site into user and site).
Go through our database and for each document create a new field that is a list of URL parts. This field could have doc_values: true still so would be stored off-heap, and we could do term querying on exact field values instead of wildcards.
My question is this:
Which is better for performance: having a list of variable lengths that has doc_values on, or having an analysed field? (ie: option 1 or option 2) OR is there an option 3 that would be even better yet?!
Thanks for your help!
Your question is about a field where you need doc_values but can not index with keyword-analyzer.
You did not mention why you need doc_values. But you did mention that you currently not search in this field.
So I guess that the name of the search-field do not have to be the same: you can copy the field value in an other field which is only for search ( "store": false ). For this new field you can use the pattern-analyzer or pattern-tokenizer for your use case.
It seems that no-one has actually performance tested the two options, so I did.
I took a sample of 10 million documents and created two new indices:
An index with an analysed field that was setup as suggested in the other answer.
An index with a string field that would store all permutations of URL segmentation.
I ran an enrichment process over the second index to populate the fields. The field values on the first index were created when I re-indexed the sample data from my main index.
Then I created a set of gatling tests to run against the indices and compared the gatling results and netdata (https://github.com/firehol/netdata) landscape for each.
The results were as follows:
Regarding the netadata landscape: The analysed field showed a spike - although only a small one - on all elastic nodes. The not_analysed list field tests didn't even register.
It is worth mentioning that enriching the list field with URL segmentation permutations bloated the index by about 80% in our case. So there's a trade off - you never need to do wildcard searches for exact sub-segment matching on URLs, but you'll need a lot more disk to do it.
Update
Don't do this. Go for doc_values. Doing anything with analyzed strings that have a massive number of possible terms will mean massive field data that will, eventually, never fit in the amount of memory you can allocate it.

Lucene: Filter query by doc ID

I want to have in the search response only documents with specified doc id. In stackoverflow I found this question (Lucene filter with docIds) but as far as I understand there is created the additional field in the document and then doing search by this field. Is there another way to deal with it?
Lucene's docids are intended only to be internal keys. You should not be using them as search keys, or storing them for later use. Those ids are subject to change without warning. They will be changed when updating or reindexing documents, and can change at other times, such as segment merges, as well.
If you want your documents to have a unique identifier, you should generate that key separate from the docId, and index it as a field in your document.

What indexes are created when indexing a document in elasticsearch

If I create a first document of it's type, or put a mapping, is an index created for each field?
Obviously if i set "index" to "analyzed" or "not analyzed" the field is indexed.
Is there a way to store a field so it can be retrieved but never searched by? I imagine this will save a lot of space? If I set this to "no" will this save space?
Will I still be able to search by this, just take more time, or will this be totally unsearchable?
Is there a way to make a field indexed after some documents are inserted and I change my mind?
For example, I might have a mapping:
{
"book":{"properties":{
"title":{"type":"string", "index":"not_analyzed"},
"shelf":{"type":"long","index":"no"}
}}}
so I want to be able to search by title, but also retrieve the shelf the book is on
index:no will indeed not create an index for that field, so that saves some space. Once you've done that you can't search for that particular field anymore.
Perhaps also useful in this context is to know aboutthe _source field, which is returned by default and includes all fields you've stored. http://www.elasticsearch.org/guide/reference/mapping/source-field/
As to your second question:
you can't change your mind halfway. When you want to index a particular field later on you have to reindex the documents.
That's why you may want to reconsider setting index:no, etc. In fact a good strategy to begin is to don't define a schema for fields at all, unless you're 100% sure you need a non-default analyzer for a particular field for instance. Otherwise ES will use generally usable defaults.

Resources