[es painless]difference between doc and params._source - elasticsearch-painless

when use doc["abc"],it turns out no field "abc" exception, only to find params._source["abc"] get everything correct.
I checked the status of doc["abc"].value ,it shows null , also doc["abc"].empty is true.
1.elasticsearch version:5.x
2.use painless inline sort script
can anybody figureit out what. happened?

Depending on where a script is used, it will have access to certain special variables and document fields. I dont know your mapping, but i think this link will answer your questions - Accessing document fields and special variables
To quote further from above link:
Doc values and text fields
The doc['field'] syntax can also be used for analyzed text fields if
fielddata is enabled, but BEWARE: enabling fielddata on a text field
requires loading all of the terms into the JVM heap, which can be very
expensive both in terms of memory and CPU. It seldom makes sense to
access text fields from scripts.
Doc values are a columnar field value store, enabled by default on all
fields except for analyzed text fields.
The _source provides access to the original document body that was
indexed (including the ability to distinguish null values from empty
fields, single-value arrays from plain scalars, etc).
if your field abc is an analyzed text field or an object, doc wont work.

Related

Control which multi fields are queried by default

I have a preexisting index that contains field mappings and is currently being queried by many applications. I would like to add additional ways for the data to be queried, specifically, support full text search via analysis. Multi-fields seemed like the obvious way to do this, but I found that adding new multi-fields actually changes the existing query behavior.
For example, I have an "id" field that is a keyword. Applications are already using this field to query on. After I add a new multi-field, like "txt" (using the standard analyzer), new documents can be found by querying with just a partial value match. Values for "id" look like this: "123-abc" so now a query with just "abc" will match when querying against the "id" field. This is not how it worked previously (the keyword only field would require the entire value "123-abc").
Ideally, the top-level "id" field would be keyword only, and if a "full text" search was required, the query would need to specify "id.txt". So my question is... is there a way to disable multi-fields and require that the query explicitly set a sub field when needed?
My only other thought on how to solve this, was to use copy_to so that these fields are completely distinct... but that is a bit more work and there are many many fields to deal with that would require this.

Dealing with Empty Fields

I am new to stormcrawler and elasticsearch in general. I am currently using stormcrawler 2.0 to index website data (including non-HTML items such as PDF's and Word Documents) into elasticsearch. In some cases, the metadata of PDF's or Word documents do not contain a title so the field is stored blank/null in elasticsearch. This is unfortunately causing issues in the webapp I am using to display search results (search-ui). Is there a way I can have stormcrawler insert a default value of "Untitled" into the title field if none exists in the metadata?
I understand that elasticsearch has a null_value field parameter, but if I understand correctly that parameter cannot be used for text fields and only helps with searching.
Thanks!
One option would be to write a custom ParseFilter to give an arbitrary value to any missing key or a key with an empty value. The StormCrawler code has quite a few examples of ParseFilters, see also the WIKI.
The same could be done as a custom Bolt placed between the parser and the indexer; grab the metadata and normalise to your heart's content.

analyzed field vs doc_values: true field

We have an elasticsearch that contains over half a billion documents that each have a url field that stores a URL.
The url field mapping currently has the settings:
{
index: not_analyzed
doc_values: true
...
}
We want our users to be able to search URLs, or portions of URLs without having to use wildcards.
For example, taking the URL with path: /part1/user#site/part2/part3.ext
They should be able to bring back a matching document by searching:
part3.ext
user#site
part1
part2/part3.ext
The way I see it, we have two options:
Implement an analysed version of this field (which can no longer have doc_values: true) and do match querying instead of wildcards. This would also require using a custom analyser to leverage the pattern tokeniser to make the extracted terms correct (the standard tokeniser would split user#site into user and site).
Go through our database and for each document create a new field that is a list of URL parts. This field could have doc_values: true still so would be stored off-heap, and we could do term querying on exact field values instead of wildcards.
My question is this:
Which is better for performance: having a list of variable lengths that has doc_values on, or having an analysed field? (ie: option 1 or option 2) OR is there an option 3 that would be even better yet?!
Thanks for your help!
Your question is about a field where you need doc_values but can not index with keyword-analyzer.
You did not mention why you need doc_values. But you did mention that you currently not search in this field.
So I guess that the name of the search-field do not have to be the same: you can copy the field value in an other field which is only for search ( "store": false ). For this new field you can use the pattern-analyzer or pattern-tokenizer for your use case.
It seems that no-one has actually performance tested the two options, so I did.
I took a sample of 10 million documents and created two new indices:
An index with an analysed field that was setup as suggested in the other answer.
An index with a string field that would store all permutations of URL segmentation.
I ran an enrichment process over the second index to populate the fields. The field values on the first index were created when I re-indexed the sample data from my main index.
Then I created a set of gatling tests to run against the indices and compared the gatling results and netdata (https://github.com/firehol/netdata) landscape for each.
The results were as follows:
Regarding the netadata landscape: The analysed field showed a spike - although only a small one - on all elastic nodes. The not_analysed list field tests didn't even register.
It is worth mentioning that enriching the list field with URL segmentation permutations bloated the index by about 80% in our case. So there's a trade off - you never need to do wildcard searches for exact sub-segment matching on URLs, but you'll need a lot more disk to do it.
Update
Don't do this. Go for doc_values. Doing anything with analyzed strings that have a massive number of possible terms will mean massive field data that will, eventually, never fit in the amount of memory you can allocate it.

Only index certain fields from Wikipedia River

I'm trying to use the Wikipedia River
Is there a way / How can I customize the mapping so that ElasticSearch only index the title fields (I'd still like to access the whole text)?
The mapping is useful more to decide how you index data rather than what you index, unless you set it to dynamic: false which means that elasticsearch effectively accepts only the fields that are explicitly declared in the mapping.
The problem is that the wikipedia river always sends a set of fields for every document and this behaviour is not currently configurable, thus there's no way to index only a subset of those fields (e.g. only title and _source). What you could do is modify your search request so that you get back only the fields that you are interested in, but the content of the index will stay the same.

Stored field in elastic search

In the documentation, some types, such as numbers and dates, it specifies that store defaults to no. But that the field can still be retrieved from the json.
Its confusing. Does this mean _source?
Is there a way to not store a field at all, and just have it indexed and searchable?
None of the field types are stored by default. Only the _source field is. That means you can always get back what you sent to the search engine. Even if you ask for specific fields, elasticsearch is going to parse the _source field for you and give you back those fields.
You can disable the _source if you want but then you could only retrieve the fields that you explicitly stored, according to your mapping.

Resources