How can I make ElasticSearch yield just the first couple of words for a field? - elasticsearch

I'm using ElasticSearch to query a set of rather long documents. Each document has (among other things) a title, a URL and a body.
When presenting the results to the user, I'd like to present just an 'abstract' of each document (along with the title and the URL). However, returning the full body only to trim it client-side seems wasteful.
Alas, I don't have a dedicated 'abstract' field or the like. Hence I wonder: is there a way to make ElasticSearch yield just the beginning (e.g. the first 200 words) of the 'body' field for each hit? I looked at source filtering (which I'm already using in my queries) but that seems to just select/deselect individual fields for the response. I'm rather looking for a way to transform the returned data.

It appears that Script Fields are one way to solve this. Here is an example query which gets the title, uri and a scripted(!) abstract field for each document. The abstract consists of the firsts 200 letters of the actual content field:
{
"query": {
"match": {
"title": "Scripting"
},
},
"_source": ["title", "uri"],
"script_fields": {
"abstract": {
"script": {
"lang": "painless",
"source": "params['_source'].content.substring(0, 200)"
}
}
}
}

Related

Most performant way to update a single document in Elasticsearch via an alias

I have an Elasticsearch setup with an alias that points to many indices. I need to update a single document, but I don't know which index it resides in.
There are two ways I can accomplish this as far as I can see:
_update_by_query:
POST my-alias/_update_by_query
{
"query": {
"terms": {
"_id": ["my-id-to-update"]
}
},
"script": {
"source": "ctx._source['Field'] = 'new value'"
}
}
read (which returns the specific index) then write:
GET my-alias/_search
{
"query": {
"terms": {
"_id": ["my-id-to-update"]
}
}
}
POST my-index-returned-from-the-get/_update/my-id-to-update
{
"doc": {
"Field": "new value"
}
}
Which method is more performant?
Which method is preferred?
Is there a better way than either of these two?
The performance of both approach will be the same with one difference that your first approach only need to send one request compare to second one with two request, so it would be better to use first approach as you will reduce the API calls by half.
Also in my opinion the first approach is much cleaner and fits more in concept of aliases of Elasticsearch because you are encapsulating exact index name from your application, as application doesn't need to have any clue about exact index-name your documents are in.
An important note about updating a document in Elasticsearch is documents in Elasticsearch don't get updated, it means the document will be flagged as deleted and new document will be created (this is due to Lucene implementation), then during process of Lucene segment merging the document will be actually deleted.
you can find a good blog post about segment merging here.

How to boost Elasticsearch results based on another field?

Kinda simple use case but cannot come up with good solution.
Basically I have two indexed fields: content and keywords (keyword tokenizer), where content is a long text field and keywords contain important terms within that content. When I query with some long text, I have to boost those results based on the keywords present in the matching document.
I tried querying the complete text on both content and keywords field, but it is too slow or it throws too_many_clauses error for text with more than 40 words.
{"query": {
"match": {
"keywords": {
"query": "some long text",
"analyzer": "custom_analyzer"
}
}
}}
Is there any better way? Would percolator work here?
I can relate this to my application, which is similar to Stackoverflow, which consists of question and answers, for a question, there is subject, body, tags etc.
Subject here relates to your keyword indexed field and body relate to your content indexed field. Normally subject contains the important keywords about the post, which is also the case with you.
Now coming to solution part,
How we solve it by querying both on subject and body indexed fields but boost subject by a factor of 15, which is configurable.
ES query which we use:
{
"query": {
"multi_match" : {
"query" : "this is a test",
"fields" : [ "subject^15", "message" ]
}
}
}
This ES doc also has a similar example where they are boosting a subject field in multi_match query by a factor of 3.
Let me know if you have any questions.

Elasticsearch nested objects with query_string as first class attributes

I'm trying to index a nested field as a first-class attribute in my document so that I can search them using query_string without dot syntax.
For example, if I have a document like
"data": { "name": "Bob" }
instead of searching for data.name:Bob I would like to be able to search for name:Bob
The root of my issue is that we index a jsonb column that may have varying attributes. In some instances the data property may contain a data.business attribute, etc. I would like users to be able to search on these attributes without needing to "dig" into the object.
The data field does not have to be indexed as a nested type unless necessary; I was indexing it as an object previously.
I have tried to leverage the _all field as suggested in this post.
I have also tried to use include_in_parent:true and set the datatype as nested for my data field as suggested in this post.
I have also looked into the inner_hits feature to no avail.
Here's an example of my mapping for the data attribute.
PUT my_index
{
"mappings": {
"my_type": {
"properties": {
"data": {
"type": "object"
}
}
}
}
}
Example document
PUT my_index/_doc/1
{
"data": {
name: "bob",
business: "None of yours"
}
}
And how my query currently looks:
GET my_index/_search
{
"query": {
"query_string": {
"query": "name:bob",
"fields": ["data.*"]
}
}
}
With the current setup I almost get my desired results. I can search on individual properties like data.name:bob and data.business:"None of yours" and get back the correct documents.
However I want to be able to get the exact same results with business:"None of yours" or name:bob.
Thanks in advance for any help!
I figured it out using dynamic templates. For anyone coming across this in the future, here is how I solved the issue:
I used path_match to match the data object (data.*).
Then using copy_to and {name} I dynamically created top-level fields on my parent object.
{
"dynamic_templates":[
{"template_1":
{"mapping":
{"copy_to":"{name}"},
"path_match":"data.*"
}
}
]
}

Find documents in Elasticsearch where `ignore_malformed` was triggered

Elasticsearch by default throws an exception if inserting data to a field which does not fit the existing type. For example, if a field has been created as number type, inserting a document with a string value for that field causes an error.
This behavior can be changed by enabling then ignore_malformed setting, which means such fields are silently ignored for indexing purposes, but retained in the _source document - meaning that the invalid values cannot be searched or aggregated, but are still included in the returned document.
This is preferable behavior in our use case, but we would wish to be able to locate such documents somehow so we can fix them in the future.
Is there any way to somehow flag documents for which some malformed fields were ignored? We control the document insertion process fully, so we can modify all insertion flags, or do a trial insert, or anything, to reach our goal.
You can use the exists query to find document where this field does not exist, see this example
PUT foo
{
"mappings": {
"bar": {
"properties": {
"baz": {
"type": "integer",
"ignore_malformed": true
}
}
}
}
}
PUT foo/bar/1
{
"baz": "field"
}
GET foo/bar/_search
{
"query": {
"bool": {
"filter": {
"bool": {
"must_not": [
{
"exists": {
"field": "baz"
}
}
]
}
}
}
}
}
There is no dedicated mechanism though, so this search finds also documents where the field is not set intentionally
You cannot, when you search on elasticsearch, you don't search on document source but on the inverted index, which contains the analyzed data.
ignore_malformed flag is saying "always store document, analyze if possible".
You can try, create a mal-formed document, and use _termvectors API to see how the document is analyzed and stored in the inverted index, in a case of a string field, you can see an "Array" is stored as an empty string etc.. but the field will exists.
So forget the inverted index, let's use the source!
Scroll all your data until you find the anomaly, I use a small python script that search scroll, unserialize and I test field type for every documents (very long) but I can have a list of wrong document IDs.
Use a script query can be very long and crash your cluster, use with caution, maybe as a post_filter:
Here I want to retrieve the document where country_name is not a string:
{
"_source": false,
"timeout" : "30s",
"query" : {
"query_string" : {
"query" : "locale:de_ch"
}
},
"post_filter": {
"script": {
"script": "!(_source.country_name instanceof String)"
}
}
}
"_source:false" => I want only document ID
"timeout" => prevent crash
As you notice, this is a missing feature, I know logstash will tag
document that fail, so elasticsearch could implement the same thing.

Sorting a match query with ElasticSearch

I'm trying to use ElasticSearch to find all records containing a particular string. I'm using a match query for this, and it's working fine.
Now, I'm trying to sort the results based on a particular field. When I try this, I get some very unexpected output, and none of the records even contain my initial search query.
My request is structured as follows:
{
"query":
{
"match": {"_all": "some_search_string"}
},
"sort": [
{
"some_field": {
"order": "asc"
}
}
] }
Am I doing something wrong here?
In order to sort on a string field, your mapping must contain a non-analyzed version of this field. Here's a simple blog post I found that describes how you can do this using the multi_field mapping type.

Resources