Is it possible to retrieve only doc id and score from Elasticsearch without performing the fetch phase of search? - elasticsearch

Understanding "Query Then Fetch" shows that an Elasticsearch query is a two step process of query (find/score/sort matching documents from all servers) and fetching (go back to the servers and collect the content of the matching documents).
Is there a way to retrieve only a list of sorted doc_id and score but avoid the fetch? I know that fetch can be avoided by setting size to 0... but I still need the matching docs and their scores and that would return none.
I figure I might be able to turn off _source, but I'm not sure that would work if, for example, the query portion of the search only knows the internal doc_id and needs to go and retrieve the public doc_id.

GET /_search
{
"_source": false,
"query" : {
"term" : { "user" : "kimchy" }
}
}
off course, you have to use your own ids, not auto-generated ones

The scores are separated from the docs' sources so I don't see why a fetch would be necessary to retrieve them.
You can surely turn off _source and then also sort by _id, like so:
GET your_index/_search
{
"_source": false,
"size": 200,
"sort": [
{
"_id": {
"order": "asc"
}
},
{
"_score": {
"order": "desc"
}
}
]
}
Interestingly enough, sorting the response by doc's _source field seems to be ~3x faster than sorting by the inner _id (Contrary to what I expected). I've tested this w/ quite a small index -- ~1.5M docs. I wonder what you get when you run
GET your_index/_search?request_cache=false
{
"_source": false,
"size": 200,
"sort": [
{
"_id": {
"order": "asc"
}
}
]
}
and then replace _id with another doc's _source sortable field.

Indeed by setting size to 0, we will be skipping the fetch phase. In all other cases, if have even a single hit, the fetch phase will be executed and there is no way to skip it.
As you correctly noted, the query phase doesn't know the real _ids of matched documents, only their internal doc ids on respected shard. As a part of the fetch phase we will retrieve those _ids, which are stored as a stored field in Lucene. _source is a separate stored field from _id, which is also loaded during the fetch phase. But to speed the fetch phase, you can disable loading _source if you don't need it. Being a separate field from _id, disabling _source doesn't affect the correct loading of _ids.

Related

How can I make ElasticSearch yield just the first couple of words for a field?

I'm using ElasticSearch to query a set of rather long documents. Each document has (among other things) a title, a URL and a body.
When presenting the results to the user, I'd like to present just an 'abstract' of each document (along with the title and the URL). However, returning the full body only to trim it client-side seems wasteful.
Alas, I don't have a dedicated 'abstract' field or the like. Hence I wonder: is there a way to make ElasticSearch yield just the beginning (e.g. the first 200 words) of the 'body' field for each hit? I looked at source filtering (which I'm already using in my queries) but that seems to just select/deselect individual fields for the response. I'm rather looking for a way to transform the returned data.
It appears that Script Fields are one way to solve this. Here is an example query which gets the title, uri and a scripted(!) abstract field for each document. The abstract consists of the firsts 200 letters of the actual content field:
{
"query": {
"match": {
"title": "Scripting"
},
},
"_source": ["title", "uri"],
"script_fields": {
"abstract": {
"script": {
"lang": "painless",
"source": "params['_source'].content.substring(0, 200)"
}
}
}
}

Checking for not null with completion suggester query in Elastic Search

I have an existing query that is providing suggestions for postcode having the query as below (I have hard coded it with postcode as T0L)
"suggest":{
"suggestions":{
"text":"T0L",
"completion":{
"field": "postcode.suggest"
}
}
}
This works fine, but it searches for some results where the city contains null values. So I need to filter the addresses where the city is not null.
So I followed the solution on this and prepared the query like this.
{
"query": {
"constant_score": {
"filter": {
"exists": {
"field": "city"
}
}
}
},
"suggest":{
"suggestions":{
"text":"T0L",
"completion":{
"field": "postcode.suggest"
}
}
}
}
But unfortunately this is not giving the required addresses where the postcode contains T0L, rather I am getting results where postcode starts with A1X. So I believe it is querying for all the addresses where the city is present and ignoring the completion suggester query. Can you please let me know where is the mistake. Or may be how to write it correctly.
There is no way to filter out suggestions at query time, because completion suggester use FST (special in-memory data structure that built at index time) for lightning-fast search.
But you can change your mapping and add context for your suggester. The basic idea of context that it also filled at index time along with completion field and therefore can be used at query time with suggest query.

Find documents in Elasticsearch where `ignore_malformed` was triggered

Elasticsearch by default throws an exception if inserting data to a field which does not fit the existing type. For example, if a field has been created as number type, inserting a document with a string value for that field causes an error.
This behavior can be changed by enabling then ignore_malformed setting, which means such fields are silently ignored for indexing purposes, but retained in the _source document - meaning that the invalid values cannot be searched or aggregated, but are still included in the returned document.
This is preferable behavior in our use case, but we would wish to be able to locate such documents somehow so we can fix them in the future.
Is there any way to somehow flag documents for which some malformed fields were ignored? We control the document insertion process fully, so we can modify all insertion flags, or do a trial insert, or anything, to reach our goal.
You can use the exists query to find document where this field does not exist, see this example
PUT foo
{
"mappings": {
"bar": {
"properties": {
"baz": {
"type": "integer",
"ignore_malformed": true
}
}
}
}
}
PUT foo/bar/1
{
"baz": "field"
}
GET foo/bar/_search
{
"query": {
"bool": {
"filter": {
"bool": {
"must_not": [
{
"exists": {
"field": "baz"
}
}
]
}
}
}
}
}
There is no dedicated mechanism though, so this search finds also documents where the field is not set intentionally
You cannot, when you search on elasticsearch, you don't search on document source but on the inverted index, which contains the analyzed data.
ignore_malformed flag is saying "always store document, analyze if possible".
You can try, create a mal-formed document, and use _termvectors API to see how the document is analyzed and stored in the inverted index, in a case of a string field, you can see an "Array" is stored as an empty string etc.. but the field will exists.
So forget the inverted index, let's use the source!
Scroll all your data until you find the anomaly, I use a small python script that search scroll, unserialize and I test field type for every documents (very long) but I can have a list of wrong document IDs.
Use a script query can be very long and crash your cluster, use with caution, maybe as a post_filter:
Here I want to retrieve the document where country_name is not a string:
{
"_source": false,
"timeout" : "30s",
"query" : {
"query_string" : {
"query" : "locale:de_ch"
}
},
"post_filter": {
"script": {
"script": "!(_source.country_name instanceof String)"
}
}
}
"_source:false" => I want only document ID
"timeout" => prevent crash
As you notice, this is a missing feature, I know logstash will tag
document that fail, so elasticsearch could implement the same thing.

Elasticsearch delete duplicates

Some of the records are duplicated in my index identified by a numeric field recordid.
There is delete-by-query in elasticsearch, Can I use it to delete any one of the duplicate record?
Or some other way to achieve this?
Yes, you can find duplicated document with an aggregation query:
curl -XPOST http://localhost:9200/your_index/_search -d '
{
"size": 0,
"aggs": {
"duplicateCount": {
"terms": {
"field": "recordid",
"min_doc_count": 2,
"size": 10
},
"aggs": {
"duplicateDocuments": {
"top_hits": {
"size": 10
}
}
}
}
}
}'
then delete duplicated documents preferably using a bulk query. Have a look at es-deduplicator for automated duplicates removal (disclaimer: I'm author of that script).
NOTE: Aggregate queries could be very expensive and might lead to crash of your nodes (in case that your index is too large and number of data nodes too small).
Elasticsearch recommends "use(ing) the scroll/scan API to find all matching ids and then issue a bulk request to delete them".
**Edited
The first challenge here would be to identify the duplicate documents. For that you need to run a terms aggregation on the fields that defines the uniqueness of the document. On the second level of aggregation use top_hits to get the document ID too. Once you are there , you will get the ID of documents having duplicates.
Now you can safely remove them , may be using Bulk API.
You can read of other approaches to detect and remove duplicate documents here.

Sorting a match query with ElasticSearch

I'm trying to use ElasticSearch to find all records containing a particular string. I'm using a match query for this, and it's working fine.
Now, I'm trying to sort the results based on a particular field. When I try this, I get some very unexpected output, and none of the records even contain my initial search query.
My request is structured as follows:
{
"query":
{
"match": {"_all": "some_search_string"}
},
"sort": [
{
"some_field": {
"order": "asc"
}
}
] }
Am I doing something wrong here?
In order to sort on a string field, your mapping must contain a non-analyzed version of this field. Here's a simple blog post I found that describes how you can do this using the multi_field mapping type.

Resources