Elasticsearch, understanding completion suggester - elasticsearch

I got the completion suggest working for autocomplete
However I have a question that I can't answer myself
Why are we storing the suggest in a field of the document?
GET /my_index/_search
{
hits: [{
"_id": 1,
"suggest": {
"input": [
"p1",
"p22",
],
"weight": 1
}
}, {
"_id": 2,
"suggest": {
"input": [
"p22",
"p3",
],
"weight": 1
}
}]
}
For autocomplete, don't we just need a list of phrases?
[
"p1",
"p22",
"p3"
]
What do we gain by the association of suggest and the doc?
as in example, multiple docs can have same suggest input , p22 in the example. When I ask for autocomplete for p2 I get two p22.
is there a way of handling this?

There's no other way to store suggestions than storing them in a completion field inside the document itself. This gives you maximum flexibility, because even if two documents have the same or similar suggestions, you can give one a higher weight than the other if you deem necessary.
If you have multiple documents with the same suggestions, you can leverage the skip_duplicates setting and ES will filter out duplicate suggestions from the response.

Related

Elastic Search - Search using multiple indexes with documents that have different data structures

I'm working with two different indexes, and different data structures in each index. Both indexes have a matching property on the documents, this property is a unique identifier of these documents, so bellow you can see the structures of the documents and how a document should look like in case it is complete.
I need to create filters and metrics (metrics can be done in the code later on if necessary), that removes from query, the documents that don't fully match the conditions, but the properties are spread within this two indexes, so when I filter by status code 2, the pair document in the other index should not be included in the search.
https://elasticsearch.*.com.br/index_a,index_b/_search
index_a
{"matching_pair_key": "1", "status_code": 2},
{"matching_pair_key": "2", "status_code": 2},
{"matching_pair_key": "2", "status_code": 1},
index_b
{"matching_pair_key": "1", "age": "31"},
{"matching_pair_key": "2", "age": "33"},
{"matching_pair_key": "3", "age": "18"},
{"matching_pair_key": "4", "age": "52"},
Complete document
{"matching_pair_key": "1", "age": "31", "status_code": 2}
I need to create a query that "join" and retrieves these documents, as well as filtering them, such as:
{
"must": [
"match": {
"status_code": 2
}
]
}
Should respond:
This whould be the ideal thing for me, but I understand that is no easy, or maybe even impossible.
{"matching_pair_key": "1", "status_code": 2, "age": "31"},
{"matching_pair_key": "2", "status_code": 2, "age": "33"},
{"matching_pair_key": "2", "status_code": 2, "age": "33"}
I tried doing an aggregation that join these documents in buckets:
"aggs": {
"joined_docs": {
"terms": {
"field": "matching_pair_key.keyword"
},
"aggs": {
"_top_hits_agg": {
"top_hits": {
"_source": {
"includes": [
"matching_pair_key",
"status_code",
"age"
]
},
"size": 10
}
}
}
}
}
But when I do that, I can't use filters on my must statement, because if I filter for a property that don't exists in the matching document, that document wont be included in the result.
I could fix this if I knew a way of including the doc on the search if it doesn't have that property that i'm trying to filter, but I don't know how. And I want to avoid using bucket aggregation if I can, because the database is quite large.
I'm open for suggestion on how to approach this issue, in my experience I only thought about using the aggregations to work with this data, but I'm afraid of how much this will cost in terms of processing and how long it takes.
Key points:
It is assured to be a matching property for every doc. (matching_pair_key)
Very large database
I can do using buckets, but the filters has been constraining so far (and avoiding bucket will make it
easier on the code later on).
I'm open for any sort of suggestion
you're looking for a join, which Elasticsearch can't do
your best option would be to merge these two indices together. this could be achieved using reindex via an ingest pipeline, that does an enrich - https://www.elastic.co/guide/en/elasticsearch/reference/current/enrich-processor.html

Searching for a field in AWS ElasticSearch

After indexing ddb records into ElasticSearch, when doing a simple search /_search?q=test, I see the hits shown like this
"hits": [
{
// ignore other fields ...
"_id": "z0YdS3I",
"_source": {
"M": {
"name": {
"S": "test name"
},
"age": {
"N": "18"
},
// ignore other fields ...
}
}
},
....
]
However, when I search for a specific field, e.g. /_search?q=name:test, I get zero hits. This happens with every field.
"hits": {
"total": {
"value": 0,
"relation": "eq"
},
"max_score": null,
"hits": []
}
So instead I have to search like this _search?q=M.name.S=test, which is a bit cumbersome. Just wonder if there's a cleaner way to search for a field? Maybe I'm missing some configuration during indexing step?
You could try this :
First define mappings for your index as per your requirement . like -
"name":"text",
"age":"integer"
.
.
etc
Then check if that got applied properly using /_mapping API - once you see the datatypes are applied as you desire then start indexing data.
Details of mappings => https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping.html
I found out I could use DynamoDB Converter provided by AWS SDK to convert back and forth between Javascript object and its equivalent DDB AttributeValue type. That way I can index a document in the write mapping and access it with the normal fields.

Count of "actual hits" (not just matching docs) for arbitrary queries in Elasticsearch

This one really frustrates me. I tried to find a solution for quite a long time, but wherever I try to find questions from people asking for the same, they either want something a little different (like here or here or here) or don't get an answer that solves the problem (like here).
What I need
I want to know how many hits my search has in total, independently from the type of query used. I am not talking about the number of hits you always get from ES, which is the number of documents found for that query, but rather the number of occurrences of document features matching my query.
For example, I could have two documents with text a text field "description", both containing the word hero, but one of them containing it twice.
Like in this minimal example here:
Index mapping:
PUT /sample
{
"settings": {
"index" : {
"number_of_shards" : 1,
"number_of_replicas" : 0
}
},
"mappings": {
"doc": {
"properties": {
"name": { "type": "keyword" },
"description": { "type": "text" }
}
}
}
}
Two sample documents:
POST /sample/doc
{
"name": "Jack Beauregard",
"description": "An aging hero"
}
POST /sample/doc
{
"name": "Master Splinter",
"description": "This rat is a hero, a real hero!"
}
...and the query:
POST /sample/_search
{
"query": {
"match": { "description": "hero" }
},
"_source": false
}
... which gives me:
{
"took": 2,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 2,
"max_score": 0.22396864,
"hits": [
{
"_index": "sample",
"_type": "doc",
"_id": "hoDsm2oB22SyyA49oDe_",
"_score": 0.22396864
},
{
"_index": "sample",
"_type": "doc",
"_id": "h4Dsm2oB22SyyA49xDf8",
"_score": 0.22227617
}
]
}
}
So there are two hits ("total": 2), which is correct, because the query matches two documents. BUT I want to know many times my query matched inside each document (or the sum of this), which would be 3 in this example, because the second document contained the search term twice.
IMPORTANT: This is just a simple example. But I want this to work for any type of query and any mapping, also nested documents with inner_hits and all.
I didn't expect this to be so difficult, because it must be an information ES comes across during search anyway, right? I mean it ranks the documents with more hits inside them higher, so why can't I get the count of these hits?
I am tempted to call them "inner hits", but that is the name of a different ES feature (see below).
What I tried / could try (but it's ugly)
I could use highlighting (which I do anyway) and try to make the highlighter generate one highlight for each "inner match" (and don't combine them), then post-process the complete set of search results and count all the highlights --> Of course, this is very ugly, because (1) I don't really want to post-process my results and (2) I'd have to get all results to do this by setting size to a high enough value, but actually i only want to get the number of results requested by the client. This would be a lot of overhead!
The feature inner_hits sounds very promising, but it just means that you can handle the hits inside nested documents independently to get a highlighting for each of them. I use this for my nested docs already, but it doesn't solve this problem because (1) it persists on inner hit level and (2) I want this to work with non-nested queries, too.
Is there a way to achieve this in a generic way for arbitrary queries? I'd be most thankful for any suggestions. I'm even down for solving it by tinkering with the ranking or using script fields, anything.
Thank's a lot in advance!
I would definitely not recommend this for any kind of practical use due to the awful performance, but this data is technically available in the term frequency calculation in the results from the explain API. See What is Relevance? for a conceptual explanation and Explain API for usage.

Using Timelion in ElasticSearch/Kibana 5.0

I'm trying to visualize a timeseries in Timelion. I have a few hundred datapoints in elasticsearch with this sort of format - I've manually removed some fields which I never meant to use in the timeseries plot.
"_index": "foo-2016-11-06",
"_type": "bar",
"_id": "7239171989271733678",
"_score": 1,
"_source": {
"timestamp": "2016-11-06T15:27:37.123581+00:00",
"rank": 2,
}
What I want is to quite simply plot the change in rank over time. I found this post Kibana Timelion plugin how to specify a field in the elastic search which seems to describe the same thing and I understand I should be able to just do .es(metric='sum:rank').
My problem is that no matter how I define my timelion query (even just calling .es(*)), I end up just getting a horizontal line where y=0.
timelion
Things I've tried so far:
Changed timefield in timelion.json from #timefield to just timefield
Extending the timeseries window (even into the future)
Set default_index to _all in timelion.json
Queried specific indices that I know contain data
All of them give me the same outcome which you can see in the attached picture. Does anyone have any idea what might be going on here?
Set the timelion.json as above:
{
"quandl": {
"key": ""
},
"es": {
"timefield": "timestamp",
"default_index": "_all",
"allow_url_parameter": false
},
"graphite": {
"url": "https://www.hostedgraphite.com/UID/ACCESS_KEY/graphite"
},
"default_interval": "1h",
"max_buckets": 2000
}
set the granularity to 'Auto' and use the above Timelion query:.es(index='foo-2016-11-06', metric='max:rank').

Is it possible to eliminate "empty" facets with Elastic Search?

I've finally managed to get Elastic Search indexing to work the way I want it to work, indexing the raw values of certain fields using subfields and not_analyzed. The facets are what I expect, however, in some cases, due to the source data having null/empty values for those fields, I get results like this in the facets section:
"things": {
"_type": "terms",
"missing": 187,
"total": 12214,
"other": 10608,
"terms": [
{
"term": "foo",
"count": 912
},
{
"term": "",
"count": 532
},
{
"term": "bar",
"count": 37
}
}
}
Note the "" in the second item. I can see why ElasticSearch wouldn't automatically exclude this, as one might want to know how many documents don't have the field. But for my purposes I'd like to just not have this returned.
Is there some way that I can configure ElasticSearch to ignore these, either in the indexing or in the query?
Try putting
"exclude" : ""
in your aggregation terms

Resources