Trying to search unique results with Elasticsearch and highlighting feature - elasticsearch

I am trying to implement an Elasticsearch query that will return highlighted distinct results based on the field queried upon.
I am aware that this isn't initially supported in ES and that most people are satisfied with aggregate features like facets. But since I need more data than what a facet can provide and that result highlighting is key to the project, I have been looking for alternate solutions.
I am using the Tire gem for a Rails project and so far this was my strategy:
Query Elastic Search with on one part the query string with highlighting, and on the other the search
{
"query": {
"match": {
"name": {
"query": "Banana",
"analyzer": "query_analyzer",
"operator": "AND"
}
}
},
"facets": {
"group_by": {
"terms": {
"fields": [
"name"
],
"size": 10,
"all_terms": false
}
}
},
"highlight": {
"fields": {
"name": {}
},
"pre_tags": [
"<span class=\"highlight\">"
],
"post_tags": [
"</span>"
]
},
"size": 100
}
Cross-reference the unique results with the first matching element from the query result to not only retrieve the missing information but also the highlighting for the result.
The problem with this approach is that even if I limit my query results to ten times more than my initial result size, the cross-reference could end up not finding 10 unique results in the query.
Also if I disregard the query and cross-reference the facet results with my database, I will lose the highlight.
I was also thinking that maybe I could even index my data differently a second time to enforce uniqueness server-side but this has proven to be another challenge altogether.
I am running out of ideas right now so if anyone sees something I'm missing I would be very grateful for any help.
Edit:
As an example, lets say I have these documents indexed in ES
[
{
id: 1,
name: 'Banana',
countryOfOrigin: 'Banana land'
},
{
id: 2,
name: 'Banana',
countryOfOrigin: 'Candy mountain'
},
{
id: 3,
name: 'Carrot',
countryOfOrigin: 'United Kingdom'
},
{
id: 4,
name: 'Barrel',
countryOfOrigin: 'Canada'
}
]
And I search for "Ba" in the same fashion as the query above, I would expect to find something like this:
{
"_shards":{
/* ... */
},
"hits":{
"total" : 2,
"hits" : [
{
"_index" : "my_index",
"_type" : "my_type",
"_id" : "1",
"_source" : {
"id": '1',
"name": 'Banana',
"countryOfOrigin": 'Banana land'
}
"highlight": {
"name": ["<span class='highlight'>Ba</span>nana"]
}
},
{
"_index" : "my_index",
"_type" : "my_type",
"_id" : "4",
"_source" : {
"id": '4',
"name": 'Barrel',
"countryOfOrigin": 'Canada'
}
"highlight": {
"name": ["<span class='highlight'>Ba</span>rrel"]
}
}
]
}
}
This would basically allow me to search for distinct item names in my records.

Related

Skipping indexing of some values in an array, but keeping them in _source

In ElasticSearch, I am trying to index documents like:
{
"tags": [
{
"value": "...",
"quality": 0.7
},
...
]
}
I would like that _source contains the full document, but that that only those values are indexed which have quality above some threshold. I read documentation and it looks to me that if I modify input document in any way before indexing (e.g., filter out values), then the modified document will be stored under _source, not the original one.
Is there a way to achieve this?
There is one way to achieve this. By default the tags structure is disabled in the mapping (i.e. not indexed). Then by leveraging ingest processors, you can create a secondary tags structure (which I called indexedTags) that will only contain the tag elements whose quality component is higher than a given threshold.
So the mapping should look like this:
PUT test
{
"mappings": {
"properties": {
"tags": {
"enabled": false, <--- won't be indexed at all, but still present in _source
"properties": {
"value": {
"type": "text"
},
"quality": {
"type": "float"
}
}
},
"indexedTags": { <--- will only contain indexable values above threshold
"properties": {
"value": {
"type": "text"
},
"quality": {
"type": "float"
}
}
}
}
}
}
Then, we need to create an ingest pipeline that allows us to filter the right tag values. The following ingest processor uses a script processor to create the indexedTags array out of the tags one, and it will only contain elements whose quality field is above a defined threshold (e.g. 0.6 in this case):
PUT _ingest/pipeline/quality-threshold
{
"processors": [
{
"script": {
"source": """
ctx.indexedTags = ctx.tags.stream().filter(t -> t.quality > params.threshold).collect(Collectors.toList());
""",
"params": {
"threshold": 0.6
}
}
}
]
}
Finally, we can leverage that ingest pipeline while indexing documents:
PUT test/_doc/1?pipeline=quality-threshold
{
"tags": [
{
"value": "test",
"quality": 0.5
},
{
"value": "test2",
"quality": 0.8
}
]
}
When running the above command, the whole tags array will still be present in the _source but it won't be indexed. What will be indexed, however, is another array called indexedTags which will only contain the second element (i.e. test2), because its quality value is 0.8 and that's higher than the 0.6 threshold.
The document looks like this:
{
"_index" : "test",
"_type" : "_doc",
"_id" : "1",
"_score" : 0.2876821,
"_source" : {
"indexedTags" : [
{
"value" : "test2",
"quality" : 0.8
}
],
"tags" : [
{
"value" : "test",
"quality" : 0.5
},
{
"value" : "test2",
"quality" : 0.8
}
]
}
}
You can now see that the first element test is not indexed at all by searching for
GET test/_search?q=test
=> No results
While searching for test2 will retrieve your document:
GET test/_search?q=test2
=> Returns document 1

Aggregate by fields in _source

I have an index in elasticsearch with documents that look like this:
"hits": [
{
"_index": "my-index2",
"_type": "my-type",
"_id": "1",
"_score": 1,
"_source": {
"entities": {
"persons": [
"Kobe Bryant",
"Michael Jordan"
],
"dates": [
"Yesterday"
],
"locations": [
"Munich",
"New York"
]
},
"my_field": "Kobe Bryant was one of the best basketball players of all times. Not even Michael Jordan has ever scored 81 points in one game. Munich is really an awesome city, but New York is as well. Yesterday has been the hottest day of the year."
}
}
Is it possible to use the aggregate function to aggregate by fields in the entities object? I tried this and it didn't work
{
"aggs" : {
"avg_date" : {
"avg" : {
"script" : {
"source" : "doc.entities.dates"
}
}
}
}
}
The error said that my index doesn't have an entities field.
EDIT: With the following term aggregation query:
{
"aggs" : {
"dates" : {
"terms" : { "field" : "entities.dates" }
}
}
}
I get an error saying
Fielddata is disabled on text fields by default. Set fielddata=true on [entities.dates] in order to load fielddata in memory by uninverting the inverted index.
I can set fielddata=true like the error says I should however the documentation warns against this because it uses a lot of heap space.Is there another way I can do this query?
EDIT 2: Solved this by setting all fields in entities to keywords in the index.

Elastic Search get top grouped sums with additional filters (Elasticsearch version5.3)

This is my Mapping :
{
"settings" : {
"number_of_shards" : 2,
"number_of_replicas" : 1
},
"mappings" :{
"cpt_logs_mapping" : {
"properties" : {
"channel_id" : {"type":"integer","store":"yes","index":"not_analyzed"},
"playing_date" : {"type":"string","store":"yes","index":"not_analyzed"},
"country_code" : {"type":"text","store":"yes","index":"analyzed"},
"playtime_in_sec" : {"type":"integer","store":"yes","index":"not_analyzed"},
"channel_name" : {"type":"text","store":"yes","index":"analyzed"},
"device_report_tag" : {"type":"text","store":"yes","index":"analyzed"}
}
}
}
}
I want to query the index similar to the way I do using the following MySQL query :
SELECT
channel_name,
SUM(`playtime_in_sec`) as playtime_in_sec
FROM
channel_play_times_bar_chart
WHERE
country_code = 'country' AND
device_report_tag = 'device' AND
channel_name = 'channel'
playing_date BETWEEN 'date_range_start' AND 'date_range_end'
GROUP BY channel_id
ORDER BY SUM(`playtime_in_sec`) DESC
LIMIT 30;
So far my QueryDSL looks like this
{
"size": 0,
"aggs": {
"ch_agg": {
"terms": {
"field": "channel_id",
"size": 30 ,
"order": {
"sum_agg": "desc"
}
},
"aggs": {
"sum_agg": {
"sum": {
"field": "playtime_in_sec"
}
}
}
}
}
}
QUESTION 1
Although the QueryDSL I have made does return me the top 30 channel_ids w.r.t playtimes but I am confused how to add other filters too within the search i.e country_code, device_report_tag & playing_date.
QUESTION 2
Another issue is that the result set contains only the channel_id and playtime fields unlike the MySQL result set which returns me channel_name and playtime_in_sec columns. This means I want to achieve aggregation using channel_id field but result set should instead return corresponding channel_name name of the group.
NOTE: Performance over here is a top priority as this is supposed to be running behind a graph generator querying millions or even more docs.
TEST DATA
hits: [
{
_index: "cpt_logs_index",
_type: "cpt_logs_mapping",
_id: "",
_score: 1,
_source: {
ChID: 1453,
playtime_in_sec: 35,
device_report_tag: "mydev",
channel_report_tag: "Sony Six",
country_code: "SE",
#timestamp: "2017-08-11",
}
},
{
_index: "cpt_logs_index",
_type: "cpt_logs_mapping",
_id: "",
_score: 1,
_source: {
ChID: 145,
playtime_in_sec: 25,
device_report_tag: "mydev",
channel_report_tag: "Star Movies",
country_code: "US",
#timestamp: "2017-08-11",
}
},
{
_index: "cpt_logs_index",
_type: "cpt_logs_mapping",
_id: "",
_score: 1,
_source: {
ChID: 12,
playtime_in_sec: 15,
device_report_tag: "mydev",
channel_report_tag: "HBO",
country_code: "PK",
#timestamp: "2017-08-12",
}
}
]
QUESTION1:
Are you looking to add a filter/query to the example above? If so you can simply add a "query" node to the query document:
{
"size": 0,
"query":{
"bool":{
"must":[
{"terms": { "country_code": ["pk","us","se"] } },
{"range": { "#timestamp": { "gt": "2017-01-01", "lte": "2017-08-11" } } }
]
}
},
"aggs": {
"ch_agg": {
"terms": {
"field": "ChID",
"size": 30
},
"aggs":{
"ch_report_tag_agg": {
"terms":{
"field" :"channel_report_tag.keyword"
},
"aggs":{
"sum_agg":{
"sum":{
"field":"playtime_in_sec"
}
}
}
}
}
}
}
}
You can use all normal queries/filters for elastic to pre-filter your search before you start aggregating (Regarding performance, elasticsearch will apply any filters / queries before starting to aggregate, so any filtering you can do here will help a lot)
Question2:
On the top of my head I would suggest one of two solutions (unless I'm not completely misunderstanding the question):
Add aggs levels for the fields you want in the output in the order you want to drill down. (you can nest aggs within aggs quite deeply without issues and get the bonus of count on each level)
Use the top_hits aggregation on the "lowest" level of aggs, and specify which fields you want in the output using "_source": { "include": [/fields/] }
Can you provide a few records of test data?
Also, it is useful to know which version of ElasticSearch you're running as the syntax and behaviour change a lot between major versions.

Elasticsearch 5.1: applying additional filters to the "more like this" query

Building a search engine on top of emails. MLT is great at finding emails with similar bodies or subjects, but sometimes I want to do something like: show me the emails with similar content to this one, but only from joe#yahoo.com and only during this date range. This seems to have been possible with ES 2.x, but it seems that 5.x doesn't allow allow filtration on fields other than that being considered for similarity. Am I missing something?
i still can't figure how to do what i described. Imagine I have an index of emails with two types for the sake of simplicity: body and sender. I know now to find messages that are restricted to a sender, the posted query would be something like:
{
"query": {
"bool": {
"filter": {
"bool": {
"must": [
{
"term": {
"sender": "mike#foo.com"
}
}
]
}
}
}
}
}
Similarly, if I wish to know how to find messages that are similar to a single hero message using the contents of the body, i can issue a query like:
{
"query": {
"more_like_this": {
"fields" : ["body"],
"like" : [{
"_index" : "foo",
"_type" : "email",
"_id" : "a1af33b9c3dd436dabc1b7f66746cc8f"
}],
"min_doc_freq" : 2,
"min_word_length" : 2,
"max_query_terms" : 12,
"include" : "true"
}
}
}
both of these queries specify the results by adding clauses inside the query clause of the root object. However, any way I try to put these together gives me parse exceptions. I can't find any examples of documentations that would say, give me emails that are similar to this hero, but only from mike#foo.com
You're almost there, you can combine them both using a bool/filter query like this, i.e. make an array out of your filter and put both constraints in there:
{
"query": {
"bool": {
"filter": [
{
"term": {
"sender": "mike#foo.com"
}
},
{
"more_like_this": {
"fields": [
"body"
],
"like": [
{
"_index": "foo",
"_type": "email",
"_id": "a1af33b9c3dd436dabc1b7f66746cc8f"
}
],
"min_doc_freq": 2,
"min_word_length": 2,
"max_query_terms": 12,
"include": "true"
}
}
]
}
}
}

Elasticsearch shuffle index sorting

Thanks in advance. I expose the situation first and in the end the solution.
I have a collection of 2M documents with the following mapping:
{
"image": {
"properties": {
"timestamp": {
"type": "date",
"format": "dateOptionalTime"
},
"title": {
"type": "string"
},
"url": {
"type": "string"
}
}
}
}
I have a webpage which paginates through all the documents with the following search:
{
"from":STARTING_POSITION_NUMBER,
"size":15,
"sort" : [
{ "_id" : {"order" : "desc"}}
],
"query" : {
"match_all": {}
}
}
And a hit looks like this(note that the _id value is a hash of the url to prevent duplicated documents):
{
"_index": "images",
"_type": "image",
"_id": "2a750a4817bd1600",
"_score": null,
"_source": {
"url": "http://test.test/test.jpg",
"timestamp": "2014-02-13T17:01:40.442307",
"title": "Test image!"
},
"sort": [
null
]
}
This works pretty well. The only problem I have is that the documents appear sorted chronologically (The oldest documents appear on the first page, and the ones indexed more recently on the last page), but I want them to appear on a random order. For example, page 10 should always show always the same N documents, but they don't have to appear sorted by the date.
I though of something like sorting all the documents by their hash, which is kind of random and deterministic. How could I do it?
I've searched on the docs and the sorting api just works for sorting the results, not the full index. If I don't find a solution I will pick documents randomly and index them on a separated collection.
Thank you.
I solved it using the following search:
{
"from":STARTING_POSITION_NUMBER,
"size":15,
"query" : {
"function_score": {
"random_score": {
"seed" : 1
}
}
}
}
Thanks to David from the Elasticsearch mailing list for pointing out the function score with random scoring.

Resources