get all similar documents in the entire index - elasticsearch

Is there a way to find documents that match query, but the query has no specific values.
For example, I have index person with mapping:
{
"properties": {
"fullname": {
"type": "text"
},
"email": {
"type": "keyword"
}
}
}
I have a query to find similar persons:
{
"query": {
"bool": {
"must": [
{
"match": {
"fullname": "Foo Bar"
}
},
{
"term": {
"email": "foobar#gmail.com"
}
}
]
}
}
}
It works for finding similar persons for specific person.
Is there a way to get all similar persons between each other in the index? Maybe some kind of aggregation?
It may be helpful to set up an alarm when there are some new similar documents.

First off, defining what's similar is arbitrary -- but you may want to look into fuzzy match queries.
Secondly, when you query using term on a keyword field, your results will be restricted to exact matches -- somewhat defeating the purpose of similar persons.
Finally, aggregations operate on concrete values so once you've found your similar persons using the match query, you can aggregate in multiple ways but you've 'lost' the fuzziness aspect, and rightly so.
Side note: when you intend to aggregate on text fields like fullname, you can either set fielddata: true on that field or add another subfield with the keyword mapping like so:
...
"fullname": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword"
}
}
}
...
In concrete terms, then, after ditching the term query, we can proceed as follows:
GET similar/_search
{
"query": {
"bool": {
"must": [
{
"match": {
"fullname": "Foo Bar"
}
}
]
}
},
"aggs": {
"by_email": {
"terms": {
"field": "email"
}
},
"by_name": {
"terms": {
"field": "fullname.keyword"
},
"aggs": {
"by_email": {
"terms": {
"field": "email"
}
}
}
}
}
}
The by_email aggregation gives us the top 10 emails associated with the persons matching the query, ordered by the number of occurrences of those emails. I suspect this won't help because emails are ... unique ;)
The by_name aggregation is more useful -- there may be lots of people called "Foo Bar" and the sub-aggregation, also called by_email will give you their emails.
Alerting is an entirely different topic -- feel free to ask another question.

Related

Nested Fields, Wildcard Queries and Aggregations in Elasticsearch

I have an index that collects web redirects data for various sites. I am using a nested field to collect the data as shown in the mapping below:
"chain": {
"type": "nested",
"properties": {
"url.position": {
"type": "long"
},
"url.full": {
"type": "text"
},
"url.domain": {
"type": "keyword"
},
"url.path": {
"type": "keyword"
},
"url.query": {
"type": "text"
}
}
}
As you can imagine, each document contains an array of url chains, the size of the array being equal to number of web redirects. I want to get aggregations based on wildcard/regexp matches to url.query field. Here is a sample query:
GET push_url_chain/_search
{
"query": {
"nested": {
"path": "chain",
"query": {
"regexp": {
"chain.url.query": "aff_c.*"
}
}
}
},
"size": 0,
"aggs": {
"dataFields": {
"nested": {
"path": "chain"
},
"aggs": {
"offers": {
"terms": {
"field": "chain.url.domain",
"size": 30
}
}
}
}
}
}
The above query does produce aggregated results but not the way I want.
I want to see chain.url.domain aggregations for the urls that contain the aff_c.* phrase. Right now it is looking at all the urls in the chain and then aggregating the buckets by doc_count regardless of whether that url/domain has the particular phrase. I hope I have been able to explain this clearly. How do I get my results to show bucket aggregations that contain domains that have aff_c.* phrase match to the query field of the url.
I would also like to know how I can use = or / in my wildcard or regexp queries. It is not producing any results if I use the above symbols in my queries.
Tha
Nested query returns all documents where a nested document matches the condition, you get matched nested docs only in inner_hits.
Aggregation is applied on top of these documents, so all domains are coming in terms
You need to use nested aggregation to gets only matching terms.
{
"size": 0,
"aggs": {
"Name": {
"nested": {
"path": "chain"
},
"aggs": {
"matched_doc": {
"filter": { --> filter for url
"match_phrase_prefix": {
"chain.url.query": "abc"
}
},
"aggs": {
"domain": {
"terms": {
"field": "chain.url.domain", -- terms for matched url
"size": 10
}
}
}
}
}
}
}
}
You can use match_phrase_prefix instead of regex. It has better performance.
Standard analyzer while generating tokens removes "/","=". So if you want to use regex or wildcard and look for these , you need to use keyword field not text field.

Elasticsearch ranking aggregation with multiple terms query

tl;dr: Want to rank aggregations based on whether bucket key has used either of the search terms.
I have two indices documents and recommendations with the following mappings:
Documents:
{
"id": string,
"document_text" : string,
"author" : { "name": string }
...other fields
}
Recommendations:
{
"id": string,
"recommendation_text" : string,
"author" : { "name": string }
...other fields
}
The problem I am solving is to have top authors for query terms.
This works quite well with multimatch for a single query term like this:
{
"size": 0,
"query": {
"multi_match": {
"query": "science",
"fields": [
"document_text",
"recommendation_text"
],
"type": "phrase",
}
},
"aggs": {
"search-authors": {
"terms": {
"field": "author.name.keyword",
"size": 50
},
"aggs": {
"top-docs": {
"top_hits": {
"size": 100
}
}
}
}
}
}
But when I have multiple keywords, let's say zoology, botany, I want the aggregation ranking to place the authors who have talked about both zoology and botany higher than those who have used either of them.
having multiple multi_match with bool doesn't help since this isn't exactly an and/or situation.

How to sort elasticsearch results based on number of collapsed items?

I'm using a a query with collapse in order to gather some documents under a certain person, yet I wish to sort the results based on the number of documents in which the search found a match.. this is my query:
GET documents/_search
{
"_source": {
"includes": [
"text"
]
},
"query": {
"query_string": {
"fields": [
"text"
],
"query": "some text"
}
},
"collapse": {
"field": "person_id",
"inner_hits": {
"name": "top_mathing_docs",
"_source": {
"includes": [
"doc_year",
"text"
]
}
}
}
}
Any suggestions?
Thanks
If I understand correctly, what you require here is to sort the documents i.e. parent documents, based on the count of inner_hits i.e. count of inner_hits based on person_id.
So that means, the _score of the parent documents in the result doesn't matter.
The only way I've found this doable is making use of the Top Hits Aggregation for Field Collapse Example and below is what your query would look like.
Aggregation Query Field Collapse Example:
POST <your_index_name>/_search
{
"size":0,
"query": {
"query_string": {
"fields": [
"text"
],
"query": "some text"
}
},
"aggs": {
"top_person_ids": {
"terms": {
"field": "person_id"
},
"aggs": {
"top_tags_hits": {
"top_hits": {
"size": 10
}
}
}
}
}
}
Note that I'm assuming person_id is of type keyword or any numeric.
Also if you look at query closely, I've mentioned "size":"0". Which means I'm only returning the result of aggregation.
Another note is that the above aggregation has nothing to do with Field Collapse in Search Request feature that you have posted in the question. It's just that using this aggregation, your result could be formatted in a similar way.
Let me know if this helps!

I don't get any documents back from my elasticsearch query. Can someone point out my mistake?

I thought I had figured out Elasticsearch but I suspect I have failed to grok something, and hence this problem:
I am indexing products, which have a huge number of fields, but the ones in question are:
{
"show_in_catalogue": {
"type": "boolean",
"index": "no"
},
"prices": {
"type": "object",
"dynamic": false,
"properties": {
"site_id": {
"type": "integer",
"index": "no"
},
"currency": {
"type": "string",
"index": "not_analyzed"
},
"value": {
"type": "float"
},
"gross_tax": {
"type": "integer",
"index": "no"
}
}
}
}
I am trying to return all documents where "show_in_catalogue" is true, and there is a price with site_id 1:
{
"filter": {
"term": {
"prices.site_id": "1",
"show_in_catalogue": true
}
},
"query": {
"match_all": {}
}
}
This returns zero results. I also tried an "and" filter with two separate terms - no luck.
A subset of one of the documents returned if I have no filters looks like:
{
"prices": [
{
"site_id": 1,
"currency": "GBP",
"value": 595,
"gross_tax": 1
},
{
"site_id": 2,
"currency": "USD",
"value": 745,
"gross_tax": 0
}
]
}
I hope I am OK to omit so much of the document here; I don't believe it to be contingent but I cannot be certain, of course.
Have I missed a vital piece of knowledge, or have I done something terminally thick? Either way, I would be grateful for an expert's knowledge at this point. Thanks!
Edit:
At the suggestion of J.T. I also tried reindexing the documents so that prices.site_id was indexed - no change. Also tried the bool/must filter below to no avail.
To clarify, the reason I'm using an empty query is that the web interface may supply a query string, but the same code is used to simply filter all products. Hence I left in the query, but empty, since that's what Elastica seems to produce with no query string.
{
"query": {
"filtered": {
"query": {
"match_all": {}
},
"filter": {
"bool": {
"must": [
{
"term": {
"show_in_catalogue": true
}
},
{
"term": {
"prices.site_id": 1
}
}
]
}
}
}
}
}
You have site_id set as {"index": "no"}. This tells ElasticSearch to exclude the field from the index which makes it impossible to query or filter on that field. The data will still be stored. Likewise, you can set a field to only be in the index and searchable, but not stored.
I'm new to ElasticSearch as well and can't always grok the questions! I'm actually confused by you query. If you are going to "just filter" then you don't need a query. What I don't understand is your use of two fields inside the term filter. I've never done this. I guess it acts as an OR? Also, if nothing matches, it seems to return everything. If you wanted a query with the results of that query filtered, then you would want to use a
-d '{
"query": {
"filtered": {
"query": {},
"filter": {}
}
}
}'
If you just want to apply filters is the filter that should work without any "query" necessary
-d '{
"filter": {
"bool": {
"must": [
{
"term": {
"show_in_catalogue": true
}
},
{
"term": {
"prices.site_id": 1
}
}
]
}
}
}'

Returning a partial nested document in ElasticSearch

I'd like to search an array of nested documents and return only those that fit a specific criteria.
An example mapping would be:
{"book":
{"properties":
{
"title":{"type":"string"},
"chapters":{
"type":"nested",
"properties":{"title":{"type":"string"},
"length":{"type":"long"}}
}
}
}
}
}
So, say I want to look for chapters titled "epilogue".
Not all the books have such a chapter, but If I use a nested query I'd get, as a result, all the chapters in a book that has such a chapter. While all I'm interested is the chapters themselves that have such a title.
I'm mainly concerned about i/o and net traffic since there might be a lot of chapters.
Also, is there a way of retrieving ONLY the nested document, without the containing doc?
This is a very old question I stumbled upon, so I'll show two different approaches to how this can be handled.
Let's prepare index and some test data first:
PUT /bookindex
{
"mappings": {
"book": {
"properties": {
"title": {
"type": "string"
},
"chapters": {
"type": "nested",
"properties": {
"title": {
"type": "string"
},
"length": {
"type": "long"
}
}
}
}
}
}
}
PUT /bookindex/book/1
{
"title": "My first book ever",
"chapters": [
{
"title": "epilogue",
"length": 1230
},
{
"title": "intro",
"length": 200
}
]
}
PUT /bookindex/book/2
{
"title": "Book of life",
"chapters": [
{
"title": "epilogue",
"length": 17
},
{
"title": "toc",
"length": 42
}
]
}
Now that we have this data in Elasticsearch, we can retrieve just the relevant hits using an inner_hits. This approach is very straightforward, but I prefer the approach outlined at the end.
# Inner hits query
POST /bookindex/book/_search
{
"_source": false,
"query": {
"nested": {
"path": "chapters",
"query": {
"match": {
"chapters.title": "epilogue"
}
},
"inner_hits": {}
}
}
}
The inner_hits nested query returns documents, where each hit contains an inner_hits object with all of the matching documents, including scoring information. You can see the response.
My preferred approach to this type of query is using a nested aggregation with filtered sub aggregation which contains top_hits sub aggregation. The query looks like:
# Nested and filter aggregation
POST /bookindex/book/_search
{
"size": 0,
"aggs": {
"nested": {
"nested": {
"path": "chapters"
},
"aggs": {
"filter": {
"filter": {
"match": { "chapters.title": "epilogue" }
},
"aggs": {
"t": {
"top_hits": {
"size": 100
}
}
}
}
}
}
}
}
The top_hits sub aggregation is the one doing the actual retrieving
of nested documents and supports from and size properties among
others. From the documentation:
If the top_hits aggregator is wrapped in a nested or reverse_nested
aggregator then nested hits are being returned. Nested hits are in a
sense hidden mini documents that are part of regular document where in
the mapping a nested field type has been configured. The top_hits
aggregator has the ability to un-hide these documents if it is wrapped
in a nested or reverse_nested aggregator. Read more about nested in
the nested type mapping.
The response from Elasticsearch is (IMO) prettier (and it seems to return it faster (though this is not a scientific observation)) and "easier" to parse.

Resources