Elasticsearch sort parent by inner hits doc count - elasticsearch

Let's say I am indexing into Elasticsearch a bunch of Products and Stores in which the product is available. For example, a document looks something like:
{
name: "iPhone 6s",
price: 600.0,
stores: [
{
name: "Apple Store Union Square",
location: "San Francisco, CA"
},
{
name: "Target Cupertino",
location: "Cupertino, CA"
},
{
name: "Apple Store 5th Avenue",
location: "New York, NY"
}
...
]
}
and using the nested type, the mappings will be:
"mappings" : {
"product" : {
"properties" : {
"name" : {
"type" : "string"
},
"price" : {
"type" : "float"
},
"stores" : {
"type" : "nested",
"properties" : {
"name" : {
"type" : "string"
},
"location" : {
"type" : "string"
}
}
}
}
}
}
I want to create a query to find all the products that are available in certain location, say "CA", and then sort by the number of stores matched. I know Elasticsearch has a inner hit feature which allows me to find hits in the nested Store documents, but is sorting Product based on the doc_count of the inner hit possible? And to extend the question further, is sorting the parent documents based on some inner aggregation possible? Thanks in advance.

What you are trying to achieve is possible. Currently you are not getting expected results because by default score_mode parameter is avg in nested query, so if 5 stores match the given product they might be scored lower than say one which matches 2 stores only because the _score is calculated by taking average.
This problem can be solved by summing all the inner hits by specifying score_mode as sum. One minor problem could be field length norm i.e match in shorter field gets higher score than bigger field. so in your example Cupertino, CA will get bit higher score than San Francisco, CA. You can check this behavior with inner hits. To solve this you need to disable the field norms. Change location mapping to
"location": {
"type": "string",
"norms": {
"enabled": false
}
}
After that this query will give you desired results. I included inner hits to demonstrate equal score for every matched nested doc.
{
"query": {
"nested": {
"path": "stores",
"query": {
"match": {
"stores.location": "CA"
}
},
"score_mode": "sum",
"inner_hits": {}
}
}
}
This will sort the products based on the number of stored matched.
Hope this helps!

Related

Get bucket key within scripted_metric

is there any way I can grab a bucket's key from within a scripted_metric?
I have an issue where I need to grab some specific data from within a document that is being aggregated.
For example, this is an example of the document I am working on:
{
"attr1": "thing",
"groups": [
{
"id": 1,
"name": "foo"
},
{
"id": 2,
"name": "bar"
},
{
"id": 3,
"name": "baz"
}
],
"otherAttrs": true
}
Figure 1 (Document structure)
I am doing a terms aggregation on the distinct group IDs, but within each bucket, I'd like to put the name of the group that is represented by the bucket_key (which would be the id).
This is an example of the terms aggregation I am using:
{
"terms": {
"execution_hint": "global_ordinals_hash",
"field": "actors.groups.id",
"min_doc_count": 1
}
}
Figure 2 (Terms Aggregation to create buckets where I am trying to set name as a field)
So ideally my response would look something like this:
{
"...": "...",
"buckets" : [
{
"key" : 1,
"group_name": "foo",
"doc_count" : 42684,
"measure 0" : {
"value" : 37180
},
"measure 3" : {
"doc_count" : 37180,
"measure 3" : { "value" : 68 }
},
"measure 4" : {
"doc_count" : 3008,
"measure 4" : {
"value" : 3008
}
}
}
]
}
Figure 3 (Ideal Response format)
Notice how the key corresponds with the name found in Figure 1
So I am currently receiving a response similar to Figure 3 (without group_name) and I cannot for the life of me figure out how to extract the name field because it's within a document being aggregated.
Due to the nature of the documents I'm working with, this has to happen within a bucket aggregation but this one attribute is not an aggregation, it's just a single metric that I need to pluck off of one document.
So my attempt to solve this issue was to use a scripted_metric:
{
"...":"...",
"group_name": {
"scripted_metric": {
"map_script": {
"lang": "painless",
"source": """
for (HashMap group : params._source.actor.groups) {
String groupId = < bucket_key_here >;
if (groupId != null && !groupId.isEmpty()) {
params._aggs.name = params._source.actor.groups[groupId].name;
}
}
"""
},
"reduce_script": {
"lang": "painless",
"source": "return params._aggs.length > 0 ? params._aggs[0].name : null;"
}
}
},
"...":"..."
}
Figure 4 (Current attempt to use a scripted_metric to tease out the group name)
I cannot figure out how to access the bucket's key value which means even if I use _source to access the JSON structure of the document being aggregated, I cannot see the bucket in order to determine which group is the correct name.
Notice in Figure 1 that it's possible for one document to contain multiple groups. So I need to be able to reference the key in order to match the name from the corresponding id.
Please let me know if I can clarify or expound on anything to make this issue more clear.
Regards

Elastic Search Query for Multi-valued Data

ES Data is indexed like this :
{
"addresses" : [
{
"id" : 69,
"location": "New Delhi"
},
{
"id" : 69,
"location": "Mumbai"
}
],
"goods" : [
{
"id" : 396,
"name" : "abc",
"price" : 12500
},
{
"id" : 167,
"name" : "XYz",
"price" : 12000
},
{
"id" : 168,
"name" : "XYz1",
"price" : 11000
},
{
"id" : 169,
"name" : "XYz2",
"price" : 13000
}
]
}
In my query I want to fetch records which should have at-least one of the address matched and goods price range between 11000 and 13000 and name xyz.
When your data contains arrays of complex objects like a list of addresses or a list of goods, you probably want to have a look at elasticsearch's nested objects to avoid running into problems when your queries result in more items than you would expect.
The issue here is the way how elasticsearch (and in effect lucene) stores the data. As there is no such concept of lists of nested objects directly, the data is flattened and the connection between e.g. XYz and 12000 is lost. So you would also get this document as result when you query for XYz and 12500 as the price of 12500 is also there in the list of values for goods.price. To avoid this, you can use the nested objects feature of elasticsearch which basically extracts all inner objects into a hidden index and allows querying for several fields that occur in one specific object instead of "in any of the objects". For more details, have a look at the docs on nested objects which also explains this pretty good.
In your case a mapping could look like the following. I assume, you only want to query for the addresses.location text without providing the id, so that this list can remain the simple object type instead of also being a nested type. Also, I assume you query for exact matches. If this is not the case, you need to switch from keyword to text and adapt the term query to be some match one...
PUT nesting-sample
{
"mappings": {
"item": {
"properties": {
"addresses": {
"properties": {
"id": {"type": "integer"},
"location": {"type": "keyword"}
}
},
"goods": {
"type": "nested",
"properties": {
"id": {"type": "integer"},
"name": {"type": "keyword"},
"price": {"type": "integer"}
}
}
}
}
}
}
You can then use a bool query on the location and a nested query to match the inner documents of your goods list.
GET nesting-sample/item/_search
{
"query": {
"bool": {
"must": [
{
"term": {
"addresses.location": "New Delhi"
}
},
{
"nested": {
"path": "goods",
"query": {
"bool": {
"must": [
{
"range": {
"goods.price": {
"gte": 12200,
"lt": 12999
}
}
},
{
"term": {
"goods.name": {
"value": "XYz"
}
}
}
]
}
}
}
}
]
}
}
}
This query will not match the document because the price range is not in the same nested object as the exact name of the good. If you change the lower bound to 12000 it will match.
Please check your use case and be aware of the warning on the bottom of the docs regarding the mapping explosion when using nested fields.

Get top 100 most used three word phrases in all documents

I have about 15,000 scraped websites with their body texts stored in an elastic search index. I need to get the top 100 most used three-word phrases being used in all these texts:
Something like this:
Hello there sir: 203
Big bad pony: 92
First come first: 56
[...]
I'm new to this. I looked into term vectors but they appear to apply to single documents. So I feel it will be a combination of term vectors and aggregation with n-gram analysis of sorts. But I have no idea how to go about implementing this. Any pointers will be helpful.
My current mapping and settings:
{
"mappings": {
"items": {
"properties": {
"body": {
"type": "string",
"term_vector": "with_positions_offsets_payloads",
"store" : true,
"analyzer" : "fulltext_analyzer"
}
}
}
},
"settings" : {
"index" : {
"number_of_shards" : 1,
"number_of_replicas" : 0
},
"analysis": {
"analyzer": {
"fulltext_analyzer": {
"type": "custom",
"tokenizer": "whitespace",
"filter": [
"lowercase",
"type_as_payload"
]
}
}
}
}
}
What you're looking for are called Shingles. Shingles are like "word n-grams": serial combinations of more than one term in a string. (E.g. "We all live", "all live in", "live in a", "in a yellow", "a yellow submarine")
Take a look here: https://www.elastic.co/blog/searching-with-shingles
Basically, you need a field with a shingle analyzer producing solely 3-term shingles:
Elastic blog-post configuration but with:
"filter_shingle":{
"type":"shingle",
"max_shingle_size":3,
"min_shingle_size":3,
"output_unigrams":"false"
}
The, after applying the shingle analyzer to the field in question (as in the blog post), and reindexing your data, you should be able to issue a query returning a simple terms aggregation, on your body field to see the top one-hundred 3-word phrases.
{
"size" : 0,
"query" : {
"match_all" : {}
},
"aggs" : {
"three-word-phrases" : {
"terms" : {
"field" : "body",
"size" : 100
}
}
}
}

Combining Search Across Multiple Geolocations in Multiple Indices

I'm afraid I don't know the terminology to succinctly describe what I'm trying to do, but I will explain what I'm currently doing and what I'd like to do. I'm trying to converge two search queries into a single query, taking geo point data from one index to use a search parameter for searching in a second index/doctype.
My current ES set up:
Indices and DocTypes:
|- 1) locations
|---- 1a) UK_postcode
|- 2) accounts
|---- 2a) client
Each of the Doctypes has a field names 'location' which is mapped to a GeoPoint type.
My Current Process:
1) Users search for clients based on keywords and distance from location (a UK postcode).
2) System takes the postcode and searches for the matching results to get the geo_point latitude and longitude data from the locations.UK_postcode.
3) System uses the provided keywords and latitude and longitude to search on the accounts.client index/doctype.
4) System returns nice looking results to the user, based on ES search results.
My Question:
Can steps 2 and 3 be rolled into a single search query? If yes how do I do this? I want to pass a postcode to the search query and for ES to find the geo_point data for fulfilling the requirements of a geo distance query on the client doctype.
Using pre-indexed shapes, you can definitely eliminate step 2. Note that this solution only works with pre-defined distances.
The main idea would be:
to store in your locations index a geo_shape of type circle for each postcode and each pre-defined distances.
to store in your accounts index a geo_shape of type Point for your client location
create a geo_shape query of type circle which would leverage the pre-indexed postcode shapes.
So as a quick example, you'd have this:
A. Create the postcode locations index:
PUT /locations
{
"mappings": {
"UK_postcode": {
"properties": {
"location": { "type" : "geo_shape" }
}
}
}
}
B. Create client locations index
PUT /accounts
{
"mappings": {
"client": {
"properties": {
"name": { "type": "string" }
"location": { "type" : "geo_shape" }
}
}
}
}
C. Create sample postcode circle of 1, 2, 3 mile radius for "M32 0JG"
PUT /locations/UK_postcode/M320JG-1
{
"location": {
"type" : "circle",
"coordinates" : [-2.30283674284007, 53.4556572899372],
"radius": "1mi"
}
}
PUT /locations/UK_postcode/M320JG-2
{
"location": {
"type" : "circle",
"coordinates" : [-2.30283674284007, 53.4556572899372],
"radius": "2mi"
}
}
# ... repeat until radius = 10
D. Create sample client very close to "M32 0JG"
PUT /accounts/client/1234
{
"name": "Big Corp"
"location": {
"type" : "point",
"coordinates" : [-2.30293674284007, 53.4557572899372]
}
}
E. Query all clients whose name matches "big" and who are in a 2-mile radius of the postcode "M32 0JG"
POST /accounts/client/_search
{
"bool": {
"must": [
{
"match": {
"name": "big" <--- free text name match
}
}
],
"filter": {
"geo_shape": {
"location": {
"indexed_shape": {
"id": "M320JG-2", <--- located within two miles of M32 0JG
"type": "UK_postcode",
"index": "locations",
"path": "location"
}
}
}
}
}
}

How should I query Elastic Search given my mapping and using keywords?

I have a very simple mapping which looks like this (I streamlined the example a bit):
{
"location" : {
"properties": {
"name": { "type": "string", "boost": 2.0, "analyzer": "snowball" },
"description": { "type": "string", "analyzer": "snowball" }
}
}
}
Now I index a lot of locations using some random values which are based on real English words.
I'd like to be able to search for locations that match any of the given keywords in either the name or the description field (name is more important, hence the boost I gave it). I tried a few different queries and they don't return any results.
{
"fields" : ["name", "description"],
"query" : {
"terms" : {
"name" : ["savage"],
"description" : ["savage"]
},
"from" : 0,
"size" : 500
}
}
Considering there are locations which have the word savaged in the description it should get me some results (savage is the stem of savaged). It yields 0 results using the above query. I've been using curl to query ES:
curl -XGET -d #query.json http://localhost:9200/myindex/locations/_search
If I use query string instead:
curl -XGET http://localhost:9200/fieldtripfinder/locations/_search?q=description:savage
I actually get one result (of course now it would be searching the description field only).
Basically I am looking for a query that will do a OR kind of search using multiple keywords and compare them to the values in both the name and the description field.
Snowball stems "savage" into "savag" that’s why term "savage" didn't return any results. However, when you specify "savage" on URL, it’s getting analyzed and you get results. Depending on what your intention is, you can either use correct stem ("savag") or analyze your terms by using "match" query instead of "terms":
{
"fields" : ["name", "description"],
"query" : {
"bool" : {
"should" : [
{"match" : {"name" : "savage"}},
{"match" : {"description" : "savage"}}
]
},
"from" : 0,
"size" : 500
}
}

Resources