Word Cloud in Elasticsearch 5 - elasticsearch

I am able to get word cloud using old elasticsearch version using term aggregations. I want to get word cloud from post content in es5 and I am using below query.
"aggs": {
"tagcloud": {
"terms": {
"field": "content.raw",
"size": 10
}
}
}
I did mapping like this
"content": {
"type": "text",
"fields": {
"raw": {
"type": "keyword"
}
}
}
But the result is not coming as a word cloud as expected. It is grouping similar posts (whole post) and giving as a list given belown
"buckets": [
{
"key" : "This car is awesome.",
"doc_count" : 199
},
..
..
How to do this?

The type keyword does pretty much the same as string with not_analyzed index mode. The whole string is indexed. And you can search only by exact value.
In your case, I think, you need to use a field that is analyzed and tokenized, such as content field. However you need to make sure that the field's option fielddata is set to true. Otherwise server returns exception.
Therefore your mapping should look like
"content": {
"fielddata" : true,
"type": "text"
}
and aggregation
"aggs": {
"tagcloud": {
"terms": {
"field": "content",
"size": 10
}
}
}
As the result you should see something that looks like (it depends on what analyzer you choose)
"buckets": [
{
"key" : "this",
"doc_count" : 199
},
{
"key" : "car",
"doc_count" : 199
},
{
"key" : "is",
"doc_count" : 199
},
{
"key" : "awesome",
"doc_count" : 199
},
...

Related

How can I aggregate the whole field value in Elasticsearch

I am using Elasticsearch 7.15 and need to aggregate a field and sort them by order.
My document saved in Elasticsearch looks like:
{
"logGroup" : "/aws/lambda/myLambda1",
...
},
{
"logGroup" : "/aws/lambda/myLambda2",
...
}
I need to find out which logGroup has the most document. In order to do that, I tried to use aggregate in Elasticsearch:
GET /my-index/_search?size=0
{
"aggs": {
"types_count": {
"terms": {
"field": "logGroup",
"size": 10000
}
}
}
}
the output of this query looks like:
"aggregations" : {
"types_count" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "aws",
"doc_count" : 26303620
},
{
"key" : "lambda",
"doc_count" : 25554470
},
{
"key" : "myLambda1",
"doc_count" : 25279201
}
...
}
As you can see from above output, it splits the logGroup value into terms and aggregate based on term not the whole string. Is there a way for me to aggregate them as a whole string?
I expect the output looks like:
"buckets" : [
{
"key" : "/aws/lambda/myLambda1",
"doc_count" : 26303620
},
{
"key" : "/aws/lambda/myLambda2",
"doc_count" : 25554470
},
The logGroup field in the index mapping is:
"logGroup" : {
"type" : "text",
"fielddata" : true
},
Can I achieve it without updating the index?
In order to get what you expect you need to change your mapping to this:
"logGroup" : {
"type" : "keyword"
},
Failing to do that, your log groups will get analyzed by the standard analyzer which splits the whole string and you'll not be able to aggregate by full log groups.
If you don't want or can't change the mapping and reindex everything, what you can do is the following:
First, add a keyword sub-field to your mapping, like this:
PUT /my-index/_mapping
{
"properties": {
"logGroup" : {
"type" : "text",
"fields": {
"keyword": {
"type" : "keyword"
}
}
}
}
}
And then run the following so that all existing documents pick up this new field:
POST my-index/_update_by_query?wait_for_completion=false
Finally, you'll be able to achieve what you want with the following query:
GET /my-index/_search
{
"size": 0,
"aggs": {
"types_count": {
"terms": {
"field": "logGroup.keyword",
"size": 10000
}
}
}
}

How to store real estate data in an elastic search?

I have Real Estate data. I am looking into storing it into elastic search to allow users to search the database real time.
I want to be able to let my users search by key fields like price, lot size, year-built, total bedrooms, etc. However, I also want to be able to let the user filter by keywords or amenities like "Has Pool", "Has Spa", "Parking Space", "Community"..
Additionally, I need to keep a distinct list of property type, property status, schools, community, etc so I can create drop down menu for my user to select from.
What should the stored data structure look like? How can I maintain a list of the distinct schools, community, type to use that to create drop down menu for the user to pick from?
The current data I have is basically a key/value pairs. I can clean it up and standardize it before storing it into Elastic Search but puzzled on what is considered a good approach to store this data?
Based on your question I will provide baseline mappings and a basic query with facets/filters for you to start working with.
Mappings
PUT test_jay
{
"mappings": {
"properties": {
"amenities": {
"type": "keyword"
},
"description": {
"type": "text"
},
"location": {
"type": "geo_point"
},
"name": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"status": {
"type": "keyword"
},
"type": {
"type": "keyword"
}
}
}
}
We will use "keyword" field type for that fields you will be always be doing exact matches like a drop down list.
For fields we want to do only full text search like description we use type "text". In some cases like titles I want to have both field types.
I created a location geo_type field in case you want to put your properties in a map or do distance based searches, like near houses.
For amenities a keyword field type is enough to store an array of amenities.
Ingest document
POST test_jay/_doc
{
"name": "Nice property",
"description": "nice located fancy property",
"location": {
"lat": 37.371623,
"lon": -122.003338
},
"amenities": [
"Pool",
"Parking",
"Community"
],
"type": "House",
"status": "On sale"
}
Remember keyword fields are case sensitive!
Search query
POST test_jay/_search
{
"query": {
"bool": {
"must": {
"multi_match": {
"query": "nice",
"fields": [
"name",
"description"
]
}
},
"filter": [
{
"term": {
"status": "On sale"
}
},
{
"term": {
"amenities":"Pool"
}
},
{
"term": {
"type": "House"
}
}
]
}
},
"aggs": {
"amenities": {
"terms": {
"field": "amenities",
"size": 10
}
},
"status": {
"terms": {
"field": "status",
"size": 10
}
},
"type": {
"terms": {
"field": "type",
"size": 10
}
}
}
}
The multi match part will do a full text search in the title and description fields. You are filling this one with the regular search box.
Then the filter part is filled by dropdown lists.
Query Response
{
"took" : 1,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 1,
"relation" : "eq"
},
"max_score" : 0.2876821,
"hits" : [
{
"_index" : "test_jay",
"_type" : "_doc",
"_id" : "zWysGHgBLiMtJ3pUuvZH",
"_score" : 0.2876821,
"_source" : {
"name" : "Nice property",
"description" : "nice located fancy property",
"location" : {
"lat" : 37.371623,
"lon" : -122.003338
},
"amenities" : [
"Pool",
"Parking",
"Community"
],
"type" : "House",
"status" : "On sale"
}
}
]
},
"aggregations" : {
"amenities" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "Community",
"doc_count" : 1
},
{
"key" : "Parking",
"doc_count" : 1
},
{
"key" : "Pool",
"doc_count" : 1
}
]
},
"type" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "House",
"doc_count" : 1
}
]
},
"status" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "On sale",
"doc_count" : 1
}
]
}
}
}
With the query response you can fill the facets for future filters.
I recommend you to play around with this and then come back with more specific questions.

How to combine completion, suggestion and match phrase across multiple text fields?

I've been reading about Elasticsearch suggesters, match phrase prefix and highlighting and i'm a bit confused as to which to use to suit my problem.
Requirement: i have a bunch of different text fields, and need to be able to autocomplete and autosuggest across all of them, as well as misspelling. Basically the way Google works.
See in the following Google snapshot, when we start typing "Can", it lists word like Canadian, Canada, etc. This is auto complete. However it lists additional words also like tire, post, post tracking, coronavirus etc. This is auto suggest. It searches for most relevant word in all fields. If we type "canxad" it should also misspel suggest the same results.
Could someone please give me some hints on how i can implement the above functionality across a bunch of text fields?
At first i tried this:
GET /myindex/_search
{
"query": {
"match_phrase_prefix": {
"myFieldThatIsCombinedViaCopyTo": "revis"
}
},
"highlight": {
"fields": {
"*": {}
},
"require_field_match" : false
}
}
but it returns highlights like this:
"In the aforesaid revision filed by the members of the Committee, the present revisionist was also party",
So that's not a "prefix" anymore...
Also tried this:
GET /myindex/_search
{
"query": {
"multi_match": {
"query": "revis",
"fields": ["myFieldThatIsCombinedViaCopyTo"],
"type": "phrase_prefix",
"operator": "and"
}
},
"highlight": {
"fields": {
"*": {}
}
}
}
But it still returns
"In the aforesaid revision filed by the members of the Committee, the present revisionist was also party",
Note: I have about 5 "text" fields that I need to search upon. One of those fields is quite long (1000s of words). If I break things up into keywords, I lose the phrase. So it's like I need match phrase prefix across a combined text field, with fuzziness?
EDIT
Here's an example of a document (some fields taken out, content snipped):
{
"id" : 1,
"respondent" : "Union of India",
"caseContent" : "<snip>..against the Union of India, through the ...<snip>"
}
As #Vlad suggested, i tried this:
POST /cases/_search
POST /cases/_search
{
"suggest": {
"respondent-suggest": {
"prefix": "uni",
"completion": {
"field": "respondent.suggest",
"skip_duplicates": true
}
},
"caseContent-suggest": {
"prefix": "uni",
"completion": {
"field": "caseContent.suggest",
"skip_duplicates": true
}
}
}
}
Which returns this:
{
"took" : 2,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 0,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
},
"suggest" : {
"caseContent-suggest" : [
{
"text" : "uni",
"offset" : 0,
"length" : 3,
"options" : [ ]
}
],
"respondent-suggest" : [
{
"text" : "uni",
"offset" : 0,
"length" : 3,
"options" : [
{
"text" : "Union of India",
"_index" : "cases",
"_type" : "_doc",
"_id" : "dI5hh3IBEqNFLVH6-aB9",
"_score" : 1.0,
"_ignored" : [
"headNote.suggest"
],
"_source" : {
<snip>
}
}
]
}
]
}
}
So looks like it matches on the respondent field, which is great! But, it didn't match on the caseContent field, even though the text (see above) includes the phrase "against the Union of India".. shouldn't it match there? or is it because how the text is broken up?
Since you need autocomplete/suggest on each field, then you need to run a suggest query on each field and not on the copy_to field. That way you're guaranteed to have the proper prefixes.
copy_to fields are great for searching in multiple fields, but not so good for auto-suggest/-complete type of queries.
The idea is that for each of your fields, you should have a completion sub-field so that you can get auto-complete results for each of them.
PUT index
{
"mappings": {
"properties": {
"text1": {
"type": "text",
"fields": {
"suggest": {
"type": "completion"
}
}
},
"text2": {
"type": "text",
"fields": {
"suggest": {
"type": "completion"
}
}
},
"text3": {
"type": "text",
"fields": {
"suggest": {
"type": "completion"
}
}
}
}
}
}
Your suggest queries would then run on all the sub-fields directly:
POST index/_search?pretty
{
"suggest": {
"text1-suggest" : {
"prefix" : "revis",
"completion" : {
"field" : "text1.suggest"
}
},
"text2-suggest" : {
"prefix" : "revis",
"completion" : {
"field" : "text2.suggest"
}
},
"text3-suggest" : {
"prefix" : "revis",
"completion" : {
"field" : "text3.suggest"
}
}
}
}
That takes care of the auto-complete/-suggest part. For misspellings, the suggest queries allow you to specify a fuzzy parameter as well
UPDATE
If you need to do prefix search on all sentences within a body of text, the approach needs to change a bit.
The new mapping below creates a new completion field next to the text one. The idea is to apply a small transformation (i.e. split sentences) to what you're going to store in the completion field. So first create the index mapping like this:
PUT index
{
"mappings": {
"properties": {
"text1": {
"type": "text",
},
"text1Suggest": {
"type": "completion"
}
}
}
}
Then create an ingest pipeline that will populate the text1Suggest field with sentences from the text1 field:
PUT _ingest/pipeline/sentence
{
"processors": [
{
"split": {
"field": "text1",
"target_field": "text1Suggest.input",
"separator": "\\.\\s+"
}
}
]
}
Then we can index a document such as this one (with only the text1 field as the completion field will be built dynamically)
PUT test/_doc/1?pipeline=sentence
{
"text1": "The crazy fox. The quick snail. John goes to the beach"
}
What gets indexed looks like this (your text1 field + another completion field optimized for sentence prefix completion):
{
"text1": "The crazy fox. The cat drinks milk. John goes to the beach",
"text1Suggest": {
"input": [
"The crazy fox",
"The cat drinks milk",
"John goes to the beach"
]
}
}
And finally you can search for prefixes of any sentence, below we search for John and you should get a suggestion:
POST test/_search?pretty
{
"suggest": {
"text1-suggest": {
"prefix": "John",
"completion": {
"field": "text1Suggest"
}
}
}
}

Elastic Search Filter on the result of terms aggregation

Apply Match phrase prefix query on the result of terms aggregation in Elastic Search.
I have terms query and the result looks something like below
"buckets": [
{
"key": "KEY",
"count": 20
},
{
"key": "LOCK",
"count": 30
}
]
Now the requirement is to filter those buckets whose key starts with a certain prefix, so something similar to match phrase prefix. For example if input to match phrase prefix query is "LOC", then only one bucket should be returned(2nd one). So effectively it's a filter on terms aggregation. Thanks for your thoughts.
You could use the include parameter on your terms aggregation to filter out the values based on regex.
Something like this should work:
GET stackoverflow/_search
{
"_source": false,
"aggs": {
"groups": {
"terms": {
"field": "text.keyword",
"include": "LOC.*"
}
}
}
}
Example: Let's say you have three different documents with three different terms(LOCK, KEY & LOL) in an index. So if you perform the following request:
GET stackoverflow/_search
{
"_source": false,
"aggs": {
"groups": {
"terms": {
"field": "text.keyword",
"include": "L.*"
}
}
}
}
You'll get the following buckets:
"buckets" : [
{
"key" : "LOCK",
"doc_count" : 1
},
{
"key" : "LOL",
"doc_count" : 1
}
]
Hope it is helpful.

Elasticsearch: how to filter by summed values in nested objects?

I have the following products structure in the elasticsearch:
POST /test/products/1
{
"name": "product1",
"sales": [
{
"quantity": 10,
"customer": "customer1",
"date": "2014-01-01"
},
{
"quantity": 1,
"customer": "customer1",
"date": "2014-01-02"
},
{
"quantity": 5,
"customer": "customer2",
"date": "2013-12-30"
}
]
}
POST /test/products/2
{
"name": "product2",
"sales": [
{
"quantity": 1,
"customer": "customer1",
"date": "2014-01-01"
},
{
"quantity": 15,
"customer": "customer1",
"date": "2014-02-01"
},
{
"quantity": 1,
"customer": "customer2",
"date": "2014-01-21"
}
]
}
The sales field is nested object. I need to filter products like this:
"get all products which have total quantity >= 16 and sales.customer = 'customer1'".
The total quantity is sum(sales.quantity) where sales.customer = 'customer1'.
Therefore the search results should contain only 'product2'.
I tried to use aggs but I didn't understand how to filter in this case.
I haven't found any information about it in the elasticsearch documentation.
Is it possible?
I would welcome any ideas, thanks!
First of all be clear what do you want as result? Is it count or query fields? Aggregations only gives count and for fields you need to use filter in query. If you want fields then you cant get filter for sum(sales.quantity)>=16 and if you want count you can get it using range aggregation but for that also i think you can use range only in elasticsearch document fields not some computed values.
The nearest solution i can give you is as below
{
"size" : 0,
"query" :{
"filtered" : {
"query" :{ "match_all": {} },
"filter" : {
"nested": {
"path": "sales",
"filter" : {"term" : {"sales.customer" : "customer1"}}
}
}
}
},
"aggregations" :{
"salesNested" : {
"nested" : {"path" : "sales"},
"aggregations" :{
"aggByrange" : {
"numeric_range": {
**"field": "sales.quantity"**,
"ranges": [
{
"from": 16
}]
}
}
},
"aggregations" : {
"quantityStats" : {
"stats" : {
{ "field" : "sales.quantity" }
}
}
}
}
}
}
In above query we are using "field": "sales.quantity". For your solution use must be able change sales.quantity with sum value of quantityStats aggregation which i think elasticsearch dont provide.

Resources