Elasticsearch plugin to classify documents - elasticsearch

Is there an elasticsearch plugin out there that would allow me to classify the documents that I enter in an index?
The best solution for me would be a classifications of all the most recurrent terms (/ concepts) displayed in a sort of tags cloud that the user can navigate.
Is there a way to achieve this? Any suggestions?
Thanks

The basic idea is to use a terms aggregations, which will yield one bucket per term.
POST /_search
{
"aggs" : {
"genres" : {
"terms" : { "field" : "genre" }
}
}
}
The response you'll get will be ordered by decreasing amount of term occurrences:
{
...
"aggregations" : {
"genres" : {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets" : [
{
"key" : "jazz",
"doc_count" : 10
},
{
"key" : "rock",
"doc_count" : 5
},
{
"key" : "electronic",
"doc_count" : 2
},
]
}
}
}
If you're using Kibana, you can directly create a tag cloud visualization based on those terms.

Related

Elasticsearch Aggregation most common list of integers

I am looking for elastic search aggregation + mapping
that will return the most common list for a certain field.
For example for docs:
{"ToneCurvePV2012": [1,2,3]}
{"ToneCurvePV2012": [1,5,6]}
{"ToneCurvePV2012": [1,7,8]}
{"ToneCurvePV2012": [1,2,3]}
I wish for the aggregation result:
[1,2,3] (since it appears twice).
so far any aggregation that i made would return: 1
This is not possible with default terms aggregation. You need to use terms aggregation with script. Please note that this might impact your cluster performance.
Here, i have used script which will create string from array and used it for aggregation. so if you have array value like [1,2,3] then it will create string representation of it like '[1,2,3]' and that key will be used for aggregation.
Below is sample query you can use to generate aggregation as you expected:
POST index1/_search
{
"size": 0,
"aggs": {
"tone_s": {
"terms": {
"script": {
"source": "def value='['; for(int i=0;i<doc['ToneCurvePV2012'].length;i++){value= value + doc['ToneCurvePV2012'][i] + ',';} value+= ']'; value = value.replace(',]', ']'); return value;"
}
}
}
}
}
Output:
{
"hits" : {
"total" : {
"value" : 4,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
},
"aggregations" : {
"tone_s" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "[1,2,3]",
"doc_count" : 2
},
{
"key" : "[1,5,6]",
"doc_count" : 1
},
{
"key" : "[1,7,8]",
"doc_count" : 1
}
]
}
}
}
PS: key will be come as string and not as array in aggregation response.

Elasticsearch: How many words occured once

I am modifying my question to be a little more generic so please humor me...
Say I have a Elasticsearch index with each document holding a word from a textbook. Is there a way I can tell how many words occurred just once, how many twice, and so on?
ie the result is something like this:
# words occurring once = 10,001,
twice = 503,
thrice = 807,
four times = 997,
five times = 23
Is there a way to do this in elastic?
I am not looking for "give me the top "x" words that occur most often" - that is easily retrieved by doing an aggregation.
Thanks!
Suppose your documents have a field word that holds a word from the textbook. Your use case will be solved by using the terms aggregations bucketing that'll group all the occurrences of a word into one bucket. Thus your query would like this:
{
"aggs" : {
"word_count" : {
"terms" : { "field" : "word" }
}
}
}
With the following output:
{
"aggregations" : {
"word_count" : {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets" : [
{
"key" : "The",
"doc_count" : 10
},
{
"key" : "wild",
"doc_count" : 2
},
{
"key" : "fox",
"doc_count" : 3
},
]
}
}
}
where doc_count indicates the occurrence of each word.

Elasticsearch slow results with IN query and Scoring

I have text document data (500k approximately) saved in elasticsearch where the document text is mapped with it's corresponding document number.
I am trying to fetch results in batches for "Sample Text" in particular set of document numbers (300k appoximately) with scoring and i am facing extreme slowness in the result.
Here is the the Mapping
PUT my_index
{
"mappings" : {
"doc_repo" : {
"properties" : {
"doc_number" : {
"type" : "integer"
},
"document" : {
"type" : "string",
"term_vector" : "with_positions_offsets_payloads"
}
}
}
}
}
Here is the request query
{
"query" : {
"bool" : {
"must" : [
{
"terms" : {
"document" : [
"sample text"
]
}
},
{
"terms" : {
"doc_number" : [1,2,3....,300K] //ArrayOf_300K_DocNumbers
}
}
]
}
},
"fields" : [
"doc_number"
],
"size" : 500,
"from" : 0
}
I Tried fetching result in two other ways
Result without scoring in particular set of document numbers(i used filtering for this)
Result with scoring but without any particular set of document numbers (in batches)
Both of these were pretty quick, but problem comes when i am trying achieve both.
Do i need to change mapping or search query or any other ways to achieve this.
Thanks in advance.
Issue was specifically with elasticsearch 2.X, Upgrading elasticsearch solves the issue.

Do query results impact elasticsearch phrase suggestions?

I'd like to know whether Elasticsearch users query results to populate phrase suggestions for direct generator or not?
Or it simply picks tokens from given index?
My queries are based on some permission sets.
So for instance, that'd be my query:
{
"size" : 0,
"query" : {
"filtered" : {
"query" : {
"match_all" : {}
},
"filter" : {
"bool" : {
"must" : [{
"terms" : {
"Permissions" : ["permission1", "permission2", "permission3"
]
}
}
]
}
}
}
},
"suggest" : {
"DidYouMean" : {
"text" : "{{SearchPhrase}}",
"phrase" : {
"field" : "_all",
"analyzer" : "simple",
"size" : 1,
"real_word_error_likelihood" : 0.96,
"max_errors" : 5,
"gram_size" : 3,
"direct_generator" : [{
"field" : "_all",
"suggest_mode" : "popular",
"min_word_length" : 3
}
]
}
}
}
}
How would I ensure that direct generator creates suggestions and doesn't violate my permissions clause?
Is this even possible?
The term suggester and phrase suggester feeds on the tokens for generating suggest results. The query does not affect the suggest results. The suggester directly works on the reverse index and get the tokens from them. So its scope is global and never the query

unique value from each bucket in elastic search

I have a sample database of 1000 bank accounts.
{"account_number":1,"balance":39225,...,"state":"IL"}
What I want is list of highest balance accounts in each state. Using a terms aggregator I received collected count of accounts from each state.
eg.
"aggregations" : {
"states" : {
"buckets" : [ {
"key" : "tx",
"doc_count" : 30
}, ....
But this doesn't returns the required list. Any suggestions?
Use max aggregation
{
"aggs" : {
"max_price" : { "max" : { "field" : "price" } }
}
}
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-aggregations-metrics-max-aggregation.html
You should look at significant terms aggregation http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-aggregations-bucket-significantterms-aggregation.html - this generates buckets with related terms. Explore it

Resources