Elasticsearch Aggregation of large list - elasticsearch

I'm trying to count how many times ingredients show up in different documents. My index body is similar to this
index_body = {
"settings":{
"index":{
"number_of_replicas":0,
"number_of_shards":4,
"refresh_interval":"-1",
"knn":"true"
}
},
"mappings":{
"properties":{
"recipe_id":{
"type":"keyword"
},
"recipe_title":{
"type":"text",
"analyzer":"standard",
"similarity":"BM25"
},
"description":{
"type":"text",
"analyzer":"standard",
"similarity":"BM25"
},
"ingredient":{
"type":"keyword"
},
"image":{
"type":"keyword"
},
....
}
}
In the ingredient field, I've stored an array of strings of each ingredient [ingredient1,ingredient2,....]
I have around 900 documents. Each with their own ingredients list.
I've tried using Elasticsearch's aggregations but it seems to not return what I expected.
Here is the query I've been using:
{
"size":0,
"aggs":{
"ingredients":{
"terms": {"field":"ingredient"}
}
}
}
But it returns this:
{'took': 4, 'timed_out': False, '_shards': {'total': 4, 'successful': 4, 'skipped': 0, 'failed': 0}, 'hits': {'total': {'value': 994, 'relation': 'eq'}, 'max_score': None, 'hits': []}, 'aggregations': {'ingredients': {'doc_count_error_upper_bound': 56, 'sum_other_doc_count': 4709, 'buckets': [{'key': 'salt', 'doc_count': 631}, {'key': 'oil', 'doc_count': 320}, {'key': 'sugar', 'doc_count': 314}, {'key': 'egg', 'doc_count': 302}, {'key': 'butter', 'doc_count': 291}, {'key': 'flour', 'doc_count': 264}, {'key': 'garlic', 'doc_count': 220}, {'key': 'ground pepper', 'doc_count': 185}, {'key': 'vanilla extract', 'doc_count': 146}, {'key': 'lemon', 'doc_count': 131}]}}}
This is clearly wrong, as I have many ingredients. What am I doing wrong? Why is it returning only these ones? Is there a way to force Elasticsearch to return all counts?

You need to specify size inside the aggregation.
{
"size":0,
"aggs":{
"ingredients":{
"terms": {"field":"ingredient", "size": 10000}
}
}
}

Related

Parsing the Hugging Face Transformer Output

I am looking to use bert-english-uncased-finetuned-pos transformer, mentioned here
https://huggingface.co/vblagoje/bert-english-uncased-finetuned-pos?text=My+name+is+Clara+and+I+live+in+Berkeley%2C+California.
I am querying the transformer this way...
from transformers import AutoTokenizer, AutoModelForTokenClassification
tokenizer = AutoTokenizer.from_pretrained("vblagoje/bert-english-uncased-finetuned-pos")
model = AutoModelForTokenClassification.from_pretrained("vblagoje/bert-english-uncased-finetuned-pos")
text = "My name is Clara and I live in Berkeley, California."
input_ids = tokenizer.encode(text + '</s>', return_tensors='pt')
outputs = model(input_ids)
But the outputs is coming something like this
(tensor([[[-1.8196e+00, -1.9783e+00, -1.7416e+00, 1.2082e+00,
-7.0337e-02,
-7.0322e-03, 3.4300e-01, -9.6914e-01, -1.3546e+00, 7.7266e-03,
3.7128e+00, -3.4061e-01, 4.8385e+00, -1.2548e+00, -5.1845e-01,
7.0140e-01, 1.0394e+00],
[-1.2702e+00, -1.5518e+00, -1.1553e+00, -4.4077e-01, -9.8661e-01,
-3.2680e-01, -6.5338e-01, -3.9779e-01, -7.5383e-01, -1.2677e+00,
9.6353e+00, 1.9938e-01, -1.0282e+00, -7.5071e-01, -1.0307e+00,
-8.0589e-01, 4.2073e-01],
[-9.6988e-01, -5.0090e-01, -1.3858e+00, -1.0554e+00, -1.4040e+00,
-7.5977e-01, -7.4156e-01, 8.0594e+00, -5.1854e-01, -1.9098e+00,
-1.6362e-02, 1.0594e+00, -8.4962e-01, -1.7415e+00, -1.0628e+00,
-1.7485e-01, -1.1490e+00],
[-1.4368e+00, -1.6313e-01, -1.3202e+00, 8.7465e+00, -1.3782e+00,
-9.8889e-01, -1.1371e+00, -1.0917e+00, -9.8495e-01, -9.3237e-01,
-9.6111e-01, -4.1658e-01, -7.3133e-01, -9.6004e-01, -9.5337e-01,
3.1836e+00, -8.3462e-01],
[-7.9476e-01, -7.9640e-01, -9.0027e-01, -6.9506e-01, -8.9706e-01,
-6.9383e-01, -3.1590e-01, 1.2390e+00, -1.0443e+00, -9.9977e-01,
-8.8189e-01, 8.7941e+00, -9.9445e-01, -1.2076e+00, -1.1424e+00,
-9.7801e-01, 5.6683e-01],
[-8.2837e-01, -5.5060e-01, -2.1352e-01, -8.8721e-01, 9.5536e+00,
1.0478e+00, -5.6208e-01, -7.1037e-01, -7.0248e-01, 1.1298e-01
...
-7.3788e-01, 4.3640e-03, 1.6994e+00, 1.1528e-01, -1.0983e+00,
-8.9202e-01, -1.2869e+00, 4.9141e+00, -6.2096e-01, 4.8374e+00,
3.2384e-01, 4.6213e-01],
[-1.3622e+00, 2.0772e+00, -1.6680e+00, -8.8679e-01, -8.6959e-01,
-1.7468e+00, -1.1424e+00, 1.6996e+00, 3.5800e-01, -4.3927e-01,
-3.6129e-01, -4.2220e-01, -1.7912e+00, 8.0154e-01, 7.4594e-01,
-1.0620e+00, 3.8152e+00],
[-1.2889e+00, -2.9379e-01, -1.6543e+00, -4.3326e-01, -2.4919e-01,
-4.0112e-01, -4.4255e-01, 2.2697e-01, -4.6042e-01, -3.7862e-03,
-6.3061e-01, -1.3280e+00, 8.5533e+00, -4.6881e-01, 2.3882e+00,
2.4533e-01, -1.4095e-01],
[-9.5640e-01, -5.7213e-01, -1.0245e+00, -5.3566e-01, -1.5287e-01,
-6.6977e-01, -5.3392e-01, -3.1967e-02, -7.3077e-01, -3.1048e-01,
-7.2973e-01, -3.1701e-01, 1.0196e+01, -5.2346e-01, 4.0820e-01,
-2.1350e-01, 1.0340e+00]]], grad_fn=),)
But as per the documentation, I am expecting output to be in a JSON format...
[ {
"entity_group": "PRON",
"score": 0.9994694590568542,
"word": "my" }, {
"entity_group": "NOUN",
"score": 0.997125506401062,
"word": "name" }, {
"entity_group": "AUX",
"score": 0.9938186407089233,
"word": "is" }, {
"entity_group": "PROPN",
"score": 0.9983252882957458,
"word": "clara" }, {
"entity_group": "CCONJ",
"score": 0.9991229772567749,
"word": "and" }, {
"entity_group": "PRON",
"score": 0.9994894862174988,
"word": "i" }, {
"entity_group": "VERB",
"score": 0.9983153939247131,
"word": "live" }, {
"entity_group": "ADP",
"score": 0.999370276927948,
"word": "in" }, {
"entity_group": "PROPN",
"score": 0.9987357258796692,
"word": "berkeley" }, {
"entity_group": "PUNCT",
"score": 0.9996636509895325,
"word": "," }, {
"entity_group": "PROPN",
"score": 0.9985638856887817,
"word": "california" }, {
"entity_group": "PUNCT",
"score": 0.9996631145477295,
"word": "." } ]
What am I doing wrong? How can I parse the current output to the desired JSON output?
What you see there is the proprietary inference API from huggingface. This API is not part of the transformers library, but you can build something similar. All you need is the Tokenclassificationpipeline:
from transformers import AutoTokenizer, AutoModelForTokenClassification, TokenClassificationPipeline
tokenizer = AutoTokenizer.from_pretrained("vblagoje/bert-english-uncased-finetuned-pos")
model = AutoModelForTokenClassification.from_pretrained("vblagoje/bert-english-uncased-finetuned-pos")
p = TokenClassificationPipeline(model=model, tokenizer=tokenizer)
p('My name is Clara and I live in Berkeley, California.')
Output:
[{'word': 'my', 'score': 0.9994694590568542, 'entity': 'PRON', 'index': 1},
{'word': 'name', 'score': 0.9971255660057068, 'entity': 'NOUN', 'index': 2},
{'word': 'is', 'score': 0.9938186407089233, 'entity': 'AUX', 'index': 3},
{'word': 'clara', 'score': 0.9983252882957458, 'entity': 'PROPN', 'index': 4},
{'word': 'and', 'score': 0.9991229772567749, 'entity': 'CCONJ', 'index': 5},
{'word': 'i', 'score': 0.9994894862174988, 'entity': 'PRON', 'index': 6},
{'word': 'live', 'score': 0.9983154535293579, 'entity': 'VERB', 'index': 7},
{'word': 'in', 'score': 0.999370276927948, 'entity': 'ADP', 'index': 8},
{'word': 'berkeley',
'score': 0.9987357258796692,
'entity': 'PROPN',
'index': 9},
{'word': ',', 'score': 0.9996636509895325, 'entity': 'PUNCT', 'index': 10},
{'word': 'california',
'score': 0.9985638856887817,
'entity': 'PROPN',
'index': 11},
{'word': '.', 'score': 0.9996631145477295, 'entity': 'PUNCT', 'index': 12}]
You can find the other available pipelines which might be used by the inference API here.

cannot agregate in elasticsearch

I have a service with logs in elasticsearch. I want to get users who have used my service.
Detailed log lines were returned on my request, but I want to get a unique "kubernetes.pod_name":
{
"size": 10000,
"_source": ["kubernetes.pod_name"],
"query": {"bool": {"filter": [
{"match": {"kubernetes.labels.app" : "jupyterhub"}},
{"match_phrase": {"log": "200 GET"}}
]}},
"aggs": {"pods": {"terms": {"field": "kubernetes.pod_name"}}}
}
why aren't the log lines grouped in the "aggs" section? What to do to get unique users?
Upd:
my query return:
{'took': 614,
'timed_out': False,
'_shards': {'total': 5, 'successful': 5, 'skipped': 0, 'failed': 0},
'hits': {'total': 17703,
'max_score': 0.0,
'hits': [{'_index': 'dwh-dev-2020-10-14',
'_type': 'container_log',
'_id': 'vQ6vJHUBU_u817onY-cZ',
'_score': 0.0,
'_source': {'kubernetes': {'pod_name': 'jupyter-lyisova-2evg'}}},
{'_index': 'dwh-dev-2020-10-14',
'_type': 'container_log',
'_id': 'xA6vJHUBU_u817onY-cZ',
'_score': 0.0,
'_source': {'kubernetes': {'pod_name': 'jupyter-lyisova-2evg'}}},
{'_index': 'dwh-dev-2020-10-14',
'_type': 'container_log',
'_id': '6g6vJHUBU_u817onY-cZ',
'_score': 0.0,
'_source': {'kubernetes': {'pod_name': 'jupyter-bogdanov'}}},
...
I want to get 20 lines instead of 17703 where each line corresponds to a unique "kubernetes.pod_name"
You can merge between terms aggregation and filter aggregation
{
"aggs": {
"labels_filter": {
"filter": [
{
"match": {
"kubernetes.labels.app": "jupyterhub"
}
},
{
"match_phrase": {
"log": "200 GET"
}
}
],
"aggs": {
"pods": {
"terms": {
"field": "kubernetes.pod_name"
}
}
}
}
}
}

Aggregation value inside array of array elasticsearch

i have json structure like this:
[{
'id': 1,
'result': [{
"score": 0.0,
"result_rules": [{
"rule_id": "sr-1",
},
{
"rule_id": "sr-2",
}
]
}]
},
{
'id': 2,
'result': [{
"score": 0.0,
"result_rules": [{
"rule_id": "sr-1",
},
{
"rule_id": "sr-4",
}
]
}]
}]
i want to count rule_id, so the result would be:
[
{
'rule_id': 'sr-1',
'doc_count': 2
},
{
'rule_id': 'sr-2',
'doc_count': 1
},
{
'rule_id': 'sr-4',
'doc_count': 1
}
]
i've tried something like this, but it's showing empty aggregation
{
"aggs":{
"group_by_rule_id":{
"terms":{
"field": "result.result_rules.rule_id.keyword"
}
}
}
}
For aggregation on nested structure you would have to use nested aggregation.
See the example on ES DOC.

elasticsearch response hits is not showing up

I am utilizing elasticsearch and after running a search, this is the response I get
{'took': 7, 'timed_out': False, '_shards': {'total': 5, 'successful': 5, 'skipped': 0, 'failed': 0}, 'hits': {'total': 1, 'max_score': 0.2876821, 'hits': []}}
My question is why is hits.total = 1 but hits.hits is empty?
Here is the query I used:
"query": {
"bool": {
"must": [
{"match_phrase": { "theName": "bill" } }
]
}
}
I know data exists in my node because when I did the search below (with the same url + index + type in the post request), I got hits.hits to be filled with the result.
"query" : {
"match_all" : {}
}
I had from = 40 that was causing the issue.

Elasticsearch: does not give back result when searching for a simple 'a' character

I want to store tags for messages in ElasticSearch. I've defined the tags field as this:
{
'tags': {
'type': 'string',
'index_name': 'tag'
}
}
For a message I've stored the following list in the tags field:
['a','b','c']
Now if I try to search for tag 'b' with the following query, it gives back the message and the tags:
{
'filter': {
'limit': {
'value': 100
}
},
'query': {
'bool': {
'should': [
{
'text': {
'tags': 'b'
}
}
],
'minimum_number_should_match': 1
}
}
}
There goes the same with tag 'c'.
But if I search for tag 'a' with this:
{
'filter': {
'limit': {
'value': 100
}
},
'query': {
'bool': {
'should': [
{
'text': {
'tags': 'a'
}
}
],
'minimum_number_should_match': 1
}
}
}
It gives back no results at all!
The answer is:
{
'hits': {
'hits': [],
'total': 0,
'max_score': None
},
'_shards': {
'successful': 5,
'failed': 0,
'total': 5
},
'took': 1,
'timed_out': False
}
What am I doing wrong? (It doesn't matter that the 'a' is the first element of the list, the same goes for ['b','a','c']. It seems it has problems only with a single 'a' character.
If you didn't set any analyzer and mapping to your index, Elasticsearch uses its own analyzer by default. Elasticsearch's default_analyzer has stopwords filter that defaultly ignores English stopwords such as:
"a", "an", "and", "are", "as", "at", "be", "but", "by",
"for", "if", "in", "into", "is", "it",
"no", "not", "of", "on", "or", "such",
"that", "the", "their", "then", "there", "these",
"they", "this", "to", "was", "will", "with"
Before going for more just check ElasticSearch mapping and analyzer guides:
Analyzer Guide
Mapping Guide
There might be some stemming or stop word lists involved. Try making sure the field is not analyzed.
'tags': {'type': 'string', 'index_name': 'tag', "index" : "not_analyzed"}
Similar: matching whole string with dashes in elasticsearch

Resources