I am looking to use bert-english-uncased-finetuned-pos transformer, mentioned here
https://huggingface.co/vblagoje/bert-english-uncased-finetuned-pos?text=My+name+is+Clara+and+I+live+in+Berkeley%2C+California.
I am querying the transformer this way...
from transformers import AutoTokenizer, AutoModelForTokenClassification
tokenizer = AutoTokenizer.from_pretrained("vblagoje/bert-english-uncased-finetuned-pos")
model = AutoModelForTokenClassification.from_pretrained("vblagoje/bert-english-uncased-finetuned-pos")
text = "My name is Clara and I live in Berkeley, California."
input_ids = tokenizer.encode(text + '</s>', return_tensors='pt')
outputs = model(input_ids)
But the outputs is coming something like this
(tensor([[[-1.8196e+00, -1.9783e+00, -1.7416e+00, 1.2082e+00,
-7.0337e-02,
-7.0322e-03, 3.4300e-01, -9.6914e-01, -1.3546e+00, 7.7266e-03,
3.7128e+00, -3.4061e-01, 4.8385e+00, -1.2548e+00, -5.1845e-01,
7.0140e-01, 1.0394e+00],
[-1.2702e+00, -1.5518e+00, -1.1553e+00, -4.4077e-01, -9.8661e-01,
-3.2680e-01, -6.5338e-01, -3.9779e-01, -7.5383e-01, -1.2677e+00,
9.6353e+00, 1.9938e-01, -1.0282e+00, -7.5071e-01, -1.0307e+00,
-8.0589e-01, 4.2073e-01],
[-9.6988e-01, -5.0090e-01, -1.3858e+00, -1.0554e+00, -1.4040e+00,
-7.5977e-01, -7.4156e-01, 8.0594e+00, -5.1854e-01, -1.9098e+00,
-1.6362e-02, 1.0594e+00, -8.4962e-01, -1.7415e+00, -1.0628e+00,
-1.7485e-01, -1.1490e+00],
[-1.4368e+00, -1.6313e-01, -1.3202e+00, 8.7465e+00, -1.3782e+00,
-9.8889e-01, -1.1371e+00, -1.0917e+00, -9.8495e-01, -9.3237e-01,
-9.6111e-01, -4.1658e-01, -7.3133e-01, -9.6004e-01, -9.5337e-01,
3.1836e+00, -8.3462e-01],
[-7.9476e-01, -7.9640e-01, -9.0027e-01, -6.9506e-01, -8.9706e-01,
-6.9383e-01, -3.1590e-01, 1.2390e+00, -1.0443e+00, -9.9977e-01,
-8.8189e-01, 8.7941e+00, -9.9445e-01, -1.2076e+00, -1.1424e+00,
-9.7801e-01, 5.6683e-01],
[-8.2837e-01, -5.5060e-01, -2.1352e-01, -8.8721e-01, 9.5536e+00,
1.0478e+00, -5.6208e-01, -7.1037e-01, -7.0248e-01, 1.1298e-01
...
-7.3788e-01, 4.3640e-03, 1.6994e+00, 1.1528e-01, -1.0983e+00,
-8.9202e-01, -1.2869e+00, 4.9141e+00, -6.2096e-01, 4.8374e+00,
3.2384e-01, 4.6213e-01],
[-1.3622e+00, 2.0772e+00, -1.6680e+00, -8.8679e-01, -8.6959e-01,
-1.7468e+00, -1.1424e+00, 1.6996e+00, 3.5800e-01, -4.3927e-01,
-3.6129e-01, -4.2220e-01, -1.7912e+00, 8.0154e-01, 7.4594e-01,
-1.0620e+00, 3.8152e+00],
[-1.2889e+00, -2.9379e-01, -1.6543e+00, -4.3326e-01, -2.4919e-01,
-4.0112e-01, -4.4255e-01, 2.2697e-01, -4.6042e-01, -3.7862e-03,
-6.3061e-01, -1.3280e+00, 8.5533e+00, -4.6881e-01, 2.3882e+00,
2.4533e-01, -1.4095e-01],
[-9.5640e-01, -5.7213e-01, -1.0245e+00, -5.3566e-01, -1.5287e-01,
-6.6977e-01, -5.3392e-01, -3.1967e-02, -7.3077e-01, -3.1048e-01,
-7.2973e-01, -3.1701e-01, 1.0196e+01, -5.2346e-01, 4.0820e-01,
-2.1350e-01, 1.0340e+00]]], grad_fn=),)
But as per the documentation, I am expecting output to be in a JSON format...
[ {
"entity_group": "PRON",
"score": 0.9994694590568542,
"word": "my" }, {
"entity_group": "NOUN",
"score": 0.997125506401062,
"word": "name" }, {
"entity_group": "AUX",
"score": 0.9938186407089233,
"word": "is" }, {
"entity_group": "PROPN",
"score": 0.9983252882957458,
"word": "clara" }, {
"entity_group": "CCONJ",
"score": 0.9991229772567749,
"word": "and" }, {
"entity_group": "PRON",
"score": 0.9994894862174988,
"word": "i" }, {
"entity_group": "VERB",
"score": 0.9983153939247131,
"word": "live" }, {
"entity_group": "ADP",
"score": 0.999370276927948,
"word": "in" }, {
"entity_group": "PROPN",
"score": 0.9987357258796692,
"word": "berkeley" }, {
"entity_group": "PUNCT",
"score": 0.9996636509895325,
"word": "," }, {
"entity_group": "PROPN",
"score": 0.9985638856887817,
"word": "california" }, {
"entity_group": "PUNCT",
"score": 0.9996631145477295,
"word": "." } ]
What am I doing wrong? How can I parse the current output to the desired JSON output?
What you see there is the proprietary inference API from huggingface. This API is not part of the transformers library, but you can build something similar. All you need is the Tokenclassificationpipeline:
from transformers import AutoTokenizer, AutoModelForTokenClassification, TokenClassificationPipeline
tokenizer = AutoTokenizer.from_pretrained("vblagoje/bert-english-uncased-finetuned-pos")
model = AutoModelForTokenClassification.from_pretrained("vblagoje/bert-english-uncased-finetuned-pos")
p = TokenClassificationPipeline(model=model, tokenizer=tokenizer)
p('My name is Clara and I live in Berkeley, California.')
Output:
[{'word': 'my', 'score': 0.9994694590568542, 'entity': 'PRON', 'index': 1},
{'word': 'name', 'score': 0.9971255660057068, 'entity': 'NOUN', 'index': 2},
{'word': 'is', 'score': 0.9938186407089233, 'entity': 'AUX', 'index': 3},
{'word': 'clara', 'score': 0.9983252882957458, 'entity': 'PROPN', 'index': 4},
{'word': 'and', 'score': 0.9991229772567749, 'entity': 'CCONJ', 'index': 5},
{'word': 'i', 'score': 0.9994894862174988, 'entity': 'PRON', 'index': 6},
{'word': 'live', 'score': 0.9983154535293579, 'entity': 'VERB', 'index': 7},
{'word': 'in', 'score': 0.999370276927948, 'entity': 'ADP', 'index': 8},
{'word': 'berkeley',
'score': 0.9987357258796692,
'entity': 'PROPN',
'index': 9},
{'word': ',', 'score': 0.9996636509895325, 'entity': 'PUNCT', 'index': 10},
{'word': 'california',
'score': 0.9985638856887817,
'entity': 'PROPN',
'index': 11},
{'word': '.', 'score': 0.9996631145477295, 'entity': 'PUNCT', 'index': 12}]
You can find the other available pipelines which might be used by the inference API here.
I have a service with logs in elasticsearch. I want to get users who have used my service.
Detailed log lines were returned on my request, but I want to get a unique "kubernetes.pod_name":
{
"size": 10000,
"_source": ["kubernetes.pod_name"],
"query": {"bool": {"filter": [
{"match": {"kubernetes.labels.app" : "jupyterhub"}},
{"match_phrase": {"log": "200 GET"}}
]}},
"aggs": {"pods": {"terms": {"field": "kubernetes.pod_name"}}}
}
why aren't the log lines grouped in the "aggs" section? What to do to get unique users?
Upd:
my query return:
{'took': 614,
'timed_out': False,
'_shards': {'total': 5, 'successful': 5, 'skipped': 0, 'failed': 0},
'hits': {'total': 17703,
'max_score': 0.0,
'hits': [{'_index': 'dwh-dev-2020-10-14',
'_type': 'container_log',
'_id': 'vQ6vJHUBU_u817onY-cZ',
'_score': 0.0,
'_source': {'kubernetes': {'pod_name': 'jupyter-lyisova-2evg'}}},
{'_index': 'dwh-dev-2020-10-14',
'_type': 'container_log',
'_id': 'xA6vJHUBU_u817onY-cZ',
'_score': 0.0,
'_source': {'kubernetes': {'pod_name': 'jupyter-lyisova-2evg'}}},
{'_index': 'dwh-dev-2020-10-14',
'_type': 'container_log',
'_id': '6g6vJHUBU_u817onY-cZ',
'_score': 0.0,
'_source': {'kubernetes': {'pod_name': 'jupyter-bogdanov'}}},
...
I want to get 20 lines instead of 17703 where each line corresponds to a unique "kubernetes.pod_name"
You can merge between terms aggregation and filter aggregation
{
"aggs": {
"labels_filter": {
"filter": [
{
"match": {
"kubernetes.labels.app": "jupyterhub"
}
},
{
"match_phrase": {
"log": "200 GET"
}
}
],
"aggs": {
"pods": {
"terms": {
"field": "kubernetes.pod_name"
}
}
}
}
}
}
I want to store tags for messages in ElasticSearch. I've defined the tags field as this:
{
'tags': {
'type': 'string',
'index_name': 'tag'
}
}
For a message I've stored the following list in the tags field:
['a','b','c']
Now if I try to search for tag 'b' with the following query, it gives back the message and the tags:
{
'filter': {
'limit': {
'value': 100
}
},
'query': {
'bool': {
'should': [
{
'text': {
'tags': 'b'
}
}
],
'minimum_number_should_match': 1
}
}
}
There goes the same with tag 'c'.
But if I search for tag 'a' with this:
{
'filter': {
'limit': {
'value': 100
}
},
'query': {
'bool': {
'should': [
{
'text': {
'tags': 'a'
}
}
],
'minimum_number_should_match': 1
}
}
}
It gives back no results at all!
The answer is:
{
'hits': {
'hits': [],
'total': 0,
'max_score': None
},
'_shards': {
'successful': 5,
'failed': 0,
'total': 5
},
'took': 1,
'timed_out': False
}
What am I doing wrong? (It doesn't matter that the 'a' is the first element of the list, the same goes for ['b','a','c']. It seems it has problems only with a single 'a' character.
If you didn't set any analyzer and mapping to your index, Elasticsearch uses its own analyzer by default. Elasticsearch's default_analyzer has stopwords filter that defaultly ignores English stopwords such as:
"a", "an", "and", "are", "as", "at", "be", "but", "by",
"for", "if", "in", "into", "is", "it",
"no", "not", "of", "on", "or", "such",
"that", "the", "their", "then", "there", "these",
"they", "this", "to", "was", "will", "with"
Before going for more just check ElasticSearch mapping and analyzer guides:
Analyzer Guide
Mapping Guide
There might be some stemming or stop word lists involved. Try making sure the field is not analyzed.
'tags': {'type': 'string', 'index_name': 'tag', "index" : "not_analyzed"}
Similar: matching whole string with dashes in elasticsearch