Parsing the Hugging Face Transformer Output - huggingface-transformers

I am looking to use bert-english-uncased-finetuned-pos transformer, mentioned here
https://huggingface.co/vblagoje/bert-english-uncased-finetuned-pos?text=My+name+is+Clara+and+I+live+in+Berkeley%2C+California.
I am querying the transformer this way...
from transformers import AutoTokenizer, AutoModelForTokenClassification
tokenizer = AutoTokenizer.from_pretrained("vblagoje/bert-english-uncased-finetuned-pos")
model = AutoModelForTokenClassification.from_pretrained("vblagoje/bert-english-uncased-finetuned-pos")
text = "My name is Clara and I live in Berkeley, California."
input_ids = tokenizer.encode(text + '</s>', return_tensors='pt')
outputs = model(input_ids)
But the outputs is coming something like this
(tensor([[[-1.8196e+00, -1.9783e+00, -1.7416e+00, 1.2082e+00,
-7.0337e-02,
-7.0322e-03, 3.4300e-01, -9.6914e-01, -1.3546e+00, 7.7266e-03,
3.7128e+00, -3.4061e-01, 4.8385e+00, -1.2548e+00, -5.1845e-01,
7.0140e-01, 1.0394e+00],
[-1.2702e+00, -1.5518e+00, -1.1553e+00, -4.4077e-01, -9.8661e-01,
-3.2680e-01, -6.5338e-01, -3.9779e-01, -7.5383e-01, -1.2677e+00,
9.6353e+00, 1.9938e-01, -1.0282e+00, -7.5071e-01, -1.0307e+00,
-8.0589e-01, 4.2073e-01],
[-9.6988e-01, -5.0090e-01, -1.3858e+00, -1.0554e+00, -1.4040e+00,
-7.5977e-01, -7.4156e-01, 8.0594e+00, -5.1854e-01, -1.9098e+00,
-1.6362e-02, 1.0594e+00, -8.4962e-01, -1.7415e+00, -1.0628e+00,
-1.7485e-01, -1.1490e+00],
[-1.4368e+00, -1.6313e-01, -1.3202e+00, 8.7465e+00, -1.3782e+00,
-9.8889e-01, -1.1371e+00, -1.0917e+00, -9.8495e-01, -9.3237e-01,
-9.6111e-01, -4.1658e-01, -7.3133e-01, -9.6004e-01, -9.5337e-01,
3.1836e+00, -8.3462e-01],
[-7.9476e-01, -7.9640e-01, -9.0027e-01, -6.9506e-01, -8.9706e-01,
-6.9383e-01, -3.1590e-01, 1.2390e+00, -1.0443e+00, -9.9977e-01,
-8.8189e-01, 8.7941e+00, -9.9445e-01, -1.2076e+00, -1.1424e+00,
-9.7801e-01, 5.6683e-01],
[-8.2837e-01, -5.5060e-01, -2.1352e-01, -8.8721e-01, 9.5536e+00,
1.0478e+00, -5.6208e-01, -7.1037e-01, -7.0248e-01, 1.1298e-01
...
-7.3788e-01, 4.3640e-03, 1.6994e+00, 1.1528e-01, -1.0983e+00,
-8.9202e-01, -1.2869e+00, 4.9141e+00, -6.2096e-01, 4.8374e+00,
3.2384e-01, 4.6213e-01],
[-1.3622e+00, 2.0772e+00, -1.6680e+00, -8.8679e-01, -8.6959e-01,
-1.7468e+00, -1.1424e+00, 1.6996e+00, 3.5800e-01, -4.3927e-01,
-3.6129e-01, -4.2220e-01, -1.7912e+00, 8.0154e-01, 7.4594e-01,
-1.0620e+00, 3.8152e+00],
[-1.2889e+00, -2.9379e-01, -1.6543e+00, -4.3326e-01, -2.4919e-01,
-4.0112e-01, -4.4255e-01, 2.2697e-01, -4.6042e-01, -3.7862e-03,
-6.3061e-01, -1.3280e+00, 8.5533e+00, -4.6881e-01, 2.3882e+00,
2.4533e-01, -1.4095e-01],
[-9.5640e-01, -5.7213e-01, -1.0245e+00, -5.3566e-01, -1.5287e-01,
-6.6977e-01, -5.3392e-01, -3.1967e-02, -7.3077e-01, -3.1048e-01,
-7.2973e-01, -3.1701e-01, 1.0196e+01, -5.2346e-01, 4.0820e-01,
-2.1350e-01, 1.0340e+00]]], grad_fn=),)
But as per the documentation, I am expecting output to be in a JSON format...
[ {
"entity_group": "PRON",
"score": 0.9994694590568542,
"word": "my" }, {
"entity_group": "NOUN",
"score": 0.997125506401062,
"word": "name" }, {
"entity_group": "AUX",
"score": 0.9938186407089233,
"word": "is" }, {
"entity_group": "PROPN",
"score": 0.9983252882957458,
"word": "clara" }, {
"entity_group": "CCONJ",
"score": 0.9991229772567749,
"word": "and" }, {
"entity_group": "PRON",
"score": 0.9994894862174988,
"word": "i" }, {
"entity_group": "VERB",
"score": 0.9983153939247131,
"word": "live" }, {
"entity_group": "ADP",
"score": 0.999370276927948,
"word": "in" }, {
"entity_group": "PROPN",
"score": 0.9987357258796692,
"word": "berkeley" }, {
"entity_group": "PUNCT",
"score": 0.9996636509895325,
"word": "," }, {
"entity_group": "PROPN",
"score": 0.9985638856887817,
"word": "california" }, {
"entity_group": "PUNCT",
"score": 0.9996631145477295,
"word": "." } ]
What am I doing wrong? How can I parse the current output to the desired JSON output?

What you see there is the proprietary inference API from huggingface. This API is not part of the transformers library, but you can build something similar. All you need is the Tokenclassificationpipeline:
from transformers import AutoTokenizer, AutoModelForTokenClassification, TokenClassificationPipeline
tokenizer = AutoTokenizer.from_pretrained("vblagoje/bert-english-uncased-finetuned-pos")
model = AutoModelForTokenClassification.from_pretrained("vblagoje/bert-english-uncased-finetuned-pos")
p = TokenClassificationPipeline(model=model, tokenizer=tokenizer)
p('My name is Clara and I live in Berkeley, California.')
Output:
[{'word': 'my', 'score': 0.9994694590568542, 'entity': 'PRON', 'index': 1},
{'word': 'name', 'score': 0.9971255660057068, 'entity': 'NOUN', 'index': 2},
{'word': 'is', 'score': 0.9938186407089233, 'entity': 'AUX', 'index': 3},
{'word': 'clara', 'score': 0.9983252882957458, 'entity': 'PROPN', 'index': 4},
{'word': 'and', 'score': 0.9991229772567749, 'entity': 'CCONJ', 'index': 5},
{'word': 'i', 'score': 0.9994894862174988, 'entity': 'PRON', 'index': 6},
{'word': 'live', 'score': 0.9983154535293579, 'entity': 'VERB', 'index': 7},
{'word': 'in', 'score': 0.999370276927948, 'entity': 'ADP', 'index': 8},
{'word': 'berkeley',
'score': 0.9987357258796692,
'entity': 'PROPN',
'index': 9},
{'word': ',', 'score': 0.9996636509895325, 'entity': 'PUNCT', 'index': 10},
{'word': 'california',
'score': 0.9985638856887817,
'entity': 'PROPN',
'index': 11},
{'word': '.', 'score': 0.9996631145477295, 'entity': 'PUNCT', 'index': 12}]
You can find the other available pipelines which might be used by the inference API here.

Related

Elasticsearch Aggregation of large list

I'm trying to count how many times ingredients show up in different documents. My index body is similar to this
index_body = {
"settings":{
"index":{
"number_of_replicas":0,
"number_of_shards":4,
"refresh_interval":"-1",
"knn":"true"
}
},
"mappings":{
"properties":{
"recipe_id":{
"type":"keyword"
},
"recipe_title":{
"type":"text",
"analyzer":"standard",
"similarity":"BM25"
},
"description":{
"type":"text",
"analyzer":"standard",
"similarity":"BM25"
},
"ingredient":{
"type":"keyword"
},
"image":{
"type":"keyword"
},
....
}
}
In the ingredient field, I've stored an array of strings of each ingredient [ingredient1,ingredient2,....]
I have around 900 documents. Each with their own ingredients list.
I've tried using Elasticsearch's aggregations but it seems to not return what I expected.
Here is the query I've been using:
{
"size":0,
"aggs":{
"ingredients":{
"terms": {"field":"ingredient"}
}
}
}
But it returns this:
{'took': 4, 'timed_out': False, '_shards': {'total': 4, 'successful': 4, 'skipped': 0, 'failed': 0}, 'hits': {'total': {'value': 994, 'relation': 'eq'}, 'max_score': None, 'hits': []}, 'aggregations': {'ingredients': {'doc_count_error_upper_bound': 56, 'sum_other_doc_count': 4709, 'buckets': [{'key': 'salt', 'doc_count': 631}, {'key': 'oil', 'doc_count': 320}, {'key': 'sugar', 'doc_count': 314}, {'key': 'egg', 'doc_count': 302}, {'key': 'butter', 'doc_count': 291}, {'key': 'flour', 'doc_count': 264}, {'key': 'garlic', 'doc_count': 220}, {'key': 'ground pepper', 'doc_count': 185}, {'key': 'vanilla extract', 'doc_count': 146}, {'key': 'lemon', 'doc_count': 131}]}}}
This is clearly wrong, as I have many ingredients. What am I doing wrong? Why is it returning only these ones? Is there a way to force Elasticsearch to return all counts?
You need to specify size inside the aggregation.
{
"size":0,
"aggs":{
"ingredients":{
"terms": {"field":"ingredient", "size": 10000}
}
}
}

OData - How to filter on child elements?

We have an endpoint where the payload looks like this:
{
"#odata.context": "http://localhost/AbcWebApi/v1.0/-/ABC/AP/$metadata#APDistributionSets/$entity",
"DistributionSetKey": "CC",
"Description": "Credit cards",
"Status": "Active",
"InactiveDate": null,
"DateLastMaintained": "2010-08-18T00:00:00Z",
"CodeType": "Purchase",
"DistributionMethod": "Manual",
"DistributionsEntered": 3,
"Currency": "CAD",
"DistributionSetDetails": [
{
"DistributionSet": "CC",
"LineNumber": 1,
"DistributionCode": "VISA",
"Description": "VISA card",
"GLAccount": "1022",
"Discountable": "Yes",
"Percentage": 0,
"Amount": 0,
"UpdateOperation": "Unspecified"
},
{
"DistributionSet": "CC",
"LineNumber": 2,
"DistributionCode": "MASTER",
"Description": "Mastercard",
"GLAccount": "1023",
"Discountable": "Yes",
"Percentage": 0,
"Amount": 0,
"UpdateOperation": "Unspecified"
},
{
"DistributionSet": "CC",
"LineNumber": 3,
"DistributionCode": "AMEX",
"Description": "American Express",
"GLAccount": "1021",
"Discountable": "Yes",
"Percentage": 0,
"Amount": 0,
"UpdateOperation": "Unspecified"
}
],
"UpdateOperation": "Unspecified"
}
Basically, there is a DistributionSetKey called CC. And under CC, there are different child elements like VISA, AMEX, MC, etc. Can I filter by VISA which is the DistributionCode?
This is what I have tried:
http://localhost/AbcWebApi/v1.0/-/ABC/AP/APDistributionSets?$filter=(DistributionSetDetails/DistributionCode eq 'VISA')
And this is the error I'm getting:
"The parent value for a property access of a property 'DistributionCode' is not a single value. Property access can only be applied to a single value."
Any advice? Thanks.

cannot agregate in elasticsearch

I have a service with logs in elasticsearch. I want to get users who have used my service.
Detailed log lines were returned on my request, but I want to get a unique "kubernetes.pod_name":
{
"size": 10000,
"_source": ["kubernetes.pod_name"],
"query": {"bool": {"filter": [
{"match": {"kubernetes.labels.app" : "jupyterhub"}},
{"match_phrase": {"log": "200 GET"}}
]}},
"aggs": {"pods": {"terms": {"field": "kubernetes.pod_name"}}}
}
why aren't the log lines grouped in the "aggs" section? What to do to get unique users?
Upd:
my query return:
{'took': 614,
'timed_out': False,
'_shards': {'total': 5, 'successful': 5, 'skipped': 0, 'failed': 0},
'hits': {'total': 17703,
'max_score': 0.0,
'hits': [{'_index': 'dwh-dev-2020-10-14',
'_type': 'container_log',
'_id': 'vQ6vJHUBU_u817onY-cZ',
'_score': 0.0,
'_source': {'kubernetes': {'pod_name': 'jupyter-lyisova-2evg'}}},
{'_index': 'dwh-dev-2020-10-14',
'_type': 'container_log',
'_id': 'xA6vJHUBU_u817onY-cZ',
'_score': 0.0,
'_source': {'kubernetes': {'pod_name': 'jupyter-lyisova-2evg'}}},
{'_index': 'dwh-dev-2020-10-14',
'_type': 'container_log',
'_id': '6g6vJHUBU_u817onY-cZ',
'_score': 0.0,
'_source': {'kubernetes': {'pod_name': 'jupyter-bogdanov'}}},
...
I want to get 20 lines instead of 17703 where each line corresponds to a unique "kubernetes.pod_name"
You can merge between terms aggregation and filter aggregation
{
"aggs": {
"labels_filter": {
"filter": [
{
"match": {
"kubernetes.labels.app": "jupyterhub"
}
},
{
"match_phrase": {
"log": "200 GET"
}
}
],
"aggs": {
"pods": {
"terms": {
"field": "kubernetes.pod_name"
}
}
}
}
}
}

RASA NLU: Can't extract entity

I've trained my rasa nlu model in a way that It recognizes the content in between square brackets as pst entity. For the training part, I had covered both the scenarios with more than 50 examples.
There are two scenarios(only space difference):
When I pass http://www.google.comm, 1283923, [9283911,9309212,9283238], it is considering only [ bracket as the pst entity.
When I pass http://www.google.comm, 1283923, [9283911, 9309212, 9283238], it is working fine and recognizing [9283911, 9309212, 9283238] as the pst entity as expected.
For the scenario 1, I've tried all the possible pipelines, but it only recognizes the first square bracket [ as the pst entity
In the response, I am getting this output:
{
'intent': {
'name': None,
'confidence': 0.0
},
'entities': [
{
'start': 0,
'end': 22,
'value': 'http://www.google.comm',
'entity': 'url',
'confidence': 0.8052099168500071,
'extractor': 'ner_crf'
},
{
'start': 24,
'end': 31,
'value': '1283923',
'entity': 'defect_id',
'confidence': 0.8334249141074151,
'extractor': 'ner_crf'
},
{
'start': 33,
'end': 34,
'value': '[',
'entity': 'pst',
'confidence': 0.5615805162522188,
'extractor': 'ner_crf'
}
],
'intent_ranking': [],
'text': 'http://www.google.comm, 1283923, [9283911,9309212,9283238]'
}
So, Can anyone tell me what I am missing in the configuration? The problem is happening because of spacing only, and my model should have the knowledge of spacing as I am providing the training data with both scenarios.
It is good idea to use Regex for your purpose. Rasa NLU supports extraction of Entities by Regex. Normal NLU training data will have something like below
{
"rasa_nlu_data": {
"common_examples": [
{
"text": "Hi",
"intent": "greet",
"entities": []
}]
}
}
You can provide Regex data for training as below in the NLU json file.
{
"rasa_nlu_data": {
"regex_features": [
{
"name": "pst",
"pattern": "\[..*\]"
},
]
}
}
Reference: Regular Expression in Rasal NLU

Elasticsearch: does not give back result when searching for a simple 'a' character

I want to store tags for messages in ElasticSearch. I've defined the tags field as this:
{
'tags': {
'type': 'string',
'index_name': 'tag'
}
}
For a message I've stored the following list in the tags field:
['a','b','c']
Now if I try to search for tag 'b' with the following query, it gives back the message and the tags:
{
'filter': {
'limit': {
'value': 100
}
},
'query': {
'bool': {
'should': [
{
'text': {
'tags': 'b'
}
}
],
'minimum_number_should_match': 1
}
}
}
There goes the same with tag 'c'.
But if I search for tag 'a' with this:
{
'filter': {
'limit': {
'value': 100
}
},
'query': {
'bool': {
'should': [
{
'text': {
'tags': 'a'
}
}
],
'minimum_number_should_match': 1
}
}
}
It gives back no results at all!
The answer is:
{
'hits': {
'hits': [],
'total': 0,
'max_score': None
},
'_shards': {
'successful': 5,
'failed': 0,
'total': 5
},
'took': 1,
'timed_out': False
}
What am I doing wrong? (It doesn't matter that the 'a' is the first element of the list, the same goes for ['b','a','c']. It seems it has problems only with a single 'a' character.
If you didn't set any analyzer and mapping to your index, Elasticsearch uses its own analyzer by default. Elasticsearch's default_analyzer has stopwords filter that defaultly ignores English stopwords such as:
"a", "an", "and", "are", "as", "at", "be", "but", "by",
"for", "if", "in", "into", "is", "it",
"no", "not", "of", "on", "or", "such",
"that", "the", "their", "then", "there", "these",
"they", "this", "to", "was", "will", "with"
Before going for more just check ElasticSearch mapping and analyzer guides:
Analyzer Guide
Mapping Guide
There might be some stemming or stop word lists involved. Try making sure the field is not analyzed.
'tags': {'type': 'string', 'index_name': 'tag', "index" : "not_analyzed"}
Similar: matching whole string with dashes in elasticsearch

Resources