Dumping Beam-scores to an HDF File - returnn

I have a translation model (TM), which synthesizes its hypotheses using beam-search. For analysis purposes, I would like to study all hypotheses in each beam emitted by the TM’s ChoiceLayer. I’m able to fetch the hypotheses for each input sequence from the TM’s ChoiceLayer and write it to my file system, using the HDFDumpLayer:
'__SEARCH_dump_beam__': {
'class': 'hdf_dump',
'from': ['output'],
'filename': "<my-path>/beams.hdf",
'is_output_layer': True
}
But beside the hypotheses, I would also like to store the score of each hypothesis. I’m able to fetch the beam scores from the ChoiceLayer using a ChoiceGetBeamScoresLayer, but I was not able to dump the scores using an HDFDumpLayer:
'get_scores': {'class': 'choice_get_beam_scores', 'from': ['output']},
'__SEARCH_dump_scores__': {
'class': 'hdf_dump',
'from': ['get_scores'],
'filename': "<my-path>/beam_scores.hdf",
'is_output_layer': True
}
Running the config like likes this, makes RETURNN complain about the ChoiceGetBeamScoresLayer output not having a time axis:
Exception creating layer root/'__SEARCH_dump_scores__' of class HDFDumpLayer with opts:
{'filename': '<my-path>/beam_scores.hdf',
'is_output_layer': True,
'name': '__SEARCH_dump_scores__',
'network': <TFNetwork 'root' train=False search>,
'output': Data(name='__SEARCH_dump_scores___output', shape=(), time_dim_axis=None, beam=SearchBeam(name='output/output', beam_size=12, dependency=SearchBeam(name='output/prev:output', beam_size=12)), batch_shape_meta=[B]),
'sources': [<ChoiceGetBeamScoresLayer 'get_scores' out_type=Data(shape=(), time_dim_axis=None, beam=SearchBeam(name='output/output', beam_size=12, dependency=SearchBeam(name='output/prev:output', beam_size=12)), batch_shape_meta=[B])>]}
Unhandled exception <class 'AssertionError'> in thread <_MainThread(MainThread, started 139964674299648)>, proc 31228.
...
File "<...>/returnn/repository/returnn/tf/layers/basic.py", line 6226, in __init__
line: assert self.sources[0].output.have_time_axis()
locals:
self = <local> <HDFDumpLayer '__SEARCH_dump_scores__' out_type=Data(shape=(), time_dim_axis=None, beam=SearchBeam(name='output/output', beam_size=12, dependency=SearchBeam(name='output/prev:output', beam_size=12)), batch_shape_meta=[B])>
self.sources = <local> [<ChoiceGetBeamScoresLayer 'get_scores' out_type=Data(shape=(), time_dim_axis=None, beam=SearchBeam(name='output/output', beam_size=12, dependency=SearchBeam(name='output/prev:output', beam_size=12)), batch_shape_meta=[B])>]
output = <not found>
output.have_time_axis = <not found>
AssertionError
I tried to alter the shape of the score data using ExpandDimsLayer and EvalLayer, with several different configurations, but those all lead to different errors.
I’m sure I am not the first person trying to dump beam scores. Can anybody tell me, how to do that properly?

To answer the question:
The HDFDumpLayer assert you are hitting is simply due to the fact that this is not implemented yet (support of dumping data without time axis).
You can create a pull request and add this support for HDFDumpLayer.
If you just want to dump this information in any way, not necessarily via HDFDumpLayer, there are a couple of other options:
With task="search", as the search_output_layer, just select the layer which still includes the beam information, i.e. not after a DecideLayer.
That will simply dump all hypotheses including their beam scores.
Use a custom EvalLayer and dump it in whatever way you want.

Related

How to use Chunk Size for kedro.extras.datasets.pandas.SQLTableDataSet in the kedro pipeline?

I am using kedro.extras.datasets.pandas.SQLTableDataSet and would like to use the chunk_size argument from pandas. However, when running the pipeline, the table gets treated as a generator instead of a pd.dataframe().
How would you use the chunk_size within the pipeline?
My catalog:
table_name:
type: pandas.SQLTableDataSet
credentials: redshift
table_name : rs_table_name
layer: output
save_args:
if_exists: append
schema: schema.name
chunk_size: 1000
Looking at the latest pandas doc, the actual kwarg to be used is chunksize, not chunk_size. Please see https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_sql.html.
Since kedro only wraps your save_args and passes them to pd.DataFrame.to_sql these need to match:
def _save(self, data: pd.DataFrame) -> None:
try:
data.to_sql(**self._save_args)
except ImportError as import_error:
raise _get_missing_module_error(import_error) from import_error
except NoSuchModuleError as exc:
raise _get_sql_alchemy_missing_error() from exc
EDIT: Once you have this working in your pipeline, the docs show that pandas.DataFrame.read_sql with chunksize set will return type Iterator[DataFrame]. This means that in your node function, you should iterate over the input (and annotate accordingly, if appropriate) such as:
def my_node_func(input_dfs: Iterator[pd.DataFrame], *args):
for df in input_dfs:
...
This works for the latest version of pandas. I have noticed, however, that pandas is aligning the API so that read_csv with chunksize set returns a ContextManager from pandas>=1.2 so I would expect this change to occur in read_sql as well.

Individually update a large amount of documents with the Python DSL Elasticsearch UpdateByQuery

I'm trying to use the UpdateByQuery to update a property of a large amount of documents. But as each document will have a different value, I need to execute ir one by one. I'm traversing a big amount of documents, and for each document I call this funcion:
def update_references(self, query, script_source):
try:
ubq = UpdateByQuery(using=self.client, index=self.index).update_from_dict(query).script(source=script_source)
ubq.execute()
except Exception as err:
return False
return True
Some example values are:
query = {'query': {'match': {'_id': 'VpKI1msBNuDimFsyxxm4'}}}
script_source = 'ctx._source.refs = [\'python\', \'java\']'
The problem is that when I do that, I got an error: "Too many dynamic script compilations within, max: [75/5m]; please use indexed, or scripts with parameters instead; this limit can be changed by the [script.max_compilations_rate] setting".
If I change the max_compilations_rate using Kibana, it has no effect:
PUT _cluster/settings
{
"transient": {
"script.max_compilations_rate": "1500/1m"
}
}
Anyway, it would be better to use a parametrized script. I tried:
def update_references(self, query, script_source, script_params):
try:
ubq = UpdateByQuery(using=self.client, index=self.index).update_from_dict(query).script(source=script_source, params=script_params)
ubq.execute()
except Exception as err:
return False
return True
So, this time:
script_source = 'ctx._source.refs = params.value'
script_params = {'value': [\'python\', \'java\']}
But as I have to update the query and the parameters each time, I need to create a new instance of the UpdateByQuery for each document in the large collection, and the result is the same error.
I also tried to traverse and update the large collection with:
es.update(
index=kwargs["index"],
doc_type="paper",
id=paper["_id"],
body={"doc": {
"refs": paper["refs"] # e.g. [\\'python\\', \\'java\\']
}}
)
But I'm getting the following error: "Failed to establish a new connection: [Errno 99] Cannot assign requested address juil. 10 18:07:14 bib gunicorn[20891]: POST http://localhost:9200/papers/paper/OZKI1msBNuDimFsy0SM9/_update [status:N/A request:0.005s"
So, please, if you have any idea on how to solve this it will be really appreciated.
Best,
You can try it like this.
PUT _cluster/settings
{
"persistent" : {
"script.max_compilations_rate" : "1500/1m"
}
}
The version update is causing these errors.

Vision API: How to get JSON-output

I'm having trouble saving the output given by the Google Vision API. I'm using Python and testing with a demo image. I get the following error:
TypeError: [mid:...] + is not JSON serializable
Code that I executed:
import io
import os
import json
# Imports the Google Cloud client library
from google.cloud import vision
from google.cloud.vision import types
# Instantiates a client
vision_client = vision.ImageAnnotatorClient()
# The name of the image file to annotate
file_name = os.path.join(
os.path.dirname(__file__),
'demo-image.jpg') # Your image path from current directory
# Loads the image into memory
with io.open(file_name, 'rb') as image_file:
content = image_file.read()
image = types.Image(content=content)
# Performs label detection on the image file
response = vision_client.label_detection(image=image)
labels = response.label_annotations
print('Labels:')
for label in labels:
print(label.description, label.score, label.mid)
with open('labels.json', 'w') as fp:
json.dump(labels, fp)
the output appears on the screen, however I do not know exactly how I can save it. Anyone have any suggestions?
FYI to anyone seeing this in the future, google-cloud-vision 2.0.0 has switched to using proto-plus which uses different serialization/deserialization code. A possible error you can get if upgrading to 2.0.0 without changing the code is:
object has no attribute 'DESCRIPTOR'
Using google-cloud-vision 2.0.0, protobuf 3.13.0, here is an example of how to serialize and de-serialize (example includes json and protobuf)
import io, json
from google.cloud import vision_v1
from google.cloud.vision_v1 import AnnotateImageResponse
with io.open('000048.jpg', 'rb') as image_file:
content = image_file.read()
image = vision_v1.Image(content=content)
client = vision_v1.ImageAnnotatorClient()
response = client.document_text_detection(image=image)
# serialize / deserialize proto (binary)
serialized_proto_plus = AnnotateImageResponse.serialize(response)
response = AnnotateImageResponse.deserialize(serialized_proto_plus)
print(response.full_text_annotation.text)
# serialize / deserialize json
response_json = AnnotateImageResponse.to_json(response)
response = json.loads(response_json)
print(response['fullTextAnnotation']['text'])
Note 1: proto-plus doesn't support converting to snake_case names, which is supported in protobuf with preserving_proto_field_name=True. So currently there is no way around the field names being converted from response['full_text_annotation'] to response['fullTextAnnotation']
There is an open closed feature request for this: googleapis/proto-plus-python#109
Note 2: The google vision api doesn't return an x coordinate if x=0. If x doesn't exist, the protobuf will default x=0. In python vision 1.0.0 using MessageToJson(), these x values weren't included in the json, but now with python vision 2.0.0 and .To_Json() these values are included as x:0
Maybe you were already able to find a solution to your issue (if that is the case, I invite you to share it as an answer to your own post too), but in any case, let me share some notes that may be useful for other users with a similar issue:
As you can check using the the type() function in Python, response is an object of google.cloud.vision_v1.types.AnnotateImageResponse type, while labels[i] is an object of google.cloud.vision_v1.types.EntityAnnotation type. None of them seem to have any out-of-the-box implementation to transform them to JSON, as you are trying to do, so I believe the easiest way to transform each of the EntityAnnotation in labels would be to turn them into Python dictionaries, then group them all into an array, and transform this into a JSON.
To do so, I have added some simple lines of code to your snippet:
[...]
label_dicts = [] # Array that will contain all the EntityAnnotation dictionaries
print('Labels:')
for label in labels:
# Write each label (EntityAnnotation) into a dictionary
dict = {'description': label.description, 'score': label.score, 'mid': label.mid}
# Populate the array
label_dicts.append(dict)
with open('labels.json', 'w') as fp:
json.dump(label_dicts, fp)
There is a library released by Google
from google.protobuf.json_format import MessageToJson
webdetect = vision_client.web_detection(blob_source)
jsonObj = MessageToJson(webdetect)
I was able to save the output with the following function:
# Save output as JSON
def store_json(json_input):
with open(json_file_name, 'a') as f:
f.write(json_input + '\n')
And as #dsesto mentioned, I had to define a dictionary. In this dictionary I have defined what types of information I would like to save in my output.
with open(photo_file, 'rb') as image:
image_content = base64.b64encode(image.read())
service_request = service.images().annotate(
body={
'requests': [{
'image': {
'content': image_content
},
'features': [{
'type': 'LABEL_DETECTION',
'maxResults': 20,
},
{
'type': 'TEXT_DETECTION',
'maxResults': 20,
},
{
'type': 'WEB_DETECTION',
'maxResults': 20,
}]
}]
})
The objects in the current Vision library lack serialization functions (although this is a good idea).
It is worth noting that they are about to release a substantially different library for Vision (it is on master of vision's repo now, although not released to PyPI yet) where this will be possible. Note that it is a backwards-incompatible upgrade, so there will be some (hopefully not too much) conversion effort.
That library returns plain protobuf objects, which can be serialized to JSON using:
from google.protobuf.json_format import MessageToJson
serialized = MessageToJson(original)
You can also use something like protobuf3-to-dict

Cloudwatch to Elasticsearch parse/tokenize log event before push to ES

Appreciate your help in advance.
In my scenario - Cloudwatch multiline logs needs to be shipped to elasticsearch service.
ECS--awslog->Cloudwatch---using lambda--> ES Domain
(Basic flow though very open to change how data is shipped from CW to ES )
I was able to solve multi-line issue using multi_line_start_pattern BUT
The main issue I am experiencing now - is my logs have ODL format (following format)
[yyyy-mm-ddThh:mm:ss.SSS-Z][ProductName-Version][Log Level]
[Message ID][LoggerName][Key Value Pairs][[
Message]]
AND I will like to parse and tokenize log events before storing in ES (vs the complete log line ).
For example:
[2018-05-31T11:08:49.148-0400] [glassfish 4.1] [INFO] [] [] [tid: _ThreadID=43 _ThreadName=Thread-8] [timeMillis: 1527692929148] [levelValue: 800] [[
[] INFO : (DummyApplicationFunctionJPADAO) EntityManagerFactory located under resource lookup name [null], resource name=AuthorizationPU]]
Needs to be parsed and tokenize using format
timestamp 2018-05-31T11:08:49.148-0400
ProductName-Version glassfish 4.1
LogLevel INFO
MessageID
LoggerName
KeyValuePairs tid: _ThreadID=43 _ThreadName=Thread-8
Message [] INFO : (DummyApplicationFunctionJPADAO)
EntityManagerFactorylocated under resource lookup name
[null], resource name=AuthorizationPU
In above Key Value pairs repeat and are variable - for simplicity I can store all as one long string.
As far as what I gathered about Cloudwatch - It seems Subscription Filter Pattern reg ex support is very limited really not sure how to fit the above pattern. For lambda function that pushes the data to ES have not seen AWS doc or examples that support lambda as means to parse and push for ES.
Will appreciate if someone can please guide what/where will be best option to parse CW logs before it gets into ES => Subscription Filter -Pattern vs in lambda function or any other way.
Thank you .
From what I can see your best bet is what you're suggesting, a CloudWatch log triggered lambda that reformats the logged data into your ES prefered format and then posts it into ES.
You'll need to subscribe this lambda to your CloudWatch logs. You can do this on the lambda console, or the cloudwatch console (https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/Subscriptions.html).
The lambda's event payload will be: { "awslogs": { "data": "encoded-logs" } }. Where encoded-logs is a Base64 encoding of a gzipped JSON.
For example, the sample event (https://docs.aws.amazon.com/lambda/latest/dg/eventsources.html#eventsources-cloudwatch-logs) can be decoded in node, for example, using:
const zlib = require('zlib');
const data = event.awslogs.data;
const gzipped = Buffer.from(data, 'base64');
const json = zlib.gunzipSync(gzipped);
const logs = JSON.parse(json);
console.log(logs);
/*
{ messageType: 'DATA_MESSAGE',
owner: '123456789123',
logGroup: 'testLogGroup',
logStream: 'testLogStream',
subscriptionFilters: [ 'testFilter' ],
logEvents:
[ { id: 'eventId1',
timestamp: 1440442987000,
message: '[ERROR] First test message' },
{ id: 'eventId2',
timestamp: 1440442987001,
message: '[ERROR] Second test message' } ] }
*/
From what you've outlined, you'll want to extract the logEvents array, and parse this into an array of strings. I'm happy to give some help on this too if you need it (but I'll need to know what language you're writing your lambda in- there are libraries for tokenizing ODL- so hopefully it's not too hard).
At this point you can then POST these new records directly into your AWS ES Domain. Somewhat crypitcally the S3-to-ES guide gives a good outline of how to do this in python: https://docs.aws.amazon.com/elasticsearch-service/latest/developerguide/es-aws-integrations.html#es-aws-integrations-s3-lambda-es
You can find a full example for a lambda that does all this (by someone else) here: https://github.com/blueimp/aws-lambda/tree/master/cloudwatch-logs-to-elastic-cloud

With Zabbix API, how do I get the values of items/resources rather than just the ID's?

I have some data in a Custom Screen in Zabbix, and would like to pull the data from the screen via the API. I'm using this Ruby gem: https://github.com/express42/zabbixapi
I'm able to successfully connect and query, but the results I'm getting are not very useful:
p zbx.query(
:method => "item.get",
:params => {
:itemids => "66666",
:output => "extend"
}
)
# [{"itemid"=>"66666", "type"=>"0", "snmp_community"=>"", "snmp_oid"=>"", "hostid"=>"77777", "name"=>"Fro Packages", "key_"=>"system.sw.packages[davekey1|davekey2|davekey3|davekey4]", "delay"=>"300", "history"=>"90", "trends"=>"365", "status"=>"0", "value_type"=>"1", "trapper_hosts"=>"", "units"=>"", "multiplier"=>"0", "delta"=>"0", "snmpv3_securityname"=>"", "snmpv3_securitylevel"=>"0", "snmpv3_authpassphrase"=>"", "snmpv3_privpassphrase"=>"", "formula"=>"1", "error"=>"", "lastlogsize"=>"0", "logtimefmt"=>"", "templateid"=>"88888", "valuemapid"=>"0", "delay_flex"=>"", "params"=>"", "ipmi_sensor"=>"", "data_type"=>"0", "authtype"=>"0", "username"=>"", "password"=>"", "publickey"=>"", "privatekey"=>"", "mtime"=>"0", "flags"=>"0", "filter"=>"", "interfaceid"=>"25", "port"=>"", "description"=>"", "inventory_link"=>"0", "lifetime"=>"30", "snmpv3_authprotocol"=>"0", "snmpv3_privprotocol"=>"0", "state"=>"0", "snmpv3_contextname"=>""}]
You can see that it's returning a bunch of ID's for the items, including the correct keys, but I can't seem to get the actual plain text values, which is the data I'm interested in.
I started with the screen_id, then got the screenitem_id, now the item_id, but I don't seem to be getting any closer to what I want!
Thanks for any help
Getting items or getting hosts means getting their description, not the data. is You are after history. Reading the actual Zabbix user manual and API docs is highly recommended.

Resources