BSON to Messagepack - bson

The problem that I am facing is that BSON comes with ObjectId and Timestamp which are not supported in Messagepack and it aint possible to define a custom serializer for Messagepack (at least as far as I know).
I wrote a piece of python code to compare pymongo's BSON vs msgpack. With not much of optimization I could achieve 300% performance improvement.
So, is there any way to convert BSON to Messagepack?

Here is how I solved the problem.
Unfortunately since mongodb none-REST API doesn't come with a Strict, or JS mode for document retrieval (as opposed to its REST API in which you could specify the format you wanna use to retrieve a document), we are left with no option but to do the conversion manually.
import json
from bson import json_util
import msgpack
con = Connection()
db = con.test
col = db.collection
d = col.find().limit(1)[0]
s = json.dumps(d, default=json_util.default) # s is in JSON compatibale format (ObjcetId => '$0id'
packer= msgpack.Packer()
packer.pack(s) # messagepack can successfully convert since the format is JSON compatible.
The awesome observation is that even with one extra step of json.dumps, Messagepack serializer is faster than BSON encode, not 3 times though. For 10000 repetition the difference is three tenth of a second.

Related

What is the point of google.protobuf.StringValue?

I've recently encountered all sorts of wrappers in Google's protobuf package. I'm struggling to imagine the use case. Can anyone shed the light: what problem were these intended to solve?
Here's one of the documentation links: https://developers.google.com/protocol-buffers/docs/reference/csharp/class/google/protobuf/well-known-types/string-value (it says nothing about what can this be used for).
One thing that will be different in behavior between this, and simple string type is that this field will be written less efficiently (a couple extra bytes, plus a redundant memory allocation). For other wrappers, the story is even worse, since the repeated variants of those fields will be written inefficiently (official Google's Protobuf serializer doesn't support packed encoding for non-numeric types).
Neither seems to be desirable. So, what's this all about?
There's a few reasons, mostly to do with where these are used - see struct.proto.
StringValue can be null, string often can't be in a language interfacing with protobufs. e.g. in Go strings are always set; the "zero value" for a string is "", the empty string, so it's impossible to distinguish between "this value is intentionally set to empty string" and "there was no value present". StringValue can be null and so solves this problem. It's especially important when they're used in a StructValue, which may represent arbitrary JSON: to do so it needs to distinguish between a JSON key which was set to empty string (StringValue with an empty string) or a JSON key which wasn't set at all (null StringValue).
Also if you look at struct.proto, you'll see that these aren't fully fledged message types in the proto - they're all generated from message Value, which has a oneof kind { number_value, string_value, bool_value... etc. By using a oneof struct.proto can represent a variety of different values in one field. Again this makes sense considering what struct.proto is designed to handle - arbitrary JSON - you don't know what type of value a given JSON key has ahead of time.
In addition to George's answer, you can't use a Protobuf primitive as the parameter or return value of a gRPC procedure.

Protobuf parse text

I have the following protobuf text and I am using google-protobuf to parse it but I'm not sure how to do it.
# HELP Type about service.
# TYPE gauge
metadata_server1{namespace="default",service="nginx"} 1
metadata_server2{namespace="default",service="operator"} 1
metadata_server3{namespace="default",service="someservice"} 1
...
Whenever I try to decode it, I get this error:
/usr/lib/ruby/gems/2.3.0/gems/protobuf-3.8.3/lib/protobuf/decoder.rb:21:in `decode_each_field'
This is how I am trying to decode it:
class Metrics < ::Protobuf::Message
required :string, :namespace, 1
required :string, :value, 2
required :string, :map, 3
end
class Message < ::Protobuf::Message
repeated Metrics, :metrics, 1
end
data = get_data('http://localhost:8080/')
parsed_data = Metrics.decode(data)
puts parsed_data.metrics //does not work
Does anyone know how I can parse this?
Your data is not a Protobuf. Protobuf is a binary format, not text, so it would not be human-readable like the data you are seeing. Technically, Protobuf has an alternative text representation used for debugging, but your data is not that format either.
Instead, your data appears to be Prometheus text format, which is not a Protobuf format. To parse this, you will need a Prometheus text parser. Usually, only Prometheus itself consumes this format, so not a lot of libraries for parsing it are available (whereas there are lots of libraries for creating it). The format is pretty simple, though, and you could probably parse it with a suitable regex.
Some servers which export Prometheus metrics also support exporting it in an alternative Protobuf-based format. If your server supports that, you can request it by sending the header:
Accept: application/vnd.google.protobuf; proto=io.prometheus.client.MetricFamily; encoding=delimited
If you send that in the request, you might get a Protobuf-based format back, if the server supports it. Note that the Protobuf format is deprecated and removed in Prometheus 2, so fewer servers are likely to support it these days.
If your server does support this format, note that the result is still not a plain Protobuf. Rather, it is a collection of Protobufs in "delimited" format. Each Protobuf is prefixed by a varint-encoded length ("varint" is Protobuf's variable-width integer encoding). In C++ or Java, there are "parseDelimitedFrom" functions you can use to parse this format, but it looks like Ruby does not have built-in support currently.

How to read a large file into a string

I'm trying to save and load the states of Matrices (using Matrix) during the execution of my program with the functions dump and load from Marshal. I can serialize the matrix and get a ~275 KB file, but when I try to load it back as a string to deserialize it into an object, Ruby gives me only the beginning of it.
# when I want to save
mat_dump = Marshal.dump(#mat) # serialize object - OK
File.open('mat_save', 'w') {|f| f.write(mat_dump)} # write String to file - OK
# somewhere else in the code
mat_dump = File.read('mat_save') # read String from file - only reads like 5%
#mat = Marshal.load(mat_dump) # deserialize object - "ArgumentError: marshal data too short"
I tried to change the arguments for load but didn't find anything yet that doesn't cause an error.
How can I load the entire file into memory? If I could read the file chunk by chunk, then loop to store it in the String and then deserialize, it would work too. The file has basically one big line so I can't even say I'll read it line by line, the problem stays the same.
I saw some questions about the topic:
"Ruby serialize array and deserialize back"
"What's a reasonable way to read an entire text file as a single string?"
"How to read whole file in Ruby?"
but none of them seem to have the answers I'm looking for.
Marshal is a binary format, so you need to read and write in binary mode. The easiest way is to use IO.binread/write.
...
IO.binwrite('mat_save', mat_dump)
...
mat_dump = IO.binread('mat_save')
#mat = Marshal.load(mat_dump)
Remember that Marshaling is Ruby version dependent. It's only compatible under specific circumstances with other Ruby versions. So keep that in mind:
In normal use, marshaling can only load data written with the same major version number and an equal or lower minor version number.

ARFF parsing in ELKI 0.6.5 or 0.6.0

I would like to use the newest version of ELKI, but I get errors leading to nullpointerexeptions and that task fails. When using 0.6.0 it works fine.
Here is some toy arff-data:
#ATTRIBUTE 'var_0032' real
#ATTRIBUTE 'id' real
#ATTRIBUTE 'outlier' {'no','yes'}
#DATA
0.185185185185,1.0,'no'
0.0740740740741,2.0,'no'
But I get the failure in 0.6.5:
Invalid quoted line in input: no closing quote found in: #ATTRIBUTE 'outlier' {'no','yes'}
Task failed
java.lang.NullPointerException
at de.lmu.ifi.dbs.elki.visualization.VisualizerContext.processNewResult(VisualizerContext.java:300)
at de.lmu.ifi.dbs.elki.visualization.VisualizerContext.<init>(VisualizerContext.java:141)
at de.lmu.ifi.dbs.elki.visualization.VisualizerParameterizer.newContext(VisualizerParameterizer.java:193)
at de.lmu.ifi.dbs.elki.visualization.gui.ResultVisualizer.processNewResult(ResultVisualizer.java:116)
at de.lmu.ifi.dbs.elki.workflow.OutputStep.runResultHandlers(OutputStep.java:70)
at de.lmu.ifi.dbs.elki.KDDTask.run(KDDTask.java:120)
at de.lmu.ifi.dbs.elki.application.KDDCLIApplication.run(KDDCLIApplication.java:60)
at [...]
In the 0.6.0 this just seems to be a warning:
Invalid quoted line in input: no closing quote found in: #ATTRIBUTE 'outlier' {'no','yes'} it still produces the ROCCURVE.
Should I be worried?
Should I change my arff file, and how?
The ARFF file format (https://weka.wikispaces.com/ARFF+%28developer+version%29) doesn't use quotes there.
#RELATION example
#ATTRIBUTE var_0032 NUMERIC
#ATTRIBUTE id NUMERIC
#ATTRIBUTE outlier {no,yes}
#DATA
0.185185185185,1.0,no
0.0740740740741,2.0,no
also, if your id column is really an id, don't give it real (which is only an alias for numeric) datatype. It's not a numerical column, and if you aren't careful it may be misused in the analysis.
So maybe better use something like this:
#RELATION example
#ATTRIBUTE var_0032 NUMERIC
#ATTRIBUTE id STRING
#ATTRIBUTE class {no,yes}
#DATA
0.185185185185,'1',no
0.0740740740741,'2',no
to get a proper ARFF file. I havn't tested, does this work better?
First of all, definitely use 0.6.5. ELKI is not at a 1.0 release yet, there are bugs. They will not be fixed in old versions, only in the new version, because we still need to be able to do larger API changes. Essentially, there should be no reason to use anything but the latest version. ELKI 0.7 will appear end of August at VLDB 2015.
ARFF is not used a lot. There may be errors in the parser, and ARFF support for categorial data is very very limited right now. The strengths of the ARFF format are when you have lots of categorial attributes, but that is mostly used in classification - and ELKI doesn't include much classification algorithms yet (since Weka is a strong tool for that already, we focus on algorithms that are not available/good in Weka).
Batik errors like this are usually due to NaN or infinite values. There are still some errors in the visualization code because SVG doesn't give good type safety, unfortunately. You can easily build SVG documents that are invalid, or that contain invalid characters such as ∞ in some coordinate, and then the Batik renderer will fail with such an error message.
What are you exactly trying to do? It looks a bit as if you are trying to compute the ROC curve for the existing output of an algorithm? I don't think there is an easy way to read an ARFF file containing (score, id, label) rows and compute a ROC curve using the MiniGUI. It's not hard to do in Java code, but it's not a use case of the KDD process workflow of the UI.

ruby parse HTTParty array - json

Still new to Ruby - I apologize in advance if this has been asked.
I am using HTTParty to get data from an API, and it is returning an array of JSON data that I can't quite figure out how to parse.
#<Net::HTTPOK:0x1017fb8c0>
{"ERRORARRAY":[],"DATA":[{"ALERT":1,"LABEL":"hello","WATCHDOG":1},{"LABEL":"goodbye","WATCHDOG":1}
I guess the first question is that I don't really know what I am looking at. When I do response.class I get HTTParty::Response. It appears to be a Hash inside an array? I am not sure. Anyway, I want a way to just grab the "LABEL" for every separate array, so the result would be "hello", "goodbye". How would I go about doing so?
you don't need to parse it per say. what you could do is replace ':' with '=>' and evaluate it.
example: say you have ["one":"a","two":"b"], you could set s to equal that string and do eval s.gsub(/^\[/, '{').gsub(/\]$/, '}').gsub('":', '"=>') will yield a ruby hash (with inspect showing {"one"=>"a", "two"=>"b"})
alternatively, you could do something like this
require 'json'
string_to_parse = "{\"one\":\"a\",\"two\":\"b\"}"
parsed_and_a_hash = JSON.parse(string_to_parse)
parsed_and_a_hash is a hash!
If that's JSON, then your best bet is to install a library that handles the JSON format. There's really no point in reinventing the wheel (although it is fun). Have a look at this article.
If you know that the JSON data will always, always be in exactly the same format, then you might manage something relatively simple without a JSON gem. But I'm not sure that it's worth the hassle.
If you're struggling with the json gem, consider using the Crack gem. It has the added benefit of also parsing xml.
require 'crack'
my_hash_array = Crack::JSON.parse(my_json_string)
my_hash_array = Crack::XML.parse(my_xml_string)

Resources