How to encode protocol buffer string to binary using protoc - protoc

I been trying to encode strings using protoc cli utility.
Noticed that output still contains plain text.
What am i doing wrong?
osboxes#osboxes:~/proto/bin$ cat ./teststring.proto
syntax = "proto2";
message Test2 {
optional string b = 2;
}
echo b:\"my_testing_string\"|./protoc --encode Test2 teststring.proto>result.out
result.out contains:
^R^Qmy_testing_string
protoc versions libprotoc 3.6.0 and libprotoc 2.5.0

Just to formalize in an answer:
The command as written should be fine; the output is protobuf binary - it just resembles text because protobuf uses utf-8 to encode strings, and your content is dominated by a string. However, despite this: the file isn't actually text, and you should usually use a hex viewer or similar if you need to inspect it.
If you want to understand the internals of a file, https://protogen.marcgravell.com/decode is a good resource - it rips an input file or hex string following the protocol rules, and tells you what each byte means (field headers, length prefixes, payloads, etc).
I'm guessing your file is actually:
(hex) 10 11 6D 79 5F etc
i.e. 0x10 = "field 2, length prefixed", 0x11 = 17 (the payload length, encoded as varint), then "my_testing_string" encoded as 17 bytes of UTF8.

protoc --proto_path=${protobuf_path} --encode=${protobuf_message} ${protobuf_file} < ${source_file} > ${output_file}
and in this case:
protoc --proto_path=~/proto/bin --encode="Test2" ~/proto/bin/teststring.proto < ${source.txt} > ./output.bin
or:
cat b:\"my_testing_string\" | protoc --proto_path=~/proto/bin --encode="Test2" ~/proto/bin/teststring.proto > ./output.bin

Related

Serialize and deserialize protobufs through CLI?

I am trying to deserialize a file saved as a protobuf through the CLI (seems like the easiest thing to do). I would prefer not to use protoc to compile, import it into a programming language and then read the result.
My use case: A TensorFlow lite tool has output some data in a protobuf format. I've found the protobuf message definition in the TensorFlow repo too. I just want to read the output quickly. Specifically, I am getting back a tflite::evaluation::EvaluationStageMetrics message from the inference_diff tool.
I assume that the tool outputs a protobuf message in binary format.
protoc can decode the message and output in text format. See this option:
--decode=MESSAGE_TYPE Read a binary message of the given type from
standard input and write it in text format
to standard output. The message type must
be defined in PROTO_FILES or their imports.
While Timo Stamms answer was instrumental, I still struggled with the paths to get protoc to work in a complex repo (e.g. TensorFlow).
In the end, this worked for me:
cat inference_diff.txt | \
protoc --proto_path="/Users/ben/butter/repos/tensorflow/" \
--decode tflite.evaluation.EvaluationStageMetrics \
$(pwd)/evaluation_config.proto
Here I pipe the binary contents of the file containing protobuf (inference_diff.txt in my case, generated by following this guide), and specify the fully qualified protobuf message (which I got by combining the package tflite.evaluation; and the message name, EvaluationStageMetrics), the absolute path of the project for the proto_path (which is the project root/ TensorFlow repo), and also the absolute path for the file which actually contains the message. proto_path is just used for resolving imports, where as the PROTO_FILE (in this case, evaluation_config.proto), is used to decode the file.
Example Output
num_runs: 50
process_metrics {
inference_profiler_metrics {
reference_latency {
last_us: 455818
max_us: 577312
min_us: 453121
sum_us: 72573828
avg_us: 483825.52
std_deviation_us: 37940
}
test_latency {
last_us: 59503
max_us: 66746
min_us: 57828
sum_us: 8992747
avg_us: 59951.646666666667
std_deviation_us: 1284
}
output_errors {
max_value: 122.371696
min_value: 83.0335922
avg_value: 100.17548828125
std_deviation: 8.16124535
}
}
}
If you just want to get the numbers in a rush and can't be bothered to fix the paths, you can do
cat inference_diff.txt | protoc --decode_raw
Example output
1: 50
2 {
5 {
1 {
1: 455818
2: 577312
3: 453121
4: 72573828
5: 0x411d87c6147ae148
6: 37940
}
2 {
1: 59503
2: 66746
3: 57828
4: 8992747
5: 0x40ed45f4b17e4b18
6: 1284
}
3 {
1: 0x42f4be4f
2: 0x42a61133
3: 0x40590b3b33333333
4: 0x41029476
}
}
}

Why does YAML interpret '0777' as 511?

In my YAML file I have:
foo:
- '0777'
When I load the file in my code (result = YAML.load_file(...)) I get
result[:foo] = [511]
This happens on Ubuntu. On Mac it is correct (["0777"]). When changed to:
foo:
- "'0777'"
it works on Ubuntu but the string consists the quotes: '0777'.
Why?
In Ruby for Integer if the argument is string, and happen to start with 0x, 0b, 0, it is interpreted as hex, binary, octal string respectively.
Therefore here 0777 is being treated as an octal string. Since '0777' octal = '511' decimal, you are getting 511 as result.
reference

Using ruby SAX parsers for GB2312 encoded xml

Good day,
I have a lot of big xml files that i need to parse, but problems is they have 'gb2312' encoding. I would normaly use SAX parser for this.
So here is in example of xml:
<?xml version="1.0" encoding="gb2312"?>
<Root>
<ValueList Count="112290" FieldCount="11">
<Item1 Value1="23743" Value2="Дипломатия � Пустой кувшин" Value3="1" Value4="" Value5="6" Value6="0" Value7="0" Value8="0" Value9="0" Value10="0" Value11="0"/>
<Item2 Value1="6611" Value2="ДЛ � 018 омела � золотой кинжал" Value3="1" Value4="" Value5="6" Value6="0" Value7="0" Value8="0" Value9="0" Value10="0" Value11="0"/>
<Item3 Value1="6608" Value2="Наука (ДЛ)�круг фей 021�тяпка" Value3="1" Value4="" Value5="6" Value6="0" Value7="0" Value8="0" Value9="0" Value10="0" Value11="0"/>
<Item4 Value1="6612" Value2="Знаки ДЛ � 003руны � разрушение" Value3="1" Value4="" Value5="6" Value6="0" Value7="0" Value8="0" Value9="0" Value10="0" Value11="0"/>
....
</Root>
I'm trying to use Nokogiri SAX (also tried libxml-ruby with same result) parser:
require 'nokogiri'
class SchemaParser < Nokogiri::XML::SAX::Document
def initialize
#cnt = 0
end
def start_element name, attrs =[]
if name == "Item1"
#cnt+= 1
puts #cnt
end
end
end
parser = Nokogiri::XML::SAX::Parser.new(SchemaParser.new)
parser.parse_io(File.open('2_4_EQUIPMENT_ESSENCE.xml'), 'gb2312')
But this gives error "`check_encoding': 'GB2312' is not a valid encoding (ArgumentError)". If I remove encoding declaration and let Nokogiri detect encoding himself, I will receive this error:
encoding error : input conversion failed due to input error, bytes 0xA8 0x43 0x20 0xA7
encoding error : input conversion failed due to input error, bytes 0xA8 0x43 0x20 0xA7
I/O error : encoder error
I also tried to open File with proper encoding, but that didn't help SAX parser:
[3] pry(main)> f = File.open('2_4_EQUIPMENT_ESSENCE.xml', "r:gb2312")
=> #<File:2_4_EQUIPMENT_ESSENCE.xml>
[4] pry(main)> f.external_encoding.name
=> "GB2312"
Did anyone use 'gb2312' encoding with SAX parsers in ruby? Any recommendations how to proceed?
It seems the issue is that Libxml2 does not support the GB2312 encoding (see here for a list of supported encodings).
I'm not sure if you have tried this, but I think you can work around this by removing the encoding declaration from the XML files (so Libxml2 does not try to transcode the data) and set the external encoding of the File object to GB2312, because then Ruby will transcode the file to UTF-8 as it is read, and from then on everything will remain as UTF-8.
So, here is my workaround.
Problems:
Some of characters presented in xml are not 'gb2312' encoding, I have found that 'GB18030' would be a better choice with full Chinese characters.
I converted all xml's to utf8, so i can use SAX parser.
I ended up with this rake task:
desc "convert chinese xml files to utf-8"
task :convert do
rm_rf 'data/utf8'
mkdir 'data/utf8'
Dir.foreach('data') {|f|
if f.end_with?('.xml')
puts "converted:: data/utf8/#{f}" if system("iconv -f GB18030 -t UTF-8 data/#{f} > data/utf8/#{f}")
end
}
#replace encodings for xml files
system("bundle exec ruby -pi -e \"gsub(/gb2312/, 'UTF-8')\" data/utf8/*.xml")
end

Cannot read unicode .csv into R

I have a .csv file, which contains the following data:
"Ա","Բ"
1,10
2,20
I cannot read it into R so that the column names are displayed like they are in the file.
d <- read.csv("./Data/1.csv", fileEncoding="UTF-8")
head(d)
Produces the following:
> d <- read.csv("./Data/1.csv", fileEncoding="UTF-8")
Warning messages:
1: In read.table(file = file, header = header, sep = sep, quote = quote, :
invalid input found on input connection './Data/1.csv'
2: In read.table(file = file, header = header, sep = sep, quote = quote, :
incomplete final line found by readTableHeader on './Data/1.csv'
> head(d)
[1] X.
<0 rows> (or 0-length row.names)
Meanwhile, doing the same without specifying the fileEncoding produces this:
> d <- read.csv("./Data/1.csv")
> head(d)
Ô. Ô²
1 1 10
2 2 20
When I run the "file" utility to find out the encoding of the file, it says it is UTF-8:
Data\1.csv: UTF-8 Unicode text, with CRLF line terminators
I am using RStudio, Windows 7, R version 2.15.2, 32-bit.
Thanks in advance.
I wrote a longer answer on the same issue here: R on Windows: character encoding hell .
Quick answer, using the parameter encoding instead of fileEncoding should fix your first issue. You will not be able to read it possibly in either console or table view in RStudio, but you will be able to use it in formulaes.
d <- read.csv("./Data/1.csv", encoding="UTF-8")
head(d)
Having saved your table into a UTF-8 file:
> test2 <- read.csv("test2.csv", header = FALSE, sep = ",", quote = "\"", dec = ".", fill = TRUE, comment.char = "", encoding = "UTF-8")
Warning message:
In read.table(file = file, header = header, sep = sep, quote = quote, :
incomplete final line found by readTableHeader on 'test2.csv'
This gives you how it looks like in the console and RStudio view
> test2
V1 V2
1 <U+0531> <U+0532>
2 1 10
3 2 20
However importantly you are able to manipulate this within R. Thus in my case it is possible to see that the script window input Ա has UTF-8 encoding, and a grep correctly finds this encoding in your table.
> Encoding("Ա")
[1] "UTF-8"
> grep("Ա", as.character(test2[1,1]))
[1] 1
You may need to find suitable encoding variants that work on your settings, or possibly change them. Unfortunately I am not sure where it is done.
You might not be able to make it pretty in all stages, but it is definitely possible to get it to work also in Windows 7 environment.
I tried two ways to replicate your problem.
I copied the characters above into RStudio, saved it to a csv with this code:
write.csv(c("Ա","Բ",
1,10,
2,20), "test.csv")
df <- read.csv("test.csv")
This worked fine.
Then I thought, well maybe R is cheating when I save it to CSV with R? So I just pasted the characters to a text file and save it as a CSV. This approach doesn't have problems either.
Here's my session info:
sessionInfo()
R version 3.0.1 (2013-05-16)
Platform: x86_64-pc-linux-gnu (64-bit)
locale:
[1] LC_CTYPE=en_CA.UTF-8 LC_NUMERIC=C LC_TIME=en_CA.UTF-8
[4] LC_COLLATE=en_CA.UTF-8 LC_MONETARY=en_CA.UTF-8 LC_MESSAGES=en_CA.UTF-8
[7] LC_PAPER=C LC_NAME=C LC_ADDRESS=C
[10] LC_TELEPHONE=C LC_MEASUREMENT=en_CA.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats4 grid stats graphics grDevices utils datasets methods base
other attached packages:
[1] party_1.0-9 modeltools_0.2-21 strucchange_1.4-7 sandwich_2.2-10 zoo_1.7-10
[6] GGally_0.4.4 reshape_0.8.4 plyr_1.8 ggplot2_0.9.3.1
loaded via a namespace (and not attached):
[1] coin_1.0-23 colorspace_1.2-2 dichromat_2.0-0 digest_0.6.3
[5] gtable_0.1.2 labeling_0.2 lattice_0.20-23 MASS_7.3-29
[9] munsell_0.4.2 mvtnorm_0.9-9995 proto_0.3-10 RColorBrewer_1.0-5
[13] reshape2_1.2.2 scales_0.2.3 splines_3.0.1 stringr_0.6.2
I had the same problem and found out that the file was corrupted.
I opened the file with OpenOffice and saved it back using "UTF8" character set (you need to click the edit filter settings box) and then imported it with the read.csv()(no encoding or filencoding option) and it worked fine.

Base64.decode64 in ruby returning strange results

I'm having problems in decoding a string using Base64.decode64 in Ruby. As a test, I'm using this site that decodes strings in php: https://rnd.feide.no/simplesaml/module.php/saml2debug/debug.php.
As a test, I'm using this string:
fZJNT%2BMwEIbvSPwHy%2Fd8tMvHympSdUGISuwS0cCBm%2BtMUwfbk%2FU4zfLvSVMq2Euv45n3fd7xzOb%2FrGE78KTRZXwSp5yBU1hpV2f8ubyLfvJ5fn42I2lNKxZd2Lon%2BNsBBTZMOhLjQ8Y77wRK0iSctEAiKLFa%2FH4Q0zgVrceACg1ny9uMy7rCdaM2%2Bs0BWrtppK2UAdeoVjW2ruq1bevGImcvR6zpHmtJ1MHSUZAuDKU0vY7Si2h6VU5%2BiMuJuLx65az4dPql3SHBKaz1oYnEfVkWUfG4KkeBna7A%2Fxm6M14j1gZihZazBRH4MODcoKPOgl%2BB32kFz08PGd%2BG0JJIkr7v46%2BhRCaEpod17DCRivYZCkmkd4N28B3wfNyrGKP5bws9DS6PKDz%2FMpsl36Tyz%2F%2Fax1jeFmi0emcLY7C%2F8SDD0Z7dobcynHbbV3QVbcZW0TlqQemNhoqzJD%2B4%2Fn8Yw7l8AA%3D%3D
The output should be:
<?xml version="1.0" encoding="UTF-8"?>
<samlp:AuthnRequest xmlns:samlp="urn:oasis:names:tc:SAML:2.0:protocol" ID="agdobjcfikneommfjamdclenjcpcjmgdgbmpgjmo" Version="2.0" IssueInstant="2007-04-26T13:51:56Z" ProtocolBinding="urn:oasis:names:tc:SAML:2.0:bindings:HTTP-POST" ProviderName="google.com" AssertionConsumerServiceURL="https://www.google.com/a/solweb.no/acs" IsPassive="true"><saml:Issuer xmlns:saml="urn:oasis:names:tc:SAML:2.0:assertion">google.com</saml:Issuer><samlp:NameIDPolicy AllowCreate="true" Format="urn:oasis:names:tc:SAML:2.0:nameid-format:unspecified" /></samlp:AuthnRequest>
But in Ruby I keep getting a strange output:
}\222MO?0\206?H?\a??|\264???jRuA\210J????\233?LS\aۓ?8???IS*?K\257??}????\377\254a;??e|\022\247\234\201SXiWg????~?y~~6#iM+\026]غ'??6L:\022?C?;?J?$\234\264#\"(\261Z?~?8\255ǀ\n\rg??\214˺?u\2436??Z?i\244\255\224\001רV5\266\256?m??"g/G\254?kI???Q\220.\f\2454\275\216ҋhzUN~\210ˉ\270\274z??t???!?)\254????}Y\026Q?*G\201\235\256??\031\2723^#?b\205\226\263\005\021?0?ܠ\243΂_\201?i\005?O\017\031߆ВH\222\276??\241D&\204\246\207u?0?\212?
I\244w\203v??|ܫ\030\243?o
.\217(<\3772\233%ߤ?????X?h\264zg\vc\260\277? ??\236ݡ\2672\234v?Wt\025m?V?9jA鍆\212\263$?\270\376\177\030ù|\000
The code I use is:
require 'cgi'
require 'base64'
Base64::decode64(CGI::unescape('fZJNT%2BMwEIbvSPwHy%2Fd8tMvHympSdUGISuwS0cCBm%2BtMUwfbk%2FU4zfLvSVMq2Euv45n3fd7xzOb%2FrGE78KTRZXwSp5yBU1hpV2f8ubyLfvJ5fn42I2lNKxZd2Lon%2BNsBBTZMOhLjQ8Y77wRK0iSctEAiKLFa%2FH4Q0zgVrceACg1ny9uMy7rCdaM2%2Bs0BWrtppK2UAdeoVjW2ruq1bevGImcvR6zpHmtJ1MHSUZAuDKU0vY7Si2h6VU5%2BiMuJuLx65az4dPql3SHBKaz1oYnEfVkWUfG4KkeBna7A%2Fxm6M14j1gZihZazBRH4MODcoKPOgl%2BB32kFz08PGd%2BG0JJIkr7v46%2BhRCaEpod17DCRivYZCkmkd4N28B3wfNyrGKP5bws9DS6PKDz%2FMpsl36Tyz%2F%2Fax1jeFmi0emcLY7C%2F8SDD0Z7dobcynHbbV3QVbcZW0TlqQemNhoqzJD%2B4%2Fn8Yw7l8AA%3D%3D'))
What could possibly be wrong? Thanks in advance.
I have no idea where you got the idea that that string of yours is a base64-encoded version of your XML. If you pass the first bit of it (<?x) through Base64.encode64() then CGI.escape(), you get:
PD94
at the start, which is nothing like your string. In fact, your first four characters "fZJN" are values 31, 25, 9 and 13 in base 64 so will give you:
011111 011001 001001 001101
then, grouping them in octets instead of sextets (I guess that's the right word):
01111101 10010010 01001101
7D 92 4D
which are not the characters you're expecting to see.
Putting the whole string in gives you:
PD94bWwgdmVyc2lvbj0iMS4wIiBlbmNvZGluZz0iVVRGLTgiPz4gPHNhbWxw
OkF1dGhuUmVxdWVzdCB4bWxuczpzYW1scD0idXJuOm9hc2lzOm5hbWVzOnRj
OlNBTUw6Mi4wOnByb3RvY29sIiBJRD0iYWdkb2JqY2Zpa25lb21tZmphbWRj
bGVuamNwY2ptZ2RnYm1wZ2ptbyIgVmVyc2lvbj0iMi4wIiBJc3N1ZUluc3Rh
bnQ9IjIwMDctMDQtMjZUMTM6NTE6NTZaIiBQcm90b2NvbEJpbmRpbmc9InVy
bjpvYXNpczpuYW1lczp0YzpTQU1MOjIuMDpiaW5kaW5nczpIVFRQLVBPU1Qi
IFByb3ZpZGVyTmFtZT0iZ29vZ2xlLmNvbSIgQXNzZXJ0aW9uQ29uc3VtZXJT
ZXJ2aWNlVVJMPSJodHRwczovL3d3dy5nb29nbGUuY29tL2Evc29sd2ViLm5v
L2FjcyIgSXNQYXNzaXZlPSJ0cnVlIj48c2FtbDpJc3N1ZXIgeG1sbnM6c2Ft
bD0idXJuOm9hc2lzOm5hbWVzOnRjOlNBTUw6Mi4wOmFzc2VydGlvbiI+Z29v
Z2xlLmNvbTwvc2FtbDpJc3N1ZXI+PHNhbWxwOk5hbWVJRFBvbGljeSBBbGxv
d0NyZWF0ZT0idHJ1ZSIgRm9ybWF0PSJ1cm46b2FzaXM6bmFtZXM6dGM6U0FN
TDoyLjA6bmFtZWlkLWZvcm1hdDp1bnNwZWNpZmllZCIgLz48L3NhbWxwOkF1
dGhuUmVxdWVzdD4=
When you escape that, you get:
PD94bWwgdmVyc2lvbj0iMS4wIiBlbmNvZGluZz0iVVRGLTgiPz4gPHNhbWxw%0AOkF1dGhuUmVxdWVzdCB4bWxuczpzYW1scD0idXJuOm9hc2lzOm5hbWVzOnRj%0AOlNBTUw6Mi4wOnByb3RvY29sIiBJRD0iYWdkb2JqY2Zpa25lb21tZmphbWRj%0AbGVuamNwY2ptZ2RnYm1wZ2ptbyIgVmVyc2lvbj0iMi4wIiBJc3N1ZUluc3Rh%0AbnQ9IjIwMDctMDQtMjZUMTM6NTE6NTZaIiBQcm90b2NvbEJpbmRpbmc9InVy%0AbjpvYXNpczpuYW1lczp0YzpTQU1MOjIuMDpiaW5kaW5nczpIVFRQLVBPU1Qi%0AIFByb3ZpZGVyTmFtZT0iZ29vZ2xlLmNvbSIgQXNzZXJ0aW9uQ29uc3VtZXJT%0AZXJ2aWNlVVJMPSJodHRwczovL3d3dy5nb29nbGUuY29tL2Evc29sd2ViLm5v%0AL2FjcyIgSXNQYXNzaXZlPSJ0cnVlIj48c2FtbDpJc3N1ZXIgeG1sbnM6c2Ft%0AbD0idXJuOm9hc2lzOm5hbWVzOnRjOlNBTUw6Mi4wOmFzc2VydGlvbiI%2BZ29v%0AZ2xlLmNvbTwvc2FtbDpJc3N1ZXI%2BPHNhbWxwOk5hbWVJRFBvbGljeSBBbGxv%0Ad0NyZWF0ZT0idHJ1ZSIgRm9ybWF0PSJ1cm46b2FzaXM6bmFtZXM6dGM6U0FN%0ATDoyLjA6bmFtZWlkLWZvcm1hdDp1bnNwZWNpZmllZCIgLz48L3NhbWxwOkF1%0AdGhuUmVxdWVzdD4%3D%0A
So, the bottom line is that you're getting junk from the decode because the data is not of the correct format.
It appears that the data is also deflated/compressed.
require 'zlib'
inflated=Base64::decode64(CGI::unescape('fZJNT%2BMwEIbvSPwHy%2Fd8tMvHympSdUGISuwS0cCBm%2BtMUwfbk%2FU4zfLvSVMq2Euv45n3fd7xzOb%2FrGE78KTRZXwSp5yBU1hpV2f8ubyLfvJ5fn42I2lNKxZd2Lon%2BNsBBTZMOhLjQ8Y77wRK0iSctEAiKLFa%2FH4Q0zgVrceACg1ny9uMy7rCdaM2%2Bs0BWrtppK2UAdeoVjW2ruq1bevGImcvR6zpHmtJ1MHSUZAuDKU0vY7Si2h6VU5%2BiMuJuLx65az4dPql3SHBKaz1oYnEfVkWUfG4KkeBna7A%2Fxm6M14j1gZihZazBRH4MODcoKPOgl%2BB32kFz08PGd%2BG0JJIkr7v46%2BhRCaEpod17DCRivYZCkmkd4N28B3wfNyrGKP5bws9DS6PKDz%2FMpsl36Tyz%2F%2Fax1jeFmi0emcLY7C%2F8SDD0Z7dobcyHbbV3QVbcZW0TlqQemNhoqzJD%2B4%2Fn8Yw7l8AA%3D%3D'))
zlib = Zlib::Inflate.new(-Zlib::MAX_WBITS)
zlib.inflate(inflated)

Resources