New line character in serialized messages - protocol-buffers

Some protobuf messages, when serialized to string, have new line character \n inside them. Usually when the first field of the message is a string then the new line character is prepended before the message. But wa also found messages with new line character somewhere in the middle.
The problem with new line character is when you want to save the messages into one file line by line. The new line character breaks the line and makes the message invalid.
example.proto
syntax = "proto3";
package data_sources;
message StringFirst {
string key = 1;
bool valid = 2;
}
message StringSecond {
bool valid = 1;
string key = 2;
}
example.py
from protocol_buffers.data_sources.example_pb2 import StringFirst, StringSecond
print(StringFirst(key='some key').SerializeToString())
print(StringSecond(key='some key').SerializeToString())
output
b'\n\x08some key'
b'\x12\x08some key'
Is this expected / desired behaviour? How can one prevent the new line character?

protobuf is a binary protocol (unless you're talking about the optional json thing). So: any time you're treating it as text-like in any way, you're using it wrong and the behaviour will be undefined. This includes worrying about whether there are CR/LF characters, but it also includes things like the nul-character (0x00), which is often interpreted as end-of-string in text-based APIs in many frameworks (in particular, C-strings).
Specifically:
LF (0x0A) is identical to the field header for "field 1, length-prefixed"
CR (0x0D) is identical to the field header for "field 1, fixed 32-bit"
any of 0x00, 0x0A or 0x0D could occur as a length prefix (to signify a length of 0, 10, or 13)
any of 0x00, 0x0A or 0x0D could occur naturally in binary data (bytes)
any of 0x00, 0x0A or 0x0D could occur naturally in any numeric type
0x0A or 0x0D could occur naturally in text data (as could 0x00 if your originating framework allows nul-characters arbitrarily in strings, so... not C-strings)
and probably a range of other things
So: again - if the inclusion of "special" text characters is problematic: you're using it wrong.
The most common way to handle binary data as text is to use a base-N encode; base-16 (hex) is convenient to display and read, but base-64 is more efficient in terms of the number of characters required to convey the same number of bytes. So if possible: convert to/from base-64 as required. Base-64 never includes any of the non-printable characters, so you will never encounter CR/LF/nul.

Related

Calculating checksum or XOR operations

I'm using hyperterminal and trying to send strings a to 6 digit scoreboard. I was sent a sample string from the manufacturer to test with and it worked, but to be able change the displayed message I was told to calculate a new Checksum value.
The sample string is: &AHELLO N-12345\71
Charactors A and N are addresses for the scoreboards(allowing two displays be used through one RS232 connection). HELLO and -12345 are the characters to be shown on the display. The "71" is where I am getting stuck.
How can you obtain 71 from "AHELLO N-12345"?
In the literature supplied with the scoreboard, the "71" from the sample string is described as a character by character logical XOR operation on characters "AHELLO N-12345". The manufacturer however called it a checksum. I'm not trained in this type of language and I did try to research but I can't put it together on my own.
The text below is copied from the supplied literature and describes the "71" (ckck) in question...
- ckck = 2 ASCII control characters: corresponds to the two hexadecimal digits obtained by
performing the character by character logical XOR operation on characters
"AxxxxxxByyyyyy". If there is an error in these characters, the string is ignored
Example: if the byte by byte logical XOR operation carried out on the ASCII codes of the
characters of the "AxxxxxxByyyyyy" string returns the hexadecimal value 0x2A,
the control characters ckck are "2" and "A".
You don't specify a language but here's the algorithm in C#. Basically xor the values of the string all together and you'll end up with a value of 113, 71 in hex. Hence 71 is on the end of the input string.
string input = "AHELLO N-12345";
UInt16 chk = 0;
foreach(char ch in input) {
chk ^= ch;
}
MessageBox.Show("value is " + chk);
Outputs "value is 113"

How can I convert ASCII code to characters in Verilog language

I've been looking into this but searching seems to lead to nothing.
It might be too simple to be described, but here I am, scratching my head...
Any help would be appreciated.
Verilog knows about "strings".
A single ASCII character requires 8 bits. Thus to store 8 characters you need 64 bits:
wire [63:0] string8;
assign string8 = "12345678";
There are some gotchas:
There is no End-Of-String character (like the C null-character)
The most RHS character is in bits 7:0.
Thus string8[7:0] will hold 8h'38. ("8").
To walk through a string you have to use e.g.: string[ index +: 8];
As with all Verilog vector assignments: unused bits are set to zero thus
assign string8 = "ABCD"; // MS bit63:32 are zero
You can not use two dimensional arrays:
wire [7:0] string5 [0:4]; assign string5 = "Wrong";
You are probably mislead by a misconception about characters. There are no such thing as a character in hardware. There are only sets of bits or codes. The only thing which converts binary codes to characters is your terminal. It interprets codes in a certain way and forming letters for you to se. So, all the printfs in 'c' and $display in verilog only send the codes to the terminal (or to a file).
The thing which converts characters to the codes is your keyboard, which you also use to type in the program. The compiler then interprets your program. Verilog (as well as the 'c') compiler represents strings in double quotes (which you typed in) as a set of bytes directly. Verilog, as well as 'c' use ascii-8 encoding for such character strings, meaning that the code for 'a' is decimal 97 and 'b' is 98, .... Every character is 8-bit wide and the quoted string forms a concatenation of bytes of ascii codes.
So, answering you question, you can convert an ascii codes to characters by sending them to the terminal via $display (or other) function, using the %s modifier.
So, an example:
module A;
reg[8*5-1:0] hello;
reg[8*3 - 1: 0] bye;
initial begin
hello = "hello"; // 5 bytes of characters
bye = {8'd98, 8'd121, 8'd101}; // 3 bytes 'b' 'y' 'e'
$display("hello=%s bye=%s", hello, bye);
end
endmodule

How to use protobuf to deserialize string between c# and java?

In c#,i use protobuf-net to serialize a string to byte array and send it through network
var bytes = Serializer.SerializeObject("Hello,world")
this byte array contains 13 elements, includes 2 prefix tags, start with 0x10, then 0x0b for string length.
i tried to deserialize in java, I use ByteString to convert that byte array to string, i got an error string: \n Hello,world!
this means that java does not ignore the prefix tags.
anybody knows why? thx!
The protobuf format doesn't allow for naked data, so protobuf-net interprets Serializer.Serialize("Hello, world") as though it were a message of the form:
message Foo {
optional value = 1;
}
and as if you had - in C# terms - used:
Serializer.Serialize(new Foo { value = "Hello, world") });
The 0x10 is the field marker for field 1, etc.
If you ever want to check the internals of an encoded message without knowing the schema, this tool may help: https://protogen.marcgravell.com/decode

how to document a single space character within a string in reST/Sphinx?

I've gotten lost in an edge case of sorts. I'm working on a conversion of some old plaintext documentation to reST/Sphinx format, with the intent of outputting to a few formats (including HTML and text) from there. Some of the documented functions are for dealing with bitstrings, and a common case within these is a sentence like the following: Starting character is the blank " " which has the value 0.
I tried writing this as an inline literal the following ways: Starting character is the blank `` `` which has the value 0. or Starting character is the blank :literal:` ` which has the value 0. but there are a few problems with how these end up working:
reST syntax objects to a whitespace immediately inside of the literal, and it doesn't get recognized.
The above can be "fixed"--it looks correct in the HTML () and plaintext (" ") output--with a non-breaking space character inside the literal, but technically this is a lie in our case, and if a user copied this character, they wouldn't be copying what they expect.
The space can be wrapped in regular quotes, which allows the literal to be properly recognized, and while the output in HTML is probably fine (" "), in plaintext it ends up double-quoted as "" "".
In both 2/3 above, if the literal falls on the wrap boundary, the plaintext writer (which uses textwrap) will gladly wrap inside the literal and trim the space because it's at the start/end of the line.
I feel like I'm missing something; is there a good way to handle this?
Try using the unicode character codes. If I understand your question, this should work.
Here is a "|space|" and a non-breaking space (|nbspc|)
.. |space| unicode:: U+0020 .. space
.. |nbspc| unicode:: U+00A0 .. non-breaking space
You should see:
Here is a “ ” and a non-breaking space ( )
I was hoping to get out of this without needing custom code to handle it, but, alas, I haven't found a way to do so. I'll wait a few more days before I accept this answer in case someone has a better idea. The code below isn't complete, nor am I sure it's "done" (will sort out exactly what it should look like during our review process) but the basics are intact.
There are two main components to the approach:
introduce a char role which expects the unicode name of a character as its argument, and which produces an inline description of the character while wrapping the character itself in an inline literal node.
modify the text-wrapper Sphinx uses so that it won't break at the space.
Here's the code:
class TextWrapperDeux(TextWrapper):
_wordsep_re = re.compile(
r'((?<!`)\s+(?!`)|' # whitespace not between backticks
r'(?<=\s)(?::[a-z-]+:)`\S+|' # interpreted text start
r'[^\s\w]*\w+[a-zA-Z]-(?=\w+[a-zA-Z])|' # hyphenated words
r'(?<=[\w\!\"\'\&\.\,\?])-{2,}(?=\w))') # em-dash
#property
def wordsep_re(self):
return self._wordsep_re
def char_role(name, rawtext, text, lineno, inliner, options={}, content=[]):
"""Describe a character given by unicode name.
e.g., :char:`SPACE` -> "char:` `(U+00020 SPACE)"
"""
try:
character = nodes.unicodedata.lookup(text)
except KeyError:
msg = inliner.reporter.error(
':char: argument %s must be valid unicode name at line %d' % (text, lineno))
prb = inliner.problematic(rawtext, rawtext, msg)
return [prb], [msg]
app = inliner.document.settings.env.app
describe_char = "(U+%05X %s)" % (ord(character), text)
char = nodes.inline("char:", "char:", nodes.literal(character, character))
char += nodes.inline(describe_char, describe_char)
return [char], []
def setup(app):
app.add_role('char', char_role)
The code above lacks some glue to actually force the use of the new TextWrapper, imports, etc. When a full version settles out I may try to find a meaningful way to republish it; if so I'll link it here.
Markup: Starting character is the :char:`SPACE` which has the value 0.
It'll produce plaintext output like this: Starting character is the char:` `(U+00020 SPACE) which has the value 0.
And HTML output like: Starting character is the <span>char:<code class="docutils literal"> </code><span>(U+00020 SPACE)</span></span> which has the value 0.
The HTML output ends up looking roughly like: Starting character is the char:(U+00020 SPACE) which has the value 0.

Extended Ascii codes incomplete DART? No character from 128 to 160

I created a small piece of code to print the extended ASCII characters in DART but it seems the ones between 128 and 160 are blank.
PrintExtendedASCII(){
var listCodes = new List();
for (var i = 128; i < 256 ; i++) {
listeCodes.add(i);
}
var list = new String.fromCharCodes(listCodes);
print(list);
}
It only prints :  ¡¢£¤¥¦§¨©ª«¬­®¯°±²³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿ
Is there something different about the extended ASCII characters in DART?
There is no "extended ASCII" in Dart. The character codes you are using in the code example are not ASCII - they are Unicode code points. For code points 0-127, the character codes match ASCII exactly. The block you are missing, from 128 to 160 (0x80 to 0x9F), is all non-printable control characters.
Here is a table of Unicode code points for the 0x000-0xFFF block. If you look carefully, the order of characters exactly matches the string printed on your machine.

Resources