I'm new to protocol buffers and I was wondering whether it was possible to search a protocol buffers binary file and read the data in a structured format. For example if a message in my .proto file has 4 fields I would like to serialize the message and write multiple messages into a file and then search for a particular field in the file. If I find the field I would like to read back the message in the same structured format as it was written. Is this possible with protocol buffers ? If possible any sample code or examples would be very helpful. Thank you
You should treat protobuf library as one serialization protocol, not an all-in-one library which supports complex operations (such as querying, indexing, picking up particular data). Google has various libraries on top of open-sourced portion of protobuf to do so, but they are not released as open source, as they are tied with their unique infrastructure. That being said, what you want is certainly possible, yet you need to write some code.
Anyhow, some of your requirements are:
one file contains various serialized binaries.
search a particular field in each serialized binary and extract that chunk.
There are several ways to achieve them.
The most popular way for serial read/write is that the file contains a series of [size, type, serialization output]. That is, one serialized output is always prefixed by size and type (either 4/8 byte or variable-length) to help reading and parsing. So you just repeat this procedure: 1) read size and type, 2) read binary with given size, 3) parse with given type 4) goto 1). If you use union type or one file shares same type, you may skip type. You cannot drop size, as there is no way know the end of output by itself. If you want random read/write, other type of data structure is necessary.
'search field' in binary file is more tricky. One way is to read/parse output one by one and to check the existance of field by HasField(). It is most obvious and slow yet straightforward way to do so. If you want to search field by number (say, you want to search 'optional string email = 3;'), thus search by binary blob (like 0x1A, field number 3, wire type 2), it is not possible. In a serialized binary stream, field information is saved merely a number. Without an exact context (.proto scheme or binary file's structure), the number alone doesn't mean anything. There is no guarantee that 0x1A is from field information, or field information from other message type, or actually number 26, or part of other number, etc. That is, you need to maintain the information by yourself. You may create another file or database with necessary information to fetch particular message (like the location of serialization output with given field).
Long story short, what you ask is beyond what open-sourced protobuf library itself does, yet you can write them with your requirements.
I hope, this is what you are looking for:
http://temk.github.io/protobuf-utils/
This is a command line utility for searching within protobuf file.
Related
I'm looking for a string representation of arbitrary fields inside protocol buffer messages. Is there any library that implements this? I've looked at using field masks, however they don't have a strong support for repeated fields.
Protocol buffer message and field descriptors provide field access by name. This allows you to find a particular field using a path and to erase it, if that's what you are asking for (if not, I'd recommend to expand the question to include an example for what you'd like to do).
One corresponding Java method is getDescriptorForType (the return type is a message descriptor, where you'll find field descriptors).
There is a similar descriptor API for C++ (in Java, you could theoretically also use reflection).
This API is not available in light mode.
Without any encryption, if the recipient has the serialized Protobuf file but does not have the generated Protobuf class (they don't have access to the .proto file that define its structure), is it possible for them to get any data in the Protobuf file from the binary?
If they have access to a part of the .proto file (for example, just one related message in the file) can they get a part of that data out from the entire file while skipping other unknown parts?
yes, absolutely; the protoc tool can help with this (see: --decode_raw), as can https://protogen.marcgravell.com/decode - so it should not be treated as "secure" at all
yes, absolutely - that's a key part built into the protocol that allows messages to be extensible such that they can decode the bits they understand and either ignore or just store (for round-trip or "extension" fields) the bits they don't understand
protobuf is not a security device; to someone with the right tools it is just as readable as xml or json, with the slight issue that it can be uncertain how to interpret some values; but: you can infer and guess and reverse engineer
Ok, I have found this page https://developers.google.com/protocol-buffers/docs/encoding
The message discards all the names and is just a pair of key number and values. The generated class might offer some protection for safely reading these data and could not read unknown data. (Sure enough because the generated class was generated from known structure, .proto file)
But if I am an attacker I could reference that Encoding page and try to figure out which area in the binary corresponds to which data. For example, varint might be easy to spot after changing some data. And proceed to write my own .proto file to attack this unknown data or even a custom binary reader that can selectively read part of the binary.
In Hadoop Cascading Flow, i have a number of tuples which is processed and finally i have sunk into a destination.
Now my requirement is: To sink that tuples in destination file with certain defined constant String values at beginning and at the end.
For example: I have following input tuples
10|11|12|13|14|15|16|17|18|19|20
20|21|22|23|24|25|26|27|28|29|30
1|2|3|4|5|6|7|8|9|10
Now i need to have like this output:
Certain data before those data
10|11|12|13|14|15|16|17|18|19|20
20|21|22|23|24|25|26|27|28|29|30
1|2|3|4|5|6|7|8|9|10
Certain data after those data
Little bit i have searched of repository class DelimitedParser and its methods like joinLine, joinFirstLine but due to poor documentation i am unable to get exact point of it.
It may depend on what "Certain data before those data" means ?
If you are using TextDelimited, then you can dump the header values in the sink. By default header values are not written as per the documentation hence you will need to enable it. Another thing to remember is that the header values represents the output fields.
-Amit
I have a location that stores plain string data right now and I want to store at the same location a protobuf object in the future.
Is there a way for my new code to read the old data (plain string) and reliably determine: "this is not a protobuf object"?
Given that I can't reliably determine the format of the current plain string data (e.g. hostnames) from other formats (it's possible that some protocol buffer objects resemble hostnames, which could be anything) either, it doesn't seem feasible right?
It seems that you haven't yet stored any protocol buffers data in the location. How about marking the new data with some specific prefix?
For example, generate a GUID or just some random 64-bit value, and prepend it to the data whenever you store a protobuf message. On reading, just check if this marker is present.
By varying the length of the marker, you can make any collisions with existing data arbitrarily improbable.
I am processing some CDRs (call detailed record). I dont know which exactly the file it is? But i supposed this to be 'ASN.1' format BER encoded files. Now my problem is that I want to modify some data in this files but I dont know which Editor or decorder I can use to modify this files. I searched a lot and found many ASN.1 Decorder as well as ASN.1 BSR viewer/editor but no one allows what i want to perform.
This CDR is supposed to contain Customer detail, phone number, telecom services(telephony, SMS, MMS) etc.
One of CDR name is - GGSN01_20120105000102_56641-09-12-01-09%3A30
and file type is - File
No other information is available. When I am opening this file in some text editor it show some rectangles and some text data.
Any telecom guy can definite help me. I am new to telecom domain.
Please ask if you need more information. Thanks
You would need to know something about ASN.1 and BER to be able to correctly edit your file. BER is a binary format, not ASCII text, thus what you see in your text editor. Even modifying any embedded plain text is only safe if you are not changing the length of the string; BER uses nested structures that encode lengths and so a change in the length of a string value requires adjustments to the encoded lengths of the enclosing structures. Additionally, in order to really know what your data is, you would need to know the ASN.1 that describes it (defines the types that describe your encoded data).
You could use a tool such as ASN.1 editor, but without the requisite background knowledge, I think it will not be very helpful to you. You can follow various links on this resources page to get more information about ASN.1. (full disclosure: I am currently an Obj-Sys employee).
Look for tools like enber and unber, they come as debugging tools with the fee asn.1-compiler of Lev Walkin. At least you get text-format from them.
The systemic solution is, of course to write a program that reads the BER-file, applies the schnages and then writes out the altered BER-file. To do so you need the ASN.1-Specification file of your CDR-Format (usually to be found in the specifications of the standard e.g. IMS, you are using) an asn1-compiler such as Lev's and some programming skills.