How to verify and validate parsed Google Protobuf v2 file - protocol-buffers

First, I'll just couch this in the acknowledgement, yes, I am aware of protoc, but I've got a specific requirement to extrapolate some specialized target language artifacts based on a .proto file parser outcome.
That being established, I've already got the parser itself working. I am working on resolving imported .proto dependencies. Not a terribly difficult endeavor on the surface, in and of itself.
The next steps after that, I think, are to perform a kind of "transitive linkage", as I've learned, but I am curious what I should be aware of. Prima facie, I think I should be collating a set (most likely map) of element paths to field numbers, as well as collating the reserved as well as extensions, then perhaps verifying as I traverse the .proto dependency tree.
However, I'd like to get an idea of others' experience, guidance, feedback, along these lines.
For what I'm wanting to accomplish, I do not think this verification step needs to be that elaborate, only enough to rule out invalid .proto, etc.
Oh. last but not least, I need to handle this for Protobuf v2 language spec.

Related

Using unexported functions/types from stdlib in Go

Disclaimer: yes I know that this is not "supposed to be done" and "use interface composition and delegation" and "the authors of the language know better". However, I am confronted with a choice of either copy-pasting from the standard library and creating my own packages, or doing what I am asking. So please do not reply with "What you want to do is wrong, you are a bad dev and you should feel bad."
So, in Go we have the http stdlib package. This package has a number of functions for dealing with HTTP Range headers and responses (parsers, a struct for "offset+size" and so forth). For various reasons I want to use something that is very similar to ServeContent but works a bit differently (long story short - the amount of plumbing needed to do the ReaderAt gymnastics is suboptimal for what I want to accomplish) so I want to parse the HTTP Range header myself, using the utility functions/structs from the http stdlib package and then deal with them manually. Basically, I want a changed version of ServeContent :-)
Is there a way for me to "reopen" the http stdlib package to use it's unexported identifiers? ABI is not a concern for me as the source is mine, the program gets compiled from scratch every time etc. etc. and it does not need binary compatibility with older/other Go versions. I.e. I am able to ensure that the build is going to be done on a specific Go version and there are tests to check that an unexported identifier disappeared. So...
If there is a package called foo in the Go standard library, but it only exposes a MagicMegamethod that does the thing I do not need, and uses usefulFunc and usefulStruct that I want to get access to, is there a way for me to get access to those identifiers? Either by reopening the package, or using some other way... that does not involve copy-pasting dozens of lines from stdlib without tests etc.
There exist (rather gruesome) ways of accessing unexported symbols, but it requires nontrivial amounts of tricky code, so there's unlikely to be a net win.
Since you've outruled the "don't do this" direction, it seems that the answer is either NO or use the methods described in the post I linked to (and this repo).
FWIW I'd personally just copy the code I need from the standard library and tweak it to my needs. This would likely take less time than the time it took you to write this SO question :-)

Decoding Protobuf encoded data using non-supported platform

I am new to Protobufs; I haven't had much exposure to them. One of the API endpoints we require data from, uses Protobuf encoded data. This generally wouldn't be an issue if I was using a 'supported' language such as JavaScript, Java, Python or even R to decode the data...
Unfortunately, I am trying to automate the process using Alteryx. Rather than this being an Alteryx specific question, I have a few questions about Protobufs themselves so I understand this situation better. I've read through the implementation of Protobufs in Java and Python, and have a basic understanding of how to use them.
To surmise (please correct me if I am wrong), a Protobuf is a method of serializing structured data where a .proto schema is used to encode / decode data into raw binary. My confusion lies with the compiler. Google documentation and examples for Python / Java show how a Protobuf compiler (library) is required in order to run the encoding and decoding process. Reading the Google website, it advises that the Protobufs are 'language neutral and platform neutral', but I can't see how that is possible if you need the compiler (and .proto file!) to do the decoding. For example, how would anyone using a language outside of the languages where Google have a compiler created possibly decode Protobuf encoded data? Am I missing something?
I figure I'm missing something, since it seems weird that a public API would force this constraint.
"language/platform neutral" here simply means that you can reliably get the same data back from any language/framework/platform. The serialization format is defined independently and does not rely on the nuances of any particular framework.
This might seem a low bar, but you'd be surprised how many serialization formats fail to clear it.
Because the format is specified, anyone can create a tool for some other platform. It is a little fiddly if you're not used to dealing in bits, but: totally doable. The protobuf landscape is not dependent on Google - here's a list of some of the known non-Google tools: https://github.com/protocolbuffers/protobuf/blob/master/docs/third_party.md
Also, note that technically you don't even need a .proto; you just need some mechanism for specifying which fields map to which field numbers (since protobuf doesn't include the names). Quite a few in that list can work either from a .proto, or from the field/number map being specified in some other way. The advantage of .proto is simply that it is easy to convey as the schema - and again: isn't tied to any particular language. You can write plugins for "protoc" to add your own tooling, so you don't need to write your own parser from scratch. Or you can write your own parser from scratch if you prefer.
You can't speak of non-supported platform in this case: it is more about languages for which you can't find a protobuf implementation.
My 2 cents is: if you can't find a protobuf implementation for your language, find another language you're familiar with (and popular in protobuf community) and handle the protobuf serialization/deserialization with it. Then call it via a REST API, a executable ... whatever

Java Bytecode manipulation libraries

I am starting to work on a project and for one of the tasks I need to analyze the source code in order to gather information about the classes and their methods. More specifically, for each method I need to know which internal attributes and external objects (references) it uses throughout the entire method body.
I discussed it with my supervisors and they think that Bytecode manipulation libraries is the way to go. I already looked at BCEL, ASM and Javassist but I'm not sure which one I need to use. Do they all provide access to the method body where I can see all the instructions and get the information I need?
Any advice would be appreciate it. Thank you!
If you really “need to analyze the source code”, then libraries which allow to inspect the bytecode are not the way to go.
Otherwise, you really need to define your task precisely. Either, you are about to analyze classes, regardless of whether you will look at their source code or byte code, or you want to analyze source code and consider doing it by compiling first, followed by analyzing the compiled result. In the latter case, you have to compare the effort of both steps with alternative solution, which may, e.g. incorporate direct source code analysis.
Parsing byte code is rather easy, easier than analyzing source code, which is the reason why bytecode is produced prior to the execution of Java programs. To answer your concrete question, yes, all three libraries offer you a way to analyze the instructions and associated information. Which one is the best to fit your needs, is a question that is beyond the scope of Stackoverflow.
Whether analyzing the byte code helps, depends on your exact requirements. When it comes to field and method access, you may precisely get most of them using that approach. Only inlined compile-time constants lack their origins. When it comes to type use, you have to consider that not every source code artifact has an existing counterpart in the byte code, e.g. widening casts produce no actual code and and local variables usually don’t have a declared type (debugging information aside), but only an implied type which depends on how they are actually used. They also have no information about Generics, unless debugging information has been included.

CoreNLP: How can I get only collapesed dependencies?

I'm parsing over 60,000 sentences with CoreNLP to get dependencies relations.
Because I only need collapsed dependencies, other dependencies types -- basic and collapsed-cc-processed -- are redundant for my own use, and make it hard to build my own codes, which take xml-output as input.
Can I get only collapsed dependencies?
If so, please let me know.
Thanks.
There is currently no way to do this. Computing the additional representations take very little computation, and so they are always reported. They should be marked specially in the XML output, however; hopefully it's not hard to filter the correct representation in the downstream code.

Cross version line matching

I'm considering how to do automatic bug tracking and as part of that I'm wondering what is available to match source code line numbers (or more accurate numbers mapped from instruction pointers via something like addr2line) in one version of a program to the same line in another. (Assume everything is in some kind of source control and is available to my code)
The simplest approach would be to use a diff tool/lib on the files and do some math on the line number spans, however this has some limitations:
It doesn't handle cross file motion.
It might not play well with lines that get changed
It doesn't look at the information available in the intermediate versions.
It provides no way to manually patch up lines when the diff tool gets things wrong.
It's kinda clunky
Before I start diving into developing something better:
What already exists to do this?
What features do similar system have that I've not thought of?
Why do you need to do this? If you use decent source version control, you should have access to old versions of the code, you can simply provide a link to that so people can see the bug in its original place. In fact the main problem I see with this system is that the bug may have already been fixed, but your automatic line tracking code will point to a line and say there's a bug there. Seems this system would be a pain to build, and not provide a whole lot of help in practice.
My suggestion is: instead of trying to track line numbers, which as you observed can quickly get out of sync as software changes, you should decorate each assertion (or other line of interest) with a unique identifier.
Assuming you're using C, in the case of assertions, this could be as simple as changing something like assert(x == 42); to assert(("check_x", x == 42)); -- this is functionally identical, due to the semantics of the comma operator in C and the fact that a string literal will always evaluate to true.
Of course this means that you need to identify a priori those items that you wish to track. But given that there's no generally reliable way to match up source line numbers across versions (by which I mean that for any mechanism you could propose, I believe I could propose a situation in which that mechanism does the wrong thing) I would argue that this is the best you can do.
Another idea: If you're using C++, you can make use of RAII to track dynamic scopes very elegantly. Basically, you have a Track class whose constructor takes a string describing the scope and adds this to a global stack of currently active scopes. The Track destructor pops the top element off the stack. The final ingredient is a static function Track::getState(), which simply returns a list of all currently active scopes -- this can be called from an exception handler or other error-handling mechanism.

Resources