Parsing CArchive (MFC classes) files in Ruby - ruby

I have a legacy app that seems to be exporting/saving files with CArchive (legacy MFC application).
We're currently refactoring the tool for the web. Is there a library I can look at in Ruby for parsing and loading these legacy files?
What possible libraries could I look into?
Problems with the file format according to XML serialization for MFC include:
Non-robustness—your program will probably crash if you read an archive produced by another version of your program. This can be avoided by complex and unwieldly version management. By using XML, this can be largely avoided.
- Heavy dependencies between your program object model and the archived data. Change the program model and it is almost impossible to read data from a previous version.
- Archived data cannot be edited, understood, and changed, except with the associated application.
Also - 4 versions of the legacy software exists, how would I be able to overcome this ObjectModel, Archived data problem for the different versions? Total backward (import) capabilities are required.

CArchive doesn't have a format that you can parse. It's just a binary file. You have to know what is in it to know how to read it. A library could make it easier to read some data types (CString, CArray, etc.) but I'm not sure you'll find anything like this.
CArchive works like this (storing part):
CArchive ar;
int i = 5;
float f = 5.42f;
CString str("string");
ar << i << f << str;
Then all this is dumped into binary file. You would have to read binary data and somehow interpret it. This is easy in C++ because MFC knows how to serialize types, including complex types like CString and CArray. But you'll have to do this on your own using Ruby.
For example you might read 4 bytes (because you know that int is that big) and interpret it as integer. Next four bytes for float. And then you have to see how to load CString, it stores the length first and then data, but you'll have to take a look at the exact format it uses. You could create utility functions for each type to make your life easier but don't expect this to be simple.

You could write an exporter in C++ using the old functionality, that would read in the CArchive and then output an xml file or whatever of the contents. Reading CArchives directly from Ruby (or any other language than C++/MFC) is going to be a major project. Maybe you can get away with it if the data that is written is just a struct with a few ints or longs, but as soon as your CArchive contains UDT's you're in for a world of pain. For example I don't even think CArchive makes promises on alignment.

Related

Is it possible to use CompUnit modules for collected data?

I have developed a module for processing a collection of documents.
One run of the software collects information about them. The data is stored in two structures called %processed and %symbols. The data needs to be cached for subsequent runs of the software on the same set of documents, some of which can change. (The documents are themselves cached using CompUnit modules).
Currently the data structures are stored / restored as follows:
# storing
'processed.raku`.IO.spurt: %processed.raku;
'symbols.raku`.IO.spurt: %symbol.raku;
# restoring
my %processed = EVALFILE 'processed.raku';
my %symbols = EVALFILE 'symbols.raku';
Outputting these structures into files, which can be quite large, can be slow because the hashes are parsed to create the Stringified forms, and slow on input because they are being recompiled.
It is not intended for the cached files to be inspected, only to save state between software runs.
In addition, although this is not a problem for my use case, this technique cannot be used in general because Stringification (serialisation) does not work for Raku closures - as far as I know.
I was wondering whether the CompUnit modules could be used because they are used to store compiled versions of modules. So perhaps, they could be used to store a 'compiled' or 'internal' version of the data structures?
Is there already a way to do this?
If there isn't, is there any technical reason it might NOT be possible?
(There's a good chance that you've already tried this and/or it isn't a good fit for your usecase, but I thought I'd mention it just in case it's helpful either to you or to anyone else who finds this question.)
Have you considered serializing the data to/from JSON with JSON::Fast? It has been optimized for (de)serialization speed in a way that basic stringification hasn't been / can't be. That doesn't allow for storing Blocks or other closures – as you mentioned, Raku doesn't currently have a good way to serialize them. But, since you mentioned that isn't an issue, it's possible that JSON would fit your usecase.
[EDIT: as you pointed out below, this can make support for some Raku datastructures more difficult. There are typically (but not always) ways to work around the issue by specifying the datatype as part of the serialization step:
use JSON::Fast;
my $a = <a a a b>.BagHash;
my $json = $a.&to-json;
my BagHash() $b = from-json($json);
say $a eqv $b # OUTPUT: «True»
This gets more complicated for datastructures that are harder to represent in JSON (such as those with non-string keys). The JSON::Class module could also be helpful, but I haven't tested its speed.]
After looking at other answers and looking at the code for Precompilation, I realised my original question was based on a misconception.
The Rakudo compiler generates an intermediate "byte code", which is then used at run time. Since modules are self-contained units for compilation purposes, they can be precompiled. This intermediate result can be cached, thus significantly speeding up Raku programs.
When a Raku program uses code that has already been compiled, the compiler does not compile it again.
I had thought of the precompilation cache as a sort of storage of the internal state of a program, but it is not quite that. That is why - I think - #ralph was confused by the question, because I was not asking the right sort of question.
My question is about the storage (and restoration) of data. JSON::Fast, as discussed by #codesections is very fast because it is used by the Rakudo compiler at a low level and so is highly optimised. Consequently, restructuring data upon restoration will be faster than restoring native data types because the slow rate-determining step is storage and restoration from "disk", which JSON does very quickly.
Interestingly, the CompUnit modules I mentioned use low level JSON functions that make JSON::Fast so quick.
I am now considering other ways of storing data using optimised routines, perhaps using a compression/archiving module. It will come down to testing which is fastest. It may be that the JSON route is the fastest.
So this question does not have a clear answer because the question itself is "incorrect".
Update As #RichardHainsworth notes, I was confused by their question, though felt it should be helpful to answer as I did. Based on his reaction, and his decision not to accept #codesection's answer, which at that point was the only other answer, I concluded it was best to delete this answer to encourage others to answer. But now Richard has provided an answer that provides good resolution, I'm undeleting it in the hope that's now more useful.
TL;DR Instead of using EVALFILE, store your data in a module which you then use. There are simple ways to do this that would be minimal but useful improvements over EVALFILE. There are more complex ways that might be better.
A small improvement over EVALFILE
I've decided to first present a small improvement so you can solidify your shift in thinking from EVALFILE. It's small in two respects:
It should only take a few minutes to implement.
It only gives you a small improvement over EVALFILE.
I recommend you properly consider the rest of this answer (which describes more complex improvements with potentially bigger payoffs instead of this small one) before bothering to actually implement what I describe in this first section. I think this small improvement is likely to turn out to be redundant beyond serving as a mental bridge to later sections.
Write a program, say store.raku, that creates a module, say data.rakumod:
use lib '.';
my %hash-to-store = :a, :b;
my $hash-as-raku-code = %hash-to-store .raku;
my $raku-code-to-store = "unit module data; our %hash = $hash-as-raku-code";
spurt 'data.rakumod', $raku-code-to-store;
(My use of .raku is of course overly simplistic. The above is just a proof of concept.)
This form of writing your data will have essentially the same performance as your current solution, so there's no gain in that respect.
Next, write another program, say, using.raku, that uses it:
use lib '.';
use data;
say %data::hash; # {a => True, b => True}
useing the module will entail compiling it. So the first time you use this approach for reading your data instead of EVALFILE it'll be no faster, just as it was no faster to write it. But it should be much faster for subsequent reads. (Until you next change the data and have to rebuild the data module.)
This section also doesn't deal with closure stringification, and means you're still doing a data writing stage that may not be necessary.
Stringifying closures; a hack
One can extend the approach of the previous section to include stringifications of closures.
You "just" need to access the source code containing the closures; use a regex/parse to extract the closures; and then write the matches to the data module. Easy! ;)
For now I'll skip filling in details, partly because I again think this is just a mental bridge and suggest you read on rather than try to do as I just described.
Using CompUnits
Now we arrive at:
I was wondering whether the CompUnit modules could be used because they are used to store compiled versions of modules. So perhaps, they could be used to store a 'compiled' or 'internal' version of the data structures?
I'm a bit confused by what you're asking here for two reasons. First, I think you mean the documents ("The documents are themselves cached using CompUnit modules"), and that documents are stored as modules. Second, if you do mean the documents are stored as modules, then why wouldn't you be able to store the data you want stored in them? Are you concerned about hiding the data?
Anyhow, I will presume that you are asking about storing the data in the document modules, and that you're interested in ways to "hide" that data.
One simple option would be to write the data as I did in the first section, but insert the our %hash = $hash-as-raku-code"; etc code at the end, after the actual document, rather than at the start.
But perhaps that's too ugly / not "hidden" enough?
Another option might be to add Pod blocks with Pod block configuration data at the end of your document modules.
For example, putting all the code into a document module and throwing in a say just as a proof-of-concept:
# ...
# Document goes here
# ...
# At end of document:
=begin data :array<foo bar> :hash{k1=>v1, k2=>v2} :baz :qux(3.14)
=end data
say $=pod[0].config<array>; # foo bar
That said, that's just code being executed within the module; I don't know if the compiled form of the module retains the config data. Also, you need to use a "Pod loader" (cf Access pod from another Raku file). But my guess is you know all about such things.
Again, this might not be hidden enough, and there are constraints:
The data can only be literal scalars of type Str, Int, Num, or Bool, or aggregations of them in Arrays or Hashs.
Data can't have actual newlines in it. (You could presumably have double quoted strings with \ns in them.)
Modifying Rakudo
Aiui, presuming RakuAST lands, it'll be relatively easy to write Rakudo plugins that can do arbitrary work with a Raku module. And it seems like a short hop from RakuAST macros to basic is parsed macros which in turn seem like a short hop from extracting source code (eg the source of closures) as it goes thru the compiler and then spitting it back out into the compiled code as data, possibly attached to Pod declarator blocks that are in turn attached to code as code.
So, perhaps just wait a year or two to see if RakuAST lands and gets the hooks you need to do what you need to do via Rakudo?

Decoding Protobuf encoded data using non-supported platform

I am new to Protobufs; I haven't had much exposure to them. One of the API endpoints we require data from, uses Protobuf encoded data. This generally wouldn't be an issue if I was using a 'supported' language such as JavaScript, Java, Python or even R to decode the data...
Unfortunately, I am trying to automate the process using Alteryx. Rather than this being an Alteryx specific question, I have a few questions about Protobufs themselves so I understand this situation better. I've read through the implementation of Protobufs in Java and Python, and have a basic understanding of how to use them.
To surmise (please correct me if I am wrong), a Protobuf is a method of serializing structured data where a .proto schema is used to encode / decode data into raw binary. My confusion lies with the compiler. Google documentation and examples for Python / Java show how a Protobuf compiler (library) is required in order to run the encoding and decoding process. Reading the Google website, it advises that the Protobufs are 'language neutral and platform neutral', but I can't see how that is possible if you need the compiler (and .proto file!) to do the decoding. For example, how would anyone using a language outside of the languages where Google have a compiler created possibly decode Protobuf encoded data? Am I missing something?
I figure I'm missing something, since it seems weird that a public API would force this constraint.
"language/platform neutral" here simply means that you can reliably get the same data back from any language/framework/platform. The serialization format is defined independently and does not rely on the nuances of any particular framework.
This might seem a low bar, but you'd be surprised how many serialization formats fail to clear it.
Because the format is specified, anyone can create a tool for some other platform. It is a little fiddly if you're not used to dealing in bits, but: totally doable. The protobuf landscape is not dependent on Google - here's a list of some of the known non-Google tools: https://github.com/protocolbuffers/protobuf/blob/master/docs/third_party.md
Also, note that technically you don't even need a .proto; you just need some mechanism for specifying which fields map to which field numbers (since protobuf doesn't include the names). Quite a few in that list can work either from a .proto, or from the field/number map being specified in some other way. The advantage of .proto is simply that it is easy to convey as the schema - and again: isn't tied to any particular language. You can write plugins for "protoc" to add your own tooling, so you don't need to write your own parser from scratch. Or you can write your own parser from scratch if you prefer.
You can't speak of non-supported platform in this case: it is more about languages for which you can't find a protobuf implementation.
My 2 cents is: if you can't find a protobuf implementation for your language, find another language you're familiar with (and popular in protobuf community) and handle the protobuf serialization/deserialization with it. Then call it via a REST API, a executable ... whatever

How to decode a single UTF-8 character and step onto the next using only the Rust standard library?

Does Rust provide a way to decode a single character (unicode-scalar-value to be exact) from a &[u8], which may be multiple bytes, returning a single USV?
Something like GLib's g_utf8_get_char & g_utf8_next_char:
// Example of what glib's functions might look like once ported to Rust.
let i = 0;
while i < slice.len() {
let unicode_char = g_utf8_get_char(&slice[i..]);
// do something with the unicode character
funcion(unicode_char);
// move onto the next.
i += g_utf8_next_char(&slice[i..]);
}
Short of porting parts of the GLib API to Rust, does Rust provide a way to do this, besides some trial & error calls to from_utf8 which stop once the second character is reached?
See GLib's code.
No, there is no such functionality publicly exposed in the Rust standard library as of Rust 1.14.
And neither should there be. Rust doesn't believe in a gigantic standard library. Crates are trivial to use and prevent people from rewriting code. Many people have an incorrect opinion (yeah, that's right: an opinion is incorrect) that using dependencies makes their program weaker.
Anything put in the standard library has to be maintained forever. There are zero plans for a Rust 2.0 that would break backwards compatibility. Python is the normal example here, with a multitude of "get data from a URL" parts of the standard library that are all redundant and deprecated now. The Python maintainers have to waste time keeping those working, instead of advancing the language.
Third-party crates allow things to be created, evolve, and die without burdening the entire language.
You can convert a byte slice (&[u8]) into a string slice (&str) by using str::from_utf8 (note that this validates that the whole byte slice is valid UTF-8). You can then use the chars() iterator on the string slice to iterate on each character (char) in the string.

Why do format conversion libraries lack a single method to write the output to a file?

From my experience with Ruby, libraries that parse/convert a format (such as YAML, JSON, XML, SASS, etc.) into objects often have a single method that covers from reading the file to parsing, which is usually named like load, load_file, etc. (In addition, they usually have a method that only does parsing on a string that was read in advance, which is usually named like decode, parse. etc.)
On the other hand, when it comes to converting the objects into the target file format, such libraries rarely have a single method that covers from conversion to writing to the destination file. Usually, they only have a single method that does only conversion, which is usually named like encode, render, etc., and the result string has to be written to the file using another method such as File.write.
What is the reason for this assymmetry? Why does writing to a file require an extra step?
I'd guess that it's because of error handling. Readings file can goi wrong in plenty of ways, but writing a file is even more error prone. It seems silly for a library that's main purpose is parsing to have to deal with file writing. I don't know why these libraries even include file read & parse methods.
Also, for a library to include these kinds of method is useless as soon as you need to access any of the options of the file writing and reading methods. So then the library includes an options parameter that gets passed to the file method, and now the code is just an unclear mess.
That's my 2¢.

Unifying enums across multiple languages

I have one large project with components in multiple languages that each depend on some of the same enum values. What solutions have you come up with to unify enums across multiple arbitrary languages? I can think of a few, but I'm looking for the best solution.
(In my implementation, I'm using Php, Java, Javascript, and SQL.)
You can put all of the enums in a text file, then use a code generator to write out the appropriate syntax for each language from that common file so that each component has the enums. Make that text file the authoritative source of information.
You can express the text file in XML but I'd think a tab-delimited flat file would work just fine.
Make them in a format that every language can understand or has a library for. I am using JSON for this at the moment.
Then you can include it with two ways:
For development: Load it from a file/URL at runtime
good for small changes you want too see immediately
slow
For productive usage: Include it in the files
using a build script
fast
no instant feedback
I would apply the dry principle and using code generator as such you could add anew language easely even if it has not enum natively existing.

Resources