What does it mean to convert an XML document to binary? - performance

I am having a performance issue with XDocument.Load("large_file.xml"), where it takes about 25 seconds to load the file.
I read in this question that using a binary format could offer up to a 10x performance increase.
What does a binary format look like? How do you go about converting an XML file to it?

Lets start with the implied question:
Q: What is a Binary format?
A: It is a format in which data is represented in a non-textual form. For example, a Java int might be represented as 4 bytes, rather than a sequence of decimal digits and a sign.
Q: What does it look like?
A: If you view it with a text editor / viewer, it looks like garbage.
Q: How do you go about converting an XML file to a binary form?
A: By hand. Since a binary format is essentially a format (any format) that is not text, there is no magical method of converting it.
Q: How and why is a binary format faster?
A: A binary format isn't automatically faster to load than XML (or JSON). The idea is that you (the programmer) design a specific binary format for your application that will be faster to load. You typically do this by such things as:
avoiding the inclusion of verbose / repetitive structuring information (e.g. XML tag and attribute name),
using data encodings that require less CPU effort to turn into the in-memory representations,
avoiding the inclusion of unnecessary metadata,
avoiding things that require extra in-memory data copying,
and so on.

There is lots of information in an XML format. So it's big and slow. You can create your own format.
For example:
<Data>Value</Data> can be changed to just value at a concrete address in a binary file.

Related

Searching through the protocol buffer file

I'm new to protocol buffers and I was wondering whether it was possible to search a protocol buffers binary file and read the data in a structured format. For example if a message in my .proto file has 4 fields I would like to serialize the message and write multiple messages into a file and then search for a particular field in the file. If I find the field I would like to read back the message in the same structured format as it was written. Is this possible with protocol buffers ? If possible any sample code or examples would be very helpful. Thank you
You should treat protobuf library as one serialization protocol, not an all-in-one library which supports complex operations (such as querying, indexing, picking up particular data). Google has various libraries on top of open-sourced portion of protobuf to do so, but they are not released as open source, as they are tied with their unique infrastructure. That being said, what you want is certainly possible, yet you need to write some code.
Anyhow, some of your requirements are:
one file contains various serialized binaries.
search a particular field in each serialized binary and extract that chunk.
There are several ways to achieve them.
The most popular way for serial read/write is that the file contains a series of [size, type, serialization output]. That is, one serialized output is always prefixed by size and type (either 4/8 byte or variable-length) to help reading and parsing. So you just repeat this procedure: 1) read size and type, 2) read binary with given size, 3) parse with given type 4) goto 1). If you use union type or one file shares same type, you may skip type. You cannot drop size, as there is no way know the end of output by itself. If you want random read/write, other type of data structure is necessary.
'search field' in binary file is more tricky. One way is to read/parse output one by one and to check the existance of field by HasField(). It is most obvious and slow yet straightforward way to do so. If you want to search field by number (say, you want to search 'optional string email = 3;'), thus search by binary blob (like 0x1A, field number 3, wire type 2), it is not possible. In a serialized binary stream, field information is saved merely a number. Without an exact context (.proto scheme or binary file's structure), the number alone doesn't mean anything. There is no guarantee that 0x1A is from field information, or field information from other message type, or actually number 26, or part of other number, etc. That is, you need to maintain the information by yourself. You may create another file or database with necessary information to fetch particular message (like the location of serialization output with given field).
Long story short, what you ask is beyond what open-sourced protobuf library itself does, yet you can write them with your requirements.
I hope, this is what you are looking for:
http://temk.github.io/protobuf-utils/
This is a command line utility for searching within protobuf file.

Ruby equivalent of ReadString?

I'm working on a project with a "customer made" database. He developed a C++/CLI application that stores and retrieves his data from a binary file using the BinaryWriter.Write(String) and BinaryReader.ReadString() methods.
I'm no C++/CLI expert but from what I understand these methods use a 7-bits encoding of the first bytes to determine the String length.
I need to access his data from a rail application, anyone's got an idea of how to do the same think in ruby?
If you're dealing with raw binary data, you'll probably need to spend some time familiarizing yourself with the pack and unpack methods and their various options. Maybe what you're describing is a "Pascal string" where the length is encoded up front, or a variation on that.
For example:
length = data.unpack("C")[0]
string = data.unpack("Ca#{length}")[0]
The double-unpack is required because you don't know the length of the string to unpack until you do the first step. You could probably do this using a substring as well, like data[1,length] if you're reasonably certain you're not dealing with UTF-8 data.

how to check a ruby string is an actural string or a blob data such as image

In ruby how to check a string is an actural string or a blob data such as image, from the data type of view they are ruby string, but really their contents are very different since one is literal string, the other is blob data such as image.
Could anyone provide some clue for me? Thank you in advance.
Bytes are bytes. There is no way to declare that something isn't file data. It'd be fairly easy to construct a valid file in many formats consisting only of printable ASCII. Especially when dealing with Unicode, you're in very murky territory. If possible, I'd suggest modifying the method so that it takes two parameters... use one for passing text and the other for binary data.
One thing you might do is look at the length of the string. Most image formats are at least 500-600 bytes even for a tiny image, and while this is by no means an accurate test, if you get passed, say, a 20k string, it's probably an image. If it were text, it would be quite a bit (Like a quarter of a typical novel, or thereabouts)
Files like images or sound files have defined blocks that can be "sniffed". Wotsit.org has a lot of info about the key bytes and ways to determine what the files are. By looking at those byte offsets in your data you could figure it out.
Another way way is to use some "magic", which is code to sniff key-bytes or byte-types in a file to try to figure out what its type is. *nix systems have it built in via the file command. Do a man file or man magic for more info or check Wikipedia's article on Magic numbers in files.
Ruby Filemagic uses the same technique but is based on GNU's libmagic.
What would constitute a string? Are you expecting simple ASCII? UTF-8? Or text encoded some other way?
If you know you're going to get ASCII text or a blob then you can just spin through the first n bytes and see if anything has the eight bit set, that would tell you that you have binary. OTOH, not finding anything wouldn't guarantee that you had text.
If you're going to get UTF-8 Unicode then you'd do the same thing but look for invalid UTF-8 sequences. Of course, the same caveats apply.
You could scan the first n bytes for anything between 0x00 and 0x20. If you find any bytes that low then you probably have a binary blob of some sort. But maybe not.
As Tyler Eaves said: bytes are bytes. You're starting with a bunch of bytes and trying to find an interpretation of them that makes sense.
Your best bet is to make the caller supply the expected interpretation or take Greg's advice and use a magic number library.

How do I extract ASCII data from binary file with unknown format, in Windows?

On Windows, what is the best way to convert a binary file where the internal structure is unknown less that its contents are ASCII in nature back to plain text?
Ideally the conversion would produce a "human"-readable version. I think the file should contain something like the following:
Date: 10 FEB 2010
House: 345 Dogwood Drive
Exterior: Brick
In Linux/Unix:
$ strings < unknown.dat > ascii-from-unknown.txt
This is of course not so much a "conversion" as a straight up extraction, by just filtering out the non-ASCII bytes. It's useful quite often, though.
In general, without more knowledge of the file's internal format, I don't think you can do much better.
Depending on what exactly you want to achieve, a hex dump might fit the bill: It's a pure ASCII format that represents the entire file without any loss of data (but being quite wasteful with space).
It is not really human readable, but since you don't explain why you want to do that, it's the best I can offer.
There are several simple tools that produce a hex dump on Windows.

Is there a standard format for describing a flat file?

Is there a standard or open format which can be used to describe the formating of a flat file. My company integrates many different customer file formats. With an XML file it's easy to get or create an XSD to describe the XML file format. I'm looking for something similar to describe a flat file format (fixed width, delimited etc). Stylus Studio uses a proprietary .conv format to do this. That .conv format can be used at runtime to transform an arbitrary flat file to an XML file. I was just wondering if there was any more open or standards based method for doing the same thing.
I'm looking for one method of describing a variety of flat file formats whether they are fixed width or delimited, so CSV is not an answer to this question.
XFlat:
http://www.infoloom.com/gcaconfs/WEB/philadelphia99/lyons.HTM#N29
http://www.unidex.com/overview.htm
For complex cases (e.g. log files) you may consider a lexical parser.
About selecting existing flat file formats: There is the Comma-separated values (CSV) format. Or, more generally, DSV. But these are not "fixed-width", since there's a delimiter character (such as a comma) that separates individual cells. Note that though CSV is standardized, not everybody adheres to the standard. Also, CSV may be to simple for your purposes, since it doesn't allow a rich document structure.
In that respect, the standardized and only slightly more complex (but thus more useful) formats JSON and YAML are a better choice. Both are supported out of the box by plenty of languages.
Your best bet is to have a look at all languages listed as non-binary in this overview and then determine which works best for you.
About describing flat file formats: This could be very easy or difficult, depending on the format. Though in most cases easier solutions exist, one way that will work in general is to view the file format as a formal grammar, and write a lexer/parser for it. But I admit, that's quite† heavy machinery.
If you're lucky, a couple of advanced regular expressions may do the trick. Most formats will not lend themselves for that however.‡ If you plan on writing a lexer/parser yourself, I can advise PLY (Python Lex-Yacc). But many other solutions exists, in many different languages, a lot of them more convenient than the old-school Lex & Yacc. For more, see What parser generator do you recommend?
  †: Yes, that may be an understatement.
  ‡: Even properly describing the email address format is not trivial.
COBOL (whether you like it or not) has a standard format for describing fixed-width record formats in files.
Other file formats, however, are somewhat simpler to describe. A CSV file, for example, is just a list of strings. Often the first row of a CSV file is the column names -- that's the description.
There are examples of using JSON to formulate metadata for text files. This can be applied to JSON files, CSV files and fixed-format files.
Look at http://www.projectzero.org/sMash/1.1.x/docs/zero.devguide.doc/zero.resource/declaration.html
This is IBM's sMash (Project Zero) using JSON to encode metadata. You can easily apply this to flat files.
At the end of the day, you will probably have to define your own file standard that caters specifically to your storage needs. What I suggest is using xml, YAML or JSON as your internal container for all of the file types you receive. On top of this, you will have to implement some extra validation logic to maintain meta-data such as the column sizes of the fixed width files (for importing from and exporting to fixed width). Alternatively, you can store or link a set of metadata to each file you convert to the internal format.
There may be a standard out there, but it's too hard to create 'one size fits all' solutions for these problems. There are entity relationship management tools out there (Talend, others) that make creating these mappings easier, but you will still need to spend a lot of time maintaining file format definitions and rules.
As for enforcing column width, xml might be the best solution as you can describe the formats using xml schemas (with the length restriction). For YAML or JSON, you may have to write your own logic for this, although I'm sure someone else has come up with a solution.
See XML vs comma delimited text files for further reference.
I don't know if there is any standard or open format to describe a flat file format. But one industry has done this: the banking industry. Financial institutions are indeed communicating using standardized message over a dedicated network called SWIFT. SWIFT messages were originally positional (before SWIFTML, the XMLified version). I don't know if it's a good suggestion as it's kinda obscure but maybe you could look at the SWIFT Formatting Guide, it may gives you some ideas.
Having that said, check out Flatworm, an humble flat file parser. I've used it to parse positional and/or CSV file and liked its XML descriptor format. It may be a better suggestion than SWIFT :)
CSV
CSV is a delimited data format that has fields/columns separated by the comma character and records/rows separated by newlines. Fields that contain a special character (comma, newline, or double quote), must be enclosed in double quotes. However, if a line contains a single entry which is the empty string, it may be enclosed in double quotes. If a field's value contains a double quote character it is escaped by placing another double quote character next to it. The CSV file format does not require a specific character encoding, byte order, or line terminator format.
The CSV entry on wikipedia allowed me to find a comparison of data serialization formats that is pretty much what you asked for.
The only similar thing I know of is Hachoir, which can currently parse 70 file formats:
http://bitbucket.org/haypo/hachoir/wiki/Home
I'm not sure if it really counts as a declarative language, since it's plugin parser based, but it seems to work, and is extensible, which may meet your needs just fine.
As an aside, there are interesting standardised, extensible flat-file FORMATS, such as IFF (Interchange File Format).

Resources