What is the proper way to convert Mathematica expressions losslessly to a string (a string kept in memory, not exported to a file)?
I am looking for a textual representation that
will preserve all information, including keeping special (and possibly atomic) objects, such as SparseArray, Graph, Dispatch, CompiledFunction, etc. intact. E.g. cycling a SparseArray through this representation should keep it sparse (and not convert it to a normal list).
is relatively fast to cycle through (convert back and forth).
Is ToString[expr, FullForm] sufficient for this? What about ToString[expr, InputForm]?
Note 1: This came up while trying to work around some bugs in Graph where the internal representation gets corrupted occasionally. But I'm interested in an answer to the general question above.
Note 2: Save will surely do this, but it writes to files (probably possible to solve this using streams), and it only write definitions associated with symbols.
If you are not going to perform some string manipulations on the resulting string, you may consider Compress and Uncompress as an alternative to ToString. While I don't know about cases where ToString[expr,InputForm] - ToExpression cycles would break, I can easily imagine that they exist. The Compress solution seems more robust, as Uncompress invoked on Compress-ed string is guaranteed to reconstruct the original expression. An added advantage of Compress is that it is pretty memory-efficient - I used it a few times to save large amounts of numerical data in the notebook, without saving them to disk.
Should Compress exhibit round-tripping problems, ExportString and ImportString might present a useful alternative -- particularly, if they are used in conjunction with the Mathematica-native MX format:
string = ExportString[originalExpr, "MX"]
recoveredExpr = ImportString[string, "MX"]
Note that the MX format is not generally transferable between Mathematica instances, but that might not matter for the described in-memory application.
ExpressionML is another Mathematica-related export format, but it is distinctly not a compact format.
Related
It seems easy to parallelize parsers for large amounts of input data that is already given in a split format, e.g. a large list of individual database entries, or is easy to split by a fast preprocessing step, e.g. parsing the grammatical structure of sentences in large texts.
A bit harder seems to be parallel parsing that already requires quite some effort to locate sub-structures in a given input. Common programming language code looks like a good example. In languages like Haskell, that use layout/indentation for separating individual definitions, you could probably check the number of leading spaces of each line after you've found the start of a new definition, skip all lines until you find another definition and pass each skipped chunk to another thread for full parsing.
When it comes to languages like C, JavaScript etc., that use balanced braces to define scopes, the amount of work for doing the preprocessing would be much higher. You'd need to go through the whole input, thereby counting braces, taking care of text inside string literals and so on. Even worse with languages like XML, where you also need to keep track of tag names in the opening/closing tags.
I found a parallel version of the CYK parsing algortihm that seems to work for all context-free grammars. But I'm curious what other general concepts/algorithms do exist that make it possible to parallelize parsers, including such things as the brace counting described above which would only work for a limited set of languages. This question is not about specific implementations but the ideas such implementations are based on.
I think you will find McKeeman's 1982 paper on Parallel LR Parsing quite interesting, as it appears to be practical and applies to a broad class of grammars.
The basic scheme is standard LR parsing. What is clever is that the (presumably long) input is divided into roughly N equal sized chunks (for N processors), and each chunk is parsed separately. Because the starting point for a chunk may (must!) be in the middle of some of productions, McKeemans individual parsers, unlike classic LR parsers, start with all possible left contexts (requiring that the LR state machine be augmented) to determine which LR items apply to the chunk. (It shouldn't take very many tokens before an individual parser has determined what states really apply, so this isn't very inefficient). Then the results of all the parsers are stitched together.
He sort of ducks the problem of partitioning the input in the middle of a token. (You can imagine an arbitrarily big string literal containing text that looks like code, to fool the parser the starts in the middle). What appears to happen is that parser runs into an error, and abandons its parse; the parser to its left takes up the slack. One can imagine the chunk splitter to use a little bit of smarts to mostly avoid this.
He goes to demonstrate a real parser in which speedups are obtained.
Clever, indeed.
There is a lot of information about binary diff algorithms for pretty big files (1MB+). However, my use case is different. This is why it is not a duplicate.
I have a collection of many objects, each in 5-100 byte range. I want to send updates on those objects over the network. I want to compile updates into individual TCP packets (with a reasonable MTU of ~1400). I want to try and fit as much updates in each packet as possible: first add their IDs, and then put the binary diff of all the binary objects, combined.
What is the best binary diff algorithm to be used for such a purpose?
With such small objects, you could use the classic longest-common-subsequence algorithm to create a 'diff'.
That is not the best way to approach your problem, however. The LCS algorithm is constrained by the requirement to use each original byte only once when matching to the target sequence. You probably don't really need that constraint to encode your compressed packets, so it will result in sub-optimal solution in addition to being kinda slow.
Your goal is really to use the example of the original objects to compress the new objects, and you should think about it in those terms.
There are many, many ways, but you probably have some idea about how you want to encode those new objects. Probably you are thinking of replacing sections of the new objects with references to sections of the original objects.
In that case, it would be practical to make a suffix array for each original object (https://en.wikipedia.org/wiki/Suffix_array). Then when you're encoding the corresponding new object, at each byte you can use the suffix array to find the longest matching segment of the old object. If it's long enough to result in a savings, then you can replace the corresponding bytes with a reference.
The simple answer is to combine your many small objects into a single large one, and then use the existing binary diff algorithms to efficiently send the diff on that combined object.
But there is no need to roll your own. I would personally solve this problem by putting all of the objects into a filesystem, and then sending the diff using rsync.
This is theoretically not optimal. However consider the following:
The implementation is simple.
The code is battle tested - it will take a long time for your code to meet a similar level of reliability.
This implementation covers edge cases where the receiving side's state is not what the sender expects. As could happen if, say, the receiver had a crash/restart.
There are cases where this is the wrong solution. But I would insist on having this straightforward solution fail before being clever and sophisticated about it.
I am looking for an algorithm to compress small ASCII strings. They contain lots of letters but they also can contain numbers and rarely special characters. They will be small, about 50-100 bytes average, 250 max.
Examples:
Android show EditText.setError() above the EditText and not below it
ImageView CENTER_CROP dont work
Prevent an app to show on recent application list on android kitkat 4.4.2
Image can't save validable in android
Android 4.4 SMS - Not receiving sentIntents
Imported android-map-extensions version 2.0 now my R.java file is missing
GCM registering but not receiving messages on pre 4.0.4. devices
I want to compress the titles one by one, not many titles together and I don't care much about CPU and memory usage.
You can use Huffman coding with a shared Huffman tree among all texts you want to compress.
While you typically construct a Huffman tree for each string to be compressed separately, this would require a lot of overhead in storage which should be avoided here. That's also the major problem when using a standard compression scheme for your case: most of them have some overhead which kills your compression efficiency for very short strings. Some of them don't have a (big) overhead but those are typically less efficient in general.
When constructing a Huffman tree which is later used for compression and decompression, you typically use the texts which will be compressed to decide which character is encoded with which bits. Since in your case the texts to be compressed seem to be unknown in advance, you need to have some "pseudo" texts to build the tree, maybe from a dictionary of the human language or some experience of previous user data.
Then construct the Huffman tree and store it once in your application; either hardcode it into the binary or provide it in the form of a file. Then you can compress and decompress any texts using this tree. Whenever you decide to change the tree since you gain better experience on which texts are compressed, the compressed string representation also changes. It might be a good idea to introduce versioning and store the tree version together with each string you compress.
Another improvement you might think about is to use multi-character Huffman encoding. Instead of compressing the texts character by character, you could find frequent syllables or words and put them into the tree too; then they require even less bits in the compressed string. This however requires a little bit more complicated compression algorithm, but it might be well worth the effort.
To process a string of bits in the compression and decompression routine in C++(*), I recommend either boost::dynamic_bitset or std::vector<bool>. Both internally pack multiple bits into bytes.
(*)The question once had the c++ tag, so OP obviously wanted to implement it in C++. But as the general problem is not specific to a programming language, the tag was removed. But I still kept the C++-specific part of the answer.
I'm working in an online editor for a datatype that consists of nested lists of strings. Note that traffic can get unbearable if I am going to transfer the entire structure every time a single value is changed. So, in order to reduce traffic, I've thought in applying a diff tool. Problem is: how do I find and report the diff of two trees? For example:
["ah","bh",["ha","he",["li","no","pz"],"ka",["kat","xe"]],"po","xi"] ->
["ah","bh",["ha","he",["li","no","pz"],"ka",["rag","xe"]],"po","xi"]
There, the only change is "kat" -> "rag" deep down on the tree. Most of the diff tools around work for flat lists, files, etc, but not trees. I couldn't find any literature on that specific problem. What is the minimal way to report such change, and what is an efficient algorithm to find it out?
XML is a tree-like data structure in common use, often used to describe structured documents or other hierarchical objects whose changes over time need to be monitored. So it should be unsurprising that most of the recent work in tree diffing has been in the context of XML.
Here's a 2006 survey with a lot of possibly useful links: Change Detection in XML Trees
One of the more interesting links from the above, which was accompanied by an open source implementation called TreePatch, but now seems to be defunct: Kyriakos Komvoteas' thesis
Another survey article, by Daniel Ehrenberg, with a bunch more references. (That one comes from a question on http://cstheory.stackexchange.com)
Good luck.
Finding the difference between two trees looks kind of like searching in the tree. The only difference that you know you will have to get to the bottom of both of them.
You could search through both trees simultaneously, and when you hit the difference, change one to another one ( if that is your goal - to end up with identical trees, without sending one over every time).
Some links that I've found on diff'ing 2 trees:
How can i diff two trees to determine parental changes?
Detect differences between tree structures
Diff algorithms
Hope that those links will be useful to you. :)
You can use any general DIFF algorithm, it is not a problem to find ready to use library.
If you can use ZLIB library, I can suggest another solution. With some trick it is possible to use this library to send very compressed difference between two any binaries, let call them A and B (and difference Bc).
Side 1:
Init ZLIB stream
Compress A->Ac with Z_SNC_FLUSH (we don’t need result, so Ac can be freed)
Compress B->Bc with Z_SNC_FLUSH
Deinit ZLIB stream
We compress block A first with special flag which force ZLib to process and output all data. But it doesn’t reset compression state! When we compress block B compressor already knows subsequences of A and will compress block B very efficiently (if they have a lot in common). Bc is the only data to send.
Side 2:
Init ZLIB stream
Compress A->Ac with Z_SNC_FLUSH
Deinit ZLIB stream
We need to decompress exactly same blocks as we compressed. That it why we need Ac.
Init ZLIB stream again
DeCompress Ac->A with Z_SNC_FLUSH
DeCompress Bc->B with Z_SNC_FLUSH
Deinit ZLIB stream
Now we can decompress Ac-A (we have to, because we did it on other side and it helps to decompressor to learn all subsequences of block A) and finally Bc->B.
It is a bit unusual and tricky usage of ZLib, but Bc in this case is not just compressed block B, it is actually compressed difference between block A and B. It will be very efficient if size of ZLIB dictionary is comparable with size of block A. For huge blocks of data it will be not so efficient.
I need to compress some text data of the form
[70,165,531,0|70,166,562|"hi",167,578|70,171,593|71,179,593|73,188,609|"a",1,3|
The data contains a few thousand characters(10000 - 50000 approx).
I read upon the various compression algorithms, but cannot decide which one to use here.
The important thing here is : The compressed string should contain only alphanumberic characters(or a few special characters like +-/&%#$..) I mean most algorithms provide gibberish ascii characters as compressed data right? That must be avoided.
Can someone guide me on how to proceed here?
P.S The text contains numbers , ' and the | character predominantly. Other characters occur very very rarely.
Actually your requirement to limit the output character set to printable characters automatically costs you 25% of your compression gain, as out of 8 bits per by you'll end up using roughly 6.
But if that's what you really want, you can always base64 or the more space efficient base85 the output to reconvert the raw bytestream to printable characters.
Regarding the compression algorithm itself, stick to one of the better known ones like gzip or bzip2, for both well tested open source code exists.
Selecting "the best" algorithm is actually not that easy, here's an excerpt of the list of questions you have to ask yourself:
do i need best speed on the encoding or decoding side (eg bzip is quite asymmetric)
how important is memory efficiency both for the encoder and the decoder? Could be important for embedded applications
is the size of the code important, also for embedded
do I want pre existing well tested code for encoder or decorder or both only in C or also in another language
and so on
The bottom line here is probably, take a representative sample of your data and run some tests with a couple of existing algorithms, and benchmark them on the criteria that are important for your use case.
Just one thought: You can solve your two problems independently. Use whatever algorithm gives you the best compression (just try out a few on your kind of data. bz2, zip, rar -- whatever you like, and check the size), and then to get rid of the "gibberish ascii" (that's actually just bytes there...), you can encode your compressed data with Base64.
If you really put much thought into it, you might find a better algorithm for your specific problem, since you only use a few different chars, but if you stumble upon one, I think it's worth a try.