How to convert binary data to utf8 string after dart2js - utf-8

I'm storing my data in the localStorage. Because I store a lot of data, I need to compress it. I do it with GZip. Everything works, but I found out that my convertion from List which the result of GZip to Utf8 is extremely slow. On 500k of data, it can take up to 5 min on a fast computer.
The problem appear mainly during the loading function.
Load:
import 'dart:convert';
import 'package:archive/archive.dart';
import 'package:crypto/crypto.dart';
var base64 = CryptoUtils.base64StringToBytes(window.localStorage[storageName]);
var decompress = new GZipDecoder().decodeBytes(base64);
storage = JSON.decode(new Utf8Decoder().convert(decompress));
Save:
var g = JSON.encode(storage);
List<int> bz2 = new GZipEncoder().encode(new Utf8Encoder().convert(g));
window.localStorage[storageName] = CryptoUtils.bytesToBase64(bz2);
Am I doing something wrong? Is there other way of converting a List to a String in Dart? Also, I do both save and load, so there is no need of utf8.
I tried Ascii, but it throws error for unsupported character. I got kind of working latin1, for which a much higher performance, but I still get some unsupported character.

If UTF-8 is really the bottleneck, and you don't need it, you can just use Dart's native String encoding which is UTF-16.
var compressed = new GZipEncoder().encode(str.codeUnits);
var decompressed = new GZipDecoder().decodeBytes(compressed);
var finalStr = new String.fromCharCodes(decompressed);

This is just a hack and is not the best, but it works and it's giving me good performance for now:
String convertListIntToUtf8(List<int> codeUnits) {
var endIndex = codeUnits.length;
StringBuffer buffer = new StringBuffer();
int i = 0;
while (i < endIndex) {
var c = codeUnits[i];
if (c < 0x80) {
buffer.write(new String.fromCharCode(c));
i += 1;
} else if (c < 0xE0) {
buffer.write(new String.fromCharCode((c << 6) + codeUnits[i+1] - 0x3080));
i += 2;
} else if (c < 0xF0) {
buffer.write(new String.fromCharCode((c << 12) + (codeUnits[i+1] << 6) + codeUnits[i+2] - 0xE2080));
i += 3;
} else {
buffer.write(new String.fromCharCode((c << 18) + (codeUnits[i+1] << 12) + (codeUnits[i+2] << 6) + codeUnits[i+3] - 0x3C82080));
i += 4;
}
if (i >= endIndex) break;
}
return buffer.toString();
}

Related

Adding Pharmacode to PyZbar

I have an inspection application in pharma. I need to inspect batch numbers and Pharmacodes on labels using OpenCV. I thought of using PyZbar but it doesn't support Pharmacode. How can I add more codes, like Pharmacode, to PyZbar?
Many thanks
Pharmacode (also known as code32) is implemented as code39 using a radix-32 "compression scheme". So you can use PyZbar, with the understanding that you will get a code39 decode from the library and then have to convert the value from base-32 to base-10.
Note that there is no way of knowing whether a code39 "read" is actually code32 or just plain code39.
This is some javascript that converts a code39 value to its original code32 string:
// code32 radix set (missing the vowels AEIO)
const code32set = '0123456789BCDFGHJKLMNPQRSTUVWXYZ';
function code39toCode32(val) {
if (/[^0-9BCDFGHJKLMNPQRSTUVWXYZ]/.test(val)) {
throw new Error("Not code32");
}
let res = 0;
for (let i = 0; i < val.length; i++) {
res = res * 32 + code32set.indexOf(val[i]);
}
let code32 = '' + res;
if (code32.length < 9) {
code32 = ('000000000' + code32).slice(-9);
}
return code32;
}
That function should translate to python pretty easily.

InDesign Text Modification Script Skips Content

This InDesign Javascript iterates over textStyleRanges and converts text with a few specific appliedFont's and later assigns a new appliedFont:-
var textStyleRanges = [];
for (var j = app.activeDocument.stories.length-1; j >= 0 ; j--)
for (var k = app.activeDocument.stories.item(j).textStyleRanges.length-1; k >= 0; k--)
textStyleRanges.push(app.activeDocument.stories.item(j).textStyleRanges.item(k));
for (var i = textStyleRanges.length-1; i >= 0; i--) {
var myText = textStyleRanges[i];
var converted = C2Unic(myText.contents, myText.appliedFont.fontFamily);
if (myText.contents != converted)
myText.contents = converted;
if (myText.appliedFont.fontFamily == 'Chanakya'
|| myText.appliedFont.fontFamily == 'DevLys 010'
|| myText.appliedFont.fontFamily == 'Walkman-Chanakya-905') {
myText.appliedFont = app.fonts.item("Utsaah");
myText.composer="Adobe World-Ready Paragraph Composer";
}
}
But there are always some ranges where this doesn't happen. I tried iterating in the forward direction OR in the backward direction OR putting the elements in an array before conversion OR updating the appliedFont in the same iteration OR updating it a different one. Some ranges are still not converted completely.
I am doing this to convert the Devanagari text encoded in glyph based non-Unicode encoding to Unicode. Some of this involves repositioning vowel signs etc and changing the code to work with find/replace mechanism may be possible but is a lot of rework.
What is happening?
See also: http://cssdk.s3-website-us-east-1.amazonaws.com/sdk/1.0/docs/WebHelp/app_notes/indesign_text_frames.htm#Finding_and_changing_text
Sample here: https://www.dropbox.com/sh/7y10i6cyx5m5k3c/AAB74PXtavO5_0dD4_6sNn8ka?dl=0
This is untested since I'm not able to test against your document, but try using getElements() like below:
var doc = app.activeDocument;
var stories = doc.stories;
var textStyleRanges = stories.everyItem().textStyleRanges.everyItem().getElements();
for (var i = textStyleRanges.length-1; i >= 0; i--) {
var myText = textStyleRanges[i];
var converted = C2Unic(myText.contents, myText.appliedFont.fontFamily);
if (myText.contents != converted)
myText.contents = converted;
if (myText.appliedFont.fontFamily == 'Chanakya'
|| myText.appliedFont.fontFamily == 'DevLys 010'
|| myText.appliedFont.fontFamily == 'Walkman-Chanakya-905') {
myText.appliedFont = app.fonts.item("Utsaah");
myText.composer="Adobe World-Ready Paragraph Composer";
}
}
A valid approach is to use hyperlink text sources as they stick to the genuine text object. Then you can edit those source texts even if they were actually moved elsewhere in the flow.
//Main routine
var main = function() {
//VARS
var doc = app.properties.activeDocument,
fgp = app.findGrepPreferences.properties,
cgp = app.changeGrepPreferences.properties,
fcgo = app.findChangeGrepOptions.properties,
text, str,
found = [], srcs = [], n = 0;
//Exit if no documents
if ( !doc ) return;
app.findChangeGrepOptions = app.findGrepPreferences = app.changeGrepPreferences = null;
//Settings props
app.findChangeGrepOptions.properties = {
includeHiddenLayers:true,
includeLockedLayersForFind:true,
includeLockedStoriesForFind:true,
includeMasterPages:true,
}
app.findGrepPreferences.properties = {
findWhat:"\\w",
}
//Finding text instances
found = doc.findGrep();
n = found.length;
//Looping through instances and adding hyperlink text sources
//That's all we do at this stage
while ( n-- ) {
srcs.push ( doc.hyperlinkTextSources.add(found[n] ) );
}
//Then we edit the stored hyperlinks text sources 's texts objects contents
n = srcs.length;
while ( n-- ) {
text = srcs[n].sourceText;
str = text.contents;
text.contents = str+str+str+str;
}
//Eventually we remove the added hyperlinks text sources
n = srcs.length;
while ( n-- ) srcs[n].remove();
//And reset initial properties
app.findGrepPreferences.properties = fgp;
app.changeGrepPreferences.properties = cgp;
app.findChangeGrepOptions.properties =fcgo;
}
//Running script in a easily cancelable mode
var u;
app.doScript ( "main()",u,u,UndoModes.ENTIRE_SCRIPT, "The Script" );

ldap searchtype substring not working when value is an integer is less than one thousand

I search and users from active directory. My code is below:
List<DirectoryEntry> dirEntries = ActiveDirectoryActions.getListByQuery("(&(objectClass=user)(displayName~=*" + q + "*))");
for (int i = 0; i < dirEntries.Count; i++)
{
SiteSearchResult r = new SiteSearchResult();
r.title = dirEntries[i].Properties["displayName"].Value.ToString();
r.url = "/" + lang + "/directory/user/" + dirEntries[i].Properties["sAMAccountName"].Value.ToString();
r.content = dirEntries[i].Properties["title"].Value.ToString();
result.Add(r);
}
And it is getListByQuery() function
public static List<DirectoryEntry> getListByQuery(string q)
{
DirectorySearcher drSearch = new DirectorySearcher(rootEntry);
drSearch.Filter = "(distinguishedName=" + Config.xml().Root.Elements("active_directory").Elements("root_ou").Select(x => x.Value).FirstOrDefault().ToString() + ")";
DirectoryEntry searchRoot = drSearch.FindAll()[0].GetDirectoryEntry();
drSearch.SearchRoot = searchRoot;
drSearch.Filter = q;
List<DirectoryEntry> r = new List<DirectoryEntry>();
SearchResultCollection sr = drSearch.FindAll();
for (int i = 0; i < sr.Count; i++)
{
r.Add(sr[i].GetDirectoryEntry());
}
return r;
}
Everthing is ok on my local server. But gives error on global server when I search integer value. And that is interesting when the value less than 1000 (<1000) .
[NullReferenceException: Object reference not set to an instance of an
object.] Myproject.Controllers.SearchController.Index(String
lang, String q) in
D:\dotNET\Myproject\Myproject\Controllers\SearchController.cs:60
Help please.
I think there are few issues you need to check
1) Does it make sense to call DirectoryEntry.FindAll() two times? Instead of using DirectorySearcher(rootEntry) you could try to set
string ldapPath = "LDAP://" + Config.xml().Root.Elements("active_directory").Elements("root_ou").Select(x => x.Value).FirstOrDefault().ToString();
DirectoryEntry de = new DirectoryEntry(ldapPath);
DirectorySearcher desearch = new DirectorySearcher(de);
deSearch.Filter = ...
2) Approximate search ~= might be not compatible with a substring search (=*substring*), e.g. for me it does not work. So try to change to =*1000* (without ~)
3) FindAll() in both cases could return null, so you should check for null in both cases.
4) [MSDN]
Due to implementation restrictions, the SearchResultCollection class
cannot release all of its unmanaged resources when it is garbage
collected. To prevent a memory leak, you must call the Dispose method
when the SearchResultCollection object is no longer needed.
So you either need to call sr.Dispose or use a using Statement.

Audio Unit and Writing to file

I'm creating real-time audio sequencer app on OS X.
Real-time synth part is implemented by using AURenderCallback.
Now I'm making function to write rendered result to Wave File (44100Hz 16bit Stereo).
Format for render-callback function is 44100Hz 32bit float Stereo interleaved.
I'm using ExtAudioFileWrite to write to file.
But ExtAudioFileWrite function returns error code 1768846202;
I searched 1768846202 but I couldn't get information.
Would you give me some hints?
Thank you.
Here is code.
outFileFormat.mSampleRate = 44100;
outFileFormat.mFormatID = kAudioFormatLinearPCM;
outFileFormat.mFormatFlags =
kAudioFormatFlagIsSignedInteger | kAudioFormatFlagIsPacked;
outFileFormat.mBitsPerChannel = 16;
outFileFormat.mChannelsPerFrame = 2;
outFileFormat.mFramesPerPacket = 1;
outFileFormat.mBytesPerFrame =
outFileFormat.mBitsPerChannel / 8 * outFileFormat.mChannelsPerFrame;
outFileFormat.mBytesPerPacket =
outFileFormat.mBytesPerFrame * outFileFormat.mFramesPerPacket;
AudioBufferList *ioList;
ioList = (AudioBufferList*)calloc(1, sizeof(AudioBufferList)
+ 2 * sizeof(AudioBuffer));
ioList->mNumberBuffers = 2;
ioList->mBuffers[0].mNumberChannels = 1;
ioList->mBuffers[0].mDataByteSize = allocByteSize / 2;
ioList->mBuffers[0].mData = ioDataL;
ioList->mBuffers[1].mNumberChannels = 1;
ioList->mBuffers[1].mDataByteSize = allocByteSize / 2;
ioList->mBuffers[1].mData = ioDataR;
...
while (1) {
//Fill buffer by using render callback func.
RenderCallback(self, nil, nil, 0, frames, ioList);
//i want to create one sec file.
if (renderedFrames >= 44100) break;
err = ExtAudioFileWrite(outAudioFileRef, frames , ioList);
if (err != noErr){
NSLog(#"ERROR AT WRITING TO FILE");
goto errorExit;
}
}
Some of the error codes are actually four character strings. The Core Audio book provides a nice function to handle errors.
static void CheckError(OSStatus error, const char *operation)
{
if (error == noErr) return;
char str[20];
// see if it appears to be a 4-char-code
*(UInt32 *)(str + 1) = CFSwapInt32HostToBig(error);
if (isprint(str[1]) && isprint(str[2]) && isprint(str[3]) && isprint(str[4])) {
str[0] = str[5] = '\'';
str[6] = '\0';
} else
// no, format it as an integer
sprintf(str, "%d", (int)error);
fprintf(stderr, "Error: %s (%s)\n", operation, str);
exit(1);
}
Use it like this:
CheckError(ExtAudioFileSetProperty(outputFile,
kExtAudioFileProperty_CodecManufacturer,
sizeof(codec),
&codec), "Setting codec.");
Before you can do any sort of debugging, you probably need to figure out what that error message actually means. Have you tried passing that status code to GetMacOSStatusErrorString() or GetMacOSStatusCommentString()? They aren't documented so well, but they are declared in CoreServices/CarbonCore/Debugging.h.

Lossless compression in small blocks with precomputed dictionary

I have an application where I am reading and writing small blocks of data (a few hundred bytes) hundreds of millions of times. I'd like to generate a compression dictionary based on an example data file and use that dictionary forever as I read and write the small blocks. I'm leaning toward the LZW compression algorithm. The Wikipedia page (http://en.wikipedia.org/wiki/Lempel-Ziv-Welch) lists pseudocode for compression and decompression. It looks fairly straightforward to modify it such that the dictionary creation is a separate block of code. So I have two questions:
Am I on the right track or is there a better way?
Why does the LZW algorithm add to the dictionary during the decompression step? Can I omit that, or would I lose efficiency in my dictionary?
Thanks.
Update: Now I'm thinking the ideal case be to find a library that lets me store the dictionary separate from the compressed data. Does anything like that exist?
Update: I ended up taking the code at http://www.enusbaum.com/blog/2009/05/22/example-huffman-compression-routine-in-c and adapting it. I am Chris in the comments on that page. I emailed my mods back to that blog author, but I haven't heard back yet. The compression rates I'm seeing with that code are not at all impressive. Maybe that is due to the 8-bit tree size.
Update: I converted it to 16 bits and the compression is better. It's also much faster than the original code.
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.IO;
namespace Book.Core
{
public class Huffman16
{
private readonly double log2 = Math.Log(2);
private List<Node> HuffmanTree = new List<Node>();
internal class Node
{
public long Frequency { get; set; }
public byte Uncoded0 { get; set; }
public byte Uncoded1 { get; set; }
public uint Coded { get; set; }
public int CodeLength { get; set; }
public Node Left { get; set; }
public Node Right { get; set; }
public bool IsLeaf
{
get { return Left == null; }
}
public override string ToString()
{
var coded = "00000000" + Convert.ToString(Coded, 2);
return string.Format("Uncoded={0}, Coded={1}, Frequency={2}", (Uncoded1 << 8) | Uncoded0, coded.Substring(coded.Length - CodeLength), Frequency);
}
}
public Huffman16(long[] frequencies)
{
if (frequencies.Length != ushort.MaxValue + 1)
{
throw new ArgumentException("frequencies.Length must equal " + ushort.MaxValue + 1);
}
BuildTree(frequencies);
EncodeTree(HuffmanTree[HuffmanTree.Count - 1], 0, 0);
}
public static long[] GetFrequencies(byte[] sampleData, bool safe)
{
if (sampleData.Length % 2 != 0)
{
throw new ArgumentException("sampleData.Length must be a multiple of 2.");
}
var histogram = new long[ushort.MaxValue + 1];
if (safe)
{
for (int i = 0; i <= ushort.MaxValue; i++)
{
histogram[i] = 1;
}
}
for (int i = 0; i < sampleData.Length; i += 2)
{
histogram[(sampleData[i] << 8) | sampleData[i + 1]] += 1000;
}
return histogram;
}
public byte[] Encode(byte[] plainData)
{
if (plainData.Length % 2 != 0)
{
throw new ArgumentException("plainData.Length must be a multiple of 2.");
}
Int64 iBuffer = 0;
int iBufferCount = 0;
using (MemoryStream msEncodedOutput = new MemoryStream())
{
//Write Final Output Size 1st
msEncodedOutput.Write(BitConverter.GetBytes(plainData.Length), 0, 4);
//Begin Writing Encoded Data Stream
iBuffer = 0;
iBufferCount = 0;
for (int i = 0; i < plainData.Length; i += 2)
{
Node FoundLeaf = HuffmanTree[(plainData[i] << 8) | plainData[i + 1]];
//How many bits are we adding?
iBufferCount += FoundLeaf.CodeLength;
//Shift the buffer
iBuffer = (iBuffer << FoundLeaf.CodeLength) | FoundLeaf.Coded;
//Are there at least 8 bits in the buffer?
while (iBufferCount > 7)
{
//Write to output
int iBufferOutput = (int)(iBuffer >> (iBufferCount - 8));
msEncodedOutput.WriteByte((byte)iBufferOutput);
iBufferCount = iBufferCount - 8;
iBufferOutput <<= iBufferCount;
iBuffer ^= iBufferOutput;
}
}
//Write remaining bits in buffer
if (iBufferCount > 0)
{
iBuffer = iBuffer << (8 - iBufferCount);
msEncodedOutput.WriteByte((byte)iBuffer);
}
return msEncodedOutput.ToArray();
}
}
public byte[] Decode(byte[] bInput)
{
long iInputBuffer = 0;
int iBytesWritten = 0;
//Establish Output Buffer to write unencoded data to
byte[] bDecodedOutput = new byte[BitConverter.ToInt32(bInput, 0)];
var current = HuffmanTree[HuffmanTree.Count - 1];
//Begin Looping through Input and Decoding
iInputBuffer = 0;
for (int i = 4; i < bInput.Length; i++)
{
iInputBuffer = bInput[i];
for (int bit = 0; bit < 8; bit++)
{
if ((iInputBuffer & 128) == 0)
{
current = current.Left;
}
else
{
current = current.Right;
}
if (current.IsLeaf)
{
bDecodedOutput[iBytesWritten++] = current.Uncoded1;
bDecodedOutput[iBytesWritten++] = current.Uncoded0;
if (iBytesWritten == bDecodedOutput.Length)
{
return bDecodedOutput;
}
current = HuffmanTree[HuffmanTree.Count - 1];
}
iInputBuffer <<= 1;
}
}
throw new Exception();
}
private static void EncodeTree(Node node, int depth, uint value)
{
if (node != null)
{
if (node.IsLeaf)
{
node.CodeLength = depth;
node.Coded = value;
}
else
{
depth++;
value <<= 1;
EncodeTree(node.Left, depth, value);
EncodeTree(node.Right, depth, value | 1);
}
}
}
private void BuildTree(long[] frequencies)
{
var tiny = 0.1 / ushort.MaxValue;
var fraction = 0.0;
SortedDictionary<double, Node> trees = new SortedDictionary<double, Node>();
for (int i = 0; i <= ushort.MaxValue; i++)
{
var leaf = new Node()
{
Uncoded1 = (byte)(i >> 8),
Uncoded0 = (byte)(i & 255),
Frequency = frequencies[i]
};
HuffmanTree.Add(leaf);
if (leaf.Frequency > 0)
{
trees.Add(leaf.Frequency + (fraction += tiny), leaf);
}
}
while (trees.Count > 1)
{
var e = trees.GetEnumerator();
e.MoveNext();
var first = e.Current;
e.MoveNext();
var second = e.Current;
//Join smallest two nodes
var NewParent = new Node();
NewParent.Frequency = first.Value.Frequency + second.Value.Frequency;
NewParent.Left = first.Value;
NewParent.Right = second.Value;
HuffmanTree.Add(NewParent);
//Remove the two that just got joined into one
trees.Remove(first.Key);
trees.Remove(second.Key);
trees.Add(NewParent.Frequency + (fraction += tiny), NewParent);
}
}
}
}
Usage examples:
To create the dictionary from sample data:
var freqs = Huffman16.GetFrequencies(File.ReadAllBytes(#"D:\nodes"), true);
To initialize an encoder with a given dictionary:
var huff = new Huffman16(freqs);
And to do some compression:
var encoded = huff.Encode(raw);
And decompression:
var raw = huff.Decode(encoded);
The hard part in my mind is how you build your static dictionary. You don't want to use the LZW dictionary built from your sample data. LZW wastes a bunch of time learning since it can't build the dictionary faster than the decompressor can (a token will only be used the second time it's seen by the compressor so the decompressor can add it to its dictionary the first time its seen). The flip side of this is that it's adding things to the dictionary that may never get used, just in case the string shows up again. (e.g., to have a token for 'stackoverflow' you'll also have entries for 'ac','ko','ve','rf' etc...)
However, looking at the raw token stream from an LZ77 algorithm could work well. You'll only see tokens for strings seen at least twice. You can then build a list of the most common tokens/strings to include in your dictionary.
Once you have a static dictionary, using LZW sans the dictionary update seems like an easy implementation but to get the best compression I'd consider a static Huffman table instead of the traditional 12 bit fixed size token (as George Phillips suggested). An LZW dictionary will burn tokens for all the sub-strings you may never actually encode (e.g, if you can encode 'stackoverflow', there will be tokens for 'st', 'sta', 'stac', 'stack', 'stacko' etc.).
At this point it really isn't LZW - what makes LZW clever is how the decompressor can build the same dictionary the compressor used only seeing the compressed data stream. Something you won't be using. But all LZW implementations have a state where the dictionary is full and is no longer updated, this is how you'd use it with your static dictionary.
LZW adds to the dictionary during decompression to ensure it has the same dictionary state as the compressor. Otherwise the decoding would not function properly.
However, if you were in a state where the dictionary was fixed then, yes, you would not need to add new codes.
Your approach will work reasonably well and it's easy to use existing tools to prototype and measure the results. That is, compress the example file and then the example and test data together. The size of the latter less the former will be the expected compressed size of a block.
LZW is a clever way to build up a dictionary on the fly and gives decent results. But a more thorough analysis of your typical data blocks is likely to generate a more efficient dictionary.
There's also room for improvement in how LZW represents compressed data. For instance, each dictionary reference could be Huffman encoded to a closer to optimal length based on the expected frequency of their use. To be truly optimal the codes should be arithmetic encoded.
I would look at your data to see if there's an obvious reason it's so easy to compress. You might be able to do something much simpler than LZ78. I've done both LZ77 (lookback) and LZ78 (dictionary).
Try running a LZ77 on your data. There's no dictionary with LZ77, so you could use a library without alteration. Deflate is an implementation of LZ77.
Your idea of using a common dictionary is a good one, but it's hard to know whether the files are similar to each other or just self-similar without doing some tests.
The right track is to use an library -- almost every modern language have a compression library. C#, Python, Perl, Java, VB.net, whatever you use.
LZW save some space by depending the dictionary on previous inputs. It have an initial dictionary, and when you decompress something, you add them to the dictionary -- so the dictionary is growing. (I am omitting some details here, but this is the general idea)
You can omit this step by supply the whole (complete) dictionary as the initial one. But this would cost some space.
I find this aproach quite interesting for repeated log entries and something I would like to explore using.
Can you share the compression statistics for using this approach for your use case so I can compare it with other alternatives?
Have you considered having the common dictionary grow over time or is that not a valid option?

Resources