Is there an off the shelf binary format that allows string caching - binaryfiles

I am investigating migrating of a highly customized and efficient binary format to one of the available binary formats. The data is stored on some low powered mobile among other places, so performance is important requirement.
Advantage of the current format is that all strings are stored in a pool. This means that we don't repeat the same string hundred of times in file, we read it only once during deserialization and all objects are referencing it by its index. It also means that we keep only one copy in memory. So a lot of advantages :)
I was not able to find a way for capnproto or flatbuffers to support this. Or would I need to build layer on top, and in generated object use integer index to strings explicitly?
Thanks you!

FlatBuffers supports string pooling. Simply serialize a string once, then refer to that string multiple times in other objects. The string will only occur in memory once.
Simplest example, schema:
table MyObject { name: string; id: string; }
code (C++):
FlatBufferBuilder fbb;
auto s = fbb.CreateString("MyPooledString");
// Both string fields point to the same data:
auto o = CreateMyObject(fbb, s, s);
fbb.Finish(o);

You can always do this manually like:
struct MyMessage {
stringTable #0 :List(Text);
# Now encode string fields as integer indexes into the string table.
someString #1 :UInt32;
otherString #2 :UInt32;
}
Cap'n Proto could in theory allow multiple pointers to point at the same object, but currently prohibits this for security reasons: it would be too easy to DoS servers that don't expect it by sending messages that are cyclic or contain lots of overlapping references. See the section on amplification attacks in the docs.

Related

Data structure for storing list of boolean values mapped to string names in vb

What would be a good (in terms of speed, code safety etc.) data structure for storing boolean values mapped to a list of string names vb? The strings are a list of length 22 with distinct names like "A201" "A202" etc.
<string, True>
<string, False>
<string, True>
<string, False>
I then need to loop through this list. Speed is essential as each iteration needs to be as quick as possible.
I was thinking, alternative to using a data structure, I could place the 22 strings inside an array and then set them to nothing, and loop through them to see which ones are not set to nothing, then process based only on the strings that are not set to nothing. I was just wondering whether a particular data structure would be a better solution and how the solutions would compare?
Thanks for the help.
I think I could remove the string from the original list for the ones that don't need any more work, then pass the updated list (containing the ones that need more work) to the re-loop when the first loop through the list is completed.

What is the point of google.protobuf.StringValue?

I've recently encountered all sorts of wrappers in Google's protobuf package. I'm struggling to imagine the use case. Can anyone shed the light: what problem were these intended to solve?
Here's one of the documentation links: https://developers.google.com/protocol-buffers/docs/reference/csharp/class/google/protobuf/well-known-types/string-value (it says nothing about what can this be used for).
One thing that will be different in behavior between this, and simple string type is that this field will be written less efficiently (a couple extra bytes, plus a redundant memory allocation). For other wrappers, the story is even worse, since the repeated variants of those fields will be written inefficiently (official Google's Protobuf serializer doesn't support packed encoding for non-numeric types).
Neither seems to be desirable. So, what's this all about?
There's a few reasons, mostly to do with where these are used - see struct.proto.
StringValue can be null, string often can't be in a language interfacing with protobufs. e.g. in Go strings are always set; the "zero value" for a string is "", the empty string, so it's impossible to distinguish between "this value is intentionally set to empty string" and "there was no value present". StringValue can be null and so solves this problem. It's especially important when they're used in a StructValue, which may represent arbitrary JSON: to do so it needs to distinguish between a JSON key which was set to empty string (StringValue with an empty string) or a JSON key which wasn't set at all (null StringValue).
Also if you look at struct.proto, you'll see that these aren't fully fledged message types in the proto - they're all generated from message Value, which has a oneof kind { number_value, string_value, bool_value... etc. By using a oneof struct.proto can represent a variety of different values in one field. Again this makes sense considering what struct.proto is designed to handle - arbitrary JSON - you don't know what type of value a given JSON key has ahead of time.
In addition to George's answer, you can't use a Protobuf primitive as the parameter or return value of a gRPC procedure.

Dealing with huge data

Let's assume that I have a big file (500GB+) and I have a data record
declaration Sample which indicates a row in that file:
data Sample = Sample {
field1 :: Int,
field2 :: Int
}
Now what is the data structure suitable for processing
(filter/map/fold) on the collection of these Sample datas ? Don
Stewart has answered here that the Sample type should not be treated
as a list [Sample] type but as a Vector type. My question is how
does representing it as Vector type solve the problem ? Doesn't
representing the file contents as a vector of Sample type will also
occupy around 500Gb ?
What is the recommended method for solving these types of problem ?
As far as I can see, the operations you want to use (filter, map and fold) can be done via both conduit (see Data.Conduit.List) and pipes (see Pipes.Prelude).
Both libraries are perfectly capable of manipulating/folding and filtering streaming data. Depending on your scenario they might solve your actual problem.
If you, however, need to investigate values several times, you're better of by loading chunks into a vector, as #Don said.

What kind of data structure will be best for storing a key-value pair where the value will be a String for some key and a List<String> for some keys?

For example, key 1 will have values "A","B","C" but key 2 will have value "D". If I use
Map<String, List<String>>
I need to populate the List<String> even when I have only single String value.
What data structure should be used in this case?
Map<String,List<String>> would be the standard way to do it (using a size-1 list when there is only a single item).
You could also have something like Map<String, Object> (which should work in either Java or presumably C#, to name two), where the value is either List<String> or String, but this would be fairly bad practice, as there are readability issue (you don't know what Object represents right off the bat from seeing the type), casting happens during runtime, which isn't ideal, among other things.
It does however depend what type of queries you plan to run. Map<String,Set<String>> might be a good idea if you plan of doing existence checks in the List and it can be large. Set<StringPair> (where StringPair is a class with 2 String members) is another consideration if there are plenty of keys with only 1 mapped value. There are plenty of solutions which would be more appropriate under various circumstances - it basically comes down to looking at the type of queries you want to perform and picking an appropriate structure according to that.

Whether it is possible in Microsoft. Office. Interop. Outlook.userproperties to add an array

Whether it is possible in Microsoft.Office.Interop.Outlook.UserProperties to add an array/list of integer numbers and how? Usage of type OlUserPropertyType.olEnumeration leads to an exception at a stage of adding of the parameter.
There is no array support in the MAPI-supported user properties. You would have to serialize the array to a string - OlUserPropertyType.OlText (PT_STRING8) using some serialized array format (XML, CSV, JSON, etc.).

Resources