We are developing a set of C++ applications that exchange data through protobuf messages. One of the messages that we want to exchange contains a list of type-value pairs. The type is just an integer, the value can be a number of different data types, both basic ones like integer or string, but also more complex ones like ip addresses or prefixes. But for every specific type, there is only one data type allowed for the value.
type
value data type
1
string
2
integer
3
list<ip_addr>
4
integer
5
struct
6
string
...
...
Note: one of the communicating apps will ultimately encode this list of type-value pairs into a byte array in a network packet according to a fixed protocol format.
There are a few ways to encode this into a protobuf message, but we're currently leaning towards creating a protobof message for each type number separately:
message Type1
{
string value = 1;
}
message Type2
{
integer value = 1;
}
message Type3
{
repeated IpAddr value = 1;
}
...
message TVPair
{
oneof type
{
Type1 type_1 = 1;
Type1 type_2 = 2;
Type1 type_3 = 3;
...
}
}
message Foo
{
repeated TVPair tv_pairs = 1;
}
This is clear and easy to use for all applications and it hides the details of the network protocol encoding in the only app that actually needs to take care of it.
The only worry I have is that the list of Type numbers is in the order of a few 100 items. This means a few 100 protobuf messages need to be defined and the oneof structure in the TVPair message will contain that amount of members. I know the field numbers in protobuf messages can be a lot higher (~500.000.000) so that's not really an issue. But are there any downsides to having 100's of fields in a single protobuf message?
The comment from #DazWilkin pointed me towards some best practices in the protocol buffers documentation website:
Don’t Make a Message with Lots of Fields
Don’t make a message with “lots” (think: hundreds) of fields. In C++ every field adds roughly 65 bits to the in-memory object size whether it’s populated or not (8 bytes for the pointer and, if the field is declared as optional, another bit in a bitfield that keeps track of whether the field is set). When your proto grows too large, the generated code may not even compile (for example, in Java there is a hard limit on the size of a method ).
Large Data Sets
Protocol Buffers are not designed to handle large messages. As a general rule of thumb, if you are dealing in messages larger than a megabyte each, it may be time to consider an alternate strategy.
That said, Protocol Buffers are great for handling individual messages within a large data set. Usually, large data sets are a collection of small pieces, where each small piece is structured data. Even though Protocol Buffers cannot handle the entire set at once, using Protocol Buffers to encode each piece greatly simplifies your problem: now all you need is to handle a set of byte strings rather than a set of structures.
Protocol Buffers do not include any built-in support for large data sets because different situations call for different solutions. Sometimes a simple list of records will do while other times you want something more like a database. Each solution should be developed as a separate library, so that only those who need it need pay the costs.
So although it might be technically possible, it is not advised to create big messages with lots of fields.
Related
I understand that we need to marshall object to serialise it, and unMarshall to deserialise it. However my question is, when do we invoke marshal and unmarshal and why do we serialise the objects if we are going to deserialise it again soon?
PS I just started to learn Go and Proto so would really appreciate your help. Thank you.
Good question!
Marshaling (also called serializing) converts a struct to raw bytes. Usually, you do this when you're sending data to something outside your program: you might be writing to a file or sending the struct in an HTTP request body.
Unmarshaling (also called deserializing) converts raw bytes to a struct. You do this when you're accepting data from outside your program: you might be reading from a file or an HTTP response body.
In both situations, the program sending the data has it in memory as a struct. We have to marshal and unmarshal because the recipient program can't just reach into the sender's memory and read the struct. Why?
Often the two programs are on different computers, so they can't access each other's memory directly.
Even for two processes on the same computer, shared memory is complex and usually dangerous. What happens if you're halfway through overwriting the data when the other program reads it? What if your struct includes sensitive data (like a decrypted password)?
Even if you share memory, the two programs need to interpret the data in the same way. Different languages, and even different versions of Go, represent the same struct differently in memory. For example, do you represent integers with the most-significant bit first or last? Depending on their answers, one program might interpret 001 as the integer 1, while the other might interpret the same memory as the integer 4.
Nowadays, we're usually marshaling to and unmarshaling from some standardized format, like JSON, CSV, or YAML. (Back in the day, this wasn't always true - you'd often convert your struct to and from bytes by hand, using whatever logic made sense to you at the time.) Protobuf schemas make it easier to generate marshaling and unmarshaling code that's CPU- and space-efficient (compared to CSV or JSON), strongly typed, and works consistently across programming languages.
Preamble
This is about improving message send efficiency in a JIT compiler. Despite referring to Smalltalk, this question applies to most dynamic JIT-compiled languages.
Problem
Given a message send site, it can be classified as monomorphic, polymorphic or megamorphic. If the receiver of the message send is always of the same type, it is a monomorphic send, as in
10 timesRepeat: [Object new].
where the receiver of new is always Object. For this kind of sends JITs emit monomorphic inline caches.
Sometimes a given send site refers to a few different object types, like:
#(1 'a string' 1.5) do: [:element | element print]
In this case, print is sent to different types of objects. For these cases, JITs usually emit polymorphic inline caches.
Megamorphic message sends occur when a message is sent to not just a few but a lot of different object types in a same place. One of the most prominent examples is this:
Behavior>>#new
^self basicNew initialize
Here, basicNew creates the object, then initialize does initialization. You could do:
Object new
OrderedCollection new
Dictionary new
and they will all execute the same Behavior>>#new method. As the implementation of initialize is different in a lot of classes, the PIC will quickly fill. I'm interested in this kind of send sites, knowing they only occur unfrequently (only 1% of sends are megamorphic).
Question
What are the possible and specific optimizations for megamorphic send sites to avoid doing a lookup?
I imagine a few, and want to know more. After a PIC gets full, we'll have to call the lookup (being it full or the global cached one), but to optimize we can:
Recycle the PIC, throwing away all entries (many entries could be old and not used frequently).
Call some sort of specific megamorphic lookup (i.e. one that would cache all previously dispatched types in an array accesed by the type hash).
Inline the containing method (when inlined, the send site may stop being megamorphic)
I am using jquery, JSON, and AJAX for a comment system. I am curious, is there a size limit on what you can send through/store with JSON? Like if a user types a large amount and I send it through JSON is there some sort of maximum limit?
Also can any kind of text be sent through JSON. for example sometime I allow users to use html, will this be ok?
JSON is similar to other data formats like XML - if you need to transmit more data, you just send more data. There's no inherent size limitation to the JSON request. Any limitation would be set by the server parsing the request. (For instance, ASP.NET has the "MaxJsonLength" property of the serializer.)
There is no fixed limit on how large a JSON data block is or any of the fields.
There are limits to how much JSON the JavaScript implementation of various browsers can handle (e.g. around 40MB in my experience). See this question for example.
It depends on the implementation of your JSON writer/parser. Microsoft's DataContractJsonSerializer seems to have a hard limit around 8kb (8192 I think), and it will error out for larger strings.
Edit:
We were able to resolve the 8K limit for JSON strings by setting the MaxJsonLength property in the web config as described in this answer: https://stackoverflow.com/a/1151993/61569
Surely everyone's missed a trick here. The current file size limit of a json file is 18,446,744,073,709,551,616 characters or if you prefer bytes, or even 2^64 bytes if you're looking at 64 bit infrastructures at least.
For all intents, and purposes we can assume it's unlimited as you'll probably have a hard time hitting this issue...
Implementations are free to set limits on JSON documents, including the size, so choose your parser wisely. See RFC 7159, Section 9. Parsers:
"An implementation may set limits on the size of texts that it accepts. An implementation may set limits on the maximum depth of nesting. An implementation may set limits on the range and precision of numbers. An implementation may set limits on the length and character contents of strings."
There is really no limit on the size of JSON data to be send or receive. We can send Json data in file too. According to the capabilities of browser that you are working with, Json data can be handled.
If you are working with ASP.NET MVC, you can solve the problem by adding the MaxJsonLength to your result:
var jsonResult = Json(new
{
draw = param.Draw,
recordsTotal = count,
recordsFiltered = count,
data = result
}, JsonRequestBehavior.AllowGet);
jsonResult.MaxJsonLength = int.MaxValue;
What is the requirement? Are you trying to send a large SQL Table as JSON object? I think it is not practical.
You will consume a large chunk of server resource if you push thru with this.
You will also not be able to measure the progress with a progress bar because your App will just wait for the server to reply back which would take ages.
What I recommend to do is to chop the request into batches say first 1000 then request again the next 1000 till you get what you need.
This way you could also put a nice progress bar to know the progress as extracting that amount of data could really take too long.
The maximum length of JSON strings. The default is 2097152 characters, which is equivalent to 4 MB of Unicode string data.
Refer below URL
https://learn.microsoft.com/en-us/dotnet/api/system.web.script.serialization.javascriptserializer.maxjsonlength?view=netframework-4.7.2
I've started to use Google's protobuf and there is repeated field type:
repeated: this field can be repeated any number of times (including zero)
in a well-formed message. The order of the repeated values will be preserved.
What I need to know is how to make a message with a repeated field that repeats at least once. So I need to exclude zero times in this type somehow. I can do such an assertion in my code, but what is the proper way to do this inside .proto file?
You generally don't want to do that with protobufs. It's much better to assert in the code. The big reason is that once you have such a requirement (ex. required fields), you end up in a situation where you can never loosen the requirement. You may have old binaries running that suddenly start failing to read your newly constructed protobufs because they don't meet the requirements. And if you add a required field, you hit a situation where old data that you need to replay with current binaries will have failures due to not having the bits set.
Given that it's a serialization format, it's really best to separate the application logic (validation of values, etc) from the serialization logic. The format doesn't offer an "at least one" repeated field enforced on the serialization layer.
If you absolutely must fake it, you could have a required foo, and then a repeated extra_foos, but your logic would be harder to write.
I have a web app that uses Guids as the PK in the DB for an Employee object and an Association object.
One page in my app returns a large amount of data showing all Associations all Employees may be a part of.
So right now, I am sending to the client essentially a bunch of objects that look like:
{assocation_id: guid, employees: [guid1, guid2, ..., guidN]}
It turns out that many employees belong to many associations, so I am sending down the same Guids for those employees over and over again in these different objects. For example, it is possible that I am sending down 30,000 total guids across all associations in some cases, of which there are only 500 unique employees.
I am wondering if it is worth me building some kind of lookup index that I also send to the client like
{ 1: Guid1, 2: Guid2 ... }
and replacing all of the Guids in the objects I send down with those ints,
or if simply gzipping the response will compress it enough that this extra effort is not worth it?
Note: please don't get caught up in the details of if I should be sending down 30,000 pieces of data or not -- this is not my choice and there is nothing I can do about it (and I also can't change Guids to ints or longs in the DB).
Your wrote at the end of your question the following
Note: please don't get caught up in the details of if I should be
sending down 30,000 pieces of data or not -- this is not my choice and
there is nothing I can do about it (and I also can't change Guids to
ints or longs in the DB).
I think it's your main problem. If you don't solve the main problem you will be able to reduce the size of transferred data to 10 times for example, but you still don't solve the main problem. Let us we think about the question: Why so many data should be sent to the client (to the web browser)?
The data on the client side are needed to display some information to the user. The monitor is not so large to show 30,000 total on one page. No user are able to grasp so much information. So I am sure that you display only small part of the information. In the case you should send only the small part of information which you display.
You don't describe how the guids will be used on the client side. If you need the information during row editing for example. You can transfer the data only when the user start editing. In the case you need transfer the data only for one association.
If you need display the guids directly, then you can't display all the information at once. So you can send the information for one page only. If the user start to scroll or start "next page" button you can send the next portion of data. In the way you can really dramatically reduce the size of transferred data.
If you do have no possibility to redesign the part of application you can implement your original suggestion: by replacing of GUID "{7EDBB957-5255-4b83-A4C4-0DF664905735}" or "7EDBB95752554b83A4C40DF664905735" to the number like 123 you reduce the size of GUID from 34 characters to 3. If you will send additionally array of "guid mapping" elements like
123:"7EDBB95752554b83A4C40DF664905735",
you can reduce the original size of data 30000*34 = 1020000 (1 MB) to 300*39 + 30000*3 = 11700+90000 = 101700 (100 KB). So you can reduce the size of data in 10 times. The usage of compression of dynamic data on the web server can reduce the size of data additionally.
In any way you should examine why your page is so slowly. If the program works in LAN, then the transferring of even 1MB of data can be quick enough. Probably the page is slowly during placing of the data on the web page. I mean the following. If you modify some element on the page the position of all existing elements have to be recalculated. If you would be work with disconnected DOM objects first and then place the whole portion of data on the page you can improve the performance dramatically. You don't posted in the question which technology you use in you web application so I don't include any examples. If you use jQuery for example I could give some example which clear more what I mean.
The lookup index you propose is nothing else than a "custom" compression scheme. As amdmax stated, this will increase your performance if you have a lot of the same GUIDs, but so will gzip.
IMHO, the extra effort of writing the custom coding will not be worth it.
Oleg states correctly, that it might be worth fetching the data only when the user needs it. But this of course depends on your specific requirements.
if simply gzipping the response will compress it enough that this extra effort is not worth it?
The answer is: Yes, it will.
Compressing the data will remove redundant parts as good as possible (depending on the algorithm) until decompression.
To get sure, just send/generate the data uncompressed and compressed and compare the results. You can count the duplicate GUIDs to calculate how big your data block would be with the dictionary compression method. But I guess gzip will be better because it can also compress the syntactic elements like braces, colons, etc. inside your data object.
So what you are trying to accomplish is Dictionary compression, right?
http://en.wikibooks.org/wiki/Data_Compression/Dictionary_compression
What you will get instead of Guids which are 16 bytes long is int which is 4 bytes long. And you will get a dictionary full of key value pairs that will associate each guid to some int value, right?
It will decrease your transfer time when there're many objects with the same id used. But will spend CPU time before transfer to compress and after transfer to decompress. So what is the amount of data you transfer? Is it mb / gb / tb? And is there any good reason to compress it before sending?
I do not know how dynamic is your data, but I would
on a first call send two directories/dictionaries mapping short ids to long GUIDS, one for your associations and on for your employees e.g. {1: AssoGUID1, 2: AssoGUID2,...} and {1: EmpGUID1, 2:EmpGUID2,...}. These directories may also contain additional information on the Associations and Employees instances; I suspect you do not simply display GUIDs
on subsequent calls just send the index of Employees per Association { 1: [2,4,5], 3:[2,4], ...}, the key being the association short id and the ids in the array value, the short ids of the employees. Given your description building the reverse index: Employee to Associations may give better result size wise (but higher processing)
Then its all down to associative arrays manipulations which is straightforward in JS.
Again, if your data is (very) dynamic server side, the two directories will soon be obsolete and maintaining synchronization may cost you a lot.
I would start by answering the following questions:
What are the performance requirements? Are there size requirements? Speed requirements? What is the minimum performance that is truly needed?
What are the current performance metrics? How far are you from the requirements?
You characterized the data as possibly being mostly repeats. Is that the normal case? If not, what is?
The 2 options you listed above sound reasonable and trivial to implement. Try creating a look-up table and see what performance gains you get on actual queries. Try zipping the results (with look-ups and without), and see what gains you get.
In my experience if you're not TOO far from the goal, performance requirements are often trial and error.
If those options don't get you close to the requirements, I would take a step back and see if the requirements are reasonable in the time you have to solve the problem.
What you do next depends on which performance goals are lacking. If it is size, you're starting to be limited if you're required to send the entire association list ever time. Is that truly a requirement? Can you send the entire list once, and then just updates?