What is its data structure? Is it XML-based? How can it distinguish between different content types, for example text, image, files etc.?
It is a system wide bit bucket, which means it just holds a sequence of bytes and an integer value (a windows atom) which describes it's format - but does not ensure that the byte sequence is really this format.
The only feature other then this is that an application can decide if it wants the system to store the byte sequence or if the application keeps the data itself and only provide it when someone is requesting it.
So as you see it is an API and not a data structure.
Related
My question is in the title, this provides context to help you understand my confusion. Everything is sent over https.
My understanding of base 64 encoding is that it is a way of representing binary data as text, such that the text is safe to transmit across networks or the internet because it avoids anything that might be interpreted as a control code by the various possible protocols that might be involved at some point.
Given this understanding, I am confused why everything sent to over the internet is not base 64 encoded. When is it safe not to base 64 encode something before sending it? I understand that not everything understands or expects to receive things in base 64, but my question is why doesn't everything expect and work with this if it is the only way to send data without the possibility it could be interpreted as control codes?
I am designing an Android app and server API such that the app can use the API to send data to the server. There are some potentially large SQLite database files the client will be sending to the server (I know this sounds strange, yes it needs to send the entire database files). They are being gzipped prior to uploading. I know there is also a header that can be used to indicate this: Content-Encoding: gzip. Would it be safe to compress the data and send it with this header without base 64 encoding it? If not, why does such a header exist if it is not safe to use? I mean, if you base 64 encode it first and then compress it, you undo the point of base 64 encoding and it is not at that point base 64 encoded. If you compress it first and then base 64 encode it, that header would no longer be valid as it is not in the compressed format at that point. We actually don't want to use the header because we want to save the files in a compressed state, and using the header will cause the server to decompress it prior to our API code running. I'm only asking this to further clarify why I am confused about whether it is safe to send gzip compressed data without base 64 encoding it.
My best guess is that it depends on if what you are sending is binary data or not. If you are sending binary data, it should be base 64 encoded as the final step before uploading it. But if you are sending text data, you may not need to do this. However it still seems to my logic, this might still depends on the character encoding used. Perhaps some character encodings can result in sending data that could be interpreted as a control code? If this is true, which character encodings are safe to send without base 64 encoding them as the final step prior to sending it? If I am correct about this, it implies you should only use the that gzip header if you are sending compressed text that has not been base 64 encoded. Does compressing it create the possibility of something that could be interpreted as a control code?
I realize this was rather long, so I will repeat my primary questions (the title) here: Is either Gzip compressed binary data or uncompressed text safe to transmit, or should it be base 64 encoded as the final step before sending it? Okay I lied there is one more question involved in this. Would sending gzip compressed text always be safe to send without base 64 encoding it at the end, no matter which character encoding it had prior to compression?
My understanding of base 64 encoding is that it is a way of representing binary data as text,
Specifically, as text consisting of characters drawn from a 64-character set, plus a couple of additional characters serving special purposes.
such that the text is safe to transmit across networks or the internet because it avoids anything that might be interpreted as a control code by the various possible protocols that might be involved at some point.
That's a bit of an overstatement. For two endpoints to communicate with each other, they need to agree on one protocol. If another protocol becomes involved along the way, then it is the responsibility of the endpoints for that transmission to handle any needed encoding considerations for it.
What bytes and byte combinations can successfully be conveyed is a matter of the protocol in use, and there are plenty that handle binary data just fine.
At one time there was also an issue that some networks were not 8-bit clean, so that bytes with numeric values greater than 127 could not be conveyed across those networks, but that is not a practical concern today.
Given this understanding, I am confused why everything sent to over the internet is not base 64 encoded.
Given that the understanding you expressed is seriously flawed, it is not surprising that you are confused.
When is it safe not to base 64 encode something before sending it?
It is not only safe but essential to avoid base 64 encoding when the recipient of the transmission expects something different. The two or more parties to a given transmission must agree about the protocol to be used. That establishes the acceptable parameters of the communication. Although Base 64 is an available option for part or all of a message, it is by no means the only one, nor is it necessarily the best one for binary data, much less for data that are textual to begin with.
I understand that not everything understands or expects to receive things in base 64, but my question is why doesn't everything expect and work with this if it is the only way to send data without the possibility it could be interpreted as control codes?
Because it is not by any means the only way to avoid data being misinterpreted.
They are being gzipped prior to uploading. I know there is also a header that can be used to indicate this: Content-Encoding: gzip. Would it be safe to compress the data and send it with this header without base 64 encoding it?
It would be expected to transfer such data without base-64 encoding it. HTTP(S) handles binary data just fine. The Content-Encoding header tells the recipient how to interpret the message body, and if it specifies a binary content type (such as gzip) then binary data conforming to that content type are what the recipient will expect.
My best guess is that it depends on if what you are sending is binary data or not.
No. These days, for all practical intents and purposes, it depends only on what application-layer protocol you are using for the transmission. If it specifies that some or all of the message is to be base-64 encoded (according to a particular base-64 scheme, as there are more than one) then that's what the sender must do and how the receiver will interpret the message. If the protocol does not specify that, then the sender must not perform base-64 encoding. Some protocols afford the sender the option to make this choice, but those also provide a way for the sender to indicate inside the transmission what choice has been made.
Is either Gzip compressed binary data or uncompressed text safe to transmit, or should it be base 64 encoded as the final step before sending it?
Neither is inherently unsafe to transmit on today's networks. Whether data are base-64 encoded for transmission is a question of agreement between sender and receiver.
Okay I lied there is one more question involved in this. Would sending gzip compressed text always be safe to send without base 64 encoding it at the end, no matter which character encoding it had prior to compression?
The character encoding of the uncompressed text is not a factor in whether a gzipped version can be safely and successfully conveyed. But it probably matters for the receiver or anyone to whom they forward that data to understand the uncompressed text correctly. If you intend to accommodate multiple character encodings then you will want to provide a way to indicate which applies to each text.
I need to design a binary format to save data from a scientific application. This data has to be encoded in a binary format that can't be easily read by any other application (it is a requirement by some of our clients). As a consequence, we decided to build our own binary format, its encoder and its decoder.
We got some inspiration from many binary format, including protobuf. One thing that puzzles me is the way protobuf encodes the length of embedded messages. According to https://developers.google.com/protocol-buffers/docs/encoding, the size of an embedded message is encoded at its very beginning as a varint.
But before we encode an embedded message, we don't know yet its size (think for instance of an embedded message that contains many integers encoded as varint). As a consequence, we need to encode the message entirely, before we write it to the disk so we know its size.
Imagine that this message is huge. As a consequence, it is very difficult to encode it in an efficient way. We could encode this size as a full int and seek back to this part of the file once the embedded message is written, but we loose the nice property of varints: you don't need to specify if you have a 32-bit or a 64-bit integer. So going back to Google's implementation using a varint:
Is there an implementation trick I am missing, or is this scheme likely to be inefficient for large messages?
Yes, the correct way to do this is to write the message first, at the back of the buffer, and then prepend the size. With proper buffer management, you can write the message in reverse.
That said, why write your own message format when you can just use protobuf? It would be better to just use Protobuf directly and encrypt the file format. That would be easy for you to use, and still be hard for other applications to read.
I have some highly variable time series data and are looking for an efficient way to store it. It only needs to be queried based on timestamp and two tags (type and device).
A device typically logs a set of 15-20 values (from a possible pool of around 200) and this set of values will generally never change over the entire life of the device, so it seems inefficient to have all these extra fields(columns) for values I'll likely never use.
Because of the above, I'm considering protocol buffers as an efficient way to store this data, but struggling with a way to store this data in influxDB.
Is it possible to store either protocol buffers OR binary data in InfluxDB? I really don't want to use an encoding scheme (eg: base64) to store it as a string.
I have a web app that uses Guids as the PK in the DB for an Employee object and an Association object.
One page in my app returns a large amount of data showing all Associations all Employees may be a part of.
So right now, I am sending to the client essentially a bunch of objects that look like:
{assocation_id: guid, employees: [guid1, guid2, ..., guidN]}
It turns out that many employees belong to many associations, so I am sending down the same Guids for those employees over and over again in these different objects. For example, it is possible that I am sending down 30,000 total guids across all associations in some cases, of which there are only 500 unique employees.
I am wondering if it is worth me building some kind of lookup index that I also send to the client like
{ 1: Guid1, 2: Guid2 ... }
and replacing all of the Guids in the objects I send down with those ints,
or if simply gzipping the response will compress it enough that this extra effort is not worth it?
Note: please don't get caught up in the details of if I should be sending down 30,000 pieces of data or not -- this is not my choice and there is nothing I can do about it (and I also can't change Guids to ints or longs in the DB).
Your wrote at the end of your question the following
Note: please don't get caught up in the details of if I should be
sending down 30,000 pieces of data or not -- this is not my choice and
there is nothing I can do about it (and I also can't change Guids to
ints or longs in the DB).
I think it's your main problem. If you don't solve the main problem you will be able to reduce the size of transferred data to 10 times for example, but you still don't solve the main problem. Let us we think about the question: Why so many data should be sent to the client (to the web browser)?
The data on the client side are needed to display some information to the user. The monitor is not so large to show 30,000 total on one page. No user are able to grasp so much information. So I am sure that you display only small part of the information. In the case you should send only the small part of information which you display.
You don't describe how the guids will be used on the client side. If you need the information during row editing for example. You can transfer the data only when the user start editing. In the case you need transfer the data only for one association.
If you need display the guids directly, then you can't display all the information at once. So you can send the information for one page only. If the user start to scroll or start "next page" button you can send the next portion of data. In the way you can really dramatically reduce the size of transferred data.
If you do have no possibility to redesign the part of application you can implement your original suggestion: by replacing of GUID "{7EDBB957-5255-4b83-A4C4-0DF664905735}" or "7EDBB95752554b83A4C40DF664905735" to the number like 123 you reduce the size of GUID from 34 characters to 3. If you will send additionally array of "guid mapping" elements like
123:"7EDBB95752554b83A4C40DF664905735",
you can reduce the original size of data 30000*34 = 1020000 (1 MB) to 300*39 + 30000*3 = 11700+90000 = 101700 (100 KB). So you can reduce the size of data in 10 times. The usage of compression of dynamic data on the web server can reduce the size of data additionally.
In any way you should examine why your page is so slowly. If the program works in LAN, then the transferring of even 1MB of data can be quick enough. Probably the page is slowly during placing of the data on the web page. I mean the following. If you modify some element on the page the position of all existing elements have to be recalculated. If you would be work with disconnected DOM objects first and then place the whole portion of data on the page you can improve the performance dramatically. You don't posted in the question which technology you use in you web application so I don't include any examples. If you use jQuery for example I could give some example which clear more what I mean.
The lookup index you propose is nothing else than a "custom" compression scheme. As amdmax stated, this will increase your performance if you have a lot of the same GUIDs, but so will gzip.
IMHO, the extra effort of writing the custom coding will not be worth it.
Oleg states correctly, that it might be worth fetching the data only when the user needs it. But this of course depends on your specific requirements.
if simply gzipping the response will compress it enough that this extra effort is not worth it?
The answer is: Yes, it will.
Compressing the data will remove redundant parts as good as possible (depending on the algorithm) until decompression.
To get sure, just send/generate the data uncompressed and compressed and compare the results. You can count the duplicate GUIDs to calculate how big your data block would be with the dictionary compression method. But I guess gzip will be better because it can also compress the syntactic elements like braces, colons, etc. inside your data object.
So what you are trying to accomplish is Dictionary compression, right?
http://en.wikibooks.org/wiki/Data_Compression/Dictionary_compression
What you will get instead of Guids which are 16 bytes long is int which is 4 bytes long. And you will get a dictionary full of key value pairs that will associate each guid to some int value, right?
It will decrease your transfer time when there're many objects with the same id used. But will spend CPU time before transfer to compress and after transfer to decompress. So what is the amount of data you transfer? Is it mb / gb / tb? And is there any good reason to compress it before sending?
I do not know how dynamic is your data, but I would
on a first call send two directories/dictionaries mapping short ids to long GUIDS, one for your associations and on for your employees e.g. {1: AssoGUID1, 2: AssoGUID2,...} and {1: EmpGUID1, 2:EmpGUID2,...}. These directories may also contain additional information on the Associations and Employees instances; I suspect you do not simply display GUIDs
on subsequent calls just send the index of Employees per Association { 1: [2,4,5], 3:[2,4], ...}, the key being the association short id and the ids in the array value, the short ids of the employees. Given your description building the reverse index: Employee to Associations may give better result size wise (but higher processing)
Then its all down to associative arrays manipulations which is straightforward in JS.
Again, if your data is (very) dynamic server side, the two directories will soon be obsolete and maintaining synchronization may cost you a lot.
I would start by answering the following questions:
What are the performance requirements? Are there size requirements? Speed requirements? What is the minimum performance that is truly needed?
What are the current performance metrics? How far are you from the requirements?
You characterized the data as possibly being mostly repeats. Is that the normal case? If not, what is?
The 2 options you listed above sound reasonable and trivial to implement. Try creating a look-up table and see what performance gains you get on actual queries. Try zipping the results (with look-ups and without), and see what gains you get.
In my experience if you're not TOO far from the goal, performance requirements are often trial and error.
If those options don't get you close to the requirements, I would take a step back and see if the requirements are reasonable in the time you have to solve the problem.
What you do next depends on which performance goals are lacking. If it is size, you're starting to be limited if you're required to send the entire association list ever time. Is that truly a requirement? Can you send the entire list once, and then just updates?
I have built a simple project which use "Winsock" Tool.
When I receive any data I put it in a variable because i cann't put it in a textbox because
it is a file not a text.
But if i send a big file it gets me an error.
"Overflow"
Are there any way to fix this problem ?
A VB variable-length string can only in theory be 2GB in size, it's actual maximum size is depending on available virtual memory which is also limited to 2GB for the entire application. But since VB stores the string in unicode format it means that it can only contain 1GB of text.
(maximum length for string in VB6)
If this is your problem, try splitting incoming data by several strings.
Are you handling the SendComplete event properly before sending more data?
Otherwise you will get a buffer overflow from the WinSock control.
You need to split your data into smaller packets (around 2-5k each should do it) and send each packet individually, then re-construct your packets at the other end. You could add a unique character at the end of the data so that the receiving end know that all the data has been received for that transmission say Chr(0)?
This is quite a simplified solution to this problem - a better method would be to devise a simple protocol for data handshaking so you know each packet has been received.