How to handle sequences longer than 512 tokens in layoutLMV3? - transformer-model

How to work with sequences longer than 512 tokens. I don't wanted to use truncates =True. But actually wanted to handle the longer sequences

You can use the stride with max length parameter to handle the larger documents.
encoding = processor(images, words, boxes=boxes, word_labels=word_labels, truncation=True,padding="max_length", max_length = 512, stride = 128, return_overflowing_tokens = True,return_offsets_mapping = True)
This would help to handle the larger files.
Let me know if this is useful.

Related

How do you determine the length of the string in an OUTPUT_DEBUG_STRING_INFO?

The documentation for the OUTPUT_DEBUG_STRING_INFO structure doesn't explain, how to determine the length (or size) of the string value it points to. Specifically, the documentation for nDebugStringLength is confusing:
The lower 16 bits of the length of the string in bytes. As nDebugStringLength is of type WORD, this does not always contain the full length of the string in bytes.
For example, if the original output string is longer than 65536 bytes, this field will contain a value that is less than the actual string length in bytes.
As I understand it, the true size can be any value that's a solution to the equation:
size = nDebugStringLength + (n * 65536)
for any n in [0..65536).
Question:
How do I determine the correct size of the string? Unless I'm overlooking something, the documentation appears to be insufficient in this regard.
initially the debug event comes in the form DBGUI_WAIT_STATE_CHANGE
if use WaitForDebugEvent[Ex] api - it internally convert DBGUI_WAIT_STATE_CHANGE to DEBUG_EVENT by using DbgUiConvertStateChangeStructure[Ex]
the DbgExceptionStateChang ( in NewState) event with DBG_PRINTEXCEPTION_WIDE_C and DBG_PRINTEXCEPTION_C (in ExceptionCode) converted to OUTPUT_DEBUG_STRING_INFO. the nDebugStringLength is taken from Exception.ExceptionRecord.ExceptionInformation[0] or from ExceptionInformation[3] (in case DBG_PRINTEXCEPTION_C and api version without Ex ). but because nDebugStringLength is only 16 bit length, when original value is 32/64 bit length - it truncated - only low 16 bit of ExceptionInformation[0] (or [3]) is used.
note that ExceptionInformation[0] (and [3] in case DBG_PRINTEXCEPTION_WIDE_C ) containing string length in characters, including terminating 0.
in contrast nDebugStringLength in bytes (if we using WaitForDebugEventEx and DBG_PRINTEXCEPTION_WIDE_C exception - nDebugStringLength = (WORD)(ExceptionInformation[0] * sizeof(WCHAR))

How to continuously read a binary file in Crystal and get Bytes out of it?

Reading binary files in Crystal is supposed to be done with Bytes.new(size) and File#read, but... what if you don't know how many bytes you'll read in advance, and you want to keep reading chunks at a time?
Here's an example, reading 3 chunks from an imaginary file format that specifies the length of data chunks with an initial byte:
file = File.open "something.bin", "rb"
The following doesn't work, since Bytes can't be concatenated (as it's really a Slice(UInt8), and slices can't be concatenated):
data = Bytes.new(0)
3.times do
bytes_to_read = file.read_byte.not_nil!
chunk = Bytes.new(bytes_to_read)
file.read(chunk)
data += chunk
end
The best thing I've come up with is to use an Array(UInt8) instead of Bytes, and call to_a on all the bytes read:
data = [] of UInt8
3.times do
bytes_to_read = file.read_byte.not_nil!
chunk = Bytes.new(bytes_to_read)
file.read(chunk)
data += chunk.to_a
end
However, there's then seemingly no way to turn that back into Bytes (Array#to_slice was removed), which is needed for many applications and recommended by the authors to be the type of all binary data.
So... how do I keep reading from a file, concatenating to the end of previous data, and get Bytes out of it?
One solution would be to copy the data to a resized Bytes on every iteration. You could also collect the Bytes instances in a container (e.g. Array) and merge them at the end, but that would all mean additional copy operations.
The best solution would probably be to use a buffer that is large enough to fit all data that could possibly be read - or at least be very likely to (resize if necessary).
If the maximum size is just 3 * 255 bytes this is a no-brainer. You can size down at the end if the buffer is too large.
data = Bytes.new 3 * UInt8::MAX
bytes_read = 0
3.times do
bytes_to_read = file.read_byte.not_nil!
file.read_fully(data + bytes_read)
bytes_read += bytes_to_read
end
# resize to actual size at the end:
data = data[0, bytes_read]
Note: As the data format tells how many bytes to read, you should use read_fully instead of read which would silently ignore if there are actually less bytes to read.
EDIT: Since the number of chunks and thus the maximum size is not known in advance (per comment), you should use a dynamically resizing buffer. This can be easily implemented using IO::Memory, which will take care of resizing the buffer accordingly if necessary.
io = IO::Memory.new
loop do
bytes_to_read = file.read_byte
break if bytes_to_read.nil?
IO.copy(file, io, bytes_to_read)
end
data = io.to_slice

Hashing a long integer ID into a smaller string

Here is the problem, where I need to transform an ID (defined as a long integer) to a smaller alfanumeric identifier. The details are the following:
Each individual on the problem as an unique ID, a long integer of size 13 (something like 123123412341234).
I need to generate a smaller representation of this unique ID, a alfanumeric string, something like A1CB3X. The problem is that 5 or 6 character length will not be enough to represent such a large integer.
The new ID (eg A1CB3X) should be valid in a context where we know that only a small number of individuals are present (less than 500). The new ID should be unique within that small set of individuals.
The new ID (eg A1CB3X) should be the result of a calculation made over the original ID. This means that taking the original ID elsewhere and applying the same calculation, we should get the same new ID (eg A1CB3X).
This calculation should occur when the individual is added to the set, meaning that not all individuals belonging to that set will be know at that time.
Any directions on how to solve such a problem?
Assuming that you don't need a formula that goes in both directions (which is impossible if you are reducing a 13-digit number to a 5 or 6-character alphanum string):
If you can have up to 6 alphanumeric characters that gives you 366 = 2,176,782,336 possibilities, assuming only numbers and uppercase letters.
To map your larger 13-digit number onto this space, you can take a modulo of some prime number slightly smaller than that, for example 2,176,782,317, the encode it with base-36 encoding.
alphanum_id = base36encode(longnumber_id % 2176782317)
For a set of 500, this gives you a
2176782317P500 / 2176782317500 chance of a collision
(P is permutation)
Best option is to change the base to 62 using case sensitive characters
If you want it to be shorter, you can add unicode characters. See below.
Here is javascript code for you: https://jsfiddle.net/vewmdt85/1/
function compress(n) {
var symbols = 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyzÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïð'.split('');
var d = n;
var compressed = '';
while (d >= 1) {
compressed = symbols[(d - (symbols.length * Math.floor(d / symbols.length)))] + compressed;
d = Math.floor(d / symbols.length);
}
return compressed;
}
$('input').keyup(function() {
$('span').html(compress($(this).val()))
})
$('span').html(compress($('input').val()))
How about using some base-X conversion, for example 123123412341234 becomes 17N644R7CI in base-36 and 9999999999999 becomes 3JLXPT2PR?
If you need a mapping that works both directions, you can simply go for a larger base.
Meaning: using base 16, you can reduce 1 to 16 to a single character.
So, base36 is the "maximum" that allows for shorter strings (when 1-1 mapping is required)!

how bytes are used to store information in protobuf

i am trying to understand the protocol buffer here is the sample , what i am not be able to understand is how bytes are being used in following messages. i dont know what this number
1 2 3 is used for.
message Point {
required int32 x = 1;
required int32 y = 2;
optional string label = 3;
}
message Line {
required Point start = 1;
required Point end = 2;
optional string label = 3;
}
message Polyline {
repeated Point point = 1;
optional string label = 2;
}
i read following paragraph in google protobuf but not able to understand what is being said here , can anyone help me in understanding how bytes are being used to store info.
The " = 1", " = 2" markers on each element identify the unique "tag" that field uses in the binary encoding. Tag numbers 1-15 require one less byte to encode than higher numbers, so as an optimization you can decide to use those tags for the commonly used or repeated elements, leaving tags 16 and higher for less-commonly used optional element.
The general form of a protobuf message is that it is a sequence of pairs of the form:
field header
payload
For your question, we can largely forget about the payload - that isn't the bit that relates to the 1/2/3 and the <=16 restriction - all of that is in the field header. The field header is a "varint" encoded integer; "varint" uses the most-significant-bit as an optional continuation bit, so small values (<=127, assuming unsigned and not zig-zag) require one byte to encode - larger values require multiple bytes. Or in other words, you get 7 useful bits to play with before you need to set the continuation bit, requiring at least 2 bytes.
However! The field header itself is composed of two things:
the wire-type
the field-number / "tag"
The wire-type is the first 3 bits, and indicates the fundamental format of the payload - "length-delimited", "64-bit", "32-bit", "varint", "start-group", "end-group". That means that of the 7 useful bits we had, only 4 are left; 4 bits is enough to encode numbers <= 16. This is why field-numbers <= 16 are suggested (as an optimisation) for your most common elements.
In your question, the 1 / 2 / 3 is the field-number; at the time of encoding this is left-shifted by 3 and composed with the payload's wire-type; then this composed value is varint-encoded.
Protobuf stores the messages like a map from an id (the =1, =2 which they call tags) to the actual value. This is to be able to more easily extend it than if it would transfer data more like a struct with fixed offsets. So a message Point for instance would look something like this on a high level:
1 -> 100,
2 -> 500
Which then is interpreted as x=100, y=500 and label=not set. On a lower level, protobuf serializes this tag-value mapping in a highly compact format, which among other things, stores integers with variable-length encoding. The paragraph you quoted just highlights exactly this in the case of tags, which can be stored more compactly if they are < 16, but the same for instance holds for integer values in your protobuf definition.

Last byte in Huffman compression

I am wondering about what is the best way to handle the last byte in Huffman Copression. I have some nice code in C++, that can compress text files very well, but currently I must write to my coded file also number of coded chars (well, it equal to input file size), because of no idea how to handle last byte better.
For example, last char to compress is 'a', which code is 011 and I am just starting new byte to write, so the last byte will look like:
011 + some 5 bits of trash, I am making them zeros for example at the end.
And when I am encoding this coded file, it may happen that code 00000 (or with less zeros) is code for some char, so I will have some trash char at the end of my encoded file.
As I wrote in first paragraph, I am avoiding this by saving numbers of chars of input file in coded file, and while encoding, I am reading the coded file to reach that number (not to EndOfFile, to don't get to those example 5 zeros).
It's not really efficient, size of coded file is increased for long number.
How can I handle this in better way?
Your approach (write the number of encoded bytes the to the file) is a perfectly reasonable approach. If you want to try a different avenue, you could consider inventing a new "pseudo-EOF" character that marks the end of the input (I'll denote it as &square;). Whenever you want to compress a string s, you instead compress the string s&square;. This means that when you build up your encoding tree, you would include one copy of the &square; character so that you have a unique encoding for &square;. Then, when you write out the string to the file, you would write out the bits characters of the string as normal, then write out the bit pattern for &square;. If there are leftover bits, you can just leave them set arbitrarily.
The advantage to this approach is that as you decode the file, if at any point you find the &square; character, you can immediately stop decoding bits because you know that you have hit the end of the file. This does not require you to store the number of bytes that were written out anywhere - the encoding implicitly marks its own endpoint.
The disadvantage to this setup is that it might increase the length of the bit patterns used by certain characters, since you will need to assign a bit pattern to &square; in addition to all the other characters.
I teach an introductory programming course and we use Huffman encoding as one of our assignments. We have students use the above approach, since it's a bit easier than having to write out the number of bits or bytes before the file contents. For more details, you could take a look at this handout or these lecture slides from the course.
Hope this helps!
I know this is an old question, but still, there's an alternate, so it might help someone.
When you're writing your compressed file to output, you probably have some integer keeping track of where you are in the current byte (for bit shifting).
char c, p;
p = '\0';
int curr = 7;
while (infile.get(c))
{
std::string trav = GetTraversal(c);
for (int i = 0; i < trav.size(); i++)
{
if (trav[i] == '1')
p += (1 << curr);
if (--curr < 0)
{
outfile.put(p);
p = '\0';
curr = 7;
}
}
}
if (curr < 7)
outfile.put(p);
At the end of this block, (curr+1)%8 equals the number of trash bits in the last data byte. You can then store it at the end as a single extra byte, and just keep it in mind when you're decompressing.

Resources