How do I clear every other for a string in Ruby, and convert it to byte array? I understand that I need to do AND operation with 0x01010101 value for every byte. But the difficulty is with correct conversion from string to binary. Ideally it should be fast and with least amount of allocations.
Later I will need to pass this value to Digest::MD5.hexdigest.
Firstly, note that 0x is for base 16, 0b is for base 2:
0b11111111.to_s(2) #=> "11111111"
0x11111111.to_s(2) #=> "10001000100010001000100010001"
As you are converting bits within bytes you want to use 0b... for your mask.
Next,
0b01010101.to_s(2) #=> "1010101"
showing that, as with all integers, leading zeroes are dropped, meaning you can include them or not. Consider,
0b11111111 & 0 #=> 0
It is seen that, as a mask, zero is treated as having 7 leading bits of zero. We see that
(0b11111111 &
0b1010101).to_s(2) #=> "1010101"
So, we can define your bitwise mask as
MASK = 0b1010101
We can now use String#unpack with format string "C*" to convert the string to an array of 8-bit unsigned integers, which we then bitwise and with MASK (using &):
str = "Let's party, now!"
str.unpack("C*").map { |u| u & MASK }
#=> [68, 69, 84, 5, 81, 0, 80, 65, 80, 84, 81, 4, 0, 68, 69, 85, 1]
The "C" in "C*" means the format directive "C" is applied to the first character; "*" means to repeat "C" for all subsequent characters.
See also Integer#&.
I see from #DavidKling's answer that one could alternatively write
str.bytes.map { |u| u & MASK }
You can use String#bytes to give you an array of the string's characters' unicode values (in decimal).
'Roman'.bytes # [82, 111, 109, 97, 110]
Related
I am trying to convert the following byte array to hexadecimal;
[1, 1, 65, -50, 6, 104, -91, -70, -100, 119, -100, 123, 52, -109, -33, 45, -14, 86, -105, -97, -115, 16]
The result should be;
010141CE0668A5BA9C779C7B3493DF2DF256979F8D10
Here is my current attempt;
item.getProperties["Mapi-Conversation-Index"].to_a.map {|s| s.to_s(16)}.join()
But my output is: 010141-320668-5b-46-6477-647b34-6d-212d-e56-69-61-7310
arr = [1, 1, 65, -50, 6, 104, -91, -70, -100, 119, -100, 123, 52, -109, -33, 45, -14, 86]
arr.pack("c*").unpack("H*").first
#=> "010141ce0668a5ba9c779c7b3493df2df256"
See Array#pack and String#unpack.
The argument "c" for pack specifies an 8-bit signed integer. The argument "H" for unpack specifies "hex string (high nibble first)". The asterisk at the end of each directive specifies that "c" applies to all elements of arr and "H" applies to all characters of the string produced by pack.
Note that
arr.pack("c*")
#=> "\x01\x01A\xCE\x06h\xA5\xBA\x9Cw\x9C{4\x93\xDF-\xF2V"
and
arr.pack("c*").unpack("H*")
#=> ["010141ce0668a5ba9c779c7b3493df2df256"]
which is why first is needed to extract the string.
This works:
[1, 1, 65, -50].map { |n| '%02X' % (n & 0xFF) }.join
The %02X format specifier makes a 2-character-wide hex number, padded with 0 digits. The & 0xFF is necessary to convert your negative numbers into the standard 0 through 255 range that people usually use when talking about byte values.
I have a string represented like this in ruby: "\x00\x00\xff" How can I get an array of the integers? I'm abit confused about how to represent bytes properly.
For example, how can I transform that into an array like this?
[ 0, 0, 255 ]
Update
I've tried the examples below, and this is where I'm having trouble, like "\x00\x00\xff".bytes should work but I get this:
[92, 120, 52, 48, 92, 120, 102, 102, 92, 120, 53, 53]
Like each character is returning it's byte code instead of recognizing that they are separate bytes. How do I prevent that string "\x00\x00\xff" from interpreted as literally a string?
s = "\x00\x00\xff"
s.bytes # => [0, 0, 255]
Use String#each_byte for this :
"\x00\x00\xff".each_byte.to_a
# => [0, 0, 255]
This is effectively log base 2, but I do not have access to this functionality in the environment I'm in. Manually walking through the bits to verify them is unacceptably slow. If it were just 4 bits, I could probably index it and waste some space in an array, but with 64 bits it is not viable.
Any clever constant time method to find which bit is set ? (The quantity is a 64-bit number).
EDIT: To clarify, there is a single bit set in the number.
I assume you want the position of the most significant bit that is set. Do a binary search. If the entire value is 0, no bits are set. If the top 32 bits are 0, then the bit is in the bottom 32 bits; else it is in the high half. Then recurse on the two 16-bit halves of the appropriate 32 bits. Recurse until you are down to a 4-bit value and use your look-up table. (Or recurse down to a 1-bit value.) You just need to keep track of which half you used at each recursion level.
The fastest method I know of uses a DeBruijn Sequence.
Find the log base 2 of an N-bit integer in O(lg(N)) operations with multiply and lookup
Note that in lg(N), N is the number of bits, not the number of the highest set bit. So it's constant time for any N-bit number.
If you know that the number is an exact power of 2 (i.e. there is only 1 bit set), there is an even faster method just below that.
That hack is for 32 bits. I seem to recall seeing a 64 bit example somewhere, but can't track it down at the moment. Worst case, you run it twice: once for the high 32 bits and once for the low 32 bits.
If your numbers are powers of 2 and you have a bit count instruction you could do:
bitcount(x-1)
e.g.
x x-1 bitcount(x-1)
b100 b011 2
b001 b000 0
Note this will not work if the numbers are not powers of 2.
EDIT
Here is a 64bit version of the De Brujin method:
static const int log2_table[64] = {0, 1, 2, 7, 3, 13, 8, 19, 4, 25, 14, 28, 9, 34,
20, 40, 5, 17, 26, 38, 15, 46, 29, 48, 10, 31,
35, 54, 21, 50, 41, 57, 63, 6, 12, 18, 24, 27,
33, 39, 16, 37, 45, 47, 30, 53, 49, 56, 62, 11,
23, 32, 36, 44, 52, 55, 61, 22, 43, 51, 60, 42, 59, 58};
int fastlog2(unsigned long long x) {
return log2_table[ ( x * 0x218a392cd3d5dbfULL ) >> 58 ];
}
Test code:
int main(int argc,char *argv[])
{
int i;
for(i=0;i<64;i++) {
unsigned long long x=1ULL<<i;
printf("0x%llu -> %d\n",x,fastlog2(x));
}
return 0;
}
The magic 64bit number is an order 6 binary De Brujin sequence.
Multiplying by a power of 2 is equivalent to shifting this number up by a certain number of places.
This means that the top 6 bits of the multiplication result correspond to a different subsequence of 6 digits for each input number. The De Brujin sequence has the property that each subsequence is unique, so we can construct an appropriate lookup table to turn back from subsequence to position of the set bit.
If you use some modern Intel CPU, you can use hardware
supported "POPulation CouNT" assembly instruction:
http://en.wikipedia.org/wiki/SSE4#POPCNT_and_LZCNT
for Unix/gcc, you can use macro:
#include <smmintrin.h>
uint64_t x;
int c = _mm_popcnt_u64(x);
I'm trying to figure out if there's a way to split a string that contains numbers with different digit sizes without having to use if/else statements. Is there an outright method for doing so. Here is an example string:
"123456789101112131415161718192021222324252627282930"
So that it would be split into an array containing 1-9 and 10-30 without having to first split the array into single digits, separate it, find the 9, and iterate through combining every 2 elements after the 9.
Here is the current way I would go about doing this to clarify:
single_digits, double_digits = [], []
string = "123456789101112131415161718192021222324252627282930".split('')
single_digits << string.slice!(0,9)
single_digits.map! {|e| e.to_i}
string.each_slice(2) {|num| double_digits << num.join.to_i}
This would give me:
single_digits = [1,2,3,4,5,6,7,8,9]
double_digits = [10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30]
As long as you can be sure that every number is greater than its predecessor and greater than zero, and every length of number from a single digit to the maximum is represented at least once, you could write this
def split_numbers(str)
numbers = []
current = 0
str.each_char do |ch|
current = current * 10 + ch.to_i
if numbers.empty? or current > numbers.last
numbers << current
current = 0
end
end
numbers << current if current > 0
numbers
end
p split_numbers('123456789101112131415161718192021222324252627282930')
output
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30]
For Anon's example of 192837453572 we get
[1, 9, 28, 37, 45, 357, 2]
Go through each character of the string, collecting single 'digits', until you find a 9 (set a controlling value and increment it by 1), then continue on collecting two digits, until you find 2 consecutive 9's, and continue on.
This can then be written to handle any sequence of numbers such as your example string.
You could do this:
str = "123456789101112131415161718192021222324252627282930"
result = str[0..8].split('').map {|e| e.to_i }
result += str[9..-1].scan(/../).map {|e| e.to_i }
It's essentially the same solution as yours, but slightly cleaner (no need to combine the pairs of digits). But yeah, if you want a generalizable solution to an arbitrary length string (including more than just 2 digits), that's a different question than what you seem to be asking.
UPDATE:
Well, I haven't been able to get this question out of my mind, because it seems like there could be a simple, generalizable solution. So here's my attempt. The basic idea is to keep a counter so that you know how many digits the number you want to slice out of the string is.
str = "123456789101112131415161718192021222324252627282930"
result = []
i = 1
done = str.length < 1
str_copy = str
while !done do
result << str_copy.slice!(0..i.to_s.size-1).to_i
done = true if str_copy.size == 0
i += 1
end
puts result
This generates the desired output, and is generalizable to a string of consecutive positive integers starting with 1. I'd be very interested to see other people's improvements to this -- it's not super succinct
I want to split data to chunks of let's say 8154 byte:
data = Zlib::Deflate.deflate(some_very_long_string)
What would be the best way to do that?
I tried to use this:
chunks = data.scan /.{1,8154}/
...but data was lost! data had a size of 11682, but when looping through every chunk and summing up the size I ended up with a total size of 11677. 5 bytes were lost! Why?
Regexps are not a good way to parse binary data. Use bytes and each_slice to operate bytes. And use pack 'C*' to convert them back into strings for output or debug:
irb> data = File.open("sample.gif", "rb", &:read)
=> "GIF89a\r\x00\r........."
irb> data.bytes.each_slice(10){ |slice| p slice, slice.pack("C*") }
[71, 73, 70, 56, 57, 97, 13, 0, 13, 0]
"GIF89a\r\x00\r\x00"
[247, 0, 0, 0, 0, 0, 0, 0, 51, 0]
"\xF7\x00\x00\x00\x00\x00\x00\x003\x00"
...........
The accepted answer works, but creates unneeded arrays and is extremely slow for big files.
This alternative works fine and is much faster (500x for a 1MB file and 10kB chunks!) :
def get_binary_chunks(string, size)
Array.new(((string.length + size - 1) / size)) { |i| string.byteslice(i * size, size) }
end
For the given example, you'd use it this way :
chunks = get_binary_chunks(data, 8154)