How to convert an octet string to a number? - ruby

I am trying to convert an octet string into an Integer in Ruby.
For now I have this solution:
def octet_string_to_i(str)
str.bytes.map { |v| format('%08b', v) }.join.to_i(2)
end
It feels like total overkill. I found other questions about converting it to a string but none about an Integer. Is there any way to achieve this without first going through a string as in my solution?
Example values and results:
foobar 112628796121458
barbaz 108170670399866
123456 54091677185334

Using the following code:
str.bytes.reverse_each.with_index.sum {|elem, idx| elem << (idx * 8) }
Basically, we reduce the bytes by adding them to a sum, shifting each byte to its proper position using their index. This avoids having to first convert to a string and back, and achieves the same result.
Credit goes to Sergio Tulentsev for the code.

Related

`sum` function not working as expected in case of large strings

In Ruby, #sum is used to calculate
Sum of array
Sum of an array based on a function or condition
Sum of ASCII codepoints (ord) in a string (not char array) i.e. 'abcd'.sum # => 394
The problem with the third one is the following
For the string below,
AwotIJHOAIJSRoieJHOjasOIADaoiHAOHJAOIJGOIajdOIQWJTOIGJDOINCOIASORIOGIMAOIMEORIQEMOIGMEOIFMASKDJQOWJGOJOASJOIQWOGIMASOIDMOQWIROQIGJOIAMSFOAIJGIHIWUNVNZMXCNXCKJQOWRIEOGSDGSPOKSDLAMKMROQIJRDFLKMZXOIAJSQPIRKLMAdglkaSFAJOIAJFOIQWJEOIQJKAMCLKACMALKSDLAKWEQANLEIRJRQFIJAOIVAWOTIJHOAIJSROIEJHOJASOIADAOIHAOHJAOIJGOIAJDOIQWJTOIGJDOINCOIASORIOGIMAOIMEORIQEMOIGMASODLQWKEJOIFJLKMALSKQIOWELKMZLXKMFALSFJQOIWEAOISFWIDHGPSODRJAWOPIJHOIDJOIAJTGIOJAORAJWOIJHOFMAOIFMOIPDMOAIPWJTOPIJDOIFjawoiRJOIpjmaioGJIGHAIJRHQHQIUEIvnaksJDNWIORQIOPEGHIDVNAJKNASIPHRQEUITHIUHDNAJSNWIHJQIWJQEOIGOIDVNAKOSDNAOPWPJQOPIWTJQEOIPGDPJFNASPJNQWOIRQWIOTOIVNAKSFNAIOAWOTIJHOAIJSROIEJHOJASOIADAOIHAOHJAOIJGOIAJDOIQWJTOIGJDOINCOIASORIOGIMAOIMEORIQEMOIGMASODLQWKEJOIFJLKMALSKQIOWELKMZLXKMFALSFJQOIWEAOISFWIDHGPSODRJAWOPIJHOIDJoiajTGIOJAORAJWOIJHOFMAOIFMOIPDMOAIPWJTOPIJDOIFJAWOIRJOIPJMAIOGJIGHAIJRHQHQIUEIVNAKSJDNWIORQIOPEGHIDVIPNWIHJQIWJQEOIGOIDVNAKOSDNAOPWPJQOPIWTJqeoIPGDPJFNASPJNQWJQWOIRJgonasKFAWOEJQWOIJOGALKFNASLFKqeqOFIJAOISFJAOISFJAWOI
which is large, (of 1000 characters), the following program doesn't work
putc gets.upcase.sum/~/$/
It works for all other strings of lesser size. The output of the above must be K. But it shows \9
But if I do this
putc gets.upcase.chars.sum(&:ord)/~/$/
It shows K. But the former one gives the correct output for all the other string except the large ones like this.
What is wrong here?
EDIT : Try it Online link
Try it online!
Sum of ASCII codepoints (ord) in a string (not char array) i.e. 'abcd'.sum # => 394
I've actually never heard of String#sum before, despite being fairly knowledgeable in the language. So I looked it up:
Returns a basic n-bit checksum of the characters in str, where n is the optional Integer parameter, defaulting to 16. The result is simply the sum of the binary value of each byte in str modulo 2**n - 1. This is not a particularly good checksum.
And sure enough, using your example input string, that's why we get:
str.chars.map(&:ord).sum
# => 77090
str.sum
# => 11554
The values are different because 77090 > 2**15. Moreover, 77090 % 2**15 == 11554.
If you use a larger value for n, the (check)sum is what you expected:
str.sum(100)
#=> 77090

How do I convert a spreadsheet "letternamed" column coordinate to an integer?

In spreadsheets I have cells named like "F14", "BE5" or "ALL1". I have the first part, the column coordinate, in a variable and I want to convert it to a 0-based integer column index.
How do I do it, preferably in an elegant way, in Ruby?
I can do it using a brute-force method: I can imagine loopping through all letters, converting them to ASCII and adding to a result, but I feel there should be something more elegant/straightforward.
Edit: Example: To simplify I do only speak about the column coordinate (letters). Therefore in the first case (F14) I have "F" as the input and I expect the result to be 5. In the second case I have "BE" as input and I expect getting 56, for "ALL" I want to get 999.
Not sure if this is any clearer than the code you already have, but it does have the advantage of handling an arbitrary number of letters:
class String
def upcase_letters
self.upcase.split(//)
end
end
module Enumerable
def reverse_with_index
self.map.with_index.to_a.reverse
end
def sum
self.reduce(0, :+)
end
end
def indexFromColumnName(column_str)
start = 'A'.ord - 1
column_str.upcase_letters.map do |c|
c.ord - start
end.reverse_with_index.map do |value, digit_position|
value * (26 ** digit_position)
end.sum - 1
end
I've added some methods to String and Enumerable because I thought it made the code more readable, but you could inline these or define them elsewhere if you don't like that sort of thing.
We can use modulo and the length of the input. The last character will
be used to calculate the exact "position", and the remainders to count
how many "laps" we did in the alphabet, e.g.
def column_to_integer(column_name)
letters = /[A-Z]+/.match(column_name).to_s.split("")
laps = (letters.length - 1) * 26
position = ((letters.last.ord - 'A'.ord) % 26)
laps + position
end
Using decimal representation (ord) and the math tricks seems a neat
solution at first, but it has some pain points regarding the
implementation. We have magic numbers, 26, and constants 'A'.ord all
over.
One solution is to give our code better knowlegde about our domain, i.e.
the alphabet. In that case, we can switch the modulo with the position of
the last character in the alphabet (because it's already sorted in a zero-based array), e.g.
ALPHABET = ('A'..'Z').to_a
def column_to_integer(column_name)
letters = /[A-Z]+/.match(column_name).to_s.split("")
laps = (letters.length - 1) * ALPHABET.size
position = ALPHABET.index(letters.last)
laps + position
end
The final result:
> column_to_integer('F5')
=> 5
> column_to_integer('AK14')
=> 36
HTH. Best!
I have found particularly neat way to do this conversion:
def index_from_column_name(colname)
s=colname.size
(colname.to_i(36)-(36**s-1).div(3.5)).to_s(36).to_i(26)+(26**s-1)/25-1
end
Explanation why it works
(warning spoiler ;) ahead). Basically we are doing this
(colname.to_i(36)-('A'*colname.size).to_i(36)).to_s(36).to_i(26)+('1'*colname.size).to_i(26)-1
which in plain English means, that we are interpreting colname as 26-base number. Before we can do it we need to interpret all A's as 1, B's as 2 etc. If only this is needed than it would be even simpler, namely
(colname.to_i(36) - '9'*colname.size).to_i(36)).to_s(36).to_i(26)-1
unfortunately there are Z characters present which would need to be interpreted as 10(base 26) so we need a little trick. We shift every digit 1 more then needed and than add it at the end (to every digit in original colname)
`

Generating integer within range from unique string in ruby

I have a code that should get unique string(for example, "d86c52ec8b7e8a2ea315109627888fe6228d") from client and return integer more than 2200000000 and less than 5800000000. It's important, that this generated int is not random, it should be one for one unique string. What is the best way to generate it without using DB?
Now it looks like this:
did = "d86c52ec8b7e8a2ea315109627888fe6228d"
min_cid = 2200000000
max_cid = 5800000000
cid = did.hash.abs.to_s.split.last(10).to_s.to_i
if cid < min_cid
cid += min_cid
else
while cid > max_cid
cid -= 1000000000
end
end
Here's the problem - your range of numbers has only 3.6x10^9 possible values where as your sample unique string (which looks like a hex integer with 36 digits) has 16^32 possible values (i.e. many more). So when mapping your string into your integer range there will be collisions.
The mapping function itself can be pretty straightforward, I would do something such as below (also, consider using only a part of the input string for integer conversion, e.g. the first seven digits, if performance becomes critical):
def my_hash(str, min, max)
range = (max - min).abs
(str.to_i(16) % range) + min
end
my_hash(did, min_cid, max_cid) # => 2461595789
[Edit] If you are using Ruby 1.8 and your adjusted range can be represented as a Fixnum, just use the hash value of the input string object instead of parsing it as a big integer. Note that this strategy might not be safe in Ruby 1.9 (per the comment by #DataWraith) as object hash values may be randomized between invocations of the interpreter so you would not get the same hash number for the same input string when you restart your application:
def hash_range(obj, min, max)
(obj.hash % (max-min).abs) + [min, max].min
end
hash_range(did, min_cid, max_cid) # => 3886226395
And, of course, you'll have to decide what to do about collisions. You'll likely have to persist a bucket of input strings which map to the same value and decide how to resolve the conflicts if you are looking up by the mapped value.
You could generate a 32-bit CRC, drop one bit, and add the result to 2.2M. That gives you a max value of 4.3M.
Alternately you could use all 32 bits of the CRC, but when the result is too large, append a zero to the input string and recalculate, repeating until you get a value in range.

Calculating the size of an Array pack struct format in Ruby

In the case of e.g. ddddd, d is the native format for the system, so I can't know exactly how big it will be.
In python I can do:
import struct
print struct.calcsize('ddddd')
Which will return 40.
How do I get this in Ruby?
I haven't found a built-in way to do this, but I've had success with this small function when I know I'm dealing with only numeric formats:
def calculate_size(format)
# Only for numeric formats, String formats will raise a TypeError
elements = 0
format.each_char do |c|
if c =~ /\d/
elements += c.to_i - 1
else
elements += 1
end
end
([ 0 ] * elements).pack(format).length
end
This constructs an array of the proper number of zeros, calls pack() with your format, and returns the length (in bytes). Zeros work in this case because they're convertible to each of the numeric formats (integer, double, float, etc).
I don't know of a shortcut but you can just pack one and ask how long it is:
length_of_five_packed_doubles = 5 * [1.0].pack('d').length
By the way, a ruby array combined with the pack method appears to be functionally equivalent to python's struct module. Ruby pretty much copied perl's pack and put them as methods on the Array class.

How to find all brotherhood strings?

I have a string, and another text file which contains a list of strings.
We call 2 strings "brotherhood strings" when they're exactly the same after sorting alphabetically.
For example, "abc" and "cba" will be sorted into "abc" and "abc", so the original two are brotherhood. But "abc" and "aaa" are not.
So, is there an efficient way to pick out all brotherhood strings from the text file, according to the one string provided?
For example, we have "abc" and a text file which writes like this:
abc
cba
acb
lalala
then "abc", "cba", "acb" are the answers.
Of course, "sort & compare" is a nice try, but by "efficient", i mean if there is a way, we can determine a candidate string is or not brotherhood of the original one after one pass processing.
This is the most efficient way, i think. After all, you can not tell out the answer without even reading candidate strings. For sorting, most of the time, we need to do more than 1 pass to the candidate string. So, hash table might be a good solution, but i've no idea what hash function to choose.
Most efficient algorithm I can think of:
Set up a hash table for the original string. Let each letter be the key, and the number of times the letter appears in the string be the value. Call this hash table inputStringTable
Parse the input string, and each time you see a character, increment the value of the hash entry by one
for each string in the file
create a new hash table. Call this one brotherStringTable.
for each character in the string, add one to a new hash table. If brotherStringTable[character] > inputStringTable[character], this string is not a brother (one character shows up too many times)
once string is parsed, compare each inputStringTable value with the corresponding brotherStringTable value. If one is different, then this string is not a brother string. If all match, then the string is a brother string.
This will be O(nk), where n is the length of the input string (any strings longer than the input string can be discarded immediately) and k is the number of strings in the file. Any sort based algorithm will be O(nk lg n), so in certain cases, this algorithm is faster than a sort based algorithm.
Sorting each string, then comparing it, works out to something like O(N*(k+log S)), where N is the number of strings, k is the search key length, and S is the average string length.
It seems like counting the occurrences of each character might be a possible way to go here (assuming the strings are of a reasonable length). That gives you O(k+N*S). Whether that's actually faster than the sort & compare is obviously going to depend on the values of k, N, and S.
I think that in practice, the cache-thrashing effect of re-writing all the strings in the sorting case will kill performance, compared to any algorithm that doesn't modify the strings...
iterate, sort, compare. that shouldn't be too hard, right?
Let's assume your alphabet is from 'a' to 'z' and you can index an array based on the characters. Then, for each element in a 26 element array, you store the number of times that letter appears in the input string.
Then you go through the set of strings you're searching, and iterate through the characters in each string. You can decrement the count associated with each letter in (a copy of) the array of counts from the key string.
If you finish your loop through the candidate string without having to stop, and you have seen the same number of characters as there were in the input string, it's a match.
This allows you to skip the sorts in favor of a constant-time array copy and a single iteration through each string.
EDIT: Upon further reflection, this is effectively sorting the characters of the first string using a bucket sort.
I think what will help you is the test if two strings are anagrams. Here is how you can do it. I am assuming the string can contain 256 ascii characters for now.
#define NUM_ALPHABETS 256
int alphabets[NUM_ALPHABETS];
bool isAnagram(char *src, char *dest) {
len1 = strlen(src);
len2 = strlen(dest);
if (len1 != len2)
return false;
memset(alphabets, 0, sizeof(alphabets));
for (i = 0; i < len1; i++)
alphabets[src[i]]++;
for (i = 0; i < len2; i++) {
alphabets[dest[i]]--;
if (alphabets[dest[i]] < 0)
return false;
}
return true;
}
This will run in O(mn) if you have 'm' strings in the file of average length 'n'
Sort your query string
Iterate through the Collection, doing the following:
Sort current string
Compare against query string
If it matches, this is a "brotherhood" match, save it/index/whatever you want
That's pretty much it. If you're doing lots of searching, presorting all of your collection will make the routine a lot faster (at the cost of extra memory). If you are doing this even more, you could pre-sort and save a dictionary (or some hashed collection) based off the first character, etc, to find matches much faster.
It's fairly obvious that each brotherhood string will have the same histogram of letters as the original. It is trivial to construct such a histogram, and fairly efficient to test whether the input string has the same histogram as the test string ( you have to increment or decrement counters for twice the length of the input string ).
The steps would be:
construct histogram of test string ( zero an array int histogram[128] and increment position for each character in test string )
for each input string
for each character in input string c, test whether histogram[c] is zero. If it is, it is a non-match and restore the histogram.
decrement histogram[c]
to restore the histogram, traverse the input string back to its start incrementing rather than decrementing
At most, it requires two increments/decrements of an array for each character in the input.
The most efficient answer will depend on the contents of the file. Any algorithm we come up with will have complexity proportional to N (number of words in file) and L (average length of the strings) and possibly V (variety in the length of strings)
If this were a real world situation, I would start with KISS and not try to overcomplicate it. Checking the length of the target string is simple but could help avoid lots of nlogn sort operations.
target = sort_characters("target string")
count = 0
foreach (word in inputfile){
if target.len == word.len && target == sort_characters(word){
count++
}
}
I would recommend:
for each string in text file :
compare size with "source string" (size of brotherhood strings should be equal)
compare hashes (CRC or default framework hash should be good)
in case of equity, do a finer compare with string sorted.
It's not the fastest algorithm but it will work for any alphabet/encoding.
Here's another method, which works if you have a relatively small set of possible "letters" in the strings, or good support for large integers. Basically consists of writing a position-independent hash function...
Assign a different prime number for each letter:
prime['a']=2;
prime['b']=3;
prime['c']=5;
Write a function that runs through a string, repeatedly multiplying the prime associated with each letter into a running product
long long key(char *string)
{
long long product=1;
while (*string++) {
product *= prime[*string];
}
return product;
}
This function will return a guaranteed-unique integer for any set of letters, independent of the order that they appear in the string. Once you've got the value for the "key", you can go through the list of strings to match, and perform the same operation.
Time complexity of this is O(N), of course. You can even re-generate the (sorted) search string by factoring the key. The disadvantage, of course, is that the keys do get large pretty quickly if you have a large alphabet.
Here's an implementation. It creates a dict of the letters of the master, and a string version of the same as string comparisons will be done at C++ speed. When creating a dict of the letters in a trial string, it checks against the master dict in order to fail at the first possible moment - if it finds a letter not in the original, or more of that letter than the original, it will fail. You could replace the strings with integer-based hashes (as per one answer regarding base 26) if that proves quicker. Currently the hash for comparison looks like a3c2b1 for abacca.
This should work out O(N log( min(M,K) )) for N strings of length M and a reference string of length K, and requires the minimum number of lookups of the trial string.
master = "abc"
wordset = "def cba accb aepojpaohge abd bac ajghe aegage abc".split()
def dictmaster(str):
charmap = {}
for char in str:
if char not in charmap:
charmap[char]=1
else:
charmap[char] += 1
return charmap
def dicttrial(str,mastermap):
trialmap = {}
for char in str:
if char in mastermap:
# check if this means there are more incidences
# than in the master
if char not in trialmap:
trialmap[char]=1
else:
trialmap[char] += 1
else:
return False
return trialmap
def dicttostring(hash):
if hash==False:
return False
str = ""
for char in hash:
str += char + `hash[char]`
return str
def testtrial(str,master,mastermap,masterhashstring):
if len(master) != len(str):
return False
trialhashstring=dicttostring(dicttrial(str,mastermap))
if (trialhashstring==False) or (trialhashstring != masterhashstring):
return False
else:
return True
mastermap = dictmaster(master)
masterhashstring = dicttostring(mastermap)
for word in wordset:
if testtrial(word,master,mastermap,masterhashstring):
print word+"\n"

Resources