More elegant, simpler way to convert code point to UTF-8

More elegant, simpler way to convert code point to UTF-8 - utf-8

For this question I created the following Lua code that converts a Unicode code point to a UTF-8 character string. Is there a better way to do this (in Lua 5.1+)? "Better" in this case means "drastically more efficient, or—preferably—far fewer lines of code".
Note: I'm not really asking for a code review of this algorithm; I'm asking for a better algorithm (or built-in library).
do
local bytebits = {
{0x7F,{0,128}},
{0x7FF,{192,32},{128,64}},
{0xFFFF,{224,16},{128,64},{128,64}},
{0x1FFFFF,{240,8},{128,64},{128,64},{128,64}}
}
function utf8(decimal)
local charbytes = {}
for b,lim in ipairs(bytebits) do
if decimal<=lim[1] then
for i=b,1,-1 do
local prefix,max = lim[i+1][1],lim[i+1][2]
local mod = decimal % max
charbytes[i] = string.char( prefix + mod )
decimal = ( decimal - mod ) / max
end
break
end
end
return table.concat(charbytes)
end
end
c=utf8(0x24) print(c.." is "..#c.." bytes.") --> $ is 1 bytes.
c=utf8(0xA2) print(c.." is "..#c.." bytes.") --> ¢ is 2 bytes.
c=utf8(0x20AC) print(c.." is "..#c.." bytes.") --> € is 3 bytes.
c=utf8(0xFFFF) print(c.." is "..#c.." bytes.") --> is 3 bytes.
c=utf8(0x10000) print(c.." is "..#c.." bytes.") --> 𐀀 is 4 bytes.
c=utf8(0x24B62) print(c.." is "..#c.." bytes.") --> 𤭢 is 4 bytes.
I feel like there ought to be a way to get rid of the whole bytebits predefined table and loop just to find the matching entry. Looping from the back I could continually %64 and add 128 to form the continuation bytes until the value was below 128, but I can't figure out how to elegantly generate the 0/110/1110/11110 preamble to add on.
Edit: Here's a slightly better reworking, with a speed optimization. This is not an acceptable answer, though, since the algorithm is still basically the same idea and about the same amount of code.
do
local bytemarkers = { {0x7FF,192}, {0xFFFF,224}, {0x1FFFFF,240} }
function utf8(decimal)
if decimal<128 then return string.char(decimal) end
local charbytes = {}
for bytes,vals in ipairs(bytemarkers) do
if decimal<=vals[1] then
for b=bytes+1,2,-1 do
local mod = decimal%64
decimal = (decimal-mod)/64
charbytes[b] = string.char(128+mod)
end
charbytes[1] = string.char(vals[2]+decimal)
break
end
end
return table.concat(charbytes)
end
end

Lua 5.3 provides a basic UTF-8 library, among which the function utf8.char is what you are looking for:
Receives zero or more integers, converts each one to its corresponding UTF-8 byte sequence and returns a string with the concatenation of all these sequences.
c = utf8.char(0x24) print(c.." is "..#c.." bytes.") --> $ is 1 bytes.
c = utf8.char(0xA2) print(c.." is "..#c.." bytes.") --> ¢ is 2 bytes.
c = utf8.char(0x20AC) print(c.." is "..#c.." bytes.") --> € is 3 bytes.
c = utf8.char(0xFFFF) print(c.." is "..#c.." bytes.") --> is 3 bytes.
c = utf8.char(0x10000) print(c.." is "..#c.." bytes.") --> 𐀀 is 4 bytes.
c = utf8.char(0x24B62) print(c.." is "..#c.." bytes.") --> 𤭢 is 4 bytes.

If we're talking about speed, the usage pattern in a real world scenario is very important. But here, we're in a vacuum, so let's proceed anyway.
This algorithm is probably what you're looking for when you say you thing you ought to be able to get rid of bytebits:
do
local string_char = string.char
function utf8(cp)
if cp < 128 then
return string_char(cp)
end
local s = ""
local prefix_max = 32
while true do
local suffix = cp % 64
s = string_char(128 + suffix)..s
cp = (cp - suffix) / 64
if cp < prefix_max then
return string_char((256 - (2 * prefix_max)) + cp)..s
end
prefix_max = prefix_max / 2
end
end
end
It also includes some other optimizations which aren't particularly interesting, and for me is about 2x as fast as your optimized given code. (As a bonus, it should work all the way up to U+7FFFFFFF as well.)
If we want to micro-optimize even more, the loop can be unrolled to:
do
local string_char = string.char
function utf8_unrolled(cp)
if cp < 128 then
return string_char(cp)
end
local suffix = cp % 64
local c4 = 128 + suffix
cp = (cp - suffix) / 64
if cp < 32 then
return string_char(192 + cp, c4)
end
suffix = cp % 64
local c3 = 128 + suffix
cp = (cp - suffix) / 64
if cp < 16 then
return string_char(224 + cp, c3, c4)
end
suffix = cp % 64
cp = (cp - suffix) / 64
return string_char(240 + cp, 128 + suffix, c3, c4)
end
end
This is about 5x as fast as your optimized code, but wholly inelegant. I think the main gains are not having to store intermediate results on the heap and having fewer function calls.
However, the fastest (as far as I can find) approach is not to do the calculation at all:
do
local lookup = {}
for i=0,0x1FFFFF do
lookup[i]=calculate_utf8(i)
end
function utf8(cp)
return lookup[cp]
end
end
This is about 30x as fast as your optimized code which may qualify as "drastically more efficient" (although the memory usage is ridiculous). However, it is also not interesting. (A good compromise in some cases would be to use memoization.)
Of course, any pure c implementation is likely to be faster than any calculation done in Lua.

Related

Performing checksum calculation on python bytes type

First time I need to work on raw data (with different endianness, 2's complement, ...) and thus finally figured out how to work with the bytes type.
I need to implement the following checksum algorithm. I understand the C code, but wonder how to gracefully do this in Python3...
I'm sure I could come up with something that works, but would be terribly inefficient or unreliable
The checksum algorithm used is the 8-bit Fletcher algorithm. This algorithm works as follows:
Buffer[N] is an array of bytes that contains the data over which the checksum is to be calculated.
The two CK_A and CK_A values are 8-bit unsigned integers, only! If implementing with larger- sized integer values, make sure to mask both
CK_A and CK_B with the value 0xff after both operations in the loop.
After the loop, the two U1 values contain the checksum, transmitted after the message payload, which concludes the frame.
CK_A = 0, CK_B = 0 For (I = 0; I < N; I++)
{
CK_A = CK_A + Buffer[I]
CK_B = CK_B + CK_A
} ```
My data structure is as follows:
source = b'\xb5b\x01<#\x00\x01\x00\x00\x00hUX\x17\xdd\xff\xff\xff^\xff\xff\xff\xff\xff\xff\xff\xa6\x00\x00\x00F\xee\x88\x01\x00\x00\x00\x00\xa5\xf5\xd1\x05d\x00\x00\x00d\x00\x00\x00j\x00\x00\x00d\x00\x00\x00\xcb\x86\x00\x00\x00\x00\x00\x007\x01\x00\x00\xcd\xa2'
I came up with a couple of ideas on how to do this but have issues.
The following is where I am now, I've added comments on how I think it would work (but doesn't).
for b in source[5:-2]:
# The following results in "TypeError("can't concat int to bytes")"
# So I take one element of a byte, then I would expect to get a single byte.
# However, I get an int.
# Should I convert the left part of the operation to an int first?
# I suppose I could get this done in a couple of steps but it seems this can't be the "correct" way...
CK_A[-1:] += b
# I hoped the following would work as a bitmask,
# (by keeping only the last byte) thus "emulating" an uint8_t
# Might not be the correct/best assumption...
CK_A = CK_A[-1:]
CK_B[-1:] += CK_A
CK_B = CK_B[-1:]
ret = CK_A + CK_B
Clearly, I do not completely grasp how this Bytes type works/should be used.

Seems I was making things too difficult...
CK_A = 0
CK_B = 0
for b in source:
CK_A += b
CK_B += CK_A
CK_A %= 0x100
CK_B %= 0x100
ret = bytes()
ret = int.to_bytes(CK_A,1, 'big') + int.to_bytes(CK_B,1,'big')
The %=0x100 works as a bit mask, leaving only the 8 LSB...

Caesar's cypher encryption algorithm

Caesar's cypher is the simplest encryption algorithm. It adds a fixed value to the ASCII (unicode) value of each character of a text. In other words, it shifts the characters. Decrypting a text is simply shifting it back by the same amount, that is, it substract the same value from the characters.
My task is to write a function that:
accepts two arguments: the first is the character vector to be encrypted, and the second is the shift amount.
returns one output, which is the encrypted text.
needs to work with all the visible ASCII characters from space to ~ (ASCII codes of 32 through 126). If the shifted code goes outside of this range, it should wrap around. For example, if we shift ~ by 1, the result should be space. If we shift space by -1, the result should be ~.
This is my MATLAB code:
function [coded] = caesar(input_text, shift)
x = double(input_text); %converts char symbols to double format
for ii = 1:length(x) %go through each element
if (x(ii) + shift > 126) & (mod(x(ii) + shift, 127) < 32)
x(ii) = mod(x(ii) + shift, 127) + 32; %if the symbol + shift > 126, I make it 32
elseif (x(ii) + shift > 126) & (mod(x(ii) + shift, 127) >= 32)
x(ii) = mod(x(ii) + shift, 127);
elseif (x(ii) + shift < 32) & (126 + (x(ii) + shift - 32 + 1) >= 32)
x(ii) = 126 + (x(ii) + shift - 32 + 1);
elseif (x(ii) + shift < 32) & (126 + (x(ii) + shift - 32 + 1) < 32)
x(ii) = abs(x(ii) - 32 + shift - 32);
else x(ii) = x(ii) + shift;
end
end
coded = char(x); % converts double format back to char
end
I can't seem to make the wrapping conversions correctly (e.g. from 31 to 126, 30 to 125, 127 to 32, and so on). How should I change my code to do that?

Before you even start coding something like this, you should have a firm grasp of how to approach the problem.
The main obstacle you encountered is how to apply the modulus operation to your data, seeing how mod "wraps" inputs to the range of [0 modPeriod-1], while your own data is in the range [32 126]. To make mod useful in this case we perform an intermediate step of shifting of the input to the range that mod "likes", i.e. from some [minVal maxVal] to [0 modPeriod-1].
So we need to find two things: the size of the required shift, and the size of the period of the mod. The first one is easy, since this is just -minVal, which is the negative of the ASCII value of the first character, which is space (written as ' ' in MATLAB). As for the period of the mod, this is just the size of your "alphabet", which happens to be "1 larger than the maximum value, after shifting", or in other words - maxVal-minVal+1. Essentially, what we're doing is the following
input -> shift to 0-based ("mod") domain -> apply mod() -> shift back -> output
Now take a look how this can be written using MATLAB's vectorized notation:
function [coded] = caesar(input_text, shift)
FIRST_PRINTABLE = ' ';
LAST_PRINTABLE = '~';
N_PRINTABLE_CHARS = LAST_PRINTABLE - FIRST_PRINTABLE + 1;
coded = char(mod(input_text - FIRST_PRINTABLE + shift, N_PRINTABLE_CHARS) + FIRST_PRINTABLE);
Here are some tests:
>> caesar('blabla', 1)
ans =
'cmbcmb'
>> caesar('cmbcmb', -1)
ans =
'blabla'
>> caesar('blabla', 1000)
ans =
'5?45?4'
>> caesar('5?45?4', -1000)
ans =
'blabla'

We can solve it using the idea of periodic functions :
periodic function repeats itself every cycle and every cycle is equal to 2π ...
like periodic functions ,we have a function that repeats itself every 95 values
the cycle = 126-32+1 ;
we add one because the '32' is also in the cycle ...
So if the value of the character exceeds '126' we subtract 95 ,
i.e. if the value =127(bigger than 126) then it is equivalent to
127-95=32 .
&if the value is less than 32 we subtract 95.
i.e. if the value= 31 (less than 32) then it is equivalent to 31+95
=126..
Now we will translate that into codes :
function out= caesar(string,shift)
value=string+shift;
for i=1:length(value)
while value(i)<32
value(i)=value(i)+95;
end
while value(i)>126
value(i)=value(i)-95;
end
end
out=char(value);

First i converted the output(shift+ text_input) to char.
function coded= caesar(text_input,shift)
coded=char(text_input+shift);
for i=1:length(coded)
while coded(i)<32
coded(i)=coded(i)+95;
end
while coded(i)>126
coded(i)=coded(i)-95;
end
end

Here Is one short code:
function coded = caesar(v,n)
C = 32:126;
v = double(v);
for i = 1:length(v)
x = find(C==v(i));
C = circshift(C,-n);
v(i) = C(x);
C = 32:126;
end
coded = char(v);
end

Can this be modified to run faster?

I'm creating a word list using python that hits every combination of of characters which is a monster of a calculation past 944. Before you ask where I'm getting 94, 94 covers ASCII characters 32 to 127. Understandably this function runs super slow, I'm curious if there's a way to make it more efficient.
This is the meat and potatoes of the my code.
def CreateTable(name,ASCIIList,size):
f = open(name + '.txt','w')
combo = itertools.product(ASCIIList, repeat = size)
for x in combo:
passwords = ''.join(x)
f.write(str(passwords) + '\n')
f.close()
I'm using this so that I can make lists to use in a brute force where I don't know the length of the passwords or what characters the password contains. Using a list like this I hit every possible combination of words so I'm sure to hit the right one eventually. Having stated earlier that this is a slow program this also slow to read in and will not my first choice for a brute force, this more or less for a last ditch effort.
To give you an idea of how long that piece of code runs. I was creating all the combinations of size 5 and ran for 3 hours ending at a little over 50GB.

Warning : I have not tested this code.
I would convert combo to a list: combo_list = list(combo)
I would then break it into chunks:
# https://stackoverflow.com/a/312464/596841
def get_chunks(l, n):
"""Yield successive n-sized chunks from l."""
for i in range(0, len(l), n):
yield l[i:i + n]
# Change 1000 to whatever works.
chunks = get_chunks(combo_list, 1000)
Next, I would use multithreading to process each chunk:
class myThread (threading.Thread):
def __init__(self, chunk_id, chunk):
threading.Thread.__init__(self)
self.chunk_id = chunk_id
self.chunk = chunk
def run(self):
print ("Starting " + self.chunk_id)
process_data(self.chunk)
print ("Exiting " + self.chunk_id)
def process_data():
f = open(self.chunk_id + '.txt','w')
for item in self.chunk:
passwords = ''.join(item)
f.write(str(passwords) + '\n')
f.close()
I would then do something like this:
threads = []
for i, chunk in enumerate(chunks):
thread = myThread(i, chunk)
thread.start()
threads.append(thread)
# Wait for all threads to complete
for t in threads:
t.join()
You could then write another script to merge all the output files, if you need.

I did some testing on this, and I think the main problem is that you're writing in text mode.
Binary mode is faster, and you're only dealing with ASCII, so you might as well just spit out bytes rather than strings.
Here's my code:
import itertools
import time
def CreateTable(name,ASCIIList,size):
f = open(name + '.txt','w')
combo = itertools.product(ASCIIList, repeat = size)
for x in combo:
passwords = ''.join(x)
f.write(str(passwords) + '\n')
f.close()
def CreateTableBinary(name,ASCIIList,size):
f = open(name + '.txt', 'wb')
combo = itertools.product(ASCIIList, repeat = size)
for x in combo:
passwords = bytes(x)
f.write(passwords)
f.write(b'\n')
f.close()
def CreateTableBinaryFast(name,first,last,size):
f = open(name + '.txt', 'wb')
x = bytearray(chr(first) * size, 'ASCII')
while True:
f.write(x)
f.write(b'\n')
i = size - 1
while (x[i] == last) and (i > 0):
x[i] = first
i -= 1
if i == 0 and x[i] == last:
break
x[i] += 1
f.close()
def CreateTableTheoreticalMax(name,ASCIIList,size):
f = open(name + '.txt', 'wb')
combo = range(0, len(ASCIIList)**size)
passwords = b'A' * size
for x in combo:
f.write(passwords)
f.write(b'\n')
f.close()
print("writing real file in text mode")
start = time.time()
chars = [chr(x) for x in range(32, 126)]
CreateTable("c:/temp/output", chars, 4)
print("that took ", time.time() - start, "seconds.")
print("writing real file in binary mode")
start = time.time()
chars = bytes(range(32, 126))
CreateTableBinary("c:/temp/output", chars, 4)
print("that took ", time.time() - start, "seconds.")
print("writing real file in fast binary mode")
start = time.time()
CreateTableBinaryFast("c:/temp/output", 32, 125, size)
print("that took ", time.time() - start, "seconds.")
print("writing fake file at max speed")
start = time.time()
chars = [chr(x) for x in range(32, 126)]
CreateTableTheoreticalMax("c:/temp/output", chars, 4)
print("that took ", time.time() - start, "seconds.")
Output:
writing real file in text mode
that took 101.5869083404541 seconds.
writing real file in binary mode
that took 40.960529804229736 seconds.
writing real file in fast binary mode
that took 35.54869604110718 seconds.
writing fake file at max speed
that took 26.43029284477234 seconds.
So you can see a pretty big improvement just by switching to binary mode.
Also, there still seems to be some slack to take up, since omitting the itertools.product and writing hard-coded bytes is even faster. Maybe you could write your own version of product that directly output bytes-like objects. Not sure about that.
Edit: I had a go at a manual itertools.product working directly on a bytearray. It's a bit faster - see "fast binary mode" in the code.

How to optimize MATLAB bitwise operations

I have written my own SHA1 implementation in MATLAB, and it gives correct hashes. However, it's very slow (a string a 1000 a's takes 9.9 seconds on my Core i7-2760QM), and I think the slowness is a result of how MATLAB implements bitwise logical operations (bitand, bitor, bitxor, bitcmp) and bitwise shifts (bitshift, bitrol, bitror) of integers.
Especially I wonder the need to construct fixed-point numeric objects for bitrol and bitror using fi command, because anyway in Intel x86 assembly there's rol and ror both for registers and memory addresses of all sizes. However, bitshift is quite fast (it doesn't need any fixed-point numeric costructs, a regular uint64 variable works fine), which makes the situation stranger: why in MATLAB bitrol and bitror need fixed-point numeric objects constructed with fi, whereas bitshift does not, when in assembly level it all comes down to shl, shr, rol and ror?
So, before writing this function in C/C++ as a .mex file, I'd be happy to know if there is any way to improve the performance of this function. I know there are some specific optimizations for SHA1, but that's not the issue, if the very basic implementation of bitwise rotations is so slow.
Testing a little bit with tic and toc, it's evident that what makes it slow are the loops in with bitrol and fi. There are two such loops:
%# Define some variables.
FFFFFFFF = uint64(hex2dec('FFFFFFFF'));
%# constants: K(1), K(2), K(3), K(4).
K(1) = uint64(hex2dec('5A827999'));
K(2) = uint64(hex2dec('6ED9EBA1'));
K(3) = uint64(hex2dec('8F1BBCDC'));
K(4) = uint64(hex2dec('CA62C1D6'));
W = uint64(zeros(1, 80));
... some other code here ...
%# First slow loop begins here.
for index = 17:80
W(index) = uint64(bitrol(fi(bitxor(bitxor(bitxor(W(index-3), W(index-8)), W(index-14)), W(index-16)), 0, 32, 0), 1));
end
%# First slow loop ends here.
H = sha1_handle_block_struct.H;
A = H(1);
B = H(2);
C = H(3);
D = H(4);
E = H(5);
%# Second slow loop begins here.
for index = 1:80
rotatedA = uint64(bitrol(fi(A, 0, 32, 0), 5));
if (index <= 20)
% alternative #1.
xorPart = bitxor(D, (bitand(B, (bitxor(C, D)))));
xorPart = bitand(xorPart, FFFFFFFF);
temp = rotatedA + xorPart + E + W(index) + K(1);
elseif ((index >= 21) && (index <= 40))
% FIPS.
xorPart = bitxor(bitxor(B, C), D);
xorPart = bitand(xorPart, FFFFFFFF);
temp = rotatedA + xorPart + E + W(index) + K(2);
elseif ((index >= 41) && (index <= 60))
% alternative #2.
xorPart = bitor(bitand(B, C), bitand(D, bitxor(B, C)));
xorPart = bitand(xorPart, FFFFFFFF);
temp = rotatedA + xorPart + E + W(index) + K(3);
elseif ((index >= 61) && (index <= 80))
% FIPS.
xorPart = bitxor(bitxor(B, C), D);
xorPart = bitand(xorPart, FFFFFFFF);
temp = rotatedA + xorPart + E + W(index) + K(4);
else
error('error in the code of sha1_handle_block.m!');
end
temp = bitand(temp, FFFFFFFF);
E = D;
D = C;
C = uint64(bitrol(fi(B, 0, 32, 0), 30));
B = A;
A = temp;
end
%# Second slow loop ends here.
Measuring with tic and toc, the entire computation of SHA1 hash of message abc takes on my laptop around 0.63 seconds, of which around 0.23 seconds is passed in the first slow loop and around 0.38 seconds in the second slow loop. So is there some way to optimize those loops in MATLAB before writing a .mex file?

There's this DataHash from the MATLAB File Exchange that calculates SHA-1 hashes lightning fast.
I ran the following code:
x = 'The quick brown fox jumped over the lazy dog'; %# Just a short sentence
y = repmat('a', [1, 1e6]); %# A million a's
opt = struct('Method', 'SHA-1', 'Format', 'HEX', 'Input', 'bin');
tic, x_hashed = DataHash(uint8(x), opt), toc
tic, y_hashed = DataHash(uint8(y), opt), toc
and got the following results:
x_hashed = F6513640F3045E9768B239785625CAA6A2588842
Elapsed time is 0.029250 seconds.
y_hashed = 34AA973CD4C4DAA4F61EEB2BDBAD27316534016F
Elapsed time is 0.020595 seconds.
I verified the results with a random online SHA-1 tool, and the calculation was indeed correct. Also, the 106 a's were hashed ~1.5 times faster than the first sentence.
So how does DataHash do it so fast??? Using the java.security.MessageDigest library, no less!
If you're interested with a fast MATLAB-friendly SHA-1 function, this is the way to go.
However, if this is just an exercise for implementing fast bit-level operations, then MATLAB doesn't really handle them efficiently, and in most cases you'll have to resort to MEX.

why in MATLAB bitrol and bitror need fixed-point numeric objects constructed with fi, whereas bitshift does not
bitrol and bitror are not part of the set of bitwise logic functions that are applicable for uints. They are part of the fixed-point toolbox, which also contains variants of bitand, bitshift etc that apply to fixed-point inputs.
A bitrol could be expressed as two bitshifts, a bitand and a bitor if you want to try using only the uint-functions. That might be even slower though.

As most MATLAB functions, bitand, bitor, bitxor are vectorized. So you get a lot faster if you give these function vector input rather than calling them in a loop over each element
Example:
%# create two sets of 10k random numbers
num = 10000;
hex = '0123456789ABCDEF';
A = uint64(hex2dec( hex(randi(16, [num 16])) ));
B = uint64(hex2dec( hex(randi(16, [num 16])) ));
%# compare loop vs. vectorized call
tic
C1 = zeros(size(A), class(A));
for i=1:numel(A)
C1(i) = bitxor(A(i),B(i));
end
toc
tic
C2 = bitxor(A,B);
toc
assert(isequal(C1,C2))
The timing was:
Elapsed time is 0.139034 seconds.
Elapsed time is 0.000960 seconds.
That's an order of magnitude faster!
The problem is, and as far as I can tell, the SHA-1 computation cannot be well vectorized. So you might not be able to take advantage of such vectorization.
As an experiment, I implemented a pure MATLAB-based funciton to compute such bit operations:
function num = my_bitops(op,A,B)
%# operation to perform: not, and, or, xor
if ischar(op)
op = str2func(op);
end
%# integer class: uint8, uint16, uint32, uint64
clss = class(A);
depth = str2double(clss(5:end));
%# bit exponents
e = 2.^(depth-1:-1:0);
%# convert to binary
b1 = logical(dec2bin(A,depth)-'0');
if nargin == 3
b2 = logical(dec2bin(B,depth)-'0');
end
%# perform binary operation
if nargin < 3
num = op(b1);
else
num = op(b1,b2);
end
%# convert back to integer
num = sum(bsxfun(#times, cast(num,clss), cast(e,clss)), 2, 'native');
end
Unfortunately, this was even worse in terms of performance:
tic, C1 = bitxor(A,B); toc
tic, C2 = my_bitops('xor',A,B); toc
assert(isequal(C1,C2))
The timing was:
Elapsed time is 0.000984 seconds.
Elapsed time is 0.485692 seconds.
Conclusion: write a MEX function or search the File Exchange to see if someone already did :)

Number crunching in Ruby (optimisation needed)

Ruby may not be the optimal language for this but I'm sort of comfortable working with this in my terminal so that's what I'm going with.
I need to process the numbers from 1 to 666666 so I pin out all the numbers that contain 6 but doesn't contain 7, 8 or 9. The first number will be 6, the next 16, then 26 and so forth.
Then I needed it printed like this (6=6) (16=6) (26=6) and when I have ranges like 60 to 66 I need it printed like (60 THRU 66=6) (SPSS syntax).
I have this code and it works but it's neither beautiful nor very efficient so how could I optimize it?
(silly code may follow)
class Array
def to_ranges
array = self.compact.uniq.sort
ranges = []
if !array.empty?
# Initialize the left and right endpoints of the range
left, right = array.first, nil
array.each do |obj|
# If the right endpoint is set and obj is not equal to right's successor
# then we need to create a range.
if right && obj != right.succ
ranges << Range.new(left,right)
left = obj
end
right = obj
end
ranges << Range.new(left,right) unless left == right
end
ranges
end
end
write = ""
numbers = (1..666666).to_a
# split each number in an array containing it's ciphers
numbers = numbers.map { |i| i.to_s.split(//) }
# delete the arrays that doesn't contain 6 and the ones that contains 6 but also 8, 7 and 9
numbers = numbers.delete_if { |i| !i.include?('6') }
numbers = numbers.delete_if { |i| i.include?('7') }
numbers = numbers.delete_if { |i| i.include?('8') }
numbers = numbers.delete_if { |i| i.include?('9') }
# join the ciphers back into the original numbers
numbers = numbers.map { |i| i.join }
numbers = numbers.map { |i| i = Integer(i) }
# rangify consecutive numbers
numbers = numbers.to_ranges
# edit the ranges that go from 1..1 into just 1
numbers = numbers.map do |i|
if i.first == i.last
i = i.first
else
i = i
end
end
# string stuff
numbers = numbers.map { |i| i.to_s.gsub(".."," thru ") }
numbers = numbers.map { |i| "(" + i.to_s + "=6)"}
numbers.each { |i| write << " " + i }
File.open('numbers.txt','w') { |f| f.write(write) }
As I said it works for numbers even in the millions - but I'd like some advice on how to make prettier and more efficient.

I deleted my earlier attempt to parlez-vous-ruby? and made up for that. I know have an optimized version of x3ro's excellent example.
$,="\n"
puts ["(0=6)", "(6=6)", *(1.."66666".to_i(7)).collect {|i| i.to_s 7}.collect do |s|
s.include?('6')? "(#{s}0 THRU #{s}6=6)" : "(#{s}6=6)"
end ]
Compared to x3ro's version
... It is down to three lines
... 204.2 x faster (to 66666666)
... has byte-identical output
It uses all my ideas for optimization
gen numbers based on modulo 7 digits (so base-7 numbers)
generate the last digit 'smart': this is what compresses the ranges
So... what are the timings? This was testing with 8 digits (to 66666666, or 823544 lines of output):
$ time ./x3ro.rb > /dev/null
real 8m37.749s
user 8m36.700s
sys 0m0.976s
$ time ./my.rb > /dev/null
real 0m2.535s
user 0m2.460s
sys 0m0.072s
Even though the performance is actually good, it isn't even close to the C optimized version I posted before: I couldn't run my.rb to 6666666666 (6x10) because of OutOfMemory. When running to 9 digits, this is the comparative result:
sehe#meerkat:/tmp$ time ./my.rb > /dev/null
real 0m21.764s
user 0m21.289s
sys 0m0.476s
sehe#meerkat:/tmp$ time ./t2 > /dev/null
real 0m1.424s
user 0m1.408s
sys 0m0.012s
The C version is still some 15x faster... which is only fair considering that it runs on the bare metal.
Hope you enjoyed it, and can I please have your votes if only for learning Ruby for the purpose :)
(Can you tell I'm proud? This is my first encounter with ruby; I started the ruby koans 2 hours ago...)
Edit by #johndouthat:
Very nice! The use of base7 is very clever and this a great job for your first ruby trial :)
Here's a slight modification of your snippet that will let you test 10+ digits without getting an OutOfMemory error:
puts ["(0=6)", "(6=6)"]
(1.."66666666".to_i(7)).each do |i|
s = i.to_s(7)
puts s.include?('6') ? "(#{s}0 THRU #{s}6=6)" : "(#{s}6=6)"
end
# before:
real 0m26.714s
user 0m23.368s
sys 0m2.865s
# after
real 0m15.894s
user 0m13.258s
sys 0m1.724s

Exploiting patterns in the numbers, you can short-circuit lots of the loops, like this:
If you define a prefix as the 100s place and everything before it,
and define the suffix as everything in the 10s and 1s place, then, looping
through each possible prefix:
If the prefix is blank (i.e. you're testing 0-99), then there are 13 possible matches
elsif the prefix contains a 7, 8, or 9, there are no possible matches.
elsif the prefix contains a 6, there are 49 possible matches (a 7x7 grid)
else, there are 13 possible matches. (see the image below)
(the code doesn't yet exclude numbers that aren't specifically in the range, but it's pretty close)
number_range = (1..666_666)
prefix_range = ((number_range.first / 100)..(number_range.last / 100))
for p in prefix_range
ps = p.to_s
# TODO: if p == prefix_range.last or p == prefix_range.first,
# TODO: test to see if number_range.include?("#{ps}6".to_i), etc...
if ps == '0'
puts "(6=6) (16=6) (26=6) (36=6) (46=6) (56=6) (60 thru 66) "
elsif ps =~ /7|8|9/
# there are no candidate suffixes if the prefix contains 7, 8, or 9.
elsif ps =~ /6/
# If the prefix contains a 6, then there are 49 candidate suffixes
for i in (0..6)
print "(#{ps}#{i}0 thru #{ps}#{i}6) "
end
puts
else
# If the prefix doesn't contain 6, 7, 8, or 9, then there are only 13 candidate suffixes.
puts "(#{ps}06=6) (#{ps}16=6) (#{ps}26=6) (#{ps}36=6) (#{ps}46=6) (#{ps}56=6) (#{ps}60 thru #{ps}66) "
end
end
Which prints out the following:
(6=6) (16=6) (26=6) (36=6) (46=6) (56=6) (60 thru 66)
(106=6) (116=6) (126=6) (136=6) (146=6) (156=6) (160 thru 166)
(206=6) (216=6) (226=6) (236=6) (246=6) (256=6) (260 thru 266)
(306=6) (316=6) (326=6) (336=6) (346=6) (356=6) (360 thru 366)
(406=6) (416=6) (426=6) (436=6) (446=6) (456=6) (460 thru 466)
(506=6) (516=6) (526=6) (536=6) (546=6) (556=6) (560 thru 566)
(600 thru 606) (610 thru 616) (620 thru 626) (630 thru 636) (640 thru 646) (650 thru 656) (660 thru 666)
(1006=6) (1016=6) (1026=6) (1036=6) (1046=6) (1056=6) (1060 thru 1066)
(1106=6) (1116=6) (1126=6) (1136=6) (1146=6) (1156=6) (1160 thru 1166)
(1206=6) (1216=6) (1226=6) (1236=6) (1246=6) (1256=6) (1260 thru 1266)
(1306=6) (1316=6) (1326=6) (1336=6) (1346=6) (1356=6) (1360 thru 1366)
(1406=6) (1416=6) (1426=6) (1436=6) (1446=6) (1456=6) (1460 thru 1466)
(1506=6) (1516=6) (1526=6) (1536=6) (1546=6) (1556=6) (1560 thru 1566)
(1600 thru 1606) (1610 thru 1616) (1620 thru 1626) (1630 thru 1636) (1640 thru 1646) (1650 thru 1656) (1660 thru 1666)
etc...

Note I don't speak ruby, but I intend to dohave done a ruby version later just for speed comparison :)
If you just iterate all numbers from 0 to 117648 (ruby <<< 'print "666666".to_i(7)') and print them in base-7 notation, you'll at least have discarded any numbers containing 7,8,9. This includes the optimization suggestion by MrE, apart from lifting the problem to simple int arithmetic instead of char-sequence manipulations.
All that remains, is to check for the presence of at least one 6. This would make the algorithm skip at most 6 items in a row, so I deem it less unimportant (the average number of skippable items on the total range is 40%).
Simple benchmark to 6666666666
(Note that this means outputting 222,009,073 (222M) lines of 6-y numbers)
Staying close to this idea, I wrote this quite highly optimized C code (I don't speak ruby) to demonstrate the idea. I ran it to 282475248 (congruent to 6666666666 (mod 7)) so it was more of a benchmark to measure: 0m26.5s
#include <stdio.h>
static char buf[11];
char* const bufend = buf+10;
char* genbase7(int n)
{
char* it = bufend; int has6 = 0;
do
{
has6 |= 6 == (*--it = n%7);
n/=7;
} while(n);
return has6? it : 0;
}
void asciify(char* rawdigits)
{
do { *rawdigits += '0'; }
while (++rawdigits != bufend);
}
int main()
{
*bufend = 0; // init
long i;
for (i=6; i<=282475248; i++)
{
char* b7 = genbase7(i);
if (b7)
{
asciify(b7);
puts(b7);
}
}
}
I also benchmarked another approach, which unsurprisingly ran in less than half the time because
this version directly manipulates the results in ascii string form, ready for display
this version shortcuts the has6 flag for deeper recursion levels
this version also optimizes the 'twiddling' of the last digit when it is required to be '6'
the code is simply shorter...
Running time: 0m12.8s
#include <stdio.h>
#include <string.h>
inline void recursive_permute2(char* const b, char* const m, char* const e, int has6)
{
if (m<e)
for (*m = '0'; *m<'7'; (*m)++)
recursive_permute2(b, m+1, e, has6 || (*m=='6'));
else
if (has6)
for (*e = '0'; *e<'7'; (*e)++)
puts(b);
else /* optimize for last digit must be 6 */
puts((*e='6', b));
}
inline void recursive_permute(char* const b, char* const e)
{
recursive_permute2(b, b, e-1, 0);
}
int main()
{
char buf[] = "0000000000";
recursive_permute(buf, buf+sizeof(buf)/sizeof(*buf)-1);
}
Benchmarks measured with:
gcc -O4 t6.c -o t6
time ./t6 > /dev/null

$range_start = -1
$range_end = -1
$f = File.open('numbers.txt','w')
def output_number(i)
if $range_end == i-1
$range_end = i
elsif $range_start < $range_end
$f.puts "(#{$range_start} thru #{$range_end})"
$range_start = $range_end = i
else
$f.puts "(#{$range_start}=6)" if $range_start > 0 # no range, print out previous number
$range_start = $range_end = i
end
end
'1'.upto('666') do |n|
next unless n =~ /6/ # keep only numbers that contain 6
next if n =~ /[789]/ # remove nubmers that contain 7, 8 or 9
output_number n.to_i
end
if $range_start < $range_end
$f.puts "(#{$range_start} thru #{$range_end})"
end
$f.close
puts "Ruby is beautiful :)"

I came up with this piece of code, which I tried to keep more or less in FP-styling. Probably not much more efficient (as it has been said, with basic number logic you will be able to increase performance, for example by skipping from 19xx to 2000 directly, but that I will leave up to you :)
def check(n)
n = n.to_s
n.include?('6') and
not n.include?('7') and
not n.include?('8') and
not n.include?('9')
end
def spss(ranges)
ranges.each do |range|
if range.first === range.last
puts "(" + range.first.to_s + "=6)"
else
puts "(" + range.first.to_s + " THRU " + range.last.to_s + "=6)"
end
end
end
range = (1..666666)
range = range.select { |n| check(n) }
range = range.inject([0..0]) do |ranges, n|
temp = ranges.last
if temp.last + 1 === n
ranges.pop
ranges.push(temp.first..n)
else
ranges.push(n..n)
end
end
spss(range)

My first answer was trying to be too clever. Here is a much simpler version
class MutablePrintingCandidateRange < Struct.new(:first, :last)
def to_s
if self.first == nil and self.last == nil
''
elsif self.first == self.last
"(#{self.first}=6)"
else
"(#{self.first} thru #{self.last})"
end
end
def <<(x)
if self.first == nil and self.last == nil
self.first = self.last = x
elsif self.last == x - 1
self.last = x
else
puts(self) # print the candidates
self.first = self.last = x # reset the range
end
end
end
and how to use it:
numer_range = (1..666_666)
current_range = MutablePrintingCandidateRange.new
for i in numer_range
candidate = i.to_s
if candidate =~ /6/ and candidate !~ /7|8|9/
# number contains a 6, but not a 7, 8, or 9
current_range << i
end
end
puts current_range

Basic observation: If the current number is (say) 1900 you know that you can safely skip up to at least 2000...

(I didn't bother updating my C solution for formatting. Instead I went with x3ro's excellent ruby version and optimized that)
Undeleted:
I still am not sure whether the changed range-notation behaviour isn't actually what the OP wants: This version changes the behaviour of breaking up ranges that are actually contiguous modulo 6; I wouldn't be surprised the OP actually expected
.
....
(555536=6)
(555546=6)
(555556 THRU 666666=6)
instead of
....
(666640 THRU 666646=6)
(666650 THRU 666656=6)
(666660 THRU 666666=6)
I'll let the OP decide, and here is the modified version, which runs in 18% of the time as x3ro's version (3.2s instead of 17.0s when generating up to 6666666 (7x6)).
def check(n)
n.to_s(7).include?('6')
end
def spss(ranges)
ranges.each do |range|
if range.first === range.last
puts "(" + range.first.to_s(7) + "=6)"
else
puts "(" + range.first.to_s(7) + " THRU " + range.last.to_s(7) + "=6)"
end
end
end
range = (1..117648)
range = range.select { |n| check(n) }
range = range.inject([0..0]) do |ranges, n|
temp = ranges.last
if temp.last + 1 === n
ranges.pop
ranges.push(temp.first..n)
else
ranges.push(n..n)
end
end
spss(range)

My answer below is not complete, but just to show a path (I might come back and continue the answer):
There are only two cases:
1) All the digits besides the lowest one is either absent or not 6
6, 16, ...
2) At least one digit besides the lowest one includes 6
60--66, 160--166, 600--606, ...
Cases in (1) do not include any continuous numbers because they all have 6 in the lowest digit, and are different from one another. Cases in (2) all appear as continuous ranges where the lowest digit continues from 0 to 6. Any single continuation in (2) is not continuous with another one in (2) or with anything from (1) because a number one less than xxxxx0 will be xxxxy9, and a number one more than xxxxxx6 will be xxxxxx7, and hence be excluded.
Therefore, the question reduces to the following:
3)
Get all strings between "" to "66666" that do not include "6"
For each of them ("xxx"), output the string "(xxx6=6)"
4)
Get all strings between "" to "66666" that include at least one "6"
For each of them ("xxx"), output the string "(xxx0 THRU xxx6=6)"

The killer here is
numbers = (1..666666).to_a
Range supports iterations so you would be better off by going over the whole range and accumulating numbers that include your segments in blocks. When one block is finished and supplanted by another you could write it out.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

More elegant, simpler way to convert code point to UTF-8 - utf-8

Related

Performing checksum calculation on python bytes type

Caesar's cypher encryption algorithm

Can this be modified to run faster?

How to optimize MATLAB bitwise operations

Number crunching in Ruby (optimisation needed)

Categories

Resources