Identify characters in a series by a line number - algorithm

this is my first time on here. I searched and couldn't find anything relevant. Trying to work something out:
Where a=1, b=2, c=3 ... z=26
If you were to create a series where it goes through every possible outcome of letters and using 1 character length in numerical order, the total possible number of outcomes is 26 (26^1). You easily figure "e" would be on line 5 of the series. "y" would be line 25.
If you set the parameters to a 2 character length, the total number of combinations is 676 (26^2), "aa" would be line 1, "az" would be line 26, "ba" would be line 27, "zz" would be line 676. This is easily calculated, and can be done no matter what the character length is, you will always find what line it would be on in the series.
My question is how do you do it in reverse? Using the same parameters, 1 will obviously be "aa", 31 will be "be". How do you work out with a formula that 676 will be "zz"? 676, based on the parameters set, can only be "zz", it can't be any other set of characters. So there should be a way of calculating this, no matter how long the number is, as long as you know the parameters of the series.
If length of characters was 10, what characters would be on line 546,879,866, for example?
Is this even doable? Thanks so much in advance

It is enough to translate 546,879,866 into 26-base number. For example in bash:
echo 'obase=26 ; 546879866' | bc
01 20 00 19 03 23 00
And if your prefere 10 caracters you should fill the number from the beginning:
00 00 00 01 20 00 19 03 23 00
Just note that numeration starts from 0 which is mean a=00, b=01, … z=25.

Related

Need help for finding optimal path which visits multiple sequences of nodes

Summary
Recently I have had a path-finding puzzle that has some complex constraints (currently, I don't have any solution for this one)
A 2D matrix represented the graph. The length of a path is the number of traversed cells.
One or more number sequences are to be found inside the matrix. Each sequence is scored with a value.
Maximum length of the path in the graph. The number of picked cells must not exceed this value.
At any given moment, you can only choose cells in a specific column or row.
On each turn, you need to switch between column and row and stay on
the same line as the last cell you picked. You have to move at right angles. (The direction is like the Snake game).
Always start with picking the first cell from the top row, then go
vertically down to pick the second cell, and then continue switching
between column and row as usual.
You can't choose the same cell twice. The resulting path must not contain duplicated
cells.
For example:
The task is to find the shortest path, if possible in the graph that contains one or more sequences with the highest total score and the path's length is not exceed the provided maximum length.
The picture below demonstrates the solved puzzle with the resulting path marked in red:
Here, we have a path 3A-10-9B. This path contains the given
sequence 3A-10-9B so, which earns 10pts. More complex graphs typically have longer paths containing various sequences at once.
More complex examples
Multiple Sequences
You can complete sequences in any order. The order in which the sequences are listed doesn't matter.
Wasted Moves
Sometimes we are forced to waste moves and choose different cells that don't belong to any sequence. Here are the rules:
Able to waste 1 or 2 moves before the first sequence.
Able to waste 1 or 2 moves between any neighboring sequences.
However, you cannot break sequences and waste moves in the middle of them.
Here, we must waste one move before the sequence 3A-9B and two moves between sequences 3A-9B and 72-D4. Also, notice how red lines between 3A and 9B as well as between 72 and D4 "cross" previously selected cells D4 and 9B, respectively. You can pick different cells from the same row or column multiple times.
Optimal Sequences
Sometimes, it is not possible to have a path that contains all of the provided sequences. In this case, choose the way which achieved the most significant score.
In the above example, we can complete either 9B-3A-72-D4 or 72-D4-3A but not both due to the maximum path length of 5 cells. We have chosen the sequence 9B-3A-72-D4 since it grants more score points than 72-D4-3A.
Unsolvable solution
The first sequence 3A-D4 can't be completed since the code matrix doesn't contain code D4 at all. The second sequence, 72-10, can't be completed for another reason: codes 72 and 10 aren't located in the same row or column anywhere in the matrix and, therefore, can't form a sequence.
Performance advice
One brute force way is to generate all possible paths in the code matrix, loop through them and choose the best one. This is the easiest but also the slowest approach. Solving larger matrices with larger maximum length of path might take dozens of minutes, if not hours.
Try to implement a faster algorithm that doesn’t iterate through all possible paths and can solve puzzles with the following parameters in less than 10 seconds:
Matrix size: 10x10
Number of sequences: 5
Average length of sequences: 4
Maximum path length: 12
At least one solution exists
For example:
Matrix:
41,0f,32,18,29,4b,55,3f,10,3a,
19,4f,57,43,3a,25,19,1e,5e,42,
13,5a,54,3c,1b,32,29,1c,15,30,
49,45,22,2e,25,51,2f,21,4c,37,
1a,5e,49,12,55,1e,49,19,43,2d,
34,26,53,48,49,60,32,3c,50,10,
0f,1e,30,3d,64,37,5b,5e,22,61,
4e,4f,15,5a,13,56,44,22,40,26,
43,2c,17,2b,1f,25,43,60,50,1f,
3c,2b,54,46,42,4d,32,46,30,24,
Sequences:
30, 26, 44, 32, 3c - 25pts
5a, 3c, 12, 1e, 4d - 10pts
1e, 5a, 12 - 10pts
4d, 1e - 5pts
32, 51, 2f, 49, 55, 42 - 30pts
Optimal solution
3f, 1c, 30, 26, 44, 32, 3c, 22, 5a, 12, 1e, 4d
Which contains
30, 26, 44, 32, 3c
5a, 12, 1e
1e, 4d
Conclusion
I am looking for any advice for this puzzle since I have no idea what keywords to look for. A pseudo-code or hints would be helpful for me, and I appreciate that. What has come to my mind is just Dijkstra:
For each sequence, since the order doesn't matter, I have to find all get all possible paths with every permutation, then find the highest score path that contains other input sequences
After that, choose the best of the best.
In this case, I doubt the performance will be the issue.
First step is to find if a required sequence exists.
- SET found FALSE
- LOOP C1 over cells in first row
- CLEAR foundSequence
- ADD C1 to foundSequence
- LOOP C2 over cells is column containing C1
- IF C2 value == first value in sequence
- ADD C2 to foundSequence
- SET found TRUE
- break from LOOP C2
- IF found
- SET direction VERT
- LOOP V over remaining values in sequence
- TOGGLE direction
- SET found FALSE
- LOOP C2 over cells in same column or row ( depending on direction ) containing last cell in foundSequence
- IF C2 value == V
- ADD C2 to foundSequence
- SET found TRUE
- break from LOOP C2
- IF ! found
break out of LOOP V
- IF foundSequence == required sequence
- RETURN foundSequence
RETURN failed
Note: this doesn't find sequences that are feasible with "wasted moves". I would implement this first and get it working. Then, using the same ideas, it can be extended to allow wasted moves.
You have not specified an input format! I suggest a space delimited text files with lines beginning with 'm' containing matrix values and lines beginning 's' containing sequences, like this
m 3A 3A 10 9B
m 9B 72 3A 10
m 10 3A 3A 3A
m 3A 10 3A 9B
s 3A 10 9B
I have implemented the sequence finder in C++
std::vector<int> findSequence()
{
int w, h;
pA->size(w, h);
std::vector<int> foundSequence;
bool found = false;
bool vert = false;
// loop over cells in first row
for (int c = 0; c < w; c++)
{
foundSequence.clear();
found = false;
if (pA->cell(c, 0)->value == vSequence[0][0])
{
foundSequence.push_back(pA->cell(c, 0)->ID());
found = true;
}
while (found)
{
// found possible starting cell
// toggle search direction
vert = (!vert);
// start from last cell found
auto pmCell = pA->cell(foundSequence.back());
int c, r;
pA->coords(c, r, pmCell);
// look for next value in required sequence
std::string nextValue = vSequence[0][foundSequence.size()];
found = false;
if (vert)
{
// loop over cells in column
for (int r2 = 1; r2 < w; r2++)
{
if (pA->cell(c, r2)->value == nextValue)
{
foundSequence.push_back(pA->cell(c, r2)->ID());
found = true;
break;
}
}
}
else
{
// loop over cells in row
for (int c2 = 0; c2 < h; c2++)
{
if (pA->cell(c2, r)->value == nextValue)
{
foundSequence.push_back(pA->cell(c2, r)->ID());
found = true;
break;
}
}
}
if (!found) {
// dead end - try starting from next cell in first row
break;
}
if( foundSequence.size() == vSequence[0].size()) {
// success!!!
return foundSequence;
}
}
}
std::cout << "Cannot find sequence\n";
exit(1);
}
This outputs:
3A 3A 10 9B
9B 72 3A 10
10 3A 3A 3A
3A 10 3A 9B
row 0 col 1 3A
row 3 col 1 10
row 3 col 3 9B
You can check out the code for the complete application at https://github.com/JamesBremner/stackoverflow75410318
I have added the ability to find sequences that start elsewhere than the first row ( i.e. with "wasted moves" ). You can see the code in the github repo.
Here are the the results of a timing profile run on a 10 by 10 matrix - the algorithm finds 5 sequences in 0.6 milliseconds
Searching
41 0f 32 18 29 4b 55 3f 10 3a
19 4f 57 43 3a 25 19 1e 5e 42
13 5a 54 3c 1b 32 29 1c 15 30
49 45 22 2e 25 51 2f 21 4c 37
1a 5e 49 12 55 1e 49 19 43 2d
34 26 53 48 49 60 32 3c 50 10
0f 1e 30 3d 64 37 5b 5e 22 61
4e 4f 15 5a 13 56 44 22 40 26
43 2c 17 2b 1f 25 43 60 50 1f
3c 2b 54 46 42 4d 32 46 30 24
for sequence 4d 1e
Cannot find sequence starting in 1st row, using wasted moves
row 9 col 5 4d
row 4 col 5 1e
for sequence 30 26 44 32 3c
Cannot find sequence starting in 1st row, using wasted moves
Cannot find sequence
for sequence 5a 3c 12 1e 4d
Cannot find sequence starting in 1st row, using wasted moves
row 2 col 1 5a
row 2 col 3 3c
row 4 col 3 12
row 4 col 5 1e
row 9 col 5 4d
for sequence 1e 5a 12
Cannot find sequence starting in 1st row, using wasted moves
row 6 col 1 1e
row 4 col 5 1e
row 4 col 3 12
for sequence 32 51 2f 49 55 42
Cannot find sequence starting in 1st row, using wasted moves
row 2 col 5 32
row 3 col 5 51
row 3 col 6 2f
row 4 col 6 49
row 4 col 4 55
row 9 col 4 42
raven::set::cRunWatch code timing profile
Calls Mean (secs) Total Scope
5 0.00059034 0.0029517 findSequence

forvalues and xtile in Stata

What do the last two lines do? As far as I understand, these lines loop through the list h_nwave and calculate the weighted quantiles, if syear2digit == 'nwave' , i.e. calculate 5 quantiles for each year. But I'm not sure if my understanding is correct. Also is this equivalent to using group() function?
h_nwave "91 92 93 94 95 96 97 98 99 00 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15"
generate quantile_ip = .
forvalues number = 1(1)15 {
local nwave : word `number' of `h_nwave'
xtile quantile_ip_`nwave' = a_ip if syear2digit == `nwave' [ w = weight ], nq(5)
replace quantile_ip = quantile_ip_`nwave' if syear2digit == `nwave'
}
I try to convert this into R with forloop, mutate, xtile (statar package required) and case_when. However, so far I cannot find a suitable way to get similar result.
There is no source or context for this code.
Detail: The first command is truncated and presumably should have been
local h_nwave 91 92 93 94 95 96 97 98 99 00 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15
Detail: The first list contains 25 values, presumably corresponding to years 1991 to 2015. But the second list implies 15 values, so we are only looking at 91 to 05.
Main idea: xtile bins to quintile bins on variable a_ip, with weights. So the lowest 20% of observations (taking weighting into account) should be in bin 1, and so on. In practice observations with the same value must be assigned to the same bin, so 20-20-20-20-20 splits are not guaranteed, quite apart from the small print of whether sample size is a multiple of 5. So, the result is assignment to bins 1 to 5, and not quintiles themselves, or any other kind quantiles.
This is done separately for each survey wave.
The xtile command is documented for everyone at https://www.stata.com/manuals/dpctile.pdf regardless of personal or workplace access to Stata.
In R, you may well be able to produce quintile bins for all survey years at once. I have no idea how to do that.
Otherwise put, the loop arises because xtile doesn't work on separate subsets in one command call. There are community-contributed Stata commands that allow that. This kind of topic is much discussed on Statalist.

convert decimal to hexidecimal 2 digits at a time Ruby

I am trying to convert a string of decimal values to hex, grabbing two digits at a time
so for example, if i were to convert these decimals two digits at a time
01 67 15 06 01 76 61 73
this would be my result
01430F06014C3D49
i know that str.to_s(16) will convert decimal to hex but like i said I need this done two digits at a time so the output is correct, and i have no clue how to do this in Ruby
here is what i have tried
str.upcase.chars.each_slice(2).to_s.(16).join
You can use String#gsub with a regular expression and Kernel#sprintf:
"01 67 15 06 01 76 61 73".gsub(/\d{2} */) { |s| sprintf("%02x", s.to_i) }
#=> "01430f06014c3d49"
The regular expression /\d{2} */) matches two digits followed by zero or more spaces (note 73 is not followed by space).
The result of the block calculation replaces the two or three characters that were matched by the regular expression.
sprintf's formatting directive forms a sting containing 2 characters, padded to the left with '0''s, if necessary, and converting the string representation of an integer in base 10 to the string representation of an integer in base 16 ('x').
Alternatively, one could use String#% (with sprintf's formatting directives):
"01 67 15 06 01 76 61 73".gsub(/\d{2} */) { |s| "%02x" % s.to_i }
#=> "01430f06014c3d49"

Padding the message in SHA256

I am trying to understand SHA256. On the Wikipedia page it says:
append the bit '1' to the message
append k bits '0', where k is the minimum number >= 0 such that the resulting message
length (modulo 512 in bits) is 448.
append length of message (without the '1' bit or padding), in bits, as 64-bit big-endian
integer
(this will make the entire post-processed length a multiple of 512 bits)
So if my message is 01100001 01100010 01100011 I would first add a 1 to get
01100001 01100010 01100011 1
Then you would fill in 0s so that the total length is 448 mod 512:
01100001 01100010 01100011 10000000 0000 ... 0000
(So in this example, one would add 448 - 25 0s)
My question is: What does the last part mean? I would like to see an example.
It means the message length, padded to 64 bits, with the bytes appearing in order of significance. So if the message length is 37113, that's 90 f9 in hex; two bytes. There are two basic(*) ways to represent this as a 64-bit integer,
00 00 00 00 00 00 90 f9 # big endian
and
f9 90 00 00 00 00 00 00 # little endian
The former convention follows the way numbers are usually written out in decimal: one hundred and two is written 102, with the most significant part (the "big end") being written first, the least significant ("little end") last. The reason that this is specified explicitly is that both conventions are used in practice; internet protocols use big endian, Intel-compatible processors use little endian, so if they were decimal machines, they'd write one hundred and two as 201.
(*) Actually there are 8! = 40320 ways to represent a 64-bit integer if 8-bit bytes are the smallest units to be permuted, but two are in actual use.

Figuring out how to decode obfuscated URL parameters

I have web based system that uses encrypted GET parameters. I need to figure out what encryption is used and create a PHP function to recreate it. Any ideas?
Example URL:
...&watermark=ISpQICAK&width=IypcOysK&height=IypcLykK&...
You haven't provided nearly enough sample data for us to reliably guess even the alphabet used to encode it, much less what structure it might have.
What I can tell, from the three sample values you've provided, is:
There is quite a lot of redundancy in the data — compare e.g. width=IypcOysK and height=IypcLykK (and even watermark=ISpQICAK, though that might be just coincidence). This suggests that the data is neither random nor securely encrypted (which would make it look random).
The alphabet contains a fairly broad range of upper- and lowercase letters, from A to S and from c to y. Assuming that the alphabet consists of contiguous letter ranges, that means a palette of between 42 and 52 possible letters. Of course, we can't tell with any certainty from the samples whether other characters might also be used, so we can't even entirely rule out Base64.
This is not the output of PHP's base_convert function, as I first guessed it might be: that function only handles bases up to 36, and doesn't output uppercase letters.
That, however, is just about all. It would help to see some more data samples, ideally with the plaintext values they correspond to.
Edit: The id parameters you give in the comments are definitely in Base64. Besides the distinctive trailing = signs, they both decode to simple strings of nine printable ASCII characters followed by a line feed (hex 0A):
_Base64___________Hex____________________________ASCII_____
JiJQPjNfT0MtCg== 26 22 50 3e 33 5f 4f 43 2d 0a &"P>3_OC-.
JikwPClUPENICg== 26 29 30 3c 29 54 3c 43 48 0a &)0<)T<CH.
(I've replaced non-printable characters with a . in the ASCII column above.) On the assumption that all the other parameters are Base64 too, let's see what they decode to:
_Base64___Hex________________ASCII_
ISpQICAK 21 2a 50 20 20 0a !*P .
IypcOysK 23 2a 5c 3b 2b 0a #*\;+.
IypcLykK 23 2a 5c 2f 29 0a #*\/).
ISNAICAK 21 23 40 20 20 0a !## .
IyNAPjIK 23 23 40 3e 32 0a ###>2.
IyNAKjAK 23 23 40 2a 30 0a ###*0.
ISggICAK 21 28 20 20 20 0a !( .
IikwICAK 22 29 30 20 20 0a ")0 .
IilAPCAK 22 29 40 3c 20 0a ")#< .
So there's definitely another encoding layer involved, but we can already see some patterns:
All decoded values consist of a constant number of printable ASCII characters followed by a trailing line feed character. This cannot be a coincidence.
Most of the characters are on the low end of the printable ASCII range (hex 20 – 7E). In particular, the lowest printable ASCII character, space = hex 20, is particularly common, especially in the watermark strings.
The strings in each URL resemble each other more than they resemble the corresponding strings from other URLs. (But there are resemblances between URLs too: for example, all the decoded watermark values begin with ! = hex 21.)
In fact, the highest numbered character that occurs in any of the strings is _ = hex 5F, while the lowest (excluding the line feeds) is space = hex 20. Their difference is hex 3F = decimal 63. Coincidence? I think not. I'll guess that the second encoding layer is similar to uuencoding: the data is split into 6-bit groups (as in Base64), and each group is mapped to an ASCII character simply by adding hex 20 to it.
In fact, it looks like the second layer might be uuencoding: the first bytes of each string have the right values to be uuencode length indicators. Let's see what we get if we try to decode them:
_Base64___________UUEnc______Hex________________ASCII___re-UUE____
JiJQPjNfT0MtCg== &"P>3_OC- 0b 07 93 fe f8 cd ...... &"P>3_OC-
JikwPClUPENICg== &)0<)T<CH 25 07 09 d1 c8 e8 %..... &)0<)T<CH
_Base64___UUEnc__Hex_______ASC__re-UUE____
ISpQICAK !*P 2b + !*P``
IypcOysK #*\;+ 2b c6 cb +.. #*\;+
IypcLykK #*\/) 2b c3 c9 +.. #*\/)
ISNAICAK !## 0e . !##``
IyNAPjIK ###>2 0e 07 92 ... ###>2
IyNAKjAK ###*0 0e 02 90 ... ###*0
ISggICAK !( 20 !(```
IikwICAK ")0 25 00 %. ")0``
IilAPCAK ")#< 26 07 &. ")#<`
This is looking good:
Uudecoding and re-encoding the data (using Perl's unpack "u" and pack "u") produces the original string, except that trailing spaces are replaced with ` characters (which falls within acceptable variation between encoders).
The decoded strings are no longer printable ASCII, which suggests that we might be closer to the real data.
The watermark strings are now single characters. In two cases out of three, they're prefixes of the corresponding width and height strings. (In the third case, which looks a bit different, the watermark might perhaps have been added to the other values.)
One more piece of the puzzle — comparing the ID strings and corresponding numeric values you give in the comments, we see that:
The numbers all have six digits. The first two digits of each number are the same.
The uudecoded strings all have six bytes. The first two bytes of each string are the same.
Coincidence? Again, I think not. Let's see what we get if we write the numbers out as ASCII strings, and XOR them with the uudecoded strings:
_Num_____ASCII_hex___________UUDecoded_ID________XOR______________
406747 34 30 36 37 34 37 25 07 09 d1 c8 e8 11 37 3f e6 fc df
405174 34 30 35 31 37 34 25 07 0a d7 cb eb 11 37 3f e6 fc df
405273 34 30 35 32 37 33 25 07 0a d4 cb ec 11 37 3f e6 fc df
What is this 11 37 3f e6 fc df string? I have no idea — it's mostly not printable ASCII — but XORing the uudecoded ID with it yields the corresponding ID number in three cases out of three.
More to think about: you've provided two different ID strings for the value 405174: JiJQPjNfT0MtCg== and JikwPCpVXE9LCg==. These decode to 0b 07 93 fe f8 cd and 25 07 0a d7 cb eb respectively, and their XOR is 2e 00 99 29 33 26. The two URLs from which these ID strings came from have decoded watermarks of 0e and 20 respectively, which accounts for the first byte (and the second byte is the same in both, anyway). Where the differences in the remaining four bytes come from is still a mystery to me.
That's going to be difficult. Even if you find the encryption method and keys, the original data is likely salted and the salt is probably varied with each record.
That's the point of encryption.

Resources