Comparing csv files using field values

Comparing csv files using field values - bash

I have some csv files with the following format in the same folder:
Name - Value - Number - Key
I want to compare these files pairwise and give them a score in this way. If all the names in the Name column of the first file aren't in the corresponding column of the second file the score will be 0.
Instead, if they are, the score will be given as shown in this example:
File1.csv
NameA, ValueA, NumberA, KeyA
Jack, 10, 9, 12
Alex, 30, 2, 16
Mark, 15, 3, 18
File2.csv
NameB, ValueB, NumberB, KeyB
Jack, 13, 4, 11
Alex, 22, 5, 18
Bill, 67, 6, 20
Mark 18, 8, 26
Score = abs(11 - 10)/9 + abs(18 - 30)/2 + abs(26 - 15)/3
So it will be given by the summation of the abs(KeyB - ValueA)/NumberA scores, where abs is the absolute value of the subtraction.
How can I do this?

First of all, based on your formula, abs(KeyB - ValueA)/NumberA, you should have
Score = abs(11 - 10)/9 + abs(18 - 30)/2 + abs(26 - 15)/3 = 9.7778
instead of
Score = abs(11 - 10)/9 + abs(18 - 30)/2 + abs(26 - 15)/18
the awk command can be
awk -F, 'function abs(x){return ((x < 0.0) ? -x : x)}
BEGIN {while (getline < "file1.csv" ) { f[$1] = $2 ; g[$1] = $3; h[$1] = $4 } }
{ if (g[$1] != 0 ) score+= abs($4 - f[$1])/g[$1] } END { print score } ' file2.csv

Related

Modulo algorithm proving elusive

I have a color-wheel that maps a color to each hour on a 24-hour clock. Now given the hour of day, I want to map those colors to a 12-hour clock such that the colors 5 hours before and 6 hours after the current hour are used. But it gets a bit tricky b/c the 0th index of the result always has to be the 0th color or the 12th color of the 24 color-wheel.
For example, given colors24 as an array of 24 colors and a hour time of 5 then the final color12 array would map to colors24's indexes as:
{0,1,2,3,4,5,6,7,8,9,10,11}
If the hour is 3, then:
{0,1,2,3,4,5,6,7,8,9,22,23}
And if the hour is 9, then:
{12,13,14,15,4,5,6,7,8,9,10,11}
Bonus points if the algorithm can be generalized to any two arrays regardless of size so long as the first is evenly divisible by the second.

If hours is the total number of hours (24), length the number of colors displayed at a time (12), and hour is the current hour, then this is a generic algorithm to get the indexes into the color array:
result = [];
add = hour + hours - (length / 2) - (length % 2) + 1;
for (i = 0; i < length; i++) {
result[(add + i) % length] = (add + i) % hours;
}
Here is a Javascript implementation (generic, can be used with other ranges than 24/12):
function getColorIndexes(hour, hours, length) {
var i, result, add;
if (hours % length) throw "number of hours must be multiple of length";
result = [];
add = hour + hours - (length / 2) - (length % 2) + 1;
for (i = 0; i < length; i++) {
result[(add + i) % length] = (add + i) % hours;
}
return result;
}
console.log ('hour=3: ' + getColorIndexes(3, 24, 12));
console.log ('hour=5: ' + getColorIndexes(5, 24, 12));
console.log ('hour=9: ' + getColorIndexes(9, 24, 12));
console.log ('hour=23: ' + getColorIndexes(23, 24, 12));
As stated in the question, the number of hours (24) must be a multiple of the length of the array to return.

This can be done by first placing the numbers into a temporary array, then finding the location of 0 or 12 in it, and printing the results from that position on, treating the index as circular (i.e. modulo the array length)
Here is an example implementation:
int num[12];
// Populate the values that we are going to need
for (int i = 0 ; i != 12 ; i++) {
// 19 is 24-5
num[i] = (h+i+19) % 24;
}
int p = 0;
// Find p, the position of 0 or 12
while (num[p] != 0 && num[p] != 12) {
p++;
}
// Print num[] array with offset of p
for (int i = 0 ; i != 12 ; i++) {
printf("%d ", num[(p+i) % 12]);
}
Demo.
Note: The first and the second loops can be combined. Add a check if the number you just set is zero or 12, and set the value of p when you find a match.

Can you not get the colors straight away, i.e. from (C-Y/2+X+1)%X to (C+Y/2)%X, and then sort them?
(This is the same as looping (C+Z+X+1)%X from Z = -Y/2 to Z = Y/2-1):
for (i = 0, j = c+x+1, z = -y/2; z < y/2; z++) {
color[i++] = (z+j)%x;
}
For C=3, X=24 and Y=12, you get:
(C-12/2+24+1)%24 = 3-6+24+1 = 22, 23, 0, 1 .. 9
After sorting you get 0, 1 ...9, 22, 23 as requested.
Without sorting, you'd always get a sequence with the current hour smack in the middle (which could be good for some applications), while your 3 example has it shifted left two places.
You can do this by shifting instead of sorting by noticing that you only need to shift if c is below Y/2 (C=3 makes you start from -2, which becomes 22), in which case you shift by negative y/2-c (here, 2, or 12+2 using another modulus), or if c > (x-y/2), in which case you'd end beyond x: if c = 20, c+6 is 26, which gets rolled back to 2:
15 16 17 18 19 20 21 22 23 0 1 2
and gives a s factor of 2+1 = 3, or (c+y/2)%x+1 in general:
0 1 2 15 16 17 18 19 20 21 22 23
for (i = 0, j = c+x+1, z = -y/2; z < y/2; z++) {
color[(s+i++)%y] = (z+j)%x;
}
However, I think you've got a problem if x > 2*y; in that case you get some c values for which neither 0, nor x/2 are "in reach" of c. That is, "evenly divisible" must then mean that x must always be equal to y*2.

Here is a solution in JavaScript:
function f(h) {
var retval = [];
for (var i = h - 5; i <= h + 6; ++i)
retval.push((i+24) % 24);
return retval.sort(function(a,b){return a-b;}); // This is just a regular sort
}
https://repl.it/CWQf
For example,
f(5) // [ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11 ]
f(3) // [ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 22, 23 ]
f(9) // [ 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15 ]

Hand tracing a pseudo code

I have this pseudo code that I need to hand trace:
begin
count <- 1
while count < 11
t <- (count ^ 2) - 1
output t
count <- count + 1
endwhile
end
I am unsure what <- means and I don't really understand what to do with the t. I also keep getting 1,1,1, etc. every time I go through. Any help would be appreciated!

First off the operator <- means "gets", as in an assignment. So:
count <- count + 1
Means to set the variable count to the value count + 1.
Second the program will output the first 10 values of x2-1, so:
t <- count^2 - 1
will evaluate to:
0, 3, 8, 15, 24, 35, 48, 63, 80, 99
for the values of count
1, 2, 3, 4, 5, 6, 7, 8, 9, 10
respectively.

here is the code for it in C++, hope it helps:
int count = 1; // count <- 1
int t;
while ( count < 11 ){ // while count < 11
t = count * count - 1; // t <- (count ^ 2) - 1
std::cout<<t<<std::endl; // output t
count ++; // count <- count + 1
} // endwhile
and as said in the previous answer:
count takes the values: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10
and t will take the values: 0, 3, 8, 15, 24, 35, 48, 63, 80, 99

Organising inconsistent values [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 9 years ago.
Improve this question
No idea if this is ok to ask here since it's not programming but I have no idea where else to go:
I want to organise the following data in a consistent way. At the moment it's a mess, with only the first two columns (comma separated) consistent. The remaining columns can number anywhere from 1-9 and are usually different.
In other words, I want to sort it so the text matches (all of the value columns in a row, all of the recoil columns in a row, etc). Then I can remove the text and add a header, and it will still make sense.
bm_wp_upg_o_t1micro, sight, value = 3, zoom = 3, recoil = 1, spread_moving = -1
bm_wp_upg_o_marksmansight_rear, sight, value = 3, zoom = 1, recoil = 1, spread = 1
bm_wp_upg_o_marksmansight_front, extra, value = 1
bm_wp_m4_upper_reciever_edge, upper_reciever, value = 3, recoil = 1
bm_wp_m4_upper_reciever_round, upper_reciever, value = 1
bm_wp_m4_uupg_b_long, barrel, value = 4, damage = 1, spread = 1, spread_moving = -2, concealment = -2
Any suggestions (even on just where the right place is to actually ask this) would be great.
Context is just raw data ripped from a game file that I'm trying to organise.

I'm afraid regex isn't going to help you much here because of the irregular nature of your input (it would be possible to match it, but it would be a bear to get it all arranged one way or another). This could be done pretty easily with any programming language, but for stuff like this, I always go to awk.
Assuming your input is in a file called input.txt, put the following in a program called parse.awk:
BEGIN {
FS=" *, *";
formatStr = "%32s,%8s,%8s,%8s,%10s,%16s,%8s,%18s,%10s,%10s,%16s,%16s\n";
printf( formatStr, "id", "sight", "value", "zoom", "recoil", "spread_moving", "extra", "upper_receiver", "barrel", "damage", "spread_moving", "concealment" );
}
{
split("",a);
for( i=2; i<=NF; i++ ) {
if( split( $(i), kvp, " *= *" ) == 1 ) {
a[kvp[1]] = "x";
} else {
a[kvp[1]] = gensub( /^\s*|\s*$/, "", "g", kvp[2] );
}
}
printf( formatStr, $1, a["sight"], a["value"], a["zoom"], a["recoil"],
a["spread_moving"], a["extra"], a["upper_receiver"],
a["barrel"], a["damage"], a["spread_moving"], a["concealment"] );
}
Run awk against it:
awk -f parse.awk input.txt
And get your output:
id, sight, value, zoom, recoil, spread_moving, extra, upper_receiver, barrel, damage, spread_moving, concealment
bm_wp_upg_o_t1micro, x, 3, 3, 1, -1, , , , , -1,
bm_wp_upg_o_marksmansight_rear, x, 3, 1, 1, , , , , , ,
bm_wp_upg_o_marksmansight_front, , 1, , , , x, , , , ,
bm_wp_m4_upper_reciever_edge, , 3, , 1, , , , , , ,
bm_wp_m4_upper_reciever_round, , 1, , , , , , , , ,
bm_wp_m4_uupg_b_long, , 4, , , -2, , , x, 1, -2, -2
Note that I chose to just use an 'x' for sight, which seems to a present/absent thing. You can use whatever you want there.
If you're using Linux or a Macintosh, you should have awk available. If you're on Windows, you'll have to install it.

I did make another awk version. I think this should a little easier to read.
All value/column are read from the file to make it as dynamic as possible.
awk -F, '
{
ID[$1]=$2 # use column 1 as index
for (i=3;i<=NF;i++ ) # loop through all fields from #3 to end
{
gsub(/ +/,"",$i) # remove space from field
split($i,a,"=") # split field in name and value a[1] and a[2]
COLUMN[a[1]]++ # store field name as column name
DATA[$1" "a[1]]=a[2] # store data value in DATA using field #1 and column name as index
}
}
END {
printf "%49s ","info" # print info
for (i in COLUMN)
{printf "%15s",i} # print column name
print ""
for (i in ID) # loop through all ID
{
printf "%32s %16s ",i, ID[i] # print ID and info
for (j in COLUMN)
{
printf "%14s ",DATA[i" "j]+0 # print value
}
print ""
}
}' file
Output
info spread recoil zoom concealment spread_moving damage value
bm_wp_m4_upper_reciever_round upper_reciever 0 0 0 0 0 0 1
bm_wp_m4_uupg_b_long barrel 1 0 0 -2 -2 1 4
bm_wp_upg_o_marksmansight_rear sight 1 1 1 0 0 0 3
bm_wp_upg_o_marksmansight_front extra 0 0 0 0 0 0 1
bm_wp_m4_upper_reciever_edge upper_reciever 0 1 0 0 0 0 3
bm_wp_upg_o_t1micro sight 0 1 3 0 -1 0 3

Stick with Ethan's answer — this is just me enjoying myself. (And yes, that makes me pretty weird!)
awk script
awk 'BEGIN {
# f_idx[field] holds the column number c for a field=value item
# f_name[c] holds the names
# f_width[c] holds the width of the widest value (or the field name)
# f_fmt[c] holds the appropriate format
FS = " *, *"; n = 2;
f_name[0] = "id"; f_width[0] = length(f_name[0])
f_name[1] = "type"; f_width[1] = length(f_name[1])
}
{
#-#print NR ":" $0
line[NR,0] = $1
len = length($1)
if (len > f_width[0])
f_width[0] = len
line[NR,1] = $2
len = length($2)
if (len > f_width[1])
f_width[1] = len
for (i = 3; i <= NF; i++)
{
split($i, fv, " = ")
#-#print "1:" fv[1] ", 2:" fv[2]
if (!(fv[1] in f_idx))
{
f_idx[fv[1]] = n
f_width[n++] = length(fv[1])
}
c = f_idx[fv[1]]
f_name[c] = fv[1]
gsub(/ /, "", fv[2])
len = length(fv[2])
if (len > f_width[c])
f_width[c] = len
line[NR,c] = fv[2]
#-#print c ":" f_name[c] ":" f_width[c] ":" line[NR,c]
}
}
END {
for (i = 0; i < n; i++)
f_fmt[i] = "%s%" f_width[i] "s"
#-#for (i = 0; i < n; i++)
#-# printf "%d: (%d) %s %s\n", i, f_width[i], f_name[i], f_fmt[i]
#-# pad = ""
for (j = 0; j < n; j++)
{
printf f_fmt[j], pad, f_name[j]
pad = ","
}
printf "\n"
for (i = 1; i <= NR; i++)
{
pad = ""
for (j = 0; j < n; j++)
{
printf f_fmt[j], pad, line[i,j]
pad = ","
}
printf "\n"
}
}' data
This script adapts to the data it finds in the file. It assigns the column heading 'id' to column 1 of the input, and 'type' to column 2. For each of the sets of values in columns 3..N, it splits up the data into key (in fv[1]) and value (in fv[2]). If the key has not been seen before, it is assigned a new column number, and the key is stored as the column name, and the width of key as the initial column width. Then the value is stored in the appropriate column within the line.
When all the data's read, the script knows what the column headings are going to be. It can then create a set of format strings. Then it prints the headings and all the rows of data. If you don't want fixed width output, then you can simplify the script considerably. There are some (mostly minor) simplifications that could be made to this script.
Data file
bm_wp_upg_o_t1micro, sight, value = 3, zoom = 3, recoil = 1, spread_moving = -1
bm_wp_upg_o_marksmansight_rear, sight, value = 3, zoom = 1, recoil = 1, spread = 1
bm_wp_upg_o_marksmansight_front, extra, value = 1
bm_wp_m4_upper_receiver_edge, upper_receiver, value = 3, recoil = 1
bm_wp_m4_upper_receiver_round, upper_receiver, value = 1
bm_wp_m4_uupg_b_long, barrel, value = 4, damage = 1, spread = 1, spread_moving = -2, concealment = -2
Output
id, type,value,zoom,recoil,spread_moving,spread,damage,concealment
bm_wp_upg_o_t1micro, sight, 3, 3, 1, -1, , ,
bm_wp_upg_o_marksmansight_rear, sight, 3, 1, 1, , 1, ,
bm_wp_upg_o_marksmansight_front, extra, 1, , , , , ,
bm_wp_m4_upper_receiver_edge,upper_receiver, 3, , 1, , , ,
bm_wp_m4_upper_receiver_round,upper_receiver, 1, , , , , ,
bm_wp_m4_uupg_b_long, barrel, 4, , , -2, 1, 1, -2

how to write bash script in ubuntu to normalize the index of text comparison

I had a input which is a result from text comparison. It is in a very simple format. It has 3 columns, position, original texts and new texts.
But some of the records looks like this
4 ATCG ATCGC
10 1234 123
How to write the short script to normalize it to
7 G GC
12 34 3
probably, the whole original texts and the whole new text is like below respectively
ACCATCGGA1234
ACCATCGCGA123
"Normalize" means "trying to move the position in the first column to the position that changes gonna occur", or "we would remove the common prefix ATG, add its length 3 to the first field; similarly on line 2 the prefix we remove is length 2"

This script
awk '
BEGIN {OFS = "\t"}
function common_prefix_length(str1, str2, max_len, idx) {
idx = 1
if (length(str1) < length(str2))
max_len = length(str1)
else
max_len = length(str2)
while (substr(str1, idx, 1) == substr(str2, idx, 1) && idx < max_len)
idx++
return idx - 1
}
{
len = common_prefix_length($2, $3)
print $1 + len, substr($2, len + 1), substr($3, len + 1)
}
' << END
4 ATCG ATCGC
10 1234 123
END
outputs
7 G GC
12 34 3

Levenshtein Distance: Inferring the edit operations from the matrix

I wrote Levenshtein algorithm in in C++
If I input:
string s: democrat
string t: republican
I get the matrix D filled-up and the number of operations (the Levenshtein distance) can be read in D[10][8] = 8
Beyond the filled matrix I want to construct the optimal solution. How must look this solution? I don't have an idea.
Please only write me HOW MUST LOOK for this example.

The question is
Given the matrix produced by the Levenshtein algorithm, how can one find "the optimal solution"?
i.e. how can we find the precise sequence of string operations: inserts, deletes and substitution [of a single letter], necessary to convert the 's string' into the 't string'?
First, it should be noted that in many cases there are SEVERAL optimal solutions. While the Levenshtein algorithm supplies the minimum number of operations (8 in democrat/republican example) there are many sequences (of 8 operations) which can produce this conversion.
By "decoding" the Levenshtein matrix, one can enumerate ALL such optimal sequences.
The general idea is that the optimal solutions all follow a "path", from top left corner to bottom right corner (or in the other direction), whereby the matrix cell values on this path either remain the same or increase by one (or decrease by one in the reverse direction), starting at 0 and ending at the optimal number of operations for the strings in question (0 thru 8 democrat/republican case). The number increases when an operation is necessary, it stays the same when the letter at corresponding positions in the strings are the same.
It is easy to produce an algorithm which produces such a path (slightly more complicated to produce all possible paths), and from such path deduce the sequence of operations.
This path finding algorithm should start at the lower right corner and work its way backward. The reason for this approach is that we know for a fact that to be an optimal solution it must end in this corner, and to end in this corner, it must have come from one of the 3 cells either immediately to its left, immediately above it or immediately diagonally. By selecting a cell among these three cells, one which satisfies our "same value or decreasing by one" requirement, we effectively pick a cell on one of the optimal paths. By repeating the operation till we get on upper left corner (or indeed until we reach a cell with a 0 value), we effectively backtrack our way on an optimal path.
Illustration with the democrat - republican example
It should also be noted that one can build the matrix in one of two ways: with 'democrat' horizontally or vertically. This doesn't change the computation of the Levenshtein distance nor does it change the list of operations needed; it only changes the way we interpret the matrix, for example moving horizontally on the "path" either means inserting a character [from the t string] or deleting a character [off the s string] depending whether 'string s' is "horizontal" or "vertical" in the matrix.
I'll use the following matrix. The conventions are therefore (only going in the left-to-right and/or top-to-bottom directions)
an horizontal move is an INSERTION of a letter from the 't string'
an vertical move is a DELETION of a letter from the 's string'
a diagonal move is either:
a no-operation (both letters at respective positions are the same); the number doesn't change
a SUBSTITUTION (letters at respective positions are distinct); the number increase by one.
Levenshtein matrix for s = "democrat", t="republican"
r e p u b l i c a n
0 1 2 3 4 5 6 7 8 9 10
d 1 1 2 3 4 5 6 7 8 9 10
e 2 2 1 2 3 4 5 6 7 8 9
m 3 3 2 2 3 4 5 6 7 8 9
o 4 4 3 3 3 4 5 6 7 8 9
c 5 5 4 4 4 4 5 6 6 7 8
r 6 5 5 5 5 5 5 6 7 7 8
a 7 6 6 6 6 6 6 6 7 7 8
t 8 7 7 7 7 7 7 7 7 8 8
The arbitrary approach I use to select one path among several possible optimal paths is loosely described below:
Starting at the bottom-rightmost cell, and working our way backward toward
the top left.
For each "backward" step, consider the 3 cells directly adjacent to the current
cell (in the left, top or left+top directions)
if the value in the diagonal cell (going up+left) is smaller or equal to the
values found in the other two cells
AND
if this is same or 1 minus the value of the current cell
then "take the diagonal cell"
if the value of the diagonal cell is one less than the current cell:
Add a SUBSTITUTION operation (from the letters corresponding to
the _current_ cell)
otherwise: do not add an operation this was a no-operation.
elseif the value in the cell to the left is smaller of equal to the value of
the of the cell above current cell
AND
if this value is same or 1 minus the value of the current cell
then "take the cell to left", and
add an INSERTION of the letter corresponding to the cell
else
take the cell above, add
Add a DELETION operation of the letter in 's string'
Following this informal pseudo-code, we get the following:
Start on the "n", "t" cell at bottom right.
Pick the [diagonal] "a", "a" cell as next destination since it is less than the other two (and satisfies the same or -1 condition).
Note that the new cell is one less than current cell
therefore the step 8 is substitute "t" with "n": democra N
Continue with "a", "a" cell,
Pick the [diagonal] "c", "r" cell as next destination...
Note that the new cell is same value as current cell ==> no operation needed.
Continue with "c", "r" cell,
Pick the [diagonal] "i", "c" cell as next destination...
Note that the new cell is one less than current cell
therefore the step 7 is substitute "r" with "c": democ C an
Continue with "i", "c" cell,
Pick the [diagonal] "l", "o" cell as next destination...
Note that the new cell is one less than current cell
therefore the step 6 is substitute "c" with "i": demo I can
Continue with "l", "o" cell,
Pick the [diagonal] "b", "m" cell as next destination...
Note that the new cell is one less than current cell
therefore the step 5 is substitute "o" with "l": dem L ican
Continue with "b", "m" cell,
Pick the [diagonal]"u", "e" cell as next destination...
Note that the new cell is one less than current cell
therefore the step 4 is substitute "m" with "b": de B lican
Continue with "u", "e" cell,
Note the "diagonal" cell doesn't qualify, because the "left" cell is less than it.
Pick the [left] "p", "e" cell as next destination...
therefore the step 3 is instert "u" after "e": de U blican
Continue with "p", "e" cell,
again the "diagonal" cell doesn't qualify
Pick the [left] "e", "e" cell as next destination...
therefore the step 2 is instert "p" after "e": de P ublican
Continue with "e", "e" cell,
Pick the [diagonal] "r", "d" cell as next destination...
Note that the new cell is same value as current cell ==> no operation needed.
Continue with "r", "d" cell,
Pick the [diagonal] "start" cell as next destination...
Note that the new cell is one less than current cell
therefore the step 1 is substitute "d" with "r": R epublican
You've arrived at a cell which value is 0 : your work is done!

The backtracking algorithm to infer the moves from the matrix implemented in python:
def _backtrack_string(matrix, output_word):
'''
Iteratively backtrack DP matrix to get optimal set of moves
Inputs: DP matrix (list:list:int),
Input word (str),
Output word (str),
Start x position in DP matrix (int),
Start y position in DP matrix (int)
Output: Optimal path (list)
'''
i = len(matrix) - 1
j = len(matrix[0]) - 1
optimal_path = []
while i > 0 and j > 0:
diagonal = matrix[i-1][j-1]
vertical = matrix[i-1][j]
horizontal = matrix[i][j-1]
current = matrix[i][j]
if diagonal <= vertical and diagonal <= horizontal and (diagonal <= current):
i = i - 1
j = j - 1
if diagonal == current - 1:
optimal_path.append("Replace " + str(j) + ", " + str(output_word[j]) )
elif horizontal <= vertical and horizontal <= current:
j = j - 1
optimal_path.append("Insert " + str(j) + ", " + str(output_word[j]))
elif vertical <= horizontal and vertical <= current:
i = i - 1
optimal_path.append("Delete " + str(i))
elif horizontal <= vertical and horizontal <= current:
j = j - 1
optimal_path.append("Insert " + str(j) + ", " + str(output_word[j]))
else:
i = i - 1
optimal_path.append("Delete " + str(i))
return reversed(optimal_path)
The output I get when I run the algorithm with original word "OPERATING" and desired word "CONSTANTINE" is the following
Insert 0, C
Replace 2, N
Replace 3, S
Replace 4, T
Insert 6, N
Replace 10, E
"" C O N S T A N T I N E
"" [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]
<-- Insert 0, C
O [1, 1, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
\ Replace 2, N
P [2, 2, 2, 2, 3, 4, 5, 6, 7, 8, 9, 10]
\ Replace 3, S
E [3, 3, 3, 3, 3, 4, 5, 6, 7, 8, 9, 9]
\ Replace 4, T
R [4, 4, 4, 4, 4, 4, 5, 6, 7, 8, 9, 10] No move
\ <-- Insert 6, N
A [5, 5, 5, 5, 5, 5, 4, 5, 6, 7, 8, 9]
\ No move
T [6, 6, 6, 6, 6, 5, 5, 5, 5, 6, 7, 8]
\ No move
I [7, 7, 7, 7, 7, 6, 6, 6, 6, 5, 6, 7]
\ No move
N [8, 8, 8, 7, 8, 7, 7, 6, 7, 6, 5, 6]
\ Replace 10, E
G [9, 9, 9, 8, 8, 8, 8, 7, 7, 7, 6, 6]
Note that I had to add extra conditions if the element in the diagonal is the same as the current element. There could be a deletion or insertion depending on values in the vertical (up) and horizontal (left) positions. We only get a "no operation" or "replace" operation when the following occurs
# assume bottom right of a 2x2 matrix is the reference position
# and has value v
# the following is the situation where we get a replace operation
[v + 1 , v<]
[ v< , v]
# the following is the situation where we get a "no operation"
[v , v<]
[v<, v ]
I think this is where the algorithm described in the first answer could break. There could be other arrangements in the 2x2 matrix above when neither operations are correct. The example shown with input "OPERATING" and output "CONSTANTINE" breaks the algorithm unless this is taken into account.

It's been some times since I played with it, but it seems to me the matrix should look something like:
. . r e p u b l i c a n
. 0 1 2 3 4 5 6 7 8 9 10
d 1 1 2 3 4 5 6 7 8 9 10
e 2 2 1 2 3 4 5 6 7 8 9
m 3 3 2 2 3 4 5 6 7 8 9
o 4 4 3 3 3 4 5 6 7 8 9
c 5 5 4 4 4 4 5 6 7 8 9
r 6 5 5 5 5 5 5 6 7 8 9
a 7 6 6 6 6 6 6 6 7 7 8
t 8 7 7 7 7 7 7 7 7 7 8
Don't take it for granted though.

Here is a VBA algorithm based on mjv's answer.
(very well explained, but some case were missing).
Sub TU_Levenshtein()
Call Levenshtein("democrat", "republican")
Call Levenshtein("ooo", "u")
Call Levenshtein("ceci est un test", "ceci n'est pas un test")
End Sub
Sub Levenshtein(ByVal string1 As String, ByVal string2 As String)
' Fill Matrix Levenshtein (-> array 'Distance')
Dim i As Long, j As Long
Dim string1_length As Long
Dim string2_length As Long
Dim distance() As Long
string1_length = Len(string1)
string2_length = Len(string2)
ReDim distance(string1_length, string2_length)
For i = 0 To string1_length
distance(i, 0) = i
Next
For j = 0 To string2_length
distance(0, j) = j
Next
For i = 1 To string1_length
For j = 1 To string2_length
If Asc(Mid$(string1, i, 1)) = Asc(Mid$(string2, j, 1)) Then
distance(i, j) = distance(i - 1, j - 1)
Else
distance(i, j) = Application.WorksheetFunction.min _
(distance(i - 1, j) + 1, _
distance(i, j - 1) + 1, _
distance(i - 1, j - 1) + 1)
End If
Next
Next
LevenshteinDistance = distance(string1_length, string2_length) ' for information only
' Write Matrix on VBA sheets (only for visuation, not used in calculus)
Cells.Clear
For i = 1 To UBound(distance, 1)
Cells(i + 2, 1).Value = Mid(string1, i, 1)
Next i
For i = 1 To UBound(distance, 2)
Cells(1, i + 2).Value = Mid(string2, i, 1)
Next i
For i = 0 To UBound(distance, 1)
For j = 0 To UBound(distance, 2)
Cells(i + 2, j + 2) = distance(i, j)
Next j
Next i
' One solution
current_posx = UBound(distance, 1)
current_posy = UBound(distance, 2)
Do
cc = distance(current_posx, current_posy)
Cells(current_posx + 1, current_posy + 1).Interior.Color = vbYellow ' visualisation again
' Manage border case
If current_posy - 1 < 0 Then
MsgBox ("deletion. " & Mid(string1, current_posx, 1))
current_posx = current_posx - 1
current_posy = current_posy
GoTo suivant
End If
If current_posx - 1 < 0 Then
MsgBox ("insertion. " & Mid(string2, current_posy, 1))
current_posx = current_posx
current_posy = current_posy - 1
GoTo suivant
End If
' Middle cases
cc_L = distance(current_posx, current_posy - 1)
cc_U = distance(current_posx - 1, current_posy)
cc_D = distance(current_posx - 1, current_posy - 1)
If (cc_D <= cc_L And cc_D <= cc_U) And (cc_D = cc - 1 Or cc_D = cc) Then
If (cc_D = cc - 1) Then
MsgBox "substitution. " & Mid(string1, current_posx, 1) & " by " & Mid(string2, current_posy, 1)
current_posx = current_posx - 1
current_posy = current_posy - 1
GoTo suivant
Else
MsgBox "no operation"
current_posx = current_posx - 1
current_posy = current_posy - 1
GoTo suivant
End If
ElseIf cc_L <= cc_D And cc_L = cc - 1 Then
MsgBox ("insertion. " & Mid(string2, current_posy, 1))
current_posx = current_posx
current_posy = current_posy - 1
GoTo suivant
Else
MsgBox ("deletion." & Mid(string1, current_posy, 1))
current_posx = current_posx
current_posy = current_posy - 1
GoTo suivant
End If
suivant:
Loop While Not (current_posx = 0 And current_posy = 0)
End Sub

I've done some work with the Levenshtein distance algorithm's matrix recently. I needed to produce the operations which would transform one list into another. (This will work for strings too.)
Do the following (vows) tests show the sort of functionality that you're looking for?
, "lev - complex 2"
: { topic
: lev.diff([13, 6, 5, 1, 8, 9, 2, 15, 12, 7, 11], [9, 13, 6, 5, 1, 8, 2, 15, 12, 11])
, "check actions"
: function(topic) { assert.deepEqual(topic, [{ op: 'delete', pos: 9, val: 7 },
{ op: 'delete', pos: 5, val: 9 },
{ op: 'insert', pos: 0, val: 9 },
]); }
}
, "lev - complex 3"
: { topic
: lev.diff([9, 13, 6, 5, 1, 8, 2, 15, 12, 11], [13, 6, 5, 1, 8, 9, 2, 15, 12, 7, 11])
, "check actions"
: function(topic) { assert.deepEqual(topic, [{ op: 'delete', pos: 0, val: 9 },
{ op: 'insert', pos: 5, val: 9 },
{ op: 'insert', pos: 9, val: 7 }
]); }
}
, "lev - complex 4"
: { topic
: lev.diff([9, 13, 6, 5, 1, 8, 2, 15, 12, 11, 16], [13, 6, 5, 1, 8, 9, 2, 15, 12, 7, 11, 17])
, "check actions"
: function(topic) { assert.deepEqual(topic, [{ op: 'delete', pos: 0, val: 9 },
{ op: 'insert', pos: 5, val: 9 },
{ op: 'insert', pos: 9, val: 7 },
{ op: 'replace', pos: 11, val: 17 }
]); }
}

Here is some Matlab code, is this correct by your opinion? Seems to give the right results :)
clear all
s = char('democrat');
t = char('republican');
% Edit Matrix
m=length(s);
n=length(t);
mat=zeros(m+1,n+1);
for i=1:1:m
mat(i+1,1)=i;
end
for j=1:1:n
mat(1,j+1)=j;
end
for i=1:m
for j=1:n
if (s(i) == t(j))
mat(i+1,j+1)=mat(i,j);
else
mat(i+1,j+1)=1+min(min(mat(i+1,j),mat(i,j+1)),mat(i,j));
end
end
end
% Edit Sequence
s = char('democrat');
t = char('republican');
i = m+1;
j = n+1;
display([s ' --> ' t])
while(i ~= 1 && j ~= 1)
temp = min(min(mat(i-1,j-1), mat(i,j-1)), mat(i-1,j));
if(mat(i-1,j) == temp)
i = i - 1;
t = [t(1:j-1) s(i) t(j:end)];
disp(strcat(['iinsertion: i=' int2str(i) ' , j=' int2str(j) ' ; ' s ' --> ' t]))
elseif(mat(i-1,j-1) == temp)
if(mat(i-1,j-1) == mat(i,j))
i = i - 1;
j = j - 1;
disp(strcat(['uunchanged: i=' int2str(i) ' , j=' int2str(j) ' ; ' s ' --> ' t]))
else
i = i - 1;
j = j - 1;
t(j) = s(i);
disp(strcat(['substition: i=' int2str(i) ' , j=' int2str(j) ' ; ' s ' --> ' t]))
end
elseif(mat(i,j-1) == temp)
j = j - 1;
t(j) = [];
disp(strcat(['dddeletion: i=' int2str(i) ' , j=' int2str(j) ' ; ' s ' --> ' t]))
end
end

C# implementation of JackIsJack answer with some changes:
Operations are output in 'forward' order (JackIsJack outputs in reverse order);
Last 'else' clause in original answer worked incorrectly (looks like copy-paste error).
Console application code:
class Program
{
static void Main(string[] args)
{
Levenshtein("1", "1234567890");
Levenshtein( "1234567890", "1");
Levenshtein("kitten", "mittens");
Levenshtein("mittens", "kitten");
Levenshtein("kitten", "sitting");
Levenshtein("sitting", "kitten");
Levenshtein("1234567890", "12356790");
Levenshtein("12356790", "1234567890");
Levenshtein("ceci est un test", "ceci n'est pas un test");
Levenshtein("ceci n'est pas un test", "ceci est un test");
}
static void Levenshtein(string string1, string string2)
{
Console.WriteLine("Levenstein '" + string1 + "' => '" + string2 + "'");
var string1_length = string1.Length;
var string2_length = string2.Length;
int[,] distance = new int[string1_length + 1, string2_length + 1];
for (int i = 0; i <= string1_length; i++)
{
distance[i, 0] = i;
}
for (int j = 0; j <= string2_length; j++)
{
distance[0, j] = j;
}
for (int i = 1; i <= string1_length; i++)
{
for (int j = 1; j <= string2_length; j++)
{
if (string1[i - 1] == string2[j - 1])
{
distance[i, j] = distance[i - 1, j - 1];
}
else
{
distance[i, j] = Math.Min(distance[i - 1, j] + 1, Math.Min(
distance[i, j - 1] + 1,
distance[i - 1, j - 1] + 1));
}
}
}
var LevenshteinDistance = distance[string1_length, string2_length];// for information only
Console.WriteLine($"Levernstein distance: {LevenshteinDistance}");
// List of operations
var current_posx = string1_length;
var current_posy = string2_length;
var stack = new Stack<string>(); // for outputting messages in forward direction
while (current_posx != 0 || current_posy != 0)
{
var cc = distance[current_posx, current_posy];
// edge cases
if (current_posy - 1 < 0)
{
stack.Push("Delete '" + string1[current_posx - 1] + "'");
current_posx--;
continue;
}
if (current_posx - 1 < 0)
{
stack.Push("Insert '" + string2[current_posy - 1] + "'");
current_posy--;
continue;
}
// Middle cases
var cc_L = distance[current_posx, current_posy - 1];
var cc_U = distance[current_posx - 1, current_posy];
var cc_D = distance[current_posx - 1, current_posy - 1];
if ((cc_D <= cc_L && cc_D <= cc_U) && (cc_D == cc - 1 || cc_D == cc))
{
if (cc_D == cc - 1)
{
stack.Push("Substitute '" + string1[current_posx - 1] + "' by '" + string2[current_posy - 1] + "'");
current_posx--;
current_posy--;
}
else
{
stack.Push("Keep '" + string1[current_posx - 1] + "'");
current_posx--;
current_posy--;
}
}
else if (cc_L <= cc_D && cc_L == cc - 1)
{
stack.Push("Insert '" + string2[current_posy - 1] + "'");
current_posy--;
}
else
{
stack.Push("Delete '" + string1[current_posx - 1]+"'");
current_posx--;
}
}
while(stack.Count > 0)
{
Console.WriteLine(stack.Pop());
}
}
}

The code to get all the edit paths according to edit matrix, source and target. Make a comment if there are any bugs. Thanks a lot!
import copy
from typing import List, Union
def edit_distance(source: Union[List[str], str],
target: Union[List[str], str],
return_distance: bool = False):
"""get the edit matrix
"""
edit_matrix = [[i + j for j in range(len(target) + 1)] for i in range(len(source) + 1)]
for i in range(1, len(source) + 1):
for j in range(1, len(target) + 1):
if source[i - 1] == target[j - 1]:
d = 0
else:
d = 1
edit_matrix[i][j] = min(edit_matrix[i - 1][j] + 1,
edit_matrix[i][j - 1] + 1,
edit_matrix[i - 1][j - 1] + d)
if return_distance:
return edit_matrix[len(source)][len(target)]
return edit_matrix
def get_edit_paths(matrix: List[List[int]],
source: Union[List[str], str],
target: Union[List[str], str]):
"""get all the valid edit paths
"""
all_paths = []
def _edit_path(i, j, optimal_path):
if i > 0 and j > 0:
diagonal = matrix[i - 1][j - 1] # the diagonal value
vertical = matrix[i - 1][j] # the above value
horizontal = matrix[i][j - 1] # the left value
current = matrix[i][j] # current value
# whether the source and target token are the same
flag = False
# compute the minimal value of the diagonal, vertical and horizontal
minimal = min(diagonal, min(vertical, horizontal))
# if the diagonal is the minimal
if diagonal == minimal:
new_i = i - 1
new_j = j - 1
path_ = copy.deepcopy(optimal_path)
# if the diagnoal value equals to current - 1
# it means `replace`` operation
if diagonal == current - 1:
path_.append(f"Replace | {new_j} | {target[new_j]}")
_edit_path(new_i, new_j, path_)
# if the diagonal value equals to current value
# and corresponding positional value of source and target equal
# it means this is current best path
elif source[new_i] == target[new_j]:
flag = True
# path_.append(f"Keep | {new_i}")
_edit_path(new_i, new_j, path_)
# if the position doesn't have best path
# we need to consider other situations
if not flag:
# if vertical value equals to minimal
# it means delete source corresponding value
if vertical == minimal:
new_i = i - 1
new_j = j
path_ = copy.deepcopy(optimal_path)
path_.append(f"Delete | {new_i}")
_edit_path(new_i, new_j, path_)
# if horizontal value equals to minimal
# if mean insert target corresponding value to source
if horizontal == minimal:
new_i = i
new_j = j - 1
path_ = copy.deepcopy(optimal_path)
path_.append(f"Insert | {new_j} | {target[new_j]}")
_edit_path(new_i, new_j, path_)
else:
all_paths.append(list(reversed(optimal_path)))
# get the rows and columns of the edit matrix
row_len = len(matrix) - 1
col_len = len(matrix[0]) - 1
_edit_path(row_len, col_len, optimal_path=[])
return all_paths
if __name__ == "__main__":
source = "BBDEF"
target = "ABCDF"
matrix = edit_distance(source, target)
print("print paths")
paths = get_edit_paths(matrix, source=list(source), target=list(target))
for path in paths:
print(path)

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Comparing csv files using field values - bash

Related

Modulo algorithm proving elusive

Hand tracing a pseudo code

Organising inconsistent values [closed]

how to write bash script in ubuntu to normalize the index of text comparison

Levenshtein Distance: Inferring the edit operations from the matrix

Categories

Resources