Comparision of 2 csv files having same column names with different data - shell

I am having two CSV files each having 2 columns with same column name. 1.csv has generated first and 2.csv has generated after 1 hour. S
o I want to see the Profit % increament and decrement for each Business unit comparing to last report. for example: Business unit B has increment of 50%(((15-10)/10)*100).
However for C it has decrease of 50%. Some new business unit(AG & JK) is also added in new hour report which can be considered only for new one. However few businees unit(D) also removed from next hour which can be considered not required.
So basically i need how can i compare and extract this data.
Busines Profit %
A 0
B 10
C 10
D 0
E 0
F 1615
G 0
Busines profit %
A 0
B 15
C 5
AG 5
E 0
F 1615
G 0
JK 10
updated requirement:
Business Profits% Old profit % new Variation
A 0 0 0
B 10 15 50%
C 10 5 -50%
D 0 cleared
AG 5 New
E 0 0 0
F 1615 1615 0%
G 0 0 0%
JK 10 New

I'd use awk for the job, something like this:
$ awk 'NR==FNR{ # process file2
a[$1]=$2 # hash second column, key is the first column
next # process the next record of file2
}
{ # process file1
if($1 in a==0) # if company not found in hash a
p="new" # it must be new
else
p=($2-a[$1])/(a[$1]==0?1:a[$1])*100 # otherwise calculate p%
print $1,p # output company and p%
}' file1 file2
A 0
B 50
C -50
AG new
E 0
F 0
G 0
JK new
One-liner version with appropriate semicolons:
$ awk 'NR==FNR{a[$1]=$2;next}{if($1 in a==0)p="new";else p=($2-a[$1])/(a[$1]==0?1:a[$1])*100;print $1,p}' file1 file2

Related

Can AWK array be used to get largest cluster in correlation matrix?

I have a matrix that describes correlation between items A-K, where 1=correlated and 0=uncorrelated.
Is there an easy way to extract the largest cluster from the data? In other words, the cluster with the most correlated elements. Below is some sample data:
# A B C D E F G H I J K
A 1 1 1 1 1 1 1 1 1 1 1
B 1 1 1 1 1 1 1 1 1 1 1
C 1 1 1 1 1 1 1 1 1 1 1
D 1 1 1 1 0 1 0 0 0 1 1
E 1 1 1 0 1 0 0 1 1 0 0
F 1 1 1 1 0 1 0 0 1 0 1
G 1 1 1 0 0 0 1 0 0 0 0
H 1 1 1 0 1 0 0 1 1 1 1
I 1 1 1 0 1 1 0 1 1 0 0
J 1 1 1 1 0 0 0 1 0 1 0
K 1 1 1 1 0 1 0 1 0 0 1
Swapping a few columns/rows by eye, the expected result would be the top left of the matrix, which is a cluster of size 6 that contains: {A, B, C, D, F, K}
I know awk isn't the most user-friendly for this application, but I'm keen on using awk since this will integrate into a larger awk script. That being said, I'm not completely immovable on the language.
Not sure where to start but here's a more complex version of what I'm thinking in python:
https://stats.stackexchange.com/questions/138325/clustering-a-correlation-matrix
Assumptions:
all matrices are symmetric (ie, square; equal to its transpose; matrix[x,y]=matrix[y,x])
matrix[x,x]=1 for all x
all matrix entries are 0 or 1
not interested in 1-element clusters
not interested in permutations of the same cluster (ie, A,B is the same as B,A)
since we don't have to worry about permutations we can focus on processing elements in the order in which they show up in the matrix (eg, we process A,B,C and ignore the equivalents of A,C,B, B,A,C, B,C,A, C,A,B and C,B,A); this allows us to focus on processing just the top/right half of the matrix (above the identity/diagonal) and in order from left to right; this will greatly reduce the number of permutations we need to evaluate
as demonstrated in the question, elements that make up a cluster can be shifted up/left in the matrix so as to fill the top/left of the matrix with 1's (this comes into play during processing where for each new element we merely need to test the equivalent of the new column/row added to this top/left portion of the matrix)
Regarding the last assumption ... assume we have cluster A,D and we now want to test A,D,F; we just need to test the new column/row entries (?):
Current Cluster New Cluster ?
A D A D F
A 1 1 A 1 1 ? # if matrix is symmetric then only need to test
D 1 1 D 1 1 ? # the new column *OR* the new row, not both;
F ? ? 1 # bottom/right == 1 == matrix[F,F] per earlier assumption
One idea using a recursive function and two GNU awk's features: a) array of arrays (aka, multi-dimensional arrays) and b) PROCINFO["sorted_in"] for custom sorting of clusters to stdout:
awk '
######
# load matrix into memory
FNR==1 { n=(NF-1) # number of elements to be processed
for (i=2;i<=NF;i++)
label[i-1]=$i # save labels
next
}
{ for (i=2;i<=NF;i++)
m[FNR-1][i-1]=$i # populate matrix array m[row#][column#]
}
######
# define our recursive function
function find_cluster(cluster, i, clstrcount, stackseq, j, k, corrcount) {
# cluster : current working cluster (eg, "A,B,C")
# i : index of latest element (eg, for "A,B,C" => latest element is "C" so i = 3
# clstrcount : number of elements in current cluster
# stackseq : sequence number of stack[] array
# : stack[] contains list of indexes for current cluster (for "A,B,C" stack = "1,2,3")
# j,k,corrcount : declaring additional variables as "local" to this invocation of the function
clstrcount++ # number of elements to be processed at this call/level
for (j=i+1;j<=n;j++) { # process all elements/indexes greater than i
corrcount=1 # reset correlation count; always start with 1 since m[j][j]=1
# check the new column/row added to the top/left of the matrix to see if it extends the current cluster (ie, all entries are "1")
for (k in stack) { # loop through element/indexes in stack
if (m[stack[k]][j]) # check column entries
corrcount++
if (m[j][stack[k]]) # check row entries; not necessary if matrix is symmetric but we will add it here to show the m[][] references
corrcount++
}
if (corrcount == (stackseq*2 +1) ) { # if we have all "1"s we have a new cluster of size clstrcount
stack[++stackseq]=j # "push" current element/index on stack; increment stack seq/index
cluster=cluster "," label[j] # add current element/label to cluster
max= (clstrcount>max) ? clstrcount : max # new max(cluster count) ?
clusters[clstrcount][++clsterseq]=cluster # add new cluster to our master list: clusters[cluster_count][seq]
find_cluster(cluster, j, clstrcount, stackseq) # recursive call to check for next element(s)
delete stack[stackseq--] # back from recursive call so "pop" curent element (j) from stack
gsub(/[,][^,]+$/,"",cluster) # remove current element/label from cluster to make way for next element/label to be tested
}
}
}
######
# start looking for clusters of size 2+
END { max=2 # not interested in clusters of "1"
for (i=1;i<n;i++) { # loop through list of elements
clstrcount=1 # init cluster count = 1
clstrseq=0 # init clusters[...][seq] sequence seed
cluster=label[i] # reset cluster to current element/label
stackseq=1 # reset stack[seq] sequence seed
stack[stackseq]=i # "push" current element on stack
find_cluster(cluster, i, clstrcount, stackseq) # start recursive calls looking for next element in cluster
}
######
# for now just display clusters with size > 2; adjust this next line to add/remove cluster sizes from stdout
if (max>2) # print list of clusters with length > 2
for (i=max;i>2;i--) { # print from largest to smallest and ...
PROCINFO["sorted_in"]="#val_str_asc" # in alphabetical order
printf "####### clusters of size %s:\n", i
for (j in clusters[i]) # loop through all entries for clusters of size "i"
print clusters[i][j]
}
}
' matrix.dat
NOTE: The current version is (admittedly) a bit verbose ... the results of jotting down a first-pass solution as I was working through the details; with some further analysis it may be possible to reduce the code; having said that, the time it takes to find all 2+ sized clusters in this 11-element matrix isn't too bad:
real 0m0.084s
user 0m0.031s
sys 0m0.046s
This generates:
####### clusters of size 6:
A,B,C,D,F,K
A,B,C,E,H,I
####### clusters of size 5:
A,B,C,D,F
A,B,C,D,J
A,B,C,D,K
A,B,C,E,H
A,B,C,E,I
A,B,C,F,I
A,B,C,F,K
A,B,C,H,I
A,B,C,H,J
A,B,C,H,K
A,B,D,F,K
A,B,E,H,I
A,C,D,F,K
A,C,E,H,I
B,C,D,F,K
B,C,E,H,I
####### clusters of size 4:
A,B,C,D
A,B,C,E
A,B,C,F
A,B,C,G
A,B,C,H
A,B,C,I
A,B,C,J
A,B,C,K
A,B,D,F
A,B,D,J
A,B,D,K
A,B,E,H
A,B,E,I
A,B,F,I
A,B,F,K
A,B,H,I
A,B,H,J
A,B,H,K
A,C,D,F
A,C,D,J
A,C,D,K
A,C,E,H
A,C,E,I
A,C,F,I
A,C,F,K
A,C,H,I
A,C,H,J
A,C,H,K
A,D,F,K
A,E,H,I
B,C,D,F
B,C,D,J
B,C,D,K
B,C,E,H
B,C,E,I
B,C,F,I
B,C,F,K
B,C,H,I
B,C,H,J
B,C,H,K
B,D,F,K
B,E,H,I
C,D,F,K
C,E,H,I
####### clusters of size 3:
A,B,C
A,B,D
A,B,E
A,B,F
A,B,G
A,B,H
A,B,I
A,B,J
A,B,K
A,C,D
A,C,E
A,C,F
A,C,G
A,C,H
A,C,I
A,C,J
A,C,K
A,D,F
A,D,J
A,D,K
A,E,H
A,E,I
A,F,I
A,F,K
A,H,I
A,H,J
A,H,K
B,C,D
B,C,E
B,C,F
B,C,G
B,C,H
B,C,I
B,C,J
B,C,K
B,D,F
B,D,J
B,D,K
B,E,H
B,E,I
B,F,I
B,F,K
B,H,I
B,H,J
B,H,K
C,D,F
C,D,J
C,D,K
C,E,H
C,E,I
C,F,I
C,F,K
C,H,I
C,H,J
C,H,K
D,F,K
E,H,I

gsub many columns simultaneously based on different gsub conditions?

I have a file with the following data-
Input-
A B C D E F
A B B B B B
C A C D E F
A B D E F A
A A A A A F
A B C B B B
If any of the other rows starting from row 2 have the same letter as row 1, they should be changed to 1. Basically, I'm trying to find out how similar any of the rows are to the first row.
Desired Output-
1 1 1 1 1 1
1 1 B B B B
C A 1 1 1 1
1 1 D E F A
1 A A A A 1
1 1 1 B B B
The first row has become all 1 since it is identical to itself (obviously). In the second row, the first and second columns are identical to the first row (A B) and hence they become 1 1. And so on for the other rows.
I have written the following code which does this transformation-
for seq in {1..1} ; #Iterate over the rows (in this case just row 1)
do
for position in {1..6} ; #Iterate over the columns
do
#Define the letter in the first row with which I'm comparing the rest of the rows
aa=$(awk -v pos=$position -v line=$seq 'NR == line {print $pos}' f)
#If it matches, gsub it to 1
awk -v var=$aa -v pos=$position '{gsub (var, "1", $pos)} 1' f > temp
#Save this intermediate file and now act on this
mv temp f
done
done
As you can imagine, this is really slow because that nested loop is expensive. My real data is a 60x10000 matrix and it takes about 2 hours for this program to run on that.
I was hoping you could help me get rid of the inner loop so that I can do all 6 gsubs in a single step. Maybe putting them in an array of their own? My awk skills aren't that great yet.
You can use this simpler awk command to do the job which will be faster to complete as we are avoiding nested loops in shell and also invoking awk repeatedly in nested loop:
awk '{for (i=1; i<=NF; i++) {if (NR==1) a[i]=$i; if (a[i]==$i) $i=1} } 1' file
1 1 1 1 1 1
1 1 B B B B
C A 1 1 1 1
1 1 D E F A
1 A A A A 1
1 1 1 B B B
EDIT:
As per the comments below here is what you can do to get the sum of each column in each row:
awk '{sum=0; for (i=1; i<=NF; i++) { if (NR==1) a[i]=$i; if (a[i]==$i) $i=1; sum+=$i}
print $0, sum}' file
1 1 1 1 1 1 6
1 1 B B B B 2
C A 1 1 1 1 4
1 1 D E F A 2
1 A A A A 1 2
1 1 1 B B B 3
Input
$ cat f
A B C D E F
A B B B B B
C A C D E F
A B D E F A
A A A A A F
A B C B B B
Desired o/p
$ awk 'FNR==1{split($0,a)}{for(i=1;i<=NF;i++)if (a[i]==$i) $i=1}1' f
1 1 1 1 1 1
1 1 B B B B
C A 1 1 1 1
1 1 D E F A
1 A A A A 1
1 1 1 B B B
Explanation
FNR==1{ .. }
When awk reads first record of current file, do things inside braces
split(string, array [, fieldsep [, seps ] ])
Divide string into pieces separated by fieldsep and store the pieces
in array and the separator strings in the seps array.
split($0,a)
split current record or row ($0) into pieces by fieldsep (defualt space, as
we have not supplied 3rd argument) and store the pieces in array a
So array a contains data from first row
a[1] = A
a[2] = B
a[3] = C
a[4] = D
a[5] = E
a[6] = F
for(i=1;i<=NF;i++)
Loop through all the fields of for each record of file till end of file.
if (a[i]==$i) $i=1
if first row's column value of current index (i) is equal to
current column value of current row set current column value = 1 ( meaning modify current column value )
Now we modified column value next just print modified row
}1
1 always evaluates to true, it performs default operation {print $0}
For update request on comment
Same question here, I have a second part of the program that adds up
the numbers in the rows. I.e. You would get 6, 2, 4, 2, 2, 3 for this
output. Can your program be tweaked to get these values out at this
step itself?
$ awk 'FNR==1{split($0,a)}{s=0;for(i=1;i<=NF;i++)if(a[i]==$i)s+=$i=1;print $0,s}' f
1 1 1 1 1 1 6
1 1 B B B B 2
C A 1 1 1 1 4
1 1 D E F A 2
1 A A A A 1 2
1 1 1 B B B 3

Subsetting Data with GREP

I have a very large text file (16GB) that I want to subset as fast as possible.
Here is a sample of the data involved
0 M 4 0
0 0 Q 0 10047345 3080290,4098689 50504886,4217515 9848058,1084315 50534229,4217515 50591618,4217515 26242582,2597528 34623075,3279130 68893581,5149883 50628761,4217517 32262001,3142702 35443881,3339757
0 108 C 0 50628761
0 1080 C 0 50628761
1 M 7 0
1 0 Q 0 17143989
2 M 15 1
2 0 Q 0 17143989 4219157,1841361,853923,1720163,1912374,1755325,4454730 65548702,4975721 197782,39086 54375043,4396765 31589696,3091097 6876504,851594 3374640,455375 13274885,1354902 31585771,3091016 61234218,4723345 31583582,3091014
2 27 C 0 31589696
The first number on every line is a sessionID and any line with an 'M' denotes the start of a session (data is grouped by session). The number following an M is a Day and the second number is a userID, a user can have multiple sessions.
I want to extract all lines related to a specific user which for each session include all of the lines up until the next 'M' line is encountered (can be any number of lines). As a second task I also want to extract all session lines related to a specific day.
For example with the above data, to extract the records for userid '0' the output would be:
0 M 4 0
0 0 Q 0 10047345 3080290,4098689 50504886,4217515 9848058,1084315 50534229,4217515 50591618,4217515 26242582,2597528 34623075,3279130 68893581,5149883 50628761,4217517 32262001,3142702 35443881,3339757
0 108 C 0 50628761
0 1080 C 0 50628761
1 M 7 0
1 0 Q 0 17143989
To extract the records for day 7 the output would be:
1 M 7 0
1 0 Q 0 17143989
I believe there is a much more elegant and simple solution to what I have achieved so far and it would be great to get some feedback and suggestions. Thank you.
What I have tried
I tried to use pcrgrep -M to apply this pattern directly (matching data between two M's) but struggled to get this working across the linebreaks. I still suspect this may be the fastest option so any guidance on whether this may be possible would be great.
The next part is quite scattered and it is not necessary to read on if you already have an idea for a better solution!
Failing the above, I split the problem into two parts:
Part 1: Isolating all 'M' lines to obtain a list of sessions which belonging to that user/day
grep method is fast (then need to figure out how to use this data)
time grep -c "M\t.*\t$user_id" trainSample.txt >> sessions.txt
awk method to create an array is slow
time myarr=$(awk '/M\t.*\t$user_id/ {print $1}' trainSample.txt
Part 2: Extracting all lines belonging to a session on the list created in part 1
Continuing from the awk method, I ran grep for each but this is WAY too slow (days to complete 16GB)
for i in "${!myarr[#]}";
do
grep "^${myarr[$i]}\t" trainSample.txt >> sessions.txt
echo -ne "Session $i\r"
done
Instead of running grep once per session ID as above using them all in the one grep command is MUCH faster (I ran it with 8 sessionIDs in a [1|2|3|..|8] format and it took the same time as each did separately i.e. 8X faster). However I need then to figure out how to do this dynamically
Update
I have actually established a working solution which only takes seconds to complete but it is some messy and inflexible bash coe which I have yet to extend to the second (isolating by days) case.
I want to extract all lines related to a specific user which for each session include all of the lines up until the next 'M' line is encountered (can be any number of lines).
$ awk '$2=="M"{p=$4==0}p' file
0 M 4 0
0 0 Q 0 10047345 3080290,4098689 50504886,4217515 9848058,1084315 50534229,4217515 50591618,4217515 26242582,2597528 34623075,3279130 68893581,5149883 50628761,4217517 32262001,3142702 35443881,3339757
0 108 C 0 50628761
0 1080 C 0 50628761
1 M 7 0
1 0 Q 0 17143989
As a second task I also want to extract all session lines related to a specific day.
$ awk '$2=="M"{p=$3==7}p' file
1 M 7 0
1 0 Q 0 17143989

How to separate lines depending on the value in column 1

I have a text file that contains the following (a b c d etc... contains some random values):
1 a
1 b
2 c
2 d
2 e
2 f
6 g
6 h
6 i
12 j
12 k
Is there a way to separate lines with some characters depending on the content of the first string, knowing that those numbers will always be increasing, but may vary as well. The separation would be when first string is incrementing, going from 1 to 2, then 2 to 6 etc...
The output would be like this (here I would like to use ---------- as a separation):
1 a
1 b
----------
2 c
2 d
2 e
2 f
----------
6 g
6 h
6 i
----------
12 j
12 k
awk 'NR>1 && old != $1 { print "----------" } { print; old = $1 }'
If it isn't the first line and the value in old isn't the same as in $1, print the separator. Then unconditionally print the current line, and record the value of $1 in old so that we remember for next time. Repeat until done.

Unix / Shell Add a range of columns to file

So I've been trying the same problem for the last few days, and I'm at a formatting road block.
I have a program that will only run if its working on an equal number of columns. I know the total column count, and the number needed to add with a filler value of 0, but am not sure how to do this. Is there some time of range option with awk or sed for this?
Input:
A B C D E
A B C D E 1 1 1 1
Output:
A B C D E 0 0 0 0
A B C D E 1 1 1 1
The the alphabet columns are always present (with different values), but this "fill in the blank" function is eluding me. I can't use R for this due to data file size.
One way using awk:
$ awk 'NF!=n{for(i=NF+1;i<=n;i++)$i=0}1' n=9 file
A B C D E 0 0 0 0
A B C D E 1 1 1 1
Just set n to the number of columns you want to pad upto.

Resources