Is there a way to iterate over values of a column then check if it's present elsewhere? - shell

I have generated 2 .csv files, one containing the original md5sums of some files in a directory and one containing the md5sums calculated at a specific moment.
md5_original.csv
----------
$1 $2 $3
7815696ecbf1c96e6894b779456d330e,,s1.txt
912ec803b2ce49e4a541068d495ab570,,s2.txt
040b7cf4a55014e185813e0644502ea9,,s64.txt
8a0b67188083b924d48ea72cb187b168,,b43.txt
etc.
md5_$current_date.csv
----------
$1 $2 $3
7815696ecbf1c96e6894b779456d330e,,s1.txt
4d4046cae9e9bf9218fa653e51cadb08,,s2.txt
3ff22b3585a0d3759f9195b310635c29,,b43.txt
etc.
* some files could be deleted when calculating current md5sums
I am looking to iterate over the values of column $3 in md5_$current_date.csv and, for each value of that column, to check if it exists in the md5_original.csv and if so, finally to compare its value on $1.
Output should be:
s2.txt hash changed from 912ec803b2ce49e4a541068d495ab570 to 4d4046cae9e9bf9218fa653e51cadb08.
b43.txt hash changed from 8a0b67188083b924d48ea72cb187b168 to 3ff22b3585a0d3759f9195b310635c29.
I have written the script for building this two .csv files, but I am struggling to the awk part where I have to do what I have asked above. I don't know if there is a better way to do this, I am a newbie.

I would use GNU AWK for this task following way, let md5_original.csv content be
7815696ecbf1c96e6894b779456d330e {BLANK_COLUMN} s1.txt
912ec803b2ce49e4a541068d495ab570 {BLANK_COLUMN} s2.txt
040b7cf4a55014e185813e0644502ea9 {BLANK_COLUMN} s64.txt
8a0b67188083b924d48ea72cb187b168 {BLANK_COLUMN} b43.txt
and md5_current.csv content be
7815696ecbf1c96e6894b779456d330e {BLANK_COLUMN} s1.txt
4d4046cae9e9bf9218fa653e51cadb08 {BLANK_COLUMN} s2.txt
3ff22b3585a0d3759f9195b310635c29 {BLANK_COLUMN} b43.txt
then
awk 'FNR==NR{arr[$3]=$1;next}($3 in arr)&&($1 != arr[$3]){print $3 " hash changed from " arr[$3] " to " $1}' md5_original.csv md5_current.csv
output
s2.txt hash changed from 912ec803b2ce49e4a541068d495ab570 to 4d4046cae9e9bf9218fa653e51cadb08
b43.txt hash changed from 8a0b67188083b924d48ea72cb187b168 to 3ff22b3585a0d3759f9195b310635c29
Explanation: FNR is number of row in current file, NR is number of row globally, these are equal only when processing 1st file. When processing 1st file I create array arr so keys are filenames and values are corresponding hash values, next cause GNU AWK to go to next line i.e. no other action is undertaken, so rest is applied only for all but first file. ($3 in arr) is condition: is current $3 one of keys of arr? If it does hold true I print concatenation of current $3 (that is filename) hash changed from string value for key $3 from array arr (that is old hash value) to string $1 (current hash value). If given filename is not present in array arr then no action is undertaken.
Edit: added exclusion for hash which not changed as suggested in comment.
(tested in gawk 4.2.1)

Related

Update column in file based on associative array value in bash

So I have a file named testingFruits.csv with the following columns:
name,value_id,size
apple,1,small
mango,2,small
banana,3,medium
watermelon,4,large
I also have an associative array that stores the following data:
fruitSizes[apple] = xsmall
fruitSizes[mango] = small
fruitSizes[banana] = medium
fruitSizes[watermelon] = xlarge
Is there anyway I can update the 'size' column within the file based on the data within the associative array for each value in the 'name' column?
I've tried using awk but I had no luck. Here's a sample of what I tried to do:
awk -v t="${fruitSizes[*]}" 'BEGIN{n=split(t,arrayval,""); ($1 in arrayval) {$3=arrayval[$1]}' "testingFruits.csv"
I understand this command would get the bash defined array fruitSizes, do a split on all the values, then check if the first column (name) is within the fruitSizes array. If it is, then it would update the third column (size) with the value found in fruitSizes for that specific name.
Unfortunately this gives me the following error:
Argument list too long
This is the expected output I'd like in the same testingFruits.csv file:
name,value_id,size
apple,1,xsmall
mango,2,small
banana,3,medium
watermelon,4,xlarge
One edge case I'd like to handle is the presence of duplicate values in the name column with different values for the value_id and size columns.
If you want to stick to an awk script, pass the array via stdin to avoid running into ARG_MAX issues.
Since your array is associative, listing only the values ${fruitSizes[#]} is not sufficient. You also need the keys ${!fruitSizes[#]}. pr -2 can pair the keys and values in one line.
This assumes that ${fruitSizes[#]} and ${!fruitSizes[#]} expand in the same order, and your keys and values are free of the field separator (, in this case).
printf %s\\n "${!fruitSizes[#]}" "${fruitSizes[#]}" | pr -t -2 -s, |
awk -F, -v OFS=, 'NR==FNR {a[$1]=$2; next} $1 in a {$3=a[$1]} 1' - testingFruits.csv
However, I'm wondering where the array fruitSizes comes from. If you read it from a file or something like that, it would be easier to leave out the array altogether and do everything in awk.

Add column from one file to another based on multiple matches while retaining unmatched

So I am really new to this kind of stuff (seriously, sorry in advance) but I figured I would post this question since it is taking me some time to solve it and I'm sure it's a lot more difficult than I am imagining.
I have the file small.csv:
id,name,x,y,id2
1,john,2,6,13
2,bob,3,4,15
3,jane,5,6,17
4,cindy,1,4,18
and another file big.csv:
id3,id4,name,x,y
100,{},john,2,6
101,{},bob,3,4
102,{},jane,5,6
103,{},cindy,1,4
104,{},alice,7,8
105,{},jane,0,3
106,{},cindy,1,7
The problem with this is I am attempting to put id2 of the small.csv into the id4 column of the big.csv only if the name AND x AND y match. I have tried using different awk and join commands in Git Bash but am coming up short. Again I am sorry for the newbie perspective on all of this but any help would be awesome. Thank you in advance.
EDIT: Sorry, this is what the final desired output should look like:
id3,id4,name,x,y
100,{13},john,2,6
101,{15},bob,3,4
102,{17},jane,5,6
103,{18},cindy,1,4
104,{},alice,7,8
105,{},jane,0,3
106,{},cindy,1,7
And one of the latest trials I did was the following:
$ join -j 1 -o 1.5,2.1,2.2,2.3,2.4,2.5 <(sort -k2 small.csv) <(sort -k2 big.csv)
But I received this error:
join: /dev/fd/63: No such file or directory
Probably not trivial to solve with join but fairly easy with awk:
awk -F, -v OFS=, ' # set input and output field separators to comma
# create lookup table from lines of small.csv
NR==FNR {
# ignore header
# map columns 2/3/4 to column 5
if (NR>1) lut[$2,$3,$4] = $5
next
}
# process lines of big.csv
# if lookup table has mapping for columns 3/4/5, update column 2
v = lut[$3,$4,$5] {
$2 = "{" v "}"
}
# print (possibly-modified) lines of big.csv
1
' small.csv big.csv >bignew.csv
Code assumes small.csv contains only one line for each distinct column 2/3/4.
NR==FNR { ...; next } is a way to process contents of the first file argument. (FNR is less than NR when processing lines from second and subsequent file arguments. next skips execution of the remaining awk commands.)

Initialize an Array inside AWK Command and use the Array to Print using AWK

Im trying to Do a Comparison of 2 File Data and print certain out out of it.
My objective mainly here is to initlize an araay containing some values inside the same awk statement and use it for some printing purpose.
Below is the Command i am using which i feel looking like some syntactical error.
Please Help in the AWK part how I should define the Array also How i cna use it inside it.
Command tried -
paste -d "|" filedata.txt tabdata.txt | awk -F '|' '{array=("RE_LOG_ID" "FILE_RUN_ID" "FH_RECORDTYPE" "FILECATEGORY")}' '{c=NF/2;for(i=1;i<=c;i++)if($i!=$(i+c))printf "%s|%s|%s|%s\n",$1,${array[i]},$i,$(i+c)}'
SAMPLE INPUT FILE
filedata.txt
A|1|2|3
B|2|3|4
tabdata.txt
A|1|4|3
B|2|3|7
So my Output i am wanting is . -
A|FH_RECORDTYPE|2|4
B|FILECATEGORY|4|7
The Output Comprises the Differences -
PRIMARYKEY|COLUMNNAME|FILE1DATA|FILE2DATA
I want the Array to be initialized inside the AWK as array=("RE_LOG_ID" "FILE_RUN_ID" "FH_RECORDTYPE" "FILECATEGORY") and will correspond Column Names
The fetching columnname from the array- condition will be when ($i!=$(i+c)) whichever "i"th position mismatches i will print the "i" th Element from the Array.
Finding the Differences Section is working perfect if i remove the array part from my command, but my ask is i want to initialize an array containing the column names and print it too within the awk statement.
Just i need help how to incorporate the Array Part within AWK.
Unfortunately arrays in AWK cannot be assigned as you expect. As an alternative, you can use split function like:
split("RE_LOG_ID FILE_RUN_ID FH_RECORDTYPE FILECATEGORY", array, " ")
(Optional " " is needed because FS is overwritten.)
Then your command will look like:
paste -d "|" filedata.txt tabdata.txt | awk -F '|' '
BEGIN {split("RE_LOG_ID FILE_RUN_ID FH_RECORDTYPE FILECATEGORY", array, " ")}
{
c= NF/2;
for(i=1; i<=c; i++)
if ($i != $(i+c))
printf "%s|%s|%s|%s\n", $1, array[i], $i, $(i+c);
}'

Match a single column entry in one file to a column entry in a second file that consists of a list

I need to match a single column entry in one file to a column entry in a second file that consists of a list (in shell). The awk command I've used only matches to the first word of the list, and doesn't scan through the entire list in the column field.
File 1 looks like this:
chr1:725751 LOC100288069
rs3131980 LOC100288069
rs28830877 LINC01128
rs28873693 LINC01128
rs34221207 ATP4A
File 2 looks like this:
Annotation Total Genes With Ann Your Genes With Ann) Your Genes No Ann) Genome With Ann) Genome No Ann) ln
1 path hsa00190 Oxidative phosphorylation 55 55 1861 75 1139 5.9 9.64 0 0 ATP12A ATP4A ATP5A1 ATP5E ATP5F1 ATP5G1 ATP5G2 ATP5G3 ATP5J ATP5O ATP6V0A1 ATP6V0A4 ATP6V0D2 ATP6V1A ATP6V1C1 ATP6V1C2 ATP6V1D ATP6V1E1 ATP6V1E2 ATP6V1G3 ATP6V1H COX10 COX17 COX4I1 COX4I2 COX5A COX6B1 COX6C COX7A1 COX7A2 COX7A2L COX7C COX8A NDUFA5 NDUFA9 NDUFB3 NDUFB4 NDUFB5 NDUFB6 NDUFS1 NDUFS3 NDUFS4 NDUFS5 NDUFS6 NDUFS8 NDUFV1 NDUFV3 PP PPA2 SDHA SDHD TCIRG1 UQCRC2 UQCRFS1 UQCRH
Expected output:
rs34221207 ATP4A hsa00190
(please excuse the formatting - all the columns are tab-delimited until the column of gene names, $14, called Genome...)
My command is this:
awk 'NR==FNR{a[$14]=$3; next}a[$2]{print $0 "\t" a[$2]}' file2 file 1
All help will be much appreciated!
You need to process files in the other order, and loop over your list:
awk 'NR==FNR{a[$2]=$1; next} {for(i=15;i<=NF;++i)if(a[$i]){print a[$i] "\t" $i "\t" $3}}' file1 file2
Explanation:
NR is a global "record number" counter that awk increments for each line read from each file. FNR is a per-file "record number" that awk resets to 1 on the first line of each file. So the NR==FNR condition is true for lines in the first file and false for lines in subsequent files. It is an awk idiom for picking out just the first file info. In this case, a[$2]=$1 stores the first field text keyed by the second field text. The next tells awk to stop short on the current line and to read and continue processing normally the next line. A next at the end of the first action clause like this is functionally like an ELSE condition on the remaining code if awk had such a syntax (which it doesn't): NR==FNR{a[$2]=$1} ELSE {for.... More clear and only slightly less time-efficient would have been to write instead NR==FNR{a[$2]=$1}NR!=FNR{for....
Now to the second action clause. No condition preceding it means awk will do it for every line that is not short-circuited by the preceding next, that is, all lines in files other than the first -- file2 only in this case. Your file2 has a list of potential keys starting in field #15 and extending to the last field. The awk built-in variable for the last field number is NF (number of fields). The for loop is pretty self-explanatory then, looping over just those field numbers. For each of those numbers i we want to know if the text in that field $i is a known key from the first file -- a[$i] is set, that is, evaluates to a non-empty (non-false) string. If so, then we've got our file1 first field in a[$i], our matching file1 second field in $i, and our file2 field of interest in $3 (the text of the current file2 3rd field). Print them tab-separated. The next here is an efficiency-only measure that stops all processing on the file2 record once we've found a match. If your file2 key list might contain duplicates and you want duplicate output lines if there is a match on such a duplicate, then you must remove that last next.
Actually now that I look again, you probably do want to find any multiple matches even on non-duplicates, so I have removed the 2nd next from the code.

make math operation from multiple files with shell scripting

I have multiple files, let's say
fname1 contains:
red=5
green=10
yellow=2
fname2 contains:
red=10
green=2
yellow=2
fname3 contains:
red=1
green=7
yellow=4
I want to write script that read from these files, sum the numbers for each colour,
and redirect the sums into new file.
New file contains:
red=16
green=19
yellow=8
[ awk ] is your friend :
awk 'BEGIN{FS="=";}
{color[$1]+=$2}
END{
for(var in color)
printf "%s=%s\n",var,color[var]
}' fname1 fname2 fname3 >result
should do it.
Demystifying above stuff
Anything that is include inside '' is the awk program.
Stuff inside BEGIN will be executed only once, ie in the beginning
FS is an awk built-in variable which stands for field separator.
Setting FS to = means awk will use = to delimit the fields/columns.
By default awk considers each line as a record.
In that case you have two fields denoted by $1 and $2 in each record having = as the delimiter.
{color[$1]+=$2} creates(if not already exist) an associative array with color name as the key and += adds the value of the field2 to this array element. Remember, associative arrays at the time of creation are initilized to zero.
This is repeated for the three files fname1, fname2, fname3 fed into awk
Anything inside END{} will be executed only at last, ie just before exit.
for(var in color) is a the style of forloop used to parse an associative array.
Here var will be a key and color[key] points to value.
printf "%s=%s\n",var,color[var] is self explained.
Note
If all the filenames start with fname you can even put fname* instead of fname1 fname2 fname3
This assumes that there are no blank lines in any file
Because your source files are valid shell code. You can just source them (if they are from a trusted source) and accumulate them using Shell Arithmetic.
#!/bin/bash
sum_red=0
sum_green=0
sum_yellow=0
for file in "$#";do
. ${file}
let sum_red+=red
let sum_green+=green
let sum_yellow+=yellow
done
echo "red=$sum_red
green=$sum_green
yellow=$sum_yellow"

Resources