Match a single column entry in one file to a column entry in a second file that consists of a list - shell

I need to match a single column entry in one file to a column entry in a second file that consists of a list (in shell). The awk command I've used only matches to the first word of the list, and doesn't scan through the entire list in the column field.
File 1 looks like this:
chr1:725751 LOC100288069
rs3131980 LOC100288069
rs28830877 LINC01128
rs28873693 LINC01128
rs34221207 ATP4A
File 2 looks like this:
Annotation Total Genes With Ann Your Genes With Ann) Your Genes No Ann) Genome With Ann) Genome No Ann) ln
1 path hsa00190 Oxidative phosphorylation 55 55 1861 75 1139 5.9 9.64 0 0 ATP12A ATP4A ATP5A1 ATP5E ATP5F1 ATP5G1 ATP5G2 ATP5G3 ATP5J ATP5O ATP6V0A1 ATP6V0A4 ATP6V0D2 ATP6V1A ATP6V1C1 ATP6V1C2 ATP6V1D ATP6V1E1 ATP6V1E2 ATP6V1G3 ATP6V1H COX10 COX17 COX4I1 COX4I2 COX5A COX6B1 COX6C COX7A1 COX7A2 COX7A2L COX7C COX8A NDUFA5 NDUFA9 NDUFB3 NDUFB4 NDUFB5 NDUFB6 NDUFS1 NDUFS3 NDUFS4 NDUFS5 NDUFS6 NDUFS8 NDUFV1 NDUFV3 PP PPA2 SDHA SDHD TCIRG1 UQCRC2 UQCRFS1 UQCRH
Expected output:
rs34221207 ATP4A hsa00190
(please excuse the formatting - all the columns are tab-delimited until the column of gene names, $14, called Genome...)
My command is this:
awk 'NR==FNR{a[$14]=$3; next}a[$2]{print $0 "\t" a[$2]}' file2 file 1
All help will be much appreciated!

You need to process files in the other order, and loop over your list:
awk 'NR==FNR{a[$2]=$1; next} {for(i=15;i<=NF;++i)if(a[$i]){print a[$i] "\t" $i "\t" $3}}' file1 file2
Explanation:
NR is a global "record number" counter that awk increments for each line read from each file. FNR is a per-file "record number" that awk resets to 1 on the first line of each file. So the NR==FNR condition is true for lines in the first file and false for lines in subsequent files. It is an awk idiom for picking out just the first file info. In this case, a[$2]=$1 stores the first field text keyed by the second field text. The next tells awk to stop short on the current line and to read and continue processing normally the next line. A next at the end of the first action clause like this is functionally like an ELSE condition on the remaining code if awk had such a syntax (which it doesn't): NR==FNR{a[$2]=$1} ELSE {for.... More clear and only slightly less time-efficient would have been to write instead NR==FNR{a[$2]=$1}NR!=FNR{for....
Now to the second action clause. No condition preceding it means awk will do it for every line that is not short-circuited by the preceding next, that is, all lines in files other than the first -- file2 only in this case. Your file2 has a list of potential keys starting in field #15 and extending to the last field. The awk built-in variable for the last field number is NF (number of fields). The for loop is pretty self-explanatory then, looping over just those field numbers. For each of those numbers i we want to know if the text in that field $i is a known key from the first file -- a[$i] is set, that is, evaluates to a non-empty (non-false) string. If so, then we've got our file1 first field in a[$i], our matching file1 second field in $i, and our file2 field of interest in $3 (the text of the current file2 3rd field). Print them tab-separated. The next here is an efficiency-only measure that stops all processing on the file2 record once we've found a match. If your file2 key list might contain duplicates and you want duplicate output lines if there is a match on such a duplicate, then you must remove that last next.
Actually now that I look again, you probably do want to find any multiple matches even on non-duplicates, so I have removed the 2nd next from the code.

Related

Is there a way to iterate over values of a column then check if it's present elsewhere?

I have generated 2 .csv files, one containing the original md5sums of some files in a directory and one containing the md5sums calculated at a specific moment.
md5_original.csv
----------
$1 $2 $3
7815696ecbf1c96e6894b779456d330e,,s1.txt
912ec803b2ce49e4a541068d495ab570,,s2.txt
040b7cf4a55014e185813e0644502ea9,,s64.txt
8a0b67188083b924d48ea72cb187b168,,b43.txt
etc.
md5_$current_date.csv
----------
$1 $2 $3
7815696ecbf1c96e6894b779456d330e,,s1.txt
4d4046cae9e9bf9218fa653e51cadb08,,s2.txt
3ff22b3585a0d3759f9195b310635c29,,b43.txt
etc.
* some files could be deleted when calculating current md5sums
I am looking to iterate over the values of column $3 in md5_$current_date.csv and, for each value of that column, to check if it exists in the md5_original.csv and if so, finally to compare its value on $1.
Output should be:
s2.txt hash changed from 912ec803b2ce49e4a541068d495ab570 to 4d4046cae9e9bf9218fa653e51cadb08.
b43.txt hash changed from 8a0b67188083b924d48ea72cb187b168 to 3ff22b3585a0d3759f9195b310635c29.
I have written the script for building this two .csv files, but I am struggling to the awk part where I have to do what I have asked above. I don't know if there is a better way to do this, I am a newbie.
I would use GNU AWK for this task following way, let md5_original.csv content be
7815696ecbf1c96e6894b779456d330e {BLANK_COLUMN} s1.txt
912ec803b2ce49e4a541068d495ab570 {BLANK_COLUMN} s2.txt
040b7cf4a55014e185813e0644502ea9 {BLANK_COLUMN} s64.txt
8a0b67188083b924d48ea72cb187b168 {BLANK_COLUMN} b43.txt
and md5_current.csv content be
7815696ecbf1c96e6894b779456d330e {BLANK_COLUMN} s1.txt
4d4046cae9e9bf9218fa653e51cadb08 {BLANK_COLUMN} s2.txt
3ff22b3585a0d3759f9195b310635c29 {BLANK_COLUMN} b43.txt
then
awk 'FNR==NR{arr[$3]=$1;next}($3 in arr)&&($1 != arr[$3]){print $3 " hash changed from " arr[$3] " to " $1}' md5_original.csv md5_current.csv
output
s2.txt hash changed from 912ec803b2ce49e4a541068d495ab570 to 4d4046cae9e9bf9218fa653e51cadb08
b43.txt hash changed from 8a0b67188083b924d48ea72cb187b168 to 3ff22b3585a0d3759f9195b310635c29
Explanation: FNR is number of row in current file, NR is number of row globally, these are equal only when processing 1st file. When processing 1st file I create array arr so keys are filenames and values are corresponding hash values, next cause GNU AWK to go to next line i.e. no other action is undertaken, so rest is applied only for all but first file. ($3 in arr) is condition: is current $3 one of keys of arr? If it does hold true I print concatenation of current $3 (that is filename) hash changed from string value for key $3 from array arr (that is old hash value) to string $1 (current hash value). If given filename is not present in array arr then no action is undertaken.
Edit: added exclusion for hash which not changed as suggested in comment.
(tested in gawk 4.2.1)

Add column from one file to another based on multiple matches while retaining unmatched

So I am really new to this kind of stuff (seriously, sorry in advance) but I figured I would post this question since it is taking me some time to solve it and I'm sure it's a lot more difficult than I am imagining.
I have the file small.csv:
id,name,x,y,id2
1,john,2,6,13
2,bob,3,4,15
3,jane,5,6,17
4,cindy,1,4,18
and another file big.csv:
id3,id4,name,x,y
100,{},john,2,6
101,{},bob,3,4
102,{},jane,5,6
103,{},cindy,1,4
104,{},alice,7,8
105,{},jane,0,3
106,{},cindy,1,7
The problem with this is I am attempting to put id2 of the small.csv into the id4 column of the big.csv only if the name AND x AND y match. I have tried using different awk and join commands in Git Bash but am coming up short. Again I am sorry for the newbie perspective on all of this but any help would be awesome. Thank you in advance.
EDIT: Sorry, this is what the final desired output should look like:
id3,id4,name,x,y
100,{13},john,2,6
101,{15},bob,3,4
102,{17},jane,5,6
103,{18},cindy,1,4
104,{},alice,7,8
105,{},jane,0,3
106,{},cindy,1,7
And one of the latest trials I did was the following:
$ join -j 1 -o 1.5,2.1,2.2,2.3,2.4,2.5 <(sort -k2 small.csv) <(sort -k2 big.csv)
But I received this error:
join: /dev/fd/63: No such file or directory
Probably not trivial to solve with join but fairly easy with awk:
awk -F, -v OFS=, ' # set input and output field separators to comma
# create lookup table from lines of small.csv
NR==FNR {
# ignore header
# map columns 2/3/4 to column 5
if (NR>1) lut[$2,$3,$4] = $5
next
}
# process lines of big.csv
# if lookup table has mapping for columns 3/4/5, update column 2
v = lut[$3,$4,$5] {
$2 = "{" v "}"
}
# print (possibly-modified) lines of big.csv
1
' small.csv big.csv >bignew.csv
Code assumes small.csv contains only one line for each distinct column 2/3/4.
NR==FNR { ...; next } is a way to process contents of the first file argument. (FNR is less than NR when processing lines from second and subsequent file arguments. next skips execution of the remaining awk commands.)

extract each line followed by a line with a different value in column two

Given the following file structure,
9.975 1.49000000 0.295 0 0.4880 0.4929 0.5113 0.5245 2.016726 1.0472 -30.7449 1
9.975 1.49000000 0.295 1 0.4870 0.5056 0.5188 0.5045 2.015859 1.0442 -30.7653 1
9.975 1.50000000 0.295 0 0.5145 0.4984 0.4873 0.5019 2.002143 1.0854 -30.3044 2
is there a way to extract each line in which the value in column two is not equal to the value in column two in the following line?
I.e. from these three lines I would like to extract the second one, since 1.49 is not equal to 1.50.
Maybe with sed or awk?
This is how I do this in MATLAB:
myline = 1;
mynewline = 1;
while myline < length(myfile)
if myfile(myline,2) ~= myfile(myline+1,2)
mynewfile(mynewline,:) = myfile(myline,:);
mynewline = mynewline+1;
myline = myline+1;
else
myline = myline+1;
end
end
However, my files are so large now that I would prefer to carry out this extraction in terminal before transferring them to my laptop.
Awk should do.
<data awk '($2 != prev) {print line} {line = $0; prev = $2}'
A brief intro to awk: awk program consists of a set of condition {code} blocks. It operates line by line. When no condition is given, the block is executed for each line. BEGIN condition is executed before the first line. Each line is split to fields, which are accessible with $_number_. The full line is in $0.
Here I compare the second field to the previous value, if it does not match I print the whole previous line. In all cases I store the current line into line and the second field into prev.
And if you really want it right, careful with the float comparisons - something like abs($2 - prev) < eps (there is no abs in awk, you need to define it yourself, and eps is some small enough number). I'm actually not sure if awk converts to number for equality testing, if not you're safe with the string comparisons.
This might work for you (GNU sed):
sed -r 'N;/^((\S+)\s+){2}.*\n\S+\s+\2/!P;D' file
Read two lines at a time. Pattern match on the first two columns and only print the first line when the second column does not match.
Try following command:
awk '$2 != field && field { print line } { field = $2; line = $0 }' infile
It saves previous line and second field, comparing in next loop with current line values. The && field check is useful to avoid a blank line at the beginning of file, when $2 != field would match because variable is empty.
It yields:
9.975 1.49000000 0.295 1 0.4870 0.5056 0.5188 0.5045 2.015859 1.0442 -30.7653 1

count the max number of _ and add additional semi-colon if some are missing

I have several files with fields like below
xxx;x_x_x;xxx;xxx;x_x_x;xxx;xxx;deme_Fort_Email_am;04/02/2015;Deme_Fort_Postal
xxx;x_x_x;xxx;xxx;x_x_x;xxx;xxx;deme_faible_Email_am;18/02/2015;deme_Faible_Email_Relance_am
xxx;x_x_x;xxx;xxx;x_x_x;xxx;xxx;equi_Fort_Email_am;23/02/2015;trav_Fort_Email_am
xxx;x_x_x;xxx;xxx;x_x_x;xxx;xxx;trav_Faible_Email_pm;18/02/2015;trav_Faible_Email_Relance_pm
xxx;x_x_x;xxx;xxx;x_x_x;xxx;xxx;trav_Fort_Email_am;12/02/2015;Trav_Fort_Postal
xxx;x_x_x;xxx;xxx;x_x_x;xxx;xxx;voya_Faible_Email_am;29/01/2015;voya_Faible_Email_Relance_am
Aim is to have that
xxx;x_x_x;xxx;xxx;x_x_x;xxx;xxx;deme;Fort;Email;am;04/02/2015;Deme;Fort;Postal;;
xxx;x_x_x;xxx;xxx;x_x_x;xxx;xxxdeme;faible;Email;am;18/02/2015;deme;Faible;Email;Relance;am
xxx;x_x_x;xxx;xxx;x_x_x;xxx;xxx;equi;Fort;Email;am;23/02/2015;trav;Fort;Email;am;
xxx;x_x_x;xxx;xxx;x_x_x;xxx;xxx;trav;Faible;Email;pm;18/02/2015;trav;Faible;Email;Relance;pm
xxx;x_x_x;xxx;xxx;x_x_x;xxx;xxx;trav;Fort;Email;am;12/02/2015;Trav;Fort;Postal;;
xxx;x_x_x;xxx;xxx;x_x_x;xxx;xxx;voya;Faible;Email;am;29/01/2015;voya;Faible;Email;Relance;am
I'm counting the max of underscore, after the 7th field, for any line. I then change it to semi-colon and add additional semi-colon depending of the maximum underscore count found in all the lines.
I thought about using awk for that but I will only change ,with the command line below , every thing after the first field. My aim is also to add additional semi-colon
awk 'BEGIN{FS=OFS=";"} {for (i=7;i<=NF;i++) gsub(/_/,";", $i) } 1' file
Thanks.
Awk way
awk -F';' -vOFS=';' '{y=0;for(i=8;i<=NF;i++)y+=gsub(/_/,";",$i)
x=x<y?y:x;NF=NF+(x-y)}NR!=FNR' file{,}
Output
xxx;x_x_x;xxx;xxx;x_x_x;xxx;xxx;deme;Fort;Email;am;04/02/2015;Deme;Fort;Postal;;
xxx;x_x_x;xxx;xxx;x_x_x;xxx;xxx;deme;faible;Email;am;18/02/2015;deme;Faible;Email;Relance;am
xxx;x_x_x;xxx;xxx;x_x_x;xxx;xxx;equi;Fort;Email;am;23/02/2015;trav;Fort;Email;am;
xxx;x_x_x;xxx;xxx;x_x_x;xxx;xxx;trav;Faible;Email;pm;18/02/2015;trav;Faible;Email;Relance;pm
xxx;x_x_x;xxx;xxx;x_x_x;xxx;xxx;trav;Fort;Email;am;12/02/2015;Trav;Fort;Postal;;
xxx;x_x_x;xxx;xxx;x_x_x;xxx;xxx;voya;Faible;Email;am;29/01/2015;voya;Faible;Email;Relance;am
Explanation
awk -F';' -vOFS=';'
This sets the Field Separator and Output Field separator to ;.
y=0;
Initialised y as 0 on each line.
for(i=8;i<=NF;i++)y+=gsub(/_/,";",$i)
For each field from field 8 to the Number of Fields on the line(NF).Substitute _ with a ;.Increment y by the number of substitutions.
x=x<y?y:x
Check if x is less than y, if it is set x to yelse leave the same.
NF=NF+(x-y)
Set the number of field to the current number of fields + the difference between x and y.
NR!=FNR
This means that if the Total record number is not equal to the Files record number then print.Effectively means print anything that isn't the first file.
file{,}
Expands to file file so the file is read twice.
Resources
https://www.gnu.org/software/gawk/manual/html_node/String-Functions.html

How to use unix grep and the output together?

I am new to unix commands. I have a file named server.txt, which has 100 fields, the first line of the file is header.
I want to take a look at the fields at 99 and 100 only.
Field 99 is just some numbers, field 100 is a String.
The delimiter of each field which is a space.
My goal is to extract every tokens in the string(field100) by grep and regex,
and then output with the field99 with every token extracted from the String,
and skip the first 1000 lines of my records
----server.txt--
... ... ,field99,field100
... ... 5,"hi are"
... ... 3,"how is"
-----output.txt
header1,header2
5,hi
5,are
3,how
3,is
so i just have some idea, but i dont know how to combine all the scripts
Here is some of my thought:
sed 1000d server.txt cut -f99,100 -d' ' >output.txt
grep | /[A-Za-z]+/|
Sounds more like a job for awk.
awk -F, 'NR <= 1000 { next; }
{ gsub(/^\"|\"$/, "", $100); split($100, a, / /);
for (v=1; v<=length(a); ++v) print $99, a[v]; }' server.txt >output.txt
The general form of an awk program is a sequence of condition { action } expressions. The first line has the condition NR <= 1000 where NR is the current line number. If the condition is true, the next action skips to the next input line. Otherwise, we fall through to the next expression, which does not have a condition; so, it's uncoditional, for all input lines which reach here. It first cleans out the double quotes around the 100th field value, and then splits it on spaces into the array a. The for loop then loops over this array, printing the 99th field value and the vth element of the array, starting with v=1 and up through the end of the array.
The input file format is sort of cumbersome. The gsub and split stuff could be avoided with a slightly more sane input format. If you are new to awk, you should probably go look for a tutorial.
If you only want to learn one scripting language, I would suggest Perl or Python over awk, but it depends on your plans and orientation.

Resources