joining 2 file and taking first file as priority - shell

I'm looking for help on joining (at the UNIX level) two files (file1 and file2), picking values from file1 as a priority over the values in file2.
If a srcvalue exists in file1, that should be taken instead of file2's tmpValue. If there is no srcValue in file1, then pick up this value from file2's tmpValue.
Sample data:
file1:
id name srcValue
1 a s123
2 b s456
3 c
file2:
id tmpValue
1 Tva
3 TVb
4 Tvm
Desired output:
ID Name FinalValue
1 a s123
2 b s456
3 c TVb

I would approach this problem with an awk script; it is fairly powerful and flexible. The general approach here is to load the values from file2 first, then loop through file1 and substitute them as needed.
awk 'BEGIN { print "ID Name FinalValue" }
FNR == NR && FNR > 1 { tmpValue[$1]=$2; }
FNR != NR && FNR > 1 { if (NF == 2) {
print $1, $2, tmpValue[$1]
} else {
print $1, $2, $3
}
}
' file2 file1
The BEGIN block is executed before any files are read; its only job is to output the new header.
The FNR == NR && FNR > 1 condition is true for the first filename ("file2" here) and also skips the first line of that file (FNR > 1), since it's a header line. The "action" block for that condition simply fills an associative array with the id and tmpValue from file2.
The FNR != NR && FNR > 1 corresponds to the second filename ("file1" here) and also skips the first (header) line. In this block of code, we check to see if there's a srcValue; if so, print those three values back out; if not, substitute in the saved value (assuming there is one; otherwise, it'll be blank).
I assume that the <br> bits in the question are attempts at formatting, and that column 3 in file1 would actually be empty if there was no value there.

Related

Add location to duplicate names in a CSV file using Bash

Using Bash create user logins. Add the location if the name is duplicated. Location should be added to the original name, as well as to the duplicates.
id,location,name,login
1,KP,Lacie,
2,US,Pamella,
3,CY,Korrie,
4,NI,Korrie,
5,BT,Queenie,
6,AW,Donnie,
7,GP,Pamella,
8,KP,Pamella,
9,LC,Pamella,
10,GM,Ericka,
The result should look like this:
id,location,name,login
1,KP,Lacie,lacie#mail.com
2,US,Pamella,uspamella#mail.com
3,CY,Korrie,cykorrie#mail.com
4,NI,Korrie,nikorrie#mail.com
5,BT,Queenie,queenie#mail.com
6,AW,Donnie,donnie#mail.com
7,GP,Pamella,gppamella#mail.com
8,KP,Pamella,kppamella#mail.com
9,LC,Pamella,lcpamella#mail.com
10,GM,Ericka,ericka#mail.com
I used AWK to process the csv file.
cat data.csv | awk 'BEGIN {FS=OFS=","};
NR > 1 {
split($3, name)
$4 = tolower($3)
split($4, login)
for (k in login) {
!a[login[k]]++ ? sub(login[k], login[k]"#mail.com", $4) : sub(login[k], tolower($2)login[k]"#mail.com", $4)
}
}; 1' > data_new.csv
The script adds location values only to further duplicates.
id,location,name,login
1,KP,Lacie,lacie#mail.com
2,US,Pamella,pamella#mail.com
3,CY,Korrie,korrie#mail.com
4,NI,Korrie,nikorrie#mail.com
5,BT,Queenie,queenie#mail.com
6,AW,Donnie,donnie#mail.com
7,GP,Pamella,gppamella#mail.com
8,KP,Pamella,kppamella#mail.com
9,LC,Pamella,lcpamella#mail.com
10,GM,Ericka,ericka#mail.com
How do I add location to the initial one?
A common solution is to have Awk process the same file twice if you need to know whether there are duplicates down the line.
Notice also that this requires you to avoid the useless use of cat.
awk 'BEGIN {FS=OFS=","};
NR == FNR { ++seen[$3]; next }
FNR > 1 { $4 = (seen[$3] > 1 ? tolower($2) : "") tolower($3) "#mail.com" }
1' data.csv data.csv >data_new.csv
NR==FNR is true when you read the file the first time. We simply count the number of occurrences of $3 in seen for the second pass.
Then in the second pass, we can just look at the current entry in seen to figure out whether or not we need to add the prefix.

Awk if else with conditions

I am trying to make a script (and a loop) to extract matching lines to print them into a new file. There are 2 conditions: 1st is that I need to print the value of the 2nd and 4th columns of the map file if the 2nd column of the map file matches with the 4th column of the test file. The 2nd condition is that when there is no match, I want to print the value in the 2nd column of the test file and a zero in the second column.
My test file is made this way:
8 8:190568 0 190568
8 8:194947 0 194947
8 8:197042 0 197042
8 8:212894 0 212894
My map file is made this way:
8 190568 0.431475 0.009489
8 194947 0.434984 0.009707
8 19056880 0.395066 112.871160
8 101908687 0.643861 112.872348
1st attempt:
for chr in {21..22};
do
awk 'NR==FNR{a[$2]; next} {if ($4 in a) print $2, $4 in a; else print $2, $4 == "0"}' map_chr$chr.txt test_chr$chr.bim > position.$chr;
done
Result:
8:190568 1
8:194947 1
8:197042 0
8:212894 0
My second script is:
for chr in {21..22}; do
awk 'NR == FNR { ++a[$4]; next }
$4 in a { print a[$2], $4; ++found[$2] }
END { for(k in a) if (!found[k]) print a[k], 0 }' \
"test_chr$chr.bim" "map_chr$chr.txt" >> "position.$chr"
done
And the result is:
1 0
1 0
1 0
1 0
The result I need is:
8:190568 0.009489
8:194947 0.009707
8:197042 0
8:212894 0
This awk should work for you:
awk 'FNR==NR {map[$2]=$4; next} {print $4, map[$4]+0}' mapfile testfile
190568 0.009489
194947 0.009707
197042 0
212894 0
This awk command processes mapfile first and stores $2 as key with $4 as a value in an associative array named as map.
Later when it processes testfile in 2nd block we print $4 from 2nd file with the stored value in map using key as $4. We add 0 in stored value to make sure that we get 0 when $4 is not present in map.

Awk - Count Each Unique Value and Match Values Between 2 Files

I have two files. I am trying to get the count of each unique field in column 8 in file 1, and then match the unique field value from the 6th column of the 2nd file.
So essentially, I am trying to -> take each unique value and value count from column 8 from File1, if there is a match in column6 of file2
File1:
2020-12-23 23:59:12,235911688,\N,34,20201223233739,797495497,404,819,\N,
2020-12-23 23:59:12,235911419,\N,34,265105814,718185263,200,819,\N,
2020-12-23 23:59:12,235912029,\N,34,20201223233739,748362773,404,819,\N,
2020-12-23 23:59:12,235911839,\N,34,20201223233738,745662697,404,400,\N,
2020-12-23 23:59:12,235911839,\N,34,20201223233738,745662697,404,400,\N,
2020-12-24 23:59:12,235911839,\N,34,20201223233738,745662697,404,400,\N,
File2:
public static String status_code = "819";
public static String DeActivate = "400";
Expected output:
total count of status_code,819 : 3
total count of DeActivate,400 : 3
My code:
awk 'NR==FNR{a[$8]++}NR!=FNR{gsub(/"/,"",$6);b[$6]=$0}END{for( i in b){printf "Total count of %s,%d : %d\n",gensub(/^([^ ]+).*/,"\\1","1",b[i]),i,a[i]}}' File1 File2
Algorithm
1.Take the 8th feild from 1st file:(eg:819)
2.Count how time unique feild(819) occurs in file(based of date)
3 take the corresponding value of 819 from 4th feild of file2
4 print output together
I believe I should be able to do this with awk, but for some reason I am really struggling with this.
(It is something like SQL JOINing two relational database tables on File1's $8 being equal to File2's $6.)
awk '
NR==FNR { # For the first file
a[$8]++; # count each $8
}
NF&&NR!=FNR { # For non empty lines of file 2
gsub(/[^0-9]/,"",$6); # remove non-digits from $6
b[$6]=$4 # save name of constant to b
}
END{
for(i in b){ # for constants occurring in File2
if(a[i]) { # if File1 had non zero count
printf( "Total count of %s,%d : %d\n",b[i],i,a[i]);
#print data
}
}
}' "FS=," File1 FS=" " File2
The above code works with your sample input. It produces the following output:
Total count of DeActivate,400 : 3
Total count of status_code,819 : 3
I think the main problem is that you do not specify comma as field separator for File1. See Processing two files with different field separators in awk
A shorter, more efficient, way without the second array and for loop:
$ cat demo.awk
NR == FNR {
a[$8]++
next
}
{
gsub(/[^0-9]/,"",$6)
printf "Total count of %s,%d : %d\n", $4, $6, a[$6]
}
$ awk -f demo.awk FS="," file1 FS=" " file2
Total count of status_code,819 : 3
Total count of DeActivate,400 : 3
$

losing data when comparing a column with awk

I have a text file and all I want to do is compare the third column and see if it's equal to 1 or 0, so I just simply used
awk '$3 == 1 { print $0 }' input > output1
awk '$3 == 0 { print $0 }' input > output2
This is part of a bash script and I'm certain there is a more elegant approach to this, but the code above should get the job done, only it does not. input has 425 rows of text, the third column in input is always a 1 or 0, therefore the total number of rows in output1 + output2 should be 425. But I get 417 rows.
Here is a sample of input (all of it is just one row, and there are 425 such rows):
out_first.dat 1 1 0.000000 265075.000000 6.000000e-01 1.005205e-03 9.000000e-01 9.000000e-01 2.889631e+00 -2.423452e+00 3.730018e+00 -1.532915e+00
if $3 is 1 or 0, it will be equal to its square, prints to output1/2. If not prints to other for inspection.
awk `$3*$3==$3{print > "output"(2-$3); next} {print > "other"}' file
if $3*$3==$3 is confusing, change to $3==0 || $3===1
for the curious $3==0 || $3===1 can be written as $3*($3-1)==0 from which the above follows.

Display only lines in which 1 column is equal to another, and a second column is in a range in AWK and Bash

I have two files. The first file looks like this:
1 174392
1 230402
2 4933400
3 39322
4 42390021
5 80022392
6 3818110
and so on
the second file looks like this:
chr1 23987 137011
chr1 220320 439292
chr2 220320 439292
chr2 2389328 3293292
chr3 392329 398191
chr4 421212 3292393
and so on.
I want to return the whole line, provided that the first column in FILE1 = the first line in FILE2, as a string match AND the 2nd column in file 2 is greater than column 2 in FILE2 but less than column 3 in FILE2.
So in the above example, the line
1 230402
in FILE1 and
chr1 220320 439292
in FILE2 would satisfy the conditions because 230402 is between 220320 and 439292 and 1 would be equal to chr1 after I make the strings match, therefore that line in FILE2 would be printed.
The code I wrote was this:
#!/bin/bash
$F1="FILE1.txt"
read COL1 COL2
do
grep -w "chr$COL1" FILE2.tsv \
| awk -v C2=$COL2 '{if (C2>$1 && C2<$2); print $0}'
done < "$F1"
I have tried many variations of this. I do not care if the code is entirely in awk, entirely in bash, or a mixture.
Can anyone help?
Thank you!
Here is one way using awk:
awk '
NR==FNR {
$1 = "chr" $1
seq[$1,$2]++;
next
}
{
for(key in seq) {
split(key, tmp, SUBSEP);
if(tmp[1] == $1 && $2 <= tmp[2] && tmp[2] <= $3 ) {
print $0
}
}
}' file1 file2
chr1 220320 439292
We read the first file in to an array using key as column 1 and 2. We add a string "chr" to column 1 while making it a key for easy comparison later on
When we process the file 2, we iterate over our array and split the key.
We compare the first piece of our key to column 1 and check if second piece of the key is in the range of second and third column.
If it satisfies our condition, we print the line.
awk 'BEGIN {i = 0}
FNR == NR { chr[i] = "chr" $1; test[i++] = $2 }
FNR < NR { for (c in chr) {
if ($1 == chr[c] && test[c] > $2 && test[c] < $3) { print }
}
}' FILE1.txt FILE2.tsv
FNR is the line number within the current file, NR is the line number within all the input. So the first block processes the first file, collecting all the lines into arrays. The second block processes any remaining files, searching through the array of chrN values looking for a match, and comparing the other two numbers to the number from the first file.
Thanks very much!
These answers work and are very helpful.
Also at long last I realized I should have had:
awk -v C2=$COL2 'if (C2>$1 && C2<$2); {print $0}'
with the brace in a different place and I would have been fine.
At any rate, thank you very much!

Resources