join two file based on column when there is no one by one corespondness in bash script (awk, grep , sed) - bash

file1.txt
112|9305|/inst.exe
112|9305|/lkj.exe
112|9305|/dje.jar
112|9305|/ind.pdf
112|9306|/ma.exe
112|9306|/ngg.pdf
112|9307|/jhhh.dat
112|9312|/ee.dat
112|9312|/qwq.dll
file2.txt
117|9305|www.gahan.com
117|9306|www.google.com
117|9312|www.mihan.com
117|9307|translate.com
expected output
112|9305|www.gahan.com/inst.exe
112|9305|www.gahan.com/lkj.exe
112|9305|www.gahan.com/dje.jar
112|9305|www.gahan.com/ind.pdf
112|9306|www.google.com/ma.exe
112|9306|www.google.com/ngg.pdf
112|9307|translate.com/jhhh.dat
112|9312|www.mihan.com/ee.dat
112|9312|www.mihan.com/qwq.dll
I want to add third column of file2.txt to third column of file1.txt based on second column values. In fact I want join them based on second column but there is no one bye one correspondence between them. How can I do these with awk or grep or sed in shell script.

You can use awk like this:
awk 'BEGIN{FS=OFS="|"} FNR==NR{a[$2]=$3; next} $2 in a{$3=a[$2] $3} 1' file2.txt file1.txt
112|9305|www.gahan.com/inst.exe
112|9305|www.gahan.com/lkj.exe
112|9305|www.gahan.com/dje.jar
112|9305|www.gahan.com/ind.pdf
112|9306|www.google.com/ma.exe
112|9306|www.google.com/ngg.pdf
112|9307|translate.com/jhhh.dat
112|9312|www.mihan.com/ee.dat
112|9312|www.mihan.com/qwq.dll

Related

bash: using 2 variables from same file and sed

I have a 2 files:
file1.txt
rs142159069:45000079:TACTTCTTGGACATTTCC:T 45000079
rs111285978:45000103:A:AT 45000103
rs190363568:45000168:C:T 45000168
file2.txt
rs142159069:45000079:TACTTCTTGGACATTTCC:T rs142159069
rs111285978:45000103:A:AT rs111285978
rs190363568:45000168:C:T rs190363568
Using file2.txt, I want to replace the names (column2 of file1.txt which is column1 of file2.txt) by the entry in column 2. The output file would then be:
rs142159069 45000079
rs111285978 45000103
rs190363568 45000168
I have tried inputing the columns of file2.txt but without success:
while read -r a b
do
cat file1.txt | sed s'/$a/$b/'
done < file2.txt
I am quite new to bash. Also, not sure how to write an output file with my command. Any help would be deeply appreciated.
In your case, using awk or perl would be easier, if you are willing to accept an answer without sed:
awk '(NR==FNR){out[$1]=$2;next}{out[$1]=out[$1]" "$2}END{for (i in out){print out[i]} }' file2.txt file1.txt > output.txt
output.txt :
rs142159069 45000079
rs111285978 45000103
rs190363568 45000168
Note: this assume all symbols in column1 are unique, and that they are all present in both files
explanation:
(NR==FNR){out[$1]=$2;next} : while you are parsing the first file, create a map with the name from the first column as key
{out[$1]=out[$1]" "$2} : append the value from the second column
END{for (i in out){print out[i]} } : print all the values in the map
Apparently $2 of file2 is part of $1 of file1, so you could use awk and redefine FS:
$ awk -F"[: ]" '{print $1,$NF}' file1
rs142159069 45000079
rs111285978 45000103
rs190363568 45000168

shell - compare files and update matching string awk/sed/diff/grep/csv

I need to compare 2 csv files and make modifications to the second column. I wrote out the logic out of how I would want to achieve this however, it seems to confuse the thread a lot more than I wanted too so I'll just write out the example.
Any help would be appreciated. Thanks in advance.
file1
user1,distinguishedName1
user2,distinguishedName2
user3,distinguishedName3
user4,distinguishedName4
user5,distinguishedName5
file2
user1,distinguishedName1
user3,distinguishedName13
user5,distinguishedName12
user6,distinguishedName4
desired outcome:
user1,distinguishedName1
user2,distinguishedName2
user3,distinguishedName13
user4,distinguishedName4
user5,distinguishedName12
user6,distinguishedName4
The solution using join command combined with awk command:
join -t',' -j1 -a1 -a2 file1 file2 | awk -F',' '{if(NF==3) $0=$1FS$3}1'
The output:
user1,distinguishedName1
user2,distinguishedName2
user3,distinguishedName13
user4,distinguishedName4
user5,distinguishedName12
user6,distinguishedName4
Explanation:
-- for join command:
-t',' - defines field separator
-j1 - tells to join on first field 1
-a FILENUM - print unpairable lines coming from file FILENUM, where FILENUM is 1 or 2, corresponding to FILE1 or FILE2
-- for awk command:
NF - contains a total number of fields
FS - field separator(i.e. ,)
if(NF==3) $0=$1FS$3 - the condition, checks if there's a complement third field(as result of joining the files on lines with common first field) to perform the replacement
https://linux.die.net/man/1/join
awk to the rescue!
awk -F, '!a[$1]++' file2 file1
user1,distinguishedName1
user3,distinguishedName13
user5,distinguishedName12
user6,distinguishedName4
user2,distinguishedName2
user4,distinguishedName4
this order is based on file2 and file1 record order, if you want sorted order just pipe to sort
awk ... | sort

Joining two text files based on a common field (ip address)

File1
abcd-efg|random1||abcd|10.10.1.1||
bcde-ab|random2||bc|10.1.2.2||
efgh-bd|ramdom3||fgh|10.2.1.1||
ijkl|random4||mno|10.3.2.3||
File2
10.10.1.1| yes
10.1.2.2| no
10.2.1.1| yes
10.3.2.3| no
Output should be
abcd-efg|random1||abcd|10.10.1.1||yes
bcde-ab|random2||bc|10.1.2.2||no
efgh-bd|ramdom3||fgh|10.2.1.1||yes
ijkl|random4||mno|10.3.2.3||no
I was trying to join both text files based on ip address using awk and joins but some how not able to get the right output.
Could you help me get through the right output.Thanks in advance
$ awk -F'|' 'FNR==NR{a[$1]=$2; next} {print $0 a[$5]}' file2 file1
abcd-efg|random1||abcd|10.10.1.1|| yes
bcde-ab|random2||bc|10.1.2.2|| no
efgh-bd|ramdom3||fgh|10.2.1.1|| yes
ijkl|random4||mno|10.3.2.3|| no
This approach will work even if the IPs are in the files in different orders.
How it works
-F'|'
Set the field separator on input to |.
FNR==NR{a[$1]=$2; next}
When reading the first file, file2, save the second field as a value in associative array a under the key of the first field. Skip remaining commands and jump to the next line.
print $0 a[$5]
If we get here, we are working on the second file, file1. Print the line followed by the value of a for this IP.
BSD/OSX
On BSD (OSX) awk, try:
awk -F'|' 'FNR==NR{a[$1]=$2; next;} {print $0 a[$5];}' file2 file1
Unix join command can be used for this
join -t\| -j1 5 -j2 1 -o1.1,1.2,1.3,1.4,1.5,1.6,2.2 file1 file2
Explanation of options:
-t\| : Field separator is '|' (escaped)
-j1 5 -j2 1 : Join based on 5th field of file1 and 1st field of file2
-o1.1,1.2,1.3,1.4,1.5,1.6,2.2 : Output the 6 fields from file1 and 2nd field from file2
If the input files are not sorted, they need to be sorted first, like below
join -t\| -j1 5 -j2 1 -o1.1,1.2,1.3,1.4,1.5,1.6,2.2 <(sort -t'|' -k5 file1) <(sort -t'|' -k1 file2)
Assuming both files have IP address in same order as shown in OP's example
paste -d'\0' file1 <(cut -d' ' -f2 file2)
cut -d' ' -f2 file2 select second column of file2, column separation is space character specified by delimiter -d' '
Using process substitution, output of cut command is passed as file input to paste command
paste command then combines file1 and output of cut column wise without any character in between (reference: paste without delimiter)

match records of a file with field values of another file in awk

I want to sum up values in a field of a CSV file
where matching should be checked by reading another file,
say we have CSV_file:
adam,18
denis,19
julie,17
adam,15
max,20
julie,19
and a simple txt file containing:
adam
julie
all I need is to sum up 18,15,17,19
how could I easily do that with awk?
awk 'NR==FNR{ s[$1]+= $2; next} {t+=s[$1]} END{ print t}' FS=, csv-file names.txt
Assuming names.txt is:
adam
julie
And values.txt is:
adam,18
denis,19
julie,17
adam,15
max,20
julie,19
Then you can make use of grep's -f flag, which reads patterns from a file, one pattern per line, and returns all lines from values.txt that match any pattern. Then we just use awk to parse out the numbers and sum:
grep -f names.txt values.txt | \
awk 'BEGIN{FS=",";total=0}{total+=$2}END{print total}'

How to check whehter rows of a file within the rows of another file

I am fresh to Shell or Bash. I have file1 with one column and about 5000 rows and file2 have five columns with 240k rows. How can I check whether the values of the 5000 rows in file1 within or not the second column of file2?
$wc -l file1
$5188
$wc -l file2
$240,888
You can do this with awk, something like this:
awk 'NR == FNR {a[$2] = $1; next} {if ($2 in a){print(a[$2], $1)}}' file1 file2
Basically you read the first file in and store its contents in an array "a". Then you read the second file and check if the second field of each line is contained within array "a" and print it if it is.
My answer assumes your fields are separated by white space, if they are not you will have to change the separator. So, if your fields are separated by commas, you will need:
awk -F, .....
The above syntax does work, and it can be further simplified as:
awk 'FNR==NR{a[$1]=$2; next} {print $1, a[$1]}' file2 file1

Resources