shell - compare files and update matching string awk/sed/diff/grep/csv - shell

I need to compare 2 csv files and make modifications to the second column. I wrote out the logic out of how I would want to achieve this however, it seems to confuse the thread a lot more than I wanted too so I'll just write out the example.
Any help would be appreciated. Thanks in advance.
file1
user1,distinguishedName1
user2,distinguishedName2
user3,distinguishedName3
user4,distinguishedName4
user5,distinguishedName5
file2
user1,distinguishedName1
user3,distinguishedName13
user5,distinguishedName12
user6,distinguishedName4
desired outcome:
user1,distinguishedName1
user2,distinguishedName2
user3,distinguishedName13
user4,distinguishedName4
user5,distinguishedName12
user6,distinguishedName4

The solution using join command combined with awk command:
join -t',' -j1 -a1 -a2 file1 file2 | awk -F',' '{if(NF==3) $0=$1FS$3}1'
The output:
user1,distinguishedName1
user2,distinguishedName2
user3,distinguishedName13
user4,distinguishedName4
user5,distinguishedName12
user6,distinguishedName4
Explanation:
-- for join command:
-t',' - defines field separator
-j1 - tells to join on first field 1
-a FILENUM - print unpairable lines coming from file FILENUM, where FILENUM is 1 or 2, corresponding to FILE1 or FILE2
-- for awk command:
NF - contains a total number of fields
FS - field separator(i.e. ,)
if(NF==3) $0=$1FS$3 - the condition, checks if there's a complement third field(as result of joining the files on lines with common first field) to perform the replacement
https://linux.die.net/man/1/join

awk to the rescue!
awk -F, '!a[$1]++' file2 file1
user1,distinguishedName1
user3,distinguishedName13
user5,distinguishedName12
user6,distinguishedName4
user2,distinguishedName2
user4,distinguishedName4
this order is based on file2 and file1 record order, if you want sorted order just pipe to sort
awk ... | sort

Related

Extracting unique values between 2 files with awk

I need to get uniq lines when comparing 2 files. These files containing field separator ":" which should be treated as the end of line while comparing strings.
The file1 contains these lines
apple:tasty
apple:red
orange:nice
kiwi:awesome
kiwi:expensive
banana:big
grape:green
orange:oval
banana:long
The file2 contains these lines
orange:nice
banana:long
The output file should be (2 occurrences of orange and 2 occurrences of banana deleted)
apple:tasty
apple:red
kiwi:awesome
kiwi:expensive
grape:green
So the only strings before : should be compared
Is it possible to complete this task in 1 command ?
I tried to complete the task in such way but field separator does not work in that situation.
awk -F: 'FNR==NR {a[$0]++; next} !a[$0]' file1 file2 > outputfile
You basically had it, but $0 refers to the whole line when you want to deal with only the first field, which is $1.
Also you need to take care with the order of the input files. To use the values from file2 for deciding which lines to include from file1, process file2 first:
$ awk -F: 'FNR==NR {a[$1]++; next} !a[$1]' file2 file1
apple:tasty
apple:red
kiwi:awesome
kiwi:expensive
grape:green
One comment: awk is very ineffective with arrays. In real life with big files, better use something like:
comm -3 <(cut -d : -f 1 f1 | sort -u) <(cut -d : -f 1 f2 | sort -u) | grep -h -f /dev/stdin f1 f2

Joining two text files based on a common field (ip address)

File1
abcd-efg|random1||abcd|10.10.1.1||
bcde-ab|random2||bc|10.1.2.2||
efgh-bd|ramdom3||fgh|10.2.1.1||
ijkl|random4||mno|10.3.2.3||
File2
10.10.1.1| yes
10.1.2.2| no
10.2.1.1| yes
10.3.2.3| no
Output should be
abcd-efg|random1||abcd|10.10.1.1||yes
bcde-ab|random2||bc|10.1.2.2||no
efgh-bd|ramdom3||fgh|10.2.1.1||yes
ijkl|random4||mno|10.3.2.3||no
I was trying to join both text files based on ip address using awk and joins but some how not able to get the right output.
Could you help me get through the right output.Thanks in advance
$ awk -F'|' 'FNR==NR{a[$1]=$2; next} {print $0 a[$5]}' file2 file1
abcd-efg|random1||abcd|10.10.1.1|| yes
bcde-ab|random2||bc|10.1.2.2|| no
efgh-bd|ramdom3||fgh|10.2.1.1|| yes
ijkl|random4||mno|10.3.2.3|| no
This approach will work even if the IPs are in the files in different orders.
How it works
-F'|'
Set the field separator on input to |.
FNR==NR{a[$1]=$2; next}
When reading the first file, file2, save the second field as a value in associative array a under the key of the first field. Skip remaining commands and jump to the next line.
print $0 a[$5]
If we get here, we are working on the second file, file1. Print the line followed by the value of a for this IP.
BSD/OSX
On BSD (OSX) awk, try:
awk -F'|' 'FNR==NR{a[$1]=$2; next;} {print $0 a[$5];}' file2 file1
Unix join command can be used for this
join -t\| -j1 5 -j2 1 -o1.1,1.2,1.3,1.4,1.5,1.6,2.2 file1 file2
Explanation of options:
-t\| : Field separator is '|' (escaped)
-j1 5 -j2 1 : Join based on 5th field of file1 and 1st field of file2
-o1.1,1.2,1.3,1.4,1.5,1.6,2.2 : Output the 6 fields from file1 and 2nd field from file2
If the input files are not sorted, they need to be sorted first, like below
join -t\| -j1 5 -j2 1 -o1.1,1.2,1.3,1.4,1.5,1.6,2.2 <(sort -t'|' -k5 file1) <(sort -t'|' -k1 file2)
Assuming both files have IP address in same order as shown in OP's example
paste -d'\0' file1 <(cut -d' ' -f2 file2)
cut -d' ' -f2 file2 select second column of file2, column separation is space character specified by delimiter -d' '
Using process substitution, output of cut command is passed as file input to paste command
paste command then combines file1 and output of cut column wise without any character in between (reference: paste without delimiter)

Match and merge lines based on the first column

I have 2 files:
File1
123:dataset1:dataset932
534940023023:dataset:dataset039302
49930:dataset9203:dataset2003
File2
49930:399402:3949304:293000232:30203993
123:49030:1204:9300:293920
534940023023:49993029:3949203:49293904:29399
and I would like to create
Desired result:
49930:399402:3949304:293000232:30203993:dataset9203:dataset2003
534940023023:49993029:3949203:49293904:29399:dataset:dataset039302
etc
where the result contains one line for each pair of input lines that have identical first column (with : as the column separator).
The join command is your friend here. You'll likely need to sort the inputs (either pre-sort the files, or use a process substitution if available - e.g. with bash).
Something like:
join -t ':' <(sort file2) <(sort file1) >file3
When you do not want to sort files, play with grep:
while IFS=: read key others; do
echo "${key}:${others}:$(grep "^${key}:" file1 | cut -d: -f2-)"
done < file2

join two file based on column when there is no one by one corespondness in bash script (awk, grep , sed)

file1.txt
112|9305|/inst.exe
112|9305|/lkj.exe
112|9305|/dje.jar
112|9305|/ind.pdf
112|9306|/ma.exe
112|9306|/ngg.pdf
112|9307|/jhhh.dat
112|9312|/ee.dat
112|9312|/qwq.dll
file2.txt
117|9305|www.gahan.com
117|9306|www.google.com
117|9312|www.mihan.com
117|9307|translate.com
expected output
112|9305|www.gahan.com/inst.exe
112|9305|www.gahan.com/lkj.exe
112|9305|www.gahan.com/dje.jar
112|9305|www.gahan.com/ind.pdf
112|9306|www.google.com/ma.exe
112|9306|www.google.com/ngg.pdf
112|9307|translate.com/jhhh.dat
112|9312|www.mihan.com/ee.dat
112|9312|www.mihan.com/qwq.dll
I want to add third column of file2.txt to third column of file1.txt based on second column values. In fact I want join them based on second column but there is no one bye one correspondence between them. How can I do these with awk or grep or sed in shell script.
You can use awk like this:
awk 'BEGIN{FS=OFS="|"} FNR==NR{a[$2]=$3; next} $2 in a{$3=a[$2] $3} 1' file2.txt file1.txt
112|9305|www.gahan.com/inst.exe
112|9305|www.gahan.com/lkj.exe
112|9305|www.gahan.com/dje.jar
112|9305|www.gahan.com/ind.pdf
112|9306|www.google.com/ma.exe
112|9306|www.google.com/ngg.pdf
112|9307|translate.com/jhhh.dat
112|9312|www.mihan.com/ee.dat
112|9312|www.mihan.com/qwq.dll

Search and merge multiple files in UNIX

I have multiple .dat files, which i will get to know dynamically. dat file will have two columns seperated by comma. Will be like key value pairs. For example:
File1:
entry1,100
entry2,200
entry3,300
File2:
entry1,500
entry3,750
Now I want the output as
File3:
entry1,100,500
entry2,200
entry3,300,750
Assuming you want to include unpairable lines from both files - such as entry2,200 from File2 the following command should work:
join -t, -a1 -a2 file1 file2
-t, instructs join to use comma as a delimiter, -a1 -a2 instructs join to include unpairable lines from each file.
You can use awk when input files are not already sorted:
awk 'BEGIN{FS=OFS=","} FNR==NR{a[$1]=$2; next} {print $0 (($1 in a)?OFS a[$1]:"")}' f2.dat f1.dat
entry1,100,500
entry2,200
entry3,300,750

Resources