Extracting unique values between 2 files with awk

Extracting unique values between 2 files with awk - bash

I need to get uniq lines when comparing 2 files. These files containing field separator ":" which should be treated as the end of line while comparing strings.
The file1 contains these lines
apple:tasty
apple:red
orange:nice
kiwi:awesome
kiwi:expensive
banana:big
grape:green
orange:oval
banana:long
The file2 contains these lines
orange:nice
banana:long
The output file should be (2 occurrences of orange and 2 occurrences of banana deleted)
apple:tasty
apple:red
kiwi:awesome
kiwi:expensive
grape:green
So the only strings before : should be compared
Is it possible to complete this task in 1 command ?
I tried to complete the task in such way but field separator does not work in that situation.
awk -F: 'FNR==NR {a[$0]++; next} !a[$0]' file1 file2 > outputfile

You basically had it, but $0 refers to the whole line when you want to deal with only the first field, which is $1.
Also you need to take care with the order of the input files. To use the values from file2 for deciding which lines to include from file1, process file2 first:
$ awk -F: 'FNR==NR {a[$1]++; next} !a[$1]' file2 file1
apple:tasty
apple:red
kiwi:awesome
kiwi:expensive
grape:green

One comment: awk is very ineffective with arrays. In real life with big files, better use something like:
comm -3 <(cut -d : -f 1 f1 | sort -u) <(cut -d : -f 1 f2 | sort -u) | grep -h -f /dev/stdin f1 f2

Related

How to merge in one file, two files in bash line by line [duplicate]

What's the easiest/quickest way to interleave the lines of two (or more) text files? Example:
File 1:
line1.1
line1.2
line1.3
File 2:
line2.1
line2.2
line2.3
Interleaved:
line1.1
line2.1
line1.2
line2.2
line1.3
line2.3
Sure it's easy to write a little Perl script that opens them both and does the task. But I was wondering if it's possible to get away with fewer code, maybe a one-liner using Unix tools?

paste -d '\n' file1 file2

Here's a solution using awk:
awk '{print; if(getline < "file2") print}' file1
produces this output:
line 1 from file1
line 1 from file2
line 2 from file1
line 2 from file2
...etc
Using awk can be useful if you want to add some extra formatting to the output, for example if you want to label each line based on which file it comes from:
awk '{print "1: "$0; if(getline < "file2") print "2: "$0}' file1
produces this output:
1: line 1 from file1
2: line 1 from file2
1: line 2 from file1
2: line 2 from file2
...etc
Note: this code assumes that file1 is of greater than or equal length to file2.
If file1 contains more lines than file2 and you want to output blank lines for file2 after it finishes, add an else clause to the getline test:
awk '{print; if(getline < "file2") print; else print ""}' file1
or
awk '{print "1: "$0; if(getline < "file2") print "2: "$0; else print"2: "}' file1

#Sujoy's answer points in a useful direction. You can add line numbers, sort, and strip the line numbers:
(cat -n file1 ; cat -n file2 ) | sort -n | cut -f2-
Note (of interest to me) this needs a little more work to get the ordering right if instead of static files you use the output of commands that may run slower or faster than one another. In that case you need to add/sort/remove another tag in addition to the line numbers:
(cat -n <(command1...) | sed 's/^/1\t/' ; cat -n <(command2...) | sed 's/^/2\t/' ; cat -n <(command3) | sed 's/^/3\t/' ) \
| sort -n | cut -f2- | sort -n | cut -f2-

With GNU sed:
sed 'R file2' file1
Output:
line1.1
line2.1
line1.2
line2.2
line1.3
line2.3

Here's a GUI way to do it: Paste them into two columns in a spreadsheet, copy all cells out, then use regular expressions to replace tabs with newlines.

cat file1 file2 |sort -t. -k 2.1
Here its specified that the separater is "." and that we are sorting on the first character of the second field.

Replacing a specific field using awk, sed or any POSIX tool

I have a huge file1, which has values as follows:
a 1
b 2
c 3
d 4
e 5
I have another huge file2, which is colon delimited with seven fields as follows:
a:2543:2524:2542:252:536365:54654
c:5454:5454:654:54:87:54
d:87:65:1:98:32:87
I want to search the lines for the variables of file1 and replace its value in the 7th column in file2 so the output should be as follows:
a:2543:2524:2542:252:536365:1
c:5454:5454:654:54:87:3
d:87:65:1:98:32:4

Maybe this awk will work, assuming the files are as posted
awk -F: 'NR==FNR{a[$1]=$0;next;}a[$1]{$0=a[$1]}1' file1 file2
a:2543:2524:2542:252:536365:1
c:5454:5454:654:54:87:3
d:87:65:1:98:32:4

So I came up with a solution; ended up with a couple of lines of code. Maybe there is a better way to do it.But this works !
while read line ; do
var1=`echo $line| awk '{print $1}'`
var2=`echo $line| awk '{print $2}'`
awk -v var1="$var1" -v var2="$var2" -F ':' 'BEGIN { OFS = ":"} $1==var1 {sub(".*",var2,$7)}{print}' file2 > file2.tmp
mv file2.tmp file2
done < file1
cat file2

This should work - it assumes both files are sorted on the first column - I'd be very interested in any performance comparisons with your solution for very large files - both speed and memory - --complement is linux-specific but easily replaced if not available
file0=$1 #file with single value
file1=$2 #file with 6th value to be replaced
# normalize on colon delimiter
tr ' ' : <$file0|
# join on first field
join -t: $file1 -|
# delete column 7
cut --complement -d: -f7

Joining two text files based on a common field (ip address)

File1
abcd-efg|random1||abcd|10.10.1.1||
bcde-ab|random2||bc|10.1.2.2||
efgh-bd|ramdom3||fgh|10.2.1.1||
ijkl|random4||mno|10.3.2.3||
File2
10.10.1.1| yes
10.1.2.2| no
10.2.1.1| yes
10.3.2.3| no
Output should be
abcd-efg|random1||abcd|10.10.1.1||yes
bcde-ab|random2||bc|10.1.2.2||no
efgh-bd|ramdom3||fgh|10.2.1.1||yes
ijkl|random4||mno|10.3.2.3||no
I was trying to join both text files based on ip address using awk and joins but some how not able to get the right output.
Could you help me get through the right output.Thanks in advance

$ awk -F'|' 'FNR==NR{a[$1]=$2; next} {print $0 a[$5]}' file2 file1
abcd-efg|random1||abcd|10.10.1.1|| yes
bcde-ab|random2||bc|10.1.2.2|| no
efgh-bd|ramdom3||fgh|10.2.1.1|| yes
ijkl|random4||mno|10.3.2.3|| no
This approach will work even if the IPs are in the files in different orders.
How it works
-F'|'
Set the field separator on input to |.
FNR==NR{a[$1]=$2; next}
When reading the first file, file2, save the second field as a value in associative array a under the key of the first field. Skip remaining commands and jump to the next line.
print $0 a[$5]
If we get here, we are working on the second file, file1. Print the line followed by the value of a for this IP.
BSD/OSX
On BSD (OSX) awk, try:
awk -F'|' 'FNR==NR{a[$1]=$2; next;} {print $0 a[$5];}' file2 file1

Unix join command can be used for this
join -t\| -j1 5 -j2 1 -o1.1,1.2,1.3,1.4,1.5,1.6,2.2 file1 file2
Explanation of options:
-t\| : Field separator is '|' (escaped)
-j1 5 -j2 1 : Join based on 5th field of file1 and 1st field of file2
-o1.1,1.2,1.3,1.4,1.5,1.6,2.2 : Output the 6 fields from file1 and 2nd field from file2
If the input files are not sorted, they need to be sorted first, like below
join -t\| -j1 5 -j2 1 -o1.1,1.2,1.3,1.4,1.5,1.6,2.2 <(sort -t'|' -k5 file1) <(sort -t'|' -k1 file2)

Assuming both files have IP address in same order as shown in OP's example
paste -d'\0' file1 <(cut -d' ' -f2 file2)
cut -d' ' -f2 file2 select second column of file2, column separation is space character specified by delimiter -d' '
Using process substitution, output of cut command is passed as file input to paste command
paste command then combines file1 and output of cut column wise without any character in between (reference: paste without delimiter)

Comparing values in two files

I am comparing two files, each having one column and n number of rows.
file 1
vincy
alex
robin
file 2
Allen
Alex
Aaron
ralph
robin
if the data of file 1 is present in file 2 it should return 1 or else 0, in a tab seprated file.
Something like this
vincy 0
alex 1
robin 1
What I am doing is
#!/bin/bash
for i in `cat file1 `
do
cat file2 | awk '{ if ($1=="'$i'") print 1 ; else print 0 }'>>binary
done
the above code is not giving me the output which I am looking for.
Kindly have a look and suggest correction.
Thank you

The simple awk solution:
awk 'NR==FNR{ seen[$0]=1 } NR!=FNR{ print $0 " " seen[$0] + 0}' file2 file1
A simple explanation: for the lines in file2, NR==FNR, so the first action is executed and we simply record that a line has been seen. In file1, the 2nd action is taken and the line is printed, followed by a space, followed by a "0" or a "1", depending on if the line was seen in file2.

AWK loves to do this kind of thing.
awk 'FNR == NR {a[tolower($1)]; next} {f = 0; if (tolower($1) in a) {f = 1}; print $1, f}' file2 file1
Swap the positions of file2 and file1 in the argument list to make file1 the dictionary instead of file2.
When FNR (the record number in the current file) and NR (the record number of all records so far) are equal, then the first file is the one being processed. Simply referencing an array element brings it into existence. This sets up the dictionary. The next instruction reads the next record.
Once FNR and NR aren't equal, subsequent file(s) are being processed and their data is looked up in the dictionary array.

The following code should do it.
Take a close look to the BEGIN and END sections.
#!/bin/bash
rm -f binary
for i in $(cat file1); do
awk 'BEGIN {isthere=0;} { if ($1=="'$i'") isthere=1;} END { print "'$i'",isthere}' < file2 >> binary
done

There are several decent approaches. You can simply use line-by-line set math:
{
grep -xF -f file1 file2 | sed $'s/$/\t1/'
grep -vxF -f file1 file2 | sed $'s/$/\t0/'
} > somefile.txt
Another approach would be to simply combine the files and use uniq -c, then just swap the numeric column with something like awk:
sort file1 file2 | uniq -c | awk '{ print $2"\t"$1 }'

The comm command exists to do this kind of comparison for you.
The following approach does only one pass and scales well to very large input lists:
#!/bin/bash
while read; do
if [[ $REPLY = $'\t'* ]] ; then
printf "%s\t0\n" "${REPLY#?}"
else
printf "%s\t1\n" "${REPLY}"
fi
done < <(comm -2 <(tr '[A-Z]' '[a-z]' <file1 | sort) <(tr '[A-Z]' '[a-z]' <file2 | sort))
See also BashFAQ #36, which is directly on-point.

Another solution, if you have python installed.
If you're familiar with Python and are interested in the solution, you only need a bit of formatting.
#/bin/python
f1 = open('file1').readlines()
f2 = open('file2').readlines()
f1_in_f2 = [int(x in f2) for x in f1]
for n,c in zip(f1, f1_in_f2):
print n,c

find difference between two text files with one item per line [duplicate]

This question already has answers here:
How to remove the lines which appear on file B from another file A?
(12 answers)
Closed 6 years ago.
I have two files:
file 1
dsf
sdfsd
dsfsdf
file 2
ljljlj
lkklk
dsf
sdfsd
dsfsdf
I want to display what is in file 2 but not in file 1, so file 3 should look like
ljljlj
lkklk

grep -Fxvf file1 file2
What the flags mean:
-F, --fixed-strings
Interpret PATTERN as a list of fixed strings, separated by newlines, any of which is to be matched.
-x, --line-regexp
Select only those matches that exactly match the whole line.
-v, --invert-match
Invert the sense of matching, to select non-matching lines.
-f FILE, --file=FILE
Obtain patterns from FILE, one per line. The empty file contains zero patterns, and therefore matches nothing.

You can try
grep -f file1 file2
or
grep -v -F -x -f file1 file2

You can use the comm command to compare two sorted files
comm -13 <(sort file1) <(sort file2)

I successfully used
diff "${file1}" "${file2}" | grep "<" | sed 's/^<//g' > "${diff_file}"
Outputting the difference to a file.

if you are expecting them in a certain order, you can just use diff
diff file1 file2 | grep ">"

join -v 2 <(sort file1) <(sort file2)

A tried a slight variation on Luca's answer and it worked for me.
diff file1 file2 | grep ">" | sed 's/^> //g' > diff_file
Note that the searched pattern in sed is a > followed by a space.

file1
m1
m2
m3
file2
m2
m4
m5
>awk 'NR == FNR {file1[$0]++; next} !($0 in file1)' file1 file2
m4
m5
>awk 'NR == FNR {file1[$0]++; next} ($0 in file1)' file1 file2
m2
> What's awk command to get 'm1 and m3' ?? as in file1 and not in file2?
m1
m3

If you want to use loops You can try like this: (diff and cmp are much more efficient. )
while read line
do
flag = 0
while read line2
do
if ( "$line" = "$line2" )
then
flag = 1
fi
done < file1
if ( flag -eq 0 )
then
echo $line > file3
fi
done < file2
Note: The program is only to provide a basic insight into what can be done if u dont want to use system calls such as diff n comm..

an awk answer:
awk 'NR == FNR {file1[$0]++; next} !($0 in file1)' file1 file2

With GNU sed:
sed 's#[^^]#[&]#g;s#\^#\\^#g;s#^#/^#;s#$#$/d#' file1 | sed -f- file2
How it works:
The first sed produces an output like this:
/^[d][s][f]$/d
/^[s][d][f][s][d]$/d
/^[d][s][f][s][d][f]$/d
Then it is used as a sed script by the second sed.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Extracting unique values between 2 files with awk - bash

One comment: awk is very ineffective with arrays. In real life with big files, better use something like: comm -3 <(cut -d : -f 1 f1 | sort -u) <(cut -d : -f 1 f2 | sort -u) | grep -h -f /dev/stdin f1 f2

Related

How to merge in one file, two files in bash line by line [duplicate]

Replacing a specific field using awk, sed or any POSIX tool

Joining two text files based on a common field (ip address)

Comparing values in two files

find difference between two text files with one item per line [duplicate]

Categories

Resources