Compare column1 in File with column1 in File2, output {Column1 File1} that does not exist in file 2 - bash

Below is my file 1 content:
123|yid|def|
456|kks|jkl|
789|mno|vsasd|
and this is my file 2 content
123|abc|def|
456|ghi|jkl|
789|mno|pqr|
134|rst|uvw|
The only thing I want to compare in File 1 based on File 2 is column 1. Based on the files above, the output should only output:
134|rst|uvw|
Line to Line comparisons are not the answer since both column 2 and 3 contains different things but only column 1 contains the exact same thing in both files.
How can I achieve this?
Currently I'm using this in my code:
#sort FILEs first before comparing
sort $FILE_1 > $FILE_1_sorted
sort $FILE_2 > $FILE_2_sorted
for oid in $(cat $FILE_1_sorted |awk -F"|" '{print $1}');
do
echo "output oid $oid"
#for every oid in FILE 1, compare it with oid FILE 2 and output the difference
grep -v diff "^${oid}|" $FILE_1 $FILE_2 | grep \< | cut -d \ -f 2 > $FILE_1_tmp

You can do this in Awk very easily!
awk 'BEGIN{FS=OFS="|"}FNR==NR{unique[$1]; next}!($1 in unique)' file1 file2
Awk works by processing input lines one at a time. And there are special clauses which Awk provides, BEGIN{} and END{} which encloses actions to be run before and after the processing of the file.
So the part BEGIN{FS=OFS="|"} is set before the file processing happens, and FS and OFS are special variables in Awk which stand for input and output field separators. Since you have a provided a file that is de-limited by | you need to parse it by setting FS="|" also to print it back with |, so set OFS="|"
The main part of the command comes after BEGIN clause, the part FNR==NR is meant to process the first file argument provided in the command, because FNR keeps track of the line numbers for the both the files combined and NR for only the current file. So for each $1 in the first file, the values are hashed into the array called unique and then when the next file processing happens, the part !($1 in unique) will drop those lines in second file whose $1 value is not int the hashed array.

Here is another one liner that uses join, sort and grep
join -t"|" -j 1 -a 2 <(sort -t"|" -k1,1 file1) <(sort -t"|" -k1,1 file2) |\
grep -E -v '.*\|.*\|.*\|.*\|'
join does two things here. It pairs all lines from both files with matching keys and, with the -a 2 option, also prints the unmatched lines from file2.
Since join requires input files to be sorted, we sort them.
Finally, grep removes all lines that contain more than three fields from the output.

Related

Bash: Remove unique and keep duplicate

I have a large file with 100k lines and about 22 columns. I would like to remove all lines in which the content in column 15 only appears once. So as far as I understand its the reverse of
sort -u file.txt
After the lines that are unique in column 15 are removed, I would like to shuffle all lines again, so nothing is sorted. For this I would use
shuf file.txt
The resulting file should include only lines that have at least one duplicate (in column 15) but are in a random order.
I have tried to work around sort -u but it only sorts out the unique lines and discards the actual duplicates I need. However, not only do I need the unique lines removed, I also want to keep every line of a duplicate, not just one representitive for a duplicate.
Thank you.
Use uniq -d to get a list of all the duplicate values, then filter the file so only those lines are included.
awk -F'\t' 'NR==FNR { dup[$0]; next; }
$15 in dup' <(awk -F'\t' '{print $15}' file.txt | sort | uniq -d) file.txt > newfile.txt
awk '{print $15}' file.txt | sort | uniq -d returns a list of all the duplicate values in column 15.
The NR==FNR line in the first awk script turns this into an associative array.
The second line processes file.txt and prints any lines where column 15 is in the array.

How to ignore headers when merging single column of multiple CSV files?

I need to merge a single column from multiple CSV files whilst disregarding the headers.
file 1:
id,backer_uid,fname,lname
123,uj2uj2,JOHN,SMITH
file 2:
id,backer_uid,fname,lname
124,uj2uh3,BRIAN,DOOLEY
Output:
JOHN
BRIAN
Currently, I am using:
/*Merge 3rd column from all csv files*/
awk -F "\"*,\"*" '{print $3}’ *.csv >merged.csv
But how do I ignore the headers?
You can do it with awk, nearly as you have already done, by adding a condition on the FNR (the record number per file):
awk -F, 'FNR > 1 {print $3}' *.csv > merged.csv
Use tail and cut:
tail -q -n +2 *.csv | cut -f3 -d, > merged.csv
tail -n +2 prints all lines of files starting from line number 2
-q suppresses printing of file names
cut -f3 -d, extracts the third field, treating , as the delimiter
try: If you have to read only 2 files.
awk -F, 'FNR>1{print $(NF-1)}' file[12]
Here I am making field separator as comma and then checking if line number is greater than 1 then printing the second last field. Point to be noted here is file[12] will only read files named file1 and file2, if you have more than that files use file* then.

checking that the rows in a file have the same number of columns

I have a number of tsv files, and I want to check that each file is correctly formatted. primarily, I want to check that each row has the right number of columns. is there a way to do this? I'd love a command line solution if there is one.
Adding this here because these answers were all close but didn't quite work for me, in my case I needed to specify the field separator for awk.
The following should return with a single line containing the number of columns (if every row has the same number of columns).
$ awk -F'\t' '{print NF}' test.tsv | sort -nu
8
-F is used to specify the field separator for awk
NF is the number of fields
-nu orders the field count for each row numerically and returns only the unique ones
If you get more than one row returned, then there are some rows of your .tsv with more columns than others.
To check that the .tsv is correctly formatted with each row having the same number of fields, the following should return 1 (as commented by kmace on the accepted answer) however I needed to add the -F'\t'
$ awk -F'\t' '{print NF}' test.tsv | sort -nu | wc -l
awk '{print NF}' test | sort -nu | head -n 1
This gives you the lowest number of columns in the file on any given row.
awk '{print NF}' test | sort -nu | tail -n 1
This gives you the highest number of columns in the file on any given row.
The result should be the same, if all the columns are present.
Note: this gives me an error on OS X, but not on Debian... maybe use gawk.
(I'm assuming that by "tsv", you mean a file whose columns are separated with tab characters.)
You can do this simply with awk, as long as the file doesn't have quoted fields containing tab characters.
If you know how many columns you expect, the following will work:
awk -F '\t' -v NCOLS=42 'NF!=NCOLS{printf "Wrong number of columns at line %d\n", NR}'
(Of course, you need to change the 42 to the correct value.)
You could also automatically pick up the number of columns from the first line:
awk -F '\t' 'NR==1{NCOLS=NF};NF!=NCOLS{printf "Wrong number of columns at line %d\n", NR}'
That will work (with a lot of noise) if the first line has the wrong number of columns, but it would fail to detect a file where all the lines have the same wrong number of columns. So you're probably better off with the first version, which forces you to specify the column count.
Just cleaning up #snd answer above:
number_uniq_row_lengths=`awk '{print NF}' $pclFile | sort -nu | wc -l`
if [ $number_uniq_row_lengths -eq 1 ] 2>/dev/null; then
echo "$pclFile is clean"
fi
awk is a good candidate for this. If your columns are separated by tabs (I guess it is what tsv means) and if you know how many of them you should have, say 17, you can try:
awk -F'\t' 'NF != 17 {print}' file.tsv
This will print all lines in file.tsv that has not exactly tab-separated 17 columns. If my guess is incorrect, please edit your question and add the missing information (column separators, number of columns...) Note that the tsv (and csv) format is trickier than it seems. The fields can contain the field separator, records can span on several lines... If it is your case, do not try to reinvent the wheel and use an existing tsv parser.

grep matching specific position in lines using words from other file

I have 2 file
file1:
12342015010198765hello
12342015010188765hello
12342015010178765hello
whose each line contains fields at fixed positions, for example, position 13 - 17 is for account_id
file2:
98765
88765
which contains a list of account_ids.
In Korn Shell, I want to print lines from file1 whose position 13 - 17 match one of account_id in file2.
I can't do
grep -f file2 file1
because account_id in file2 can match other fields at other positions.
I have tried using pattern in file2:
^.{12}98765.*
but did not work.
Using awk
$ awk 'NR==FNR{a[$1]=1;next;} substr($0,13,5) in a' file2 file1
12342015010198765hello
12342015010188765hello
How it works
NR==FNR{a[$1]=1;next;}
FNR is the number of lines read so far from the current file and NR is the total number of lines read so far. Thus, if FNR==NR, we are reading the first file which is file2.
Each ID in in file2 is saved in array a. Then, we skip the rest of the commands and jump to the next line.
substr($0,13,5) in a
If we reach this command, we are working on the second file, file1.
This condition is true if the 5 character long substring that starts at position 13 is in array a. If the condition is true, then awk performs the default action which is to print the line.
Using grep
You mentioned trying
grep '^.{12}98765.*' file2
That uses extended regex syntax which means that -E is required. Also, there is no value in matching .* at the end: it will always match. Thus, try:
$ grep -E '^.{12}98765' file1
12342015010198765hello
To get both lines:
$ grep -E '^.{12}[89]8765' file1
12342015010198765hello
12342015010188765hello
This works because [89]8765 just happens to match the IDs of interest in file2. The awk solution, of course, provides more flexibility in what IDs to match.
Using sed with extended regex:
sed -r 's#.*#/^.{12}&/p#' file2 |sed -nr -f- file1
Using Basic regex:
sed 's#.*#/^.\\{12\\}&/p#' file1 |sed -n -f- file
Explanation:
sed -r 's#.*#/^.{12}&/p#' file2
will generate an output:
/.{12}98765/p
/.{12}88765/p
which is then used as a sed script for the next sed after pipe, which outputs:
12342015010198765hello
12342015010188765hello
Using Grep
The most convenient is to put each alternative in a separate line of the file.
You can look at this question:
grep multiple patterns single file argument list too long

BASH Substracting Files on Key line by line

I just wanna to substract one CSV-File from another one, but not if the lines are the same. Instead of comparing the lines I'd like to look if the lines matching in one field.
e.g. the first file
EMAIL;NAME;SALUTATION;ID
foo#bar.com;Foo;Mr;1
bar#foo.com;Bar;Ms;2
and the second file
EMAIL;NAME
foo#bar.com;Foo
the resultfile should be
EMAIL;NAME;SALUTATION;ID
bar#foo.com;Bar;Ms;2
I think u know what I mean ;)
How is that possible in bash? It's easy for me doing this in Java, but I realy like to learn how to do that in bash. Also I can substract by comparing the lines using sort
#! / bin / bash
echo "Substracting Files..."
sort "/tmp/list1.csv" "/tmp/list2.csv" "/tmp/list2.csv" | uniq -u >> /tmp/subList.csv
echo "Files successfully substracted."
But the lines arn't the same tuple. So I have to compare line with keys.
Any suggestions? Thanks a lot.. Nils
One possible solution coming to my mind is this one (working with bash):
grep -v -f <(cut -d ";" -f1 /tmp/list2.csv) /tmp/list1.csv
That means:
cut -d ";" -f1 /tmp/list2.csv: Extract the first column of the second file.
grep -f some_file: Use a file as pattern source.
<(some_command): This is a process substitution. It executes the command and feeds the output to a named pipe which then can be used as file input to grep -f.
grep -v: Print only the lines not matching the pattern(s).
Update: the solution to the question, via join and awk.
join --header -1 1 -2 1 -t";" --nocheck-order -v 1 1.csv 2.csv | | awk 'NR==1 {print gensub(";[^;]\\+$","","g");next} 1'
These were the inverse answers:
$ join -1 1 -2 1 -t";" --nocheck-order -o 1.1,1.2,1.3,1.4 1.csv 2.csv
EMAIL;NAME;SALUTATION;ID
foo#bar.com;Foo;Mr;1
join to the rescue.
Or the skipping of printing the NAME field without -o:
$ join -1 1 -2 1 -t";" --nocheck-order 1.csv 2.csv | awk 'BEGIN {FS=";" ; OFS=";"} {$NF=""; print }'
(But it still prints a plus ;˛after the last field.
HTH

Resources