count patterns in a csv file from another csv file in bash - bash

I have two csv files
File A
ID
1
2
3
File B
ID
1
1
1
1
3
2
3
What I want to do is to count how many times that a ID in File A show up in File B, and save the result in a new file C (which is in csv format). For example, 1 in File A shows up 4 times in File B. So in the new file C, I should have something like
File C
ID,Count
1,4
2,1
3,2
Originally I was thinking use "grep -f", but it seems like it only works with .txt format. Unfortunately, File A and B are both in csv format. So now, I am thinking maybe I could use a for loop to get the ID from File A individually and use grep -c to count each one of them. Any idea will be helpful.
Thanks in advance!

You can use this awk command:
awk -v OFS=, 'FNR==1{next} FNR==NR{a[$1]; next} $1 in a{freq[$1]++}
END{print "ID", "Count"; for (i in freq) print i, freq[i]}' fileA fileB
ID,Count
1,4
2,1
3,2

You could use join, sort, uniq and process substitution <(command) creatively:
$ join -2 2 <(sort A) <(sort B | uniq -c) | sort -n > C
$ cat C
ID 1
1 4
2 1
3 2
And if you really really want the header to be ID Count, before writing to file C you could replace that 1 with Count with sed by adding:
... | sed 's/\(ID \)1/\1Count/' > C
to get
ID Count
1 4
2 1
3 2
and if you really really want commas as separators instead of spaces, to replace them with spaces using tr, add also:
... | tr \ , > C
to get
ID,Count
1,4
2,1
3,2
You could of course ditch the trand use the sed like this instead:
... | sed 's/\(ID \)1/\1Count/;s/ /,/' > C
And the output would be like above.

Related

Using bash comm command on columns but returning the entire line

I have two files, each with two columns and sorted only by the second column, such as:
File 1:
176 AAATC
6 CCGTG
80 TTTCG
File 2:
20 AAATC
77 CTTTT
50 TTTTT
I would like to use comm command using options -13 and -23 to get two different files reporting the different lines between the two files with the corresponding count number, but only comparing the second columns (i.e. the strings). What I tried so far was something like:
comm -23 <(cut -d$'\t' -f2 file1.txt) <(cut -d$'\t' -f2 file2.txt)
But I could only have the strings in output, without the numbers:
CCGTG
TTTCG
While what I want would be:
6 CCGTG
80 TTTCG
Any suggestion?
Thanks!
You can use join instead of comm:
join -1 2 -2 2 File1 File2 -a 1 -o 1.1,1.2,2.2
It will output the matching lines, too, but you can remove them with
| grep -v '[ACTG] [ACTG]'
Explanation:
-1 2 use the second column in file 1 for joining;
-2 2 similarly, use the second column in file 2;
-a 1 show also non-matching lines from file 1 - these are the ones you want in the end;
-o specifies the output format, here we want columns 1 and 2 from file 1 and column 2 from file 2 (this is just arbitrary, you can use column 1 as well, but the second command would be different: | grep -v '[ACTG] [0-9]').
comm is not the right tool for this job, and while join will work you also need to look at running join twice and then further filter the results with some other command (eg, grep).
One awk idea that requires a single pass through each input file:
awk 'BEGIN {FS=OFS="\t"}
FNR==NR { f1[$2]=$1; next } # save 1st file entries
$2 in f1 { delete f1[$2]; next } # 2nd file: if $2 in f1[] then delete f1[] entry and skip this line else ..
{ f2[$2]=$1 } # save 2nd file entries
END { # at this point:
# f1[] contains rows where field #2 only exists in the 1st file
# f2[] contains rows where field #2 only exists in the 2nd file
PROCINFO["sorted_in"]="#ind_str_asc"
for (i in f1) print f1[i],i > "file-23"
for (i in f2) print f2[i],i > "file-13"
}
' file1 file2
NOTE: the PROCINFO["sorted_in"] line requires GNU awk; without this line we cannot guarantee the order of writes to the final output files, and OP would then need to add more (awk) code to maintain the ordering or use another OS-level utility (eg, sort) to sort the final files
This generates:
$ cat file-23
6 CCGTG
80 TTTCG
$ cat file-13
77 CTTTT
50 TTTTT

Bash - Compare rows then print just original rows

I've got files which look like this, (there can be more columns or rows):
dif-1-2-3-4.com 1 1 1
dif-1-2-3-5.com 1 1 2
dif-1-2-4-5.com 1 2 1
dif-1-3-4-5.com 2 1 1
dif-2-3-4-5.com 1 1 1
And I want to compare these numbers:
1 1 1
1 1 2
1 2 1
2 1 1
1 1 1
And print only those rows which do not repeat, so I get this:
dif-1-2-3-4.com 1 1 1
dif-1-2-3-5.com 1 1 2
dif-1-2-4-5.com 1 2 1
dif-1-3-4-5.com 2 1 1
Another simple approach is sort with uniq using a KEYDEF for fields 2-4 with sort and skipping field 1 with uniq, e.g.
$ sort file.txt -k 2,4 | uniq -f1
Example Use/Output
$ sort file.txt -k 2,4 | uniq -f1
dif-1-2-3-4.com 1 1 1
dif-1-2-3-5.com 1 1 2
dif-1-2-4-5.com 1 2 1
dif-1-3-4-5.com 2 1 1
Keep a running record of the triples already seen and only print the first time they appear:
$ awk '!(($2,$3,$4) in seen) {print; seen[$2,$3,$4]}' file
dif-1-2-3-4.com 1 1 1
dif-1-2-3-5.com 1 1 2
dif-1-2-4-5.com 1 2 1
dif-1-3-4-5.com 2 1 1
Try, the following awk code too:
awk '!a[$2,$3,$4]++' Input_file
Explanation:
Create an array named a and its indexes as $2,$3,$4. The condition here is !a, (which means any line's $2,$3,$4 are NOT present in array a), and then doing 2 things:
Increasing that specific index's value to 1 so that next time that condition will NOT be true for same $2,$3,$4 indexes in array a.
Not specifying an action, (so awk works in the mode of condition and then action), so the default action will be to print the current line. This will go on for all the lines in Input_file, and the last line will not be printed as its $2,$3,$4 are already present in array a.
I hope this helps.
This works with POSIX and gnu awk:
$ awk '{s=""
for (i=2;i<=NF; i++)
s=s $i "|"}
s in seen { next }
++seen[s]' file
Which can be shortened to:
$ awk '{s=""; for (i=2;i<=NF; i++) s=s $i "|"} !seen[s]++' file
Also supports a variable number of columns.
If you want a sort uniq solution that also respects file order (i.e. the first of the set of duplicates is printed, not the later ones) you need to do a decorate, sort, undecorate approach.
You can:
use cat -n to decorate the file with line numbers;
sort -k3 -k1n to sort first on all the fields starting at the 3 though the end of the line then numerically on the line number added;
add -u if your version of sort supports that or use uniq -f3 to only keep the first in the group of dups;
finally use sed -e 's/^[[:space:]]*[0-9]*[[:space:]]*// to remove the added line numbers:
cat -n file | sort -k3 -k1n | uniq -f3 | sed -e 's/^[[:space:]]*[0-9]*[[:space:]]*//'
Awk is easier and faster in this case.

Move Second Column of Each Row to a New Next Row

I can move every second row into the second column of the previous row by:
awk '{printf "%s%s",$0,(NR%2?FS:RS)}' file > newfile
But I can't do it the other way around. What I have is as below:
1 a
2 b
3 c
I need
1
a
2
b
3
c
I have checked several similar column-row shifting questions, but couldn't figure out my case. Thanks!
You can use this awk command with OFS='\n' to get output field separator as newline after forcing awk to rewrite each record with $1=$1 trick:
awk '{$1=$1} 1' OFS='\n' file
1
a
2
b
3
c
You can also use grep -o:
grep -Eo '\w+' file
1
a
2
b
3
c
Just use xargs with 1 record at a time,
xargs -n1 <file
1
a
2
b
3
c
From the man xargs page
-n max-args, --max-args=max-args
Use at most max-args arguments per command line. Fewer than max-args arguments will be used if the size (see the -s option) is exceeded, unless the
-x option is given, in which case xargs will exit.
you can use tr
cat file | tr ' ' '\n'
or sed
sed -r 's/ /\n/g' file
you get,
1
a
2
b
3
c

unix utility to compare lists and perform a set operation

I believe what I'm asking for is a sort of set operation. I need help trying to create a list of the following:
List1 contains:
1
2
3
A
B
C
List2 contains:
1
2
3
4
5
A
B
C
D
E
(I need this) - The Final list I need would be (4) items:
4
5
D
E
So obviously List2 contains more elements than List1.
Final list which I needs are the elements in List2 that are NOT in List1.
Which linux utility can I use to accomplish this? I have looked at sort, comm but i'm unsure how to do this correctly. Thanks for the help
Using awk with a straight forward logic.
awk 'FNR==NR{a[$0]; next}!($0 in a)' file1 file2
4
5
D
E
Using GNU comm utility, where according to the man comm page,
comm -3 file1 file2
Print lines in file1 not in file2, and vice versa.
Using it for your example
comm -3 file2 file1
4
5
D
E
You can do it with a simple grep command inverting the match with -v and reading the search terms from list1 with -f, e.g. grep -v -f list1 list2. Example use:
$ grep -v -f list1 list2
4
5
D
E
Linux provides a number of different ways to skin this cat.
You can try this :
$ diff list1.txt list2.txt | egrep '>|<' | awk '{ print $2 }' | sort -u
4
5
D
E
i hope help you

Shell: put data together

I write a shell script to put data together.
I have 2 files with different columns.
One of the columns is the same on both of the file.
Like :
File 1:
a 5
c 7
d 9
b 5
File 2:
c 1
d 8
a 6
b 3
For the moment my script put the data in a same file with
paste -d ' ' 'file1' 'file2' > "file3"
I would like to know if it's possible to match the 2 columns together and in order like:
a 5 6
b 5 3
c 7 1
d 9 8
Thanks
sort file1 > file1.tmp
sort file2 > file2.tmp
join -t " " -j 1 file1.tmp file2.tmp
Assumed that character and numbers are separated by a SPACE
Using process substitution you can sort the files and join them in a single command.
join -t " " -j 1 <(sort file1) <(sort file2)
Use sort to sort both files, and then join to join on the first field.

Resources