Count repeated columns of a line, print all lines and their count - bash

I want:
$ cat file
ABCDEFG, XXX
ABCDEFG, YYY
ABCDEFG, ZZZ
AAAAAAA, XZY
BBBBBBB, XYZ
CCCCCCC, YXZ
DDDDDDD, YZX
CDEFGHI, ZYX
CDEFGHI, XZY
$ cat file | magic
3 ABCDEFG, XXX
3 ABCDEFG, YYY
3 ABCDEFG, ZZZ
1 AAAAAAA, XZY
1 BBBBBBB, XYZ
1 CCCCCCC, YXZ
1 DDDDDDD, YZX
2 CDEFGHI, ZYX
2 CDEFGHI, XZY
So, pre-sorted file goes in, identify repeats in the first column, count the number of lines of this repeat, print the repeat count plus all the repeated lines and their content, including whatever is in column 2, which can be anything and is not relevant to the unique count.
Two problems:
1) get the effect of uniq -c, but without deleting the duplicates.
My really "hacky" sed -e solution after searching online was this:
cat file | cut -d',' -f1 | uniq -c | sed -E -e 's/([0-9][0-9]*) (.*)/echo $(yes \1 \2 | head -\1)/;e' | sed -E 's/ ([0-9])/;\1/g' | tr ';' '\n'
I was surprised to see things like head -\1 working, but well great. However, I feel like there should be a much simpler solution to the problem.
2) The above get's rid of the second column. I could just run my code first, and then paste it to my second column in the original file, but the file is massive and I want things to be as speed efficient as possible.
Any suggestions?

One in awk. Pretty tired so not fully tested. I hope it works, good night:
$ awk -F, '
$1!=p {
for(i=1;i<c;i++)
print c-1,a[i]
c=1
}
{
a[c++]=$0
p=$1
}
END {
for(i=1;i<c;i++)
print c-1,a[i]
}' file
Output:
3 ABCDEFG,XXX
3 ABCDEFG,YYY
3 ABCDEFG,ZZZ
1 AAAAAAA,XZY
1 BBBBBBB,XYZ
1 CCCCCCC,YXZ
1 DDDDDDD,YZX
2 CDEFGHI,ZYX
2 CDEFGHI,XZY

Here's one way using awk that passes the file twice. On the first pass, use an associative array to store the counts of the first column. On the second pass, print the array value and the line itself:
awk -F, 'FNR==NR { a[$1]++; next } { print a[$1], $0 }' file{,}
Results:
3 ABCDEFG, XXX
3 ABCDEFG, YYY
3 ABCDEFG, ZZZ
1 AAAAAAA, XZY
1 BBBBBBB, XYZ
1 CCCCCCC, YXZ
1 DDDDDDD, YZX
2 CDEFGHI, ZYX
2 CDEFGHI, XZY

Related

Duplicates in column: randomly keep one

I have a file (input.txt) with a structure similar to this:
abc 1
bcd a
cde 1
def 4
efg a
fgh 3
I want to remove duplicates in column 2, in order to have only unique strings in that column (independently to what is in column 1). But the chosen line should be selected aleatory. The output could for example be:
bcd a
cde 1
def 4
fgh 3
I tried to create a file listing the duplicates (using awk '{print $2}' input.txt | sort | uniq -D | uniq) but then I only managed to remove them all with awk '!A[$2]++' instead of randomly keeping one of the duplicates.
Pre-process the input to randomize it:
shuf input.txt | awk '!A[$2]++'
With GNU awk for true multi-dimensional arrays:
$ awk '{a[$2][++cnt[$2]]=$0} END{srand(); for (k in a) print a[k][int(rand()*cnt[k])+1]}' file
efg a
cde 1
fgh 3
def 4
With other awks:
$ awk '{keys[$2]; a[$2,++cnt[$2]]=$0} END{srand(); for (k in keys) print a[k,int(rand()*cnt[k])+1]}' file
bcd a
abc 1
fgh 3
def 4
With perl
$ perl -MList::Util=shuffle -e 'print grep { !$seen{(split)[1]}++ } shuffle <>' input.txt
def 4
fgh 3
bcd a
abc 1
-MList::Util=shuffle to get shuffle function from List::Util module
shuffle <> here <> would get all input lines as array and then gets shuffled
grep { !$seen{(split)[1]}++ } to filter lines based on 2nd field of each array element based on whitespace as separator
With ruby
$ ruby -e 'puts readlines.shuffle.uniq {|s| s.split[1]}' input.txt
abc 1
bcd a
fgh 3
def 4
readlines will get all lines from input file as array
shuffle to randomize the elements
uniq to get unique elements
{|s| s.split[1]} based on 2nd field value, using whitespace as separator
puts to print the array elements

How to sort one column and add values later

I would to first sort a specific column, which I do using sort -k2 <file>. Then, after it is sorted using the values from the second column, I would like to add all the values from column 1 , delete duplicates, and keep the value from column 1.
Example:
2 AAAAAA
3 BBBBBB
1 AAAAAA
2 BBBBBB
1 CCCCCC
sort -k2 <file> does this:
2 AAAAAA
1 AAAAAA
3 BBBBBB
2 BBBBBB
1 CCCCCC
I know uniq -c will removes duplicates and outputs how many times it occurred, however I don't want to know how many times it occurred, I just need column 1 to be added and displayed. So that I would get:
3 AAAAAA
5 BBBBBB
1 CCCCCC
I came up with a solution using two for loops:
The first loop loops over all different strings in the file (test.txt), for each one we find all the numbers in the original file, and add them in the second loop. After adding all numbers we echo the total, and the string.
for chars in `sort -k2 test.txt | uniq -f 1 | cut -d' ' -f 2 `;
do
total=0;
for nr in `grep $a test.txt | cut -d' ' -f 1`;
do
total=$(($total+$nr));
done;
echo $total $chars
done
-c is your enemy. You explicitly asked for the count . Here is my suggestion:
sort -k2 <file>| uniq -f1 file2
which gives me
cat file2
1 AAAAAA
2 BBBBBB
1 CCCCCC
If you want only column 2 in file, then use awk
sort -k2 <file>| uniq -f1 |awk '{print $2}' > file2
leading to
AAAAAA
BBBBBB
CCCCCC
Now I got it at last.
.... But if you want to sum in column 1, then just use awk ... Of course you could not make a grouped count with uniq...
awk '{array[$2]+=$1} END { for (i in array) {print array[i], i}}' file |sort -k2
which leads to your solution (even if I sorted afterwards):
3 AAAAAA
5 BBBBBB
1 CCCCCC

all pairs of consecutive lines sharing a field, using awk

I would like to process a multi-line, multi-field input file so that I get a file with all pairs of consecutive lines ONLY IF they have the same value as field #1.
This is, for each line, the output would contain the line itself + the next line, and would omit combinations of lines with different values at field #1.
It's better explained with an example.
Given this input:
1 this
1 that
1 nye
2 more
2 sit
I want to produce something like:
1 this 1 that
1 that 1 nye
2 more 2 sit
So far I've got this:
awk 'NR % 2 == 1 { i=$0 ; next } { print i,$0 } END { if ( NR % 2 == 1 ) { print i } }' input.txt
My output:
1 this 1 that
1 nye 2 more
2 sit
As you can see, my code is blind to field #1 value, and also (and more importantly) it omits "intermediate" results like 1 that 1 nye (once it's done with a line, it jumps to the next pair of lines).
Any ideas? My preferred language is awk/gawk, but if it can be done using unix bash it's ok as well.
Thanks in advance!
You can use this awk:
awk 'NR>1 && ($1 in a){print a[$1], $0} {a[$1]=$0}' file
1 this 1 that
1 that 1 nye
2 more 2 sit
You can do it with simple commands. Assuming your input file is "test.txt" with content:
1 this
1 that
1 nye
2 more
2 sit
following commands gives the requested output:
sort -n test.txt > tmp1
(echo; cat tmp1) | paste tmp1 - | egrep '^([0-9])+ *[^ ]* *\1'
Just for fun
paste -d" " filename <(sed 1d filename) | awk '$1==$3'

Bash/Shell: analyse tab-separated CSV for lines with data in n-th column

I have a tab-separated CSV, to big to download and open locally.
I want to show any lines with data in the n-th column, that is those lines with anything else than a tab right before the n-th tab of that line.
I´d post what I´ve tried so far, but my sed-knowledge is merely enough to assume that it can be done with sed.
edit1:
sample
id num name title
1 1 foo foo
2 2 bar
3 3 baz baz
If n=3 (name), then I want to output the rows 1+3.
If n=4 (title), then I want to output all the lines.
edit 2:
I found this possible solution:
awk -F '","' 'BEGIN {OFS=","} { if (toupper($5) == "STRING 1") print }' file1.csv > file2.csv
source: https://unix.stackexchange.com/questions/97070/filter-a-csv-file-based-on-the-5th-column-values-of-a-file-and-print-those-reco
But trying
awk -F '"\t"' 'BEGIN {OFS="\t"} { if (toupper($72) != "") print }' data.csv > data-tmp.csv
did not work (result file empty), so I propably got the \t wrong? (copy&paste without understanding awk)
I'm not exactly sure I understand your desired behaviour. Is this it?
$ cat file
id num name title
1 1 foo foo
2 2 bar
3 3 baz baz
$ awk -v n=3 -F$'\t' 'NR>1&&$n!=""' file
1 1 foo foo
3 3 baz baz
$ awk -v n=4 -F$'\t' 'NR>1&&$n!=""' file
1 1 foo foo
2 2 bar
3 3 baz baz
I'll assume you have enough space on the remote machine:
1) use cut to get the desired column N (delimiter is tab by standard)
cut -f N > tempfile
2) get line numbers only of non-empty lines
grep -c '^$' -n tempfile | sed 's/:.*//' > linesfile
3) use sed to extract lines
while read $linenumber ; do
sed "$linenumber p" >> newdatafile
done < linesfile
Unfortunately the line number cannot be extracted by piping the cut output to grep, but I am pretty sure there are more elegant solutions.

how to "reverse" the output bash

I have an output that looks like this: (number of occurrences of the word, and the word)
3 I
2 come
2 from
1 Slovenia
But I want that it looked like this:
I 3
come 2
from 2
Slovenia 1
I got my output with:
cut -d' ' -f1 "file" | uniq -c | sort -nr
I tried to do different things, with another pipes:
cut -d' ' -f1 "file" | uniq -c | sort -nr | cut -d' ' -f8 ...?
which is a good start, because I have the words on the first place..buuut I have no access to the number of occurrences?
AWK and SED are not allowed!
EDIT:
alright lets say the file looks like this.
I ....
come ...
from ...
Slovenia ...
I ...
I ....
come ...
from ....
I is repeated 3 times, come twice, from twice, Slovenia once. +They are on beginning of each line.
AWK and SED are not allowed!
Starting with this:
$ cat file
3 I
2 come
2 from
1 Slovenia
The order can be reversed with this:
$ while read count word; do echo "$word $count"; done <file
I 3
come 2
from 2
Slovenia 1
Complete pipeline
Let us start with:
$ cat file2
I ....
come ...
from ...
Slovenia ...
I ...
I ....
come ...
from ....
Using your pipeline (with two changes) combined with the while loop:
$ cut -d' ' -f1 "file2" | sort | uniq -c | sort -snr | while read count word; do echo "$word $count"; done
I 3
come 2
from 2
Slovenia 1
The one change that I made to the pipeline was to put a sort before uniq -c. This is because uniq -c assumes that its input is sorted. The second change is to add the -s option to the second sort so that the alphabetical order of the words with the same count is not lost
You can just pipe an awk after your first try:
$ cat so.txt
3 I
2 come
2 from
1 Slovenia
$ cat so.txt | awk '{ print $2 " " $1}'
I 3
come 2
from 2
Slovenia 1
If perl is allowed:
$ cat testfile
I ....
come ...
from ...
Slovenia ...
I ...
I ....
come ...
from ....
$ perl -e 'my %list;
while(<>){
chomp; #strip \n from the end
s/^ *([^ ]*).*/$1/; #keep only 1st word
$list{$_}++; #increment count
}
foreach (keys %list){
print "$_ $list{$_}\n";
}' < testfile
come 2
Slovenia 1
I 3
from 2

Resources