Duplicates in column: randomly keep one

Duplicates in column: randomly keep one - bash

I have a file (input.txt) with a structure similar to this:
abc 1
bcd a
cde 1
def 4
efg a
fgh 3
I want to remove duplicates in column 2, in order to have only unique strings in that column (independently to what is in column 1). But the chosen line should be selected aleatory. The output could for example be:
bcd a
cde 1
def 4
fgh 3
I tried to create a file listing the duplicates (using awk '{print $2}' input.txt | sort | uniq -D | uniq) but then I only managed to remove them all with awk '!A[$2]++' instead of randomly keeping one of the duplicates.

Pre-process the input to randomize it:
shuf input.txt | awk '!A[$2]++'

With GNU awk for true multi-dimensional arrays:
$ awk '{a[$2][++cnt[$2]]=$0} END{srand(); for (k in a) print a[k][int(rand()*cnt[k])+1]}' file
efg a
cde 1
fgh 3
def 4
With other awks:
$ awk '{keys[$2]; a[$2,++cnt[$2]]=$0} END{srand(); for (k in keys) print a[k,int(rand()*cnt[k])+1]}' file
bcd a
abc 1
fgh 3
def 4

With perl
$ perl -MList::Util=shuffle -e 'print grep { !$seen{(split)[1]}++ } shuffle <>' input.txt
def 4
fgh 3
bcd a
abc 1
-MList::Util=shuffle to get shuffle function from List::Util module
shuffle <> here <> would get all input lines as array and then gets shuffled
grep { !$seen{(split)[1]}++ } to filter lines based on 2nd field of each array element based on whitespace as separator
With ruby
$ ruby -e 'puts readlines.shuffle.uniq {|s| s.split[1]}' input.txt
abc 1
bcd a
fgh 3
def 4
readlines will get all lines from input file as array
shuffle to randomize the elements
uniq to get unique elements
{|s| s.split[1]} based on 2nd field value, using whitespace as separator
puts to print the array elements

Related

Count repeated columns of a line, print all lines and their count

I want:
$ cat file
ABCDEFG, XXX
ABCDEFG, YYY
ABCDEFG, ZZZ
AAAAAAA, XZY
BBBBBBB, XYZ
CCCCCCC, YXZ
DDDDDDD, YZX
CDEFGHI, ZYX
CDEFGHI, XZY
$ cat file | magic
3 ABCDEFG, XXX
3 ABCDEFG, YYY
3 ABCDEFG, ZZZ
1 AAAAAAA, XZY
1 BBBBBBB, XYZ
1 CCCCCCC, YXZ
1 DDDDDDD, YZX
2 CDEFGHI, ZYX
2 CDEFGHI, XZY
So, pre-sorted file goes in, identify repeats in the first column, count the number of lines of this repeat, print the repeat count plus all the repeated lines and their content, including whatever is in column 2, which can be anything and is not relevant to the unique count.
Two problems:
1) get the effect of uniq -c, but without deleting the duplicates.
My really "hacky" sed -e solution after searching online was this:
cat file | cut -d',' -f1 | uniq -c | sed -E -e 's/([0-9][0-9]*) (.*)/echo $(yes \1 \2 | head -\1)/;e' | sed -E 's/ ([0-9])/;\1/g' | tr ';' '\n'
I was surprised to see things like head -\1 working, but well great. However, I feel like there should be a much simpler solution to the problem.
2) The above get's rid of the second column. I could just run my code first, and then paste it to my second column in the original file, but the file is massive and I want things to be as speed efficient as possible.
Any suggestions?

One in awk. Pretty tired so not fully tested. I hope it works, good night:
$ awk -F, '
$1!=p {
for(i=1;i<c;i++)
print c-1,a[i]
c=1
}
{
a[c++]=$0
p=$1
}
END {
for(i=1;i<c;i++)
print c-1,a[i]
}' file
Output:
3 ABCDEFG,XXX
3 ABCDEFG,YYY
3 ABCDEFG,ZZZ
1 AAAAAAA,XZY
1 BBBBBBB,XYZ
1 CCCCCCC,YXZ
1 DDDDDDD,YZX
2 CDEFGHI,ZYX
2 CDEFGHI,XZY

Here's one way using awk that passes the file twice. On the first pass, use an associative array to store the counts of the first column. On the second pass, print the array value and the line itself:
awk -F, 'FNR==NR { a[$1]++; next } { print a[$1], $0 }' file{,}
Results:
3 ABCDEFG, XXX
3 ABCDEFG, YYY
3 ABCDEFG, ZZZ
1 AAAAAAA, XZY
1 BBBBBBB, XYZ
1 CCCCCCC, YXZ
1 DDDDDDD, YZX
2 CDEFGHI, ZYX
2 CDEFGHI, XZY

Add line numbers for duplicate lines in a file

My text file would read as:
111
111
222
222
222
333
333
My resulting file would look like:
1,111
2,111
1,222
2,222
3,222
1,333
2,333
Or the resulting file could alternatively look like the following:
1
2
1
2
3
1
2
I've specified a comma as a delimiter here but it doesn't matter what the delimeter is --- I can modify that at a future date.In reality, I don't even need the original text file contents, just the line numbers, because I can just paste the line numbers against the original text file.
I am just not sure how I can go through numbering the lines based on repeated entries.
All items in list are duplicated at least once. There are no single occurrences of a line in the file.

$ awk -v OFS=',' '{print ++cnt[$0], $0}' file
1,111
2,111
1,222
2,222
3,222
1,333
2,333

Use a variable to save the previous line, and compare it to the current line. If they're the same, increment the counter, otherwise set it back to 1.
awk '{if ($0 == prev) counter++; else counter = 1; prev=$0; print counter}'

Perl solution:
perl -lne 'print ++$c{$_}' file
-n reads the input line by line
-l handles newlines
++$c{$_} increments the value assigned to the contents of the current line $_ in the hash table %c.

Software tools method, given textfile as input:
uniq -c textfile | cut -d' ' -f7 | xargs -L 1 seq 1
Shell loop-based variant of the above:
uniq -c textfile | while read a b ; do seq 1 $a ; done
Output (of either method):
1
2
1
2
3
1
2

Counting the number of lines that have the same entry in the first column in bash

I have a data file that looks like the following:
123456, 1623326
123456, 2346525
123457, 2435466
123458, 2564252
123456, 2435145
The first column is the "ID" -- a string variable. The second column does not matter to me. I want to end up with
123456, 3
123457, 1
123458, 1
where the second column now counts how many entries there are in the original file that correspond with the unique "ID" in the first column.
Any solution in bash or perl would be fantastic. Even Stata would be good, but I figure this is harder to do in Stata. Please let me know if anything is unclear.

In Stata this is just
contract ID

cut -d',' -f1 in.txt | sort | uniq -c | awk '{print $2 ", " $1}'
gives:
123456, 3
123457, 1
123458, 1

This counts the number of lines with the same first six characters:
$ sort file | uniq -c -w6
3 123456, 1623326
1 123457, 2435466
1 123458, 2564252
Documentation
From man uniq:
-w, --check-chars=N
compare no more than N characters in lines

Split off the number in the first field and use it as a hash key, increasing its count each time
use warnings;
use strict;
my $file = 'data_cnt.txt';
open my $fh, '<', $file or die "Can't open $file: $!";
my %cnt;
while (<$fh>) {
$cnt{(/^(\d+)/)[0]}++;
}
print "$_, $cnt{$_}\n" for keys %cnt;
The regex captures consecutive digits at a beginning of a line. As that is returned as a list we index into it to get the number, (/.../)[0], which is used as a hash key. When a number is seen for the first time it is added to the hash as a key and its value is set to 1 due to ++. When a number that already exists as a key is seen its value is incremented by ++. This is a typical frequency counter.
With your numbers in file data_cnt.txt this outputs
123457, 1
123456, 3
123458, 1
The output can be sorted by hash values, if you need that
say "$_, $cnt{$_}" for sort { $cnt{$b} <=> $cnt{$a} } (keys %cnt);
Prints
123456, 3
123457, 1
123458, 1
This can fit into a one-liner, if preferred for some reason
perl -nE '
$cnt{(/^(\d+)/)[0]}++;
}{ say "$_, $cnt{$_}" for sort { $cnt{$b} <=> $cnt{$a} } keys %cnt
' data_cnt.txt
It should be entered as one line at a terminal. The }{ is short for the END { } block. The code is the same from the short script above. The -E is the same as -e while it enables feature say.

You can use awk:
awk 'BEGIN{FS=OFS=", "} counts[$1]++{} END{for (i in counts) print i, counts[i]}' file
123456, 3
123457, 1
123458, 1
FS=OFS=", " Sets input & output field separator as ", "
counts[$1]++{} increments a counter stored by first column in array counts by 1 for every instance. {} is same do-nothing
In the END block we iterate through counts array and print each unique id and the count

cut, sort, uniq, sed version
cut -d',' -f1 | sort | uniq -c | sed 's/^ *\([^ ]*\) \(.*\)/\2, \1/'
or simple Perl version with sorted by the first column
perl -F',' -anE'$s{$F[0]}++}{say"$_, $s{$_}"for sort keys%s'
or sorted by count descending and then by the first column
perl -F',' -anE'$s{$F[0]}++}{say"$_, $s{$_}"for sort{$s{$b}<=>$s{$a}or$a cmp$b}keys%s'
or in order in which key comes first
perl -F',' -anE'push#a,$F[0]if!$s{$F[0]}++}{say"$_, $s{$_}"for#a'
or just in pseudorandom order
perl -F',' -anE'$s{$F[0]}++}{say"$_, $s{$_}"for keys%s'
and so on.

Perl one-liner:
perl -naE '$h{$F[0]}++}{for(sort keys %h){say "$_ $h{$_}"}' file.txt
123456, 3
123457, 1
123458, 1
-n loops over each line in the file
-a splits each line on whitespace, and populates #F array with each entry
}{ denotes an END block, which allows us to iterate over the hash after all lines in the file have been processed

In Perl
$ perl -MData::Dump -ne "++#n{/(\d+)/}; END {dd \%n}" data.txt
{ 123456 => 3, 123457 => 1, 123458 => 1 }

With datamash:
datamash -W -s -g1 count 1 < data
Output:
123456, 3
123457, 1
123458, 1

Bash/Shell: analyse tab-separated CSV for lines with data in n-th column

I have a tab-separated CSV, to big to download and open locally.
I want to show any lines with data in the n-th column, that is those lines with anything else than a tab right before the n-th tab of that line.
I´d post what I´ve tried so far, but my sed-knowledge is merely enough to assume that it can be done with sed.
edit1:
sample
id num name title
1 1 foo foo
2 2 bar
3 3 baz baz
If n=3 (name), then I want to output the rows 1+3.
If n=4 (title), then I want to output all the lines.
edit 2:
I found this possible solution:
awk -F '","' 'BEGIN {OFS=","} { if (toupper($5) == "STRING 1") print }' file1.csv > file2.csv
source: https://unix.stackexchange.com/questions/97070/filter-a-csv-file-based-on-the-5th-column-values-of-a-file-and-print-those-reco
But trying
awk -F '"\t"' 'BEGIN {OFS="\t"} { if (toupper($72) != "") print }' data.csv > data-tmp.csv
did not work (result file empty), so I propably got the \t wrong? (copy&paste without understanding awk)

I'm not exactly sure I understand your desired behaviour. Is this it?
$ cat file
id num name title
1 1 foo foo
2 2 bar
3 3 baz baz
$ awk -v n=3 -F$'\t' 'NR>1&&$n!=""' file
1 1 foo foo
3 3 baz baz
$ awk -v n=4 -F$'\t' 'NR>1&&$n!=""' file
1 1 foo foo
2 2 bar
3 3 baz baz

I'll assume you have enough space on the remote machine:
1) use cut to get the desired column N (delimiter is tab by standard)
cut -f N > tempfile
2) get line numbers only of non-empty lines
grep -c '^$' -n tempfile | sed 's/:.*//' > linesfile
3) use sed to extract lines
while read $linenumber ; do
sed "$linenumber p" >> newdatafile
done < linesfile
Unfortunately the line number cannot be extracted by piping the cut output to grep, but I am pretty sure there are more elegant solutions.

Simpler way to count the number of duplicated rows in a text file

I have a text file that looks like this:
abc
bcd
abc
efg
bcd
abc
And the expected output is this:
3 abc
2 bcd
1 efg
I know there is an existed solution for this:
sort -k2 < inFile |
awk '!z[$1]++{a[$1]=$0;} END {for (i in a) print z[i], a[i]}' |
sort -rn -k1 > outFile
The code sorts, removes duplicates, and sorts again, and prints the expected output.
However, is there a simpler way to express the z[$1]++{a[$1]=$0} part? More "basic", I mean.

More basic:
$ sort inFile | uniq -c
3 abc
2 bcd
1 efg
More basic awk
When one is used to awk's idioms, the expression !z[$1]++{a[$1]=$0;} is clear and concise. For those used to programming in other languages, other forms might be more familiar, such as:
awk '{if (z[$1]++ == 0) a[$1]=$0;} END {for (i in a) print z[i], a[i]}'
Or,
awk '{if (z[$1] == 0) a[$1]=$0; z[$1]+=1} END {for (i in a) print z[i], a[i]}'

If your input file contains billions of lines and you want to avoid sort, then you can just do:
awk '{a[$0]++} END{for(x in a) print a[x],x}' file.txt

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Duplicates in column: randomly keep one - bash

Pre-process the input to randomize it: shuf input.txt | awk '!A[$2]++'

Related

Count repeated columns of a line, print all lines and their count

Add line numbers for duplicate lines in a file

Counting the number of lines that have the same entry in the first column in bash

Bash/Shell: analyse tab-separated CSV for lines with data in n-th column

Simpler way to count the number of duplicated rows in a text file

Categories

Resources