Add line numbers for duplicate lines in a file - bash

My text file would read as:
111
111
222
222
222
333
333
My resulting file would look like:
1,111
2,111
1,222
2,222
3,222
1,333
2,333
Or the resulting file could alternatively look like the following:
1
2
1
2
3
1
2
I've specified a comma as a delimiter here but it doesn't matter what the delimeter is --- I can modify that at a future date.In reality, I don't even need the original text file contents, just the line numbers, because I can just paste the line numbers against the original text file.
I am just not sure how I can go through numbering the lines based on repeated entries.
All items in list are duplicated at least once. There are no single occurrences of a line in the file.

$ awk -v OFS=',' '{print ++cnt[$0], $0}' file
1,111
2,111
1,222
2,222
3,222
1,333
2,333

Use a variable to save the previous line, and compare it to the current line. If they're the same, increment the counter, otherwise set it back to 1.
awk '{if ($0 == prev) counter++; else counter = 1; prev=$0; print counter}'

Perl solution:
perl -lne 'print ++$c{$_}' file
-n reads the input line by line
-l handles newlines
++$c{$_} increments the value assigned to the contents of the current line $_ in the hash table %c.

Software tools method, given textfile as input:
uniq -c textfile | cut -d' ' -f7 | xargs -L 1 seq 1
Shell loop-based variant of the above:
uniq -c textfile | while read a b ; do seq 1 $a ; done
Output (of either method):
1
2
1
2
3
1
2

Related

Using bash comm command on columns but returning the entire line

I have two files, each with two columns and sorted only by the second column, such as:
File 1:
176 AAATC
6 CCGTG
80 TTTCG
File 2:
20 AAATC
77 CTTTT
50 TTTTT
I would like to use comm command using options -13 and -23 to get two different files reporting the different lines between the two files with the corresponding count number, but only comparing the second columns (i.e. the strings). What I tried so far was something like:
comm -23 <(cut -d$'\t' -f2 file1.txt) <(cut -d$'\t' -f2 file2.txt)
But I could only have the strings in output, without the numbers:
CCGTG
TTTCG
While what I want would be:
6 CCGTG
80 TTTCG
Any suggestion?
Thanks!
You can use join instead of comm:
join -1 2 -2 2 File1 File2 -a 1 -o 1.1,1.2,2.2
It will output the matching lines, too, but you can remove them with
| grep -v '[ACTG] [ACTG]'
Explanation:
-1 2 use the second column in file 1 for joining;
-2 2 similarly, use the second column in file 2;
-a 1 show also non-matching lines from file 1 - these are the ones you want in the end;
-o specifies the output format, here we want columns 1 and 2 from file 1 and column 2 from file 2 (this is just arbitrary, you can use column 1 as well, but the second command would be different: | grep -v '[ACTG] [0-9]').
comm is not the right tool for this job, and while join will work you also need to look at running join twice and then further filter the results with some other command (eg, grep).
One awk idea that requires a single pass through each input file:
awk 'BEGIN {FS=OFS="\t"}
FNR==NR { f1[$2]=$1; next } # save 1st file entries
$2 in f1 { delete f1[$2]; next } # 2nd file: if $2 in f1[] then delete f1[] entry and skip this line else ..
{ f2[$2]=$1 } # save 2nd file entries
END { # at this point:
# f1[] contains rows where field #2 only exists in the 1st file
# f2[] contains rows where field #2 only exists in the 2nd file
PROCINFO["sorted_in"]="#ind_str_asc"
for (i in f1) print f1[i],i > "file-23"
for (i in f2) print f2[i],i > "file-13"
}
' file1 file2
NOTE: the PROCINFO["sorted_in"] line requires GNU awk; without this line we cannot guarantee the order of writes to the final output files, and OP would then need to add more (awk) code to maintain the ordering or use another OS-level utility (eg, sort) to sort the final files
This generates:
$ cat file-23
6 CCGTG
80 TTTCG
$ cat file-13
77 CTTTT
50 TTTTT

piping commands of awk and sed is too slow! any ideas on how to make it work faster?

I am trying to convert a file containing a column with scaffold numbers and another one with corresponding individual sites into a bed file which lists sites in ranges. For example, this file ($indiv.txt):
SCAFF SITE
1 1
1 2
1 3
1 4
1 5
3 1
3 2
3 34
3 35
3 36
should be converted into $indiv.bed:
SCAFF SITE-START SITE-END
1 1 5
3 1 2
3 34 36
Currently, I am using the following code but it is super slow so I wanted to ask if anybody could come up with a quicker way??
COMMAND:
for scaff in $(awk '{print $1}' $indiv.txt | uniq)
do
awk -v I=$scaff '$1 == I { print $2 }' $indiv.txt | awk 'NR==1{first=$1;last=$1;next} $1 == last+1 {last=$1;next} {print first,last;first=$1;last=first} END{print first,last}' | sed "s/^/$scaff\t/" >> $indiv.bed
done
DESCRIPTION:
awk '{print $1}' $indiv.txt | uniq #outputs a list with the unique scaffold numbers
awk -v I=$scaff '$1 == I { print $2 }' $indiv.txt #extracts the values from column 2 if the value in the first column equals the variable $scaff
awk 'NR==1{first=$1;last=$1;next} $1 == last+1 {last=$1;next} {print first,last;first=$1;last=first} END{print first,last}' #converts the list of sequential numbers into ranges as described here: https://stackoverflow.com/questions/26809668/collapse-sequential-numbers-to-ranges-in-bash
sed "s/^/$scaff\t/" >> $indiv.bed #adds a column with the respective scaffold number and then outputs the file into $indiv.bed
Thanks a lot in advance!
Calling several programs for each line of the input must be slow. It's usually better to find a way how to process all the lines in one call.
I'd reach for Perl:
tail -n+2 indiv.txt \
| sort -u -nk1,1 -nk2,2 \
| perl -ane 'END {print " $F[1]"}
next if $p[0] == $F[0] && $F[1] == $p[1] + 1;
print " $p[1]\n#F";
} continue { #p = #F;' > indiv.bed
The first two lines sort the input so that the groups are always adjacent (might be unnecessary if your input is already sorted that way); Perl than reads the lines,-a splits each line into the #F array, the #p array is used to keep the previous line: if the current line has the same first element and the second element is greater by 1, we go to the continue section which just stores the current line into #p. Otherwise, we print the last element of the previous section and the first line of the current one. The END block is responsible for printing the last element of the last section.
The output is different from yours for sections that have only a single member.

Get lengths of zeroes (interrupted by ones)

I have a long column of ones and zeroes:
0
0
0
1
0
0
0
0
0
1
0
0
0
0
0
1
0
0
1
....
I can easily get the average number of zeroes between ones (just total/ones):
ones=$(grep -c 1 file.txt)
lines=$(wc -l < file.txt)
echo "$lines / $ones" | bc -l
But how can I get the length of strings of zeroes between the ones? In the short example above it would be:
3
5
5
2
I'd include uniq for a more easily read approach:
uniq -c file.txt | awk '/ 0$/ {print $1}'
Edit: fixed for the case where the last line is a 0
Easy in awk:
awk '/1/{print NR-prev-1; prev=NR;}END{if (NR>prev)print NR-prev;}'
Not so difficult in bash, either:
i=0
for x in $(<file.txt); do
if ((x)); then echo $i; i=0; else ((++i)); fi
done
((i)) && echo $i
Using awk, I would use the fact that a field with the value 0 evaluates as False:
awk '!$1{s++; next} {if (s) print s; s=0} END {if (s) print s}' file
This returns:
3
5
5
2
Also, note the END block to print any "remaining" zeroes appearing after the last 1.
Explanation
!$1{s++; next} if the field is not True, that is, if the field is 0, increment the counter. Then, skip to the next line.
{if (s) print s; s=0} otherwise, print the value of the counter and reset it, but just if it contains some value (to avoid printing 0 if the file starts with a 1).
END {if (s) print s} print the remaining value of the counter after processing the file, but just if it wasn't printed before.
If your file.txt is just a column of ones and zeros, you can use awk and change the record separator to "1\n". This makes each "record" a sequence of "0\n", and the count of 0's in the record is the length of the record divided by 2. Counts will be correct for leading and trailing ones and zeros.
awk 'BEGIN {RS="1\n"} { print length/2 }' file.txt
This seems to be pretty popular question today. Joining the party late, here is another short gnu-awk command to do the job:
awk -F '\n' -v RS='(1\n)+' 'NF{print NF-1}' file
3
5
5
2
How it works:
-F '\n' # set input field separator as \n (newline)
-v RS='(1\n)+' # set input record separator as multipled of 1 followed by newline
NF # execute the block if minimum one field is found
print NF-1 # print num of field -1 to get count of 0
Pure bash:
sum=0
while read n ; do
if ((n)) ; then
echo $sum
sum=0
else
((++sum))
fi
done < file.txt
((sum)) && echo $sum # Don't forget to output the last number if the file ended in 0.
Another way:
perl -lnE 'if(m/1/){say $.-1;$.=0}' < file
"reset" the line counter when 1.
prints
3
5
5
2
You can use awk:
awk '$1=="0"{s++} $1=="1"{if(s)print s;s=0} END{if(s)print(s)}'
Explanation:
The special variable $1 contains the value of the first field (column) of a line of text. Unless you specify the field delimiter using the -F command line option it defaults to a widespace - meaning $1 will contain 0 or 1 in your example.
If the value of $1 equals 0 a variable called s will get incremented but if $1 is equal to 1 the current value of s gets printed (if greater than zero) and re-initialized to 0. (Note that awk initializes s with 0 before the first increment operation)
The END block gets executed after the last line of input has been processed. If the file ends with 0(s) the number of 0s between the file's end and the last 1 will get printed. (Without the END block they wouldn't printed)
Output:
3
5
5
2
if you can use perl:
perl -lne 'BEGIN{$counter=0;} if ($_ == 1){ print $counter; $counter=0; next} $counter++' file
3
5
5
2
It actually looks better with awk same logic:
awk '$1{print c; c=0} !$1{c++}' file
3
5
5
2
My attempt. Not so pretty but.. :3
grep -n 1 test.txt | gawk '{y=$1-x; print y-1; x=$1}' FS=":"
Out:
3
5
5
2
A funny one, in pure Bash:
while read -d 1 -a u || ((${#u[#]})); do
echo "${#u[#]}"
done < file
This tells read to use 1 as a delimiter, i.e., to stop reading as soon as a 1 is encountered; read stores the 0's in the fields of the array u. Then we only need to count the number of fields in u with ${#u[#]}. The || ((${#u[#]})) is here just in case your file doesn't end with a 1.
More strange (and not fully correct) way:
perl -0x31 -laE 'say #F+0' <file
prints
3
5
5
2
0
It
reads the file with the record separator is set to character 1 the -0x31
with autosplit -a (splits the record into array #F)
and prints the number of elements in #F e.g. say #F+0 or could use say scalar #F
Unfortunately, after the final 1 (as record separator) it prints an empty record - therefore prints the last 0.
It is incorrect solution, showing it only as alternative curiosity.
Expanding erickson's excellent answer, you can say:
$ uniq -c file | awk '!$2 {print $1}'
3
5
5
2
From man uniq we see that the purpose of uniq is to:
Filter adjacent matching lines from INPUT (or standard input), writing
to OUTPUT (or standard output).
So uniq groups the numbers. Using the -c option we get a prefix with the number of occurrences:
$ uniq -c file
3 0
1 1
5 0
1 1
5 0
1 1
2 0
1 1
Then it is a matter of printing those the counters before the 0. For this we can use awk like: awk '!$2 {print $1}'. That is: print the second field if the field is 0.
The simplest solution would be to use sed together with awk, like this:
sed -n '$bp;/0/{:r;N;/0$/{h;br}};/1/{x;bp};:p;/.\+/{s/\n//g;p}' input.txt \
| awk '{print length}'
Explanation:
The sed command separates the 0s and creates output like this:
000
00000
00000
00
Piped into awk '{print length}' you can get the count of 0 for each interval:
Output:
3
5
5
2

Counting equal lines in two files

Say, I have two files and want to find out how many equal lines they have. For example, file1 is
1
3
2
4
5
0
10
and file2 contains
3
10
5
64
15
In this case the answer should be 3 (common lines are '3', '10' and '5').
This, of course, is done quite simply with python, for example, but I got curious about doing it from bash (with some standard utils or extra things like awk or whatever). This is what I came up with:
cat file1 file2 | sort | uniq -c | awk '{if ($1 > 1) {$1=""; print $0}}' | wc -l
It does seem too complicated for the task, so I'm wondering is there a simpler or more elegant way to achieve the same result.
P.S. Outputting the percentage of common part to the number of lines in each file would also be nice, though is not necessary.
UPD: Files do not have duplicate lines
To find lines in common with your 2 files, using awk :
awk 'a[$0]++' file1 file2
Will output 3 10 15
Now, just pipe this to wc to get the number of common lines :
awk 'a[$0]++' file1 file2 | wc -l
Will output 3.
Explanation:
Here, a works like a dictionary with default value of 0. When you write a[$0]++, you will add 1 to a[$0], but this instruction returns the previous value of a[$0] (see difference between a++ and ++a). So you will have 0 ( = false) the first time you encounter a certain string and 1 ( or more, still = true) the next times.
By default, awk 'condition' file is a syntax for outputting all the lines where condition is true.
Be also aware that the a[] array will expand every time you encounter a new key. At the end of your script, the size of the array will be the number of unique values you have throughout all your input files (in OP's example, it would be 9).
Note: this solution counts duplicates, i.e if you have:
file1 | file2
1 | 3
2 | 3
3 | 3
awk 'a[$0]++' file1 file2 will output 3 3 3 and awk 'a[$0]++' file1 file2 | wc -l will output 3
If this is a behaviour you don't want, you can use the following code to filter out duplicates :
awk '++a[$0] == 2' file1 file2 | wc -l
with your input example, this works too. but if the files are huge, I prefer the awk solutions by others:
grep -cFwf file2 file1
with your input files, the above line outputs
3
Here's one without awk that instead uses comm:
comm -12 <(sort file1.txt) <(sort file2.txt) | wc -l
comm compares two sorted files. The arguments 1,2 suppresses unique lines found in both files.
The output is the lines they have in common, on separate lines. wc -l counts the number of lines.
Output without wc -l:
10
3
5
And when counting (obviously):
3
You can also use comm command. Remember that you will have to first sort the files that you need to compare:
[gc#slave ~]$ sort a > sorted_1
[gc#slave ~]$ sort b > sorted_2
[gc#slave ~]$ comm -1 -2 sorted_1 sorted_2
10
3
5
From man pages for comm command:
comm - compare two sorted files line by line
Options:
-1 suppress column 1 (lines unique to FILE1)
-2 suppress column 2 (lines unique to FILE2)
-3 suppress column 3 (lines that appear in both files)
You can do all with awk:
awk '{ a[$0] += 1} END { c = 0; for ( i in a ) { if ( a[i] > 1 ) c++; } print c}' file1 file2
To get the percentage, something like this works:
awk '{ a[$0] += 1; if (NR == FNR) { b = FILENAME; n = NR} } END { c = 0; for ( i in a ) { if ( a[i] > 1 ) c++; } print b, c/n; print FILENAME, c/FNR;}' file1 file2
and outputs
file1 0.428571
file2 0.6
In your solution, you can get rid of one cat:
sort file1 file2| uniq -c | awk '{if ($1 > 1) {$1=""; print $0}}' | wc -l
How about keeping it nice and simple...
This is all that's needed:
cat file1 file2 | sort -n | uniq -d | wc -l
3
man sort:
-n, --numeric-sort -- compare according to string numerical value
man uniq:
-d, --repeated -- only print duplicate lines
man wc:
-l, --lines -- print the newline counts
Hope this helps.
EDIT - one fewer process (credit martin):
sort file1 file2 | uniq -d | wc -l
One way using awk:
awk 'NR==FNR{a[$0]; next}$0 in a{n++}END{print n}' file1 file2
Output:
3
The first answer by Aserre using awk is good but may have the undesirable effect of counting duplicates - even if the duplicates exist in only ONE of the files, which is not quite what the OP asked for.
I believe this edit will return only the unique lines that exist in BOTH files.
awk 'NR==FNR{a[$0]=1;next}a[$0]==1{a[$0]++;print $0}' file1 file2
If duplicates are desired, but only if they exist in both files, I believe this next version will work, but will only report duplicates in the second file that exist in the first file. (If the duplicates exist in the first file, only the those that also exist in file2 will be reported, so file order matters).
awk 'NR==FNR{a[$0]=1;next}a[$0]' file1 file2
Btw, I tried using grep, but it was painfully slow on files with a few thousand lines each. Awk is very fast!
UPDATE 1 : new version ensures intra-file duplicates are excluded from count, so only cross-file duplicates would show up in the final stats :
mawk '
BEGIN { _*= FS = "^$"
} FNR == NF { split("",___)
} ___[$_]++<NF { __[$_]++
} END { split("",___)
for (_ in __) {
___[__[_]]++ } printf(RS)
for (_ in ___) {
printf(" %\04715.f %s\n",_,___[_]) }
printf(RS) }' \
<( jot - 1 999 3 | mawk '1;1;1;1;1' | shuf ) \
<( jot - 2 1024 7 | mawk '1;1;1;1;1' | shuf ) \
<( jot - 7 1295 17 | mawk '1;1;1;1;1' | shuf )
3 3
2 67
1 413
===========================================
this is probably waaay overkill, but i wrote something similar to this to supplement uniq -c :
measuring the frequency of frequencies
it's like uniq -c | uniq -c without wasting time sorting. The summation and % parts are trivial from here, with 47 over-lapping lines in this example. It avoids spending any time performing per row processing, since the current setup only shows the summarized stats.
If you need to actual duplicated rows, they're also available right there serving as the hash key for the 1st array.
gcat <( jot - 1 999 3 ) <( jot - 2 1024 7 ) |
mawk '
BEGIN { _*= FS = "^$"
} { __[$_]++
} END { printf(RS)
for (_ in __) { ___[__[_]]++ }
for (_ in ___) {
printf(" %\04715.f %s\n",
_,___[_]) } printf(RS) }'
2 47
1 386
add another file, and the results reflect the changes (I added <( jot - 5 1295 5 ) ):
3 9
2 115
1 482

Unix - randomly select lines based on column values

I have a file with ~1000 lines that looks like this:
ABC C5A 1
CFD D5G 4
E1E FDF 3
CFF VBV 1
FGH F4R 2
K8K F9F 3
... etc
I would like to select 100 random lines, but with 10 of each third column value (so random 10 lines from all lines with value "1" in column 3, random 10 lines from all lines with value "2" in column 3, etc).
Is this possible using bash?
First grep all the files with a certain number, shuffle them and pick the first 10 using shuf -n 10.
for i in {1..10}; do
grep " ${i}$" file | shuf -n 10
done > randomFile
If you don't have shuf, use sort -R to randomly sort them instead:
for i in {1..10}; do
grep " ${i}$" file | sort -R | head -10
done > randomFile
If you can use awk, you can do the same with a one-liner
sort -R file | awk '{if (count[$3] < 10) {count[$3]++; print $0}}'

Resources