Bash - Compare rows then print just original rows - bash

I've got files which look like this, (there can be more columns or rows):
dif-1-2-3-4.com 1 1 1
dif-1-2-3-5.com 1 1 2
dif-1-2-4-5.com 1 2 1
dif-1-3-4-5.com 2 1 1
dif-2-3-4-5.com 1 1 1
And I want to compare these numbers:
1 1 1
1 1 2
1 2 1
2 1 1
1 1 1
And print only those rows which do not repeat, so I get this:
dif-1-2-3-4.com 1 1 1
dif-1-2-3-5.com 1 1 2
dif-1-2-4-5.com 1 2 1
dif-1-3-4-5.com 2 1 1

Another simple approach is sort with uniq using a KEYDEF for fields 2-4 with sort and skipping field 1 with uniq, e.g.
$ sort file.txt -k 2,4 | uniq -f1
Example Use/Output
$ sort file.txt -k 2,4 | uniq -f1
dif-1-2-3-4.com 1 1 1
dif-1-2-3-5.com 1 1 2
dif-1-2-4-5.com 1 2 1
dif-1-3-4-5.com 2 1 1

Keep a running record of the triples already seen and only print the first time they appear:
$ awk '!(($2,$3,$4) in seen) {print; seen[$2,$3,$4]}' file
dif-1-2-3-4.com 1 1 1
dif-1-2-3-5.com 1 1 2
dif-1-2-4-5.com 1 2 1
dif-1-3-4-5.com 2 1 1

Try, the following awk code too:
awk '!a[$2,$3,$4]++' Input_file
Explanation:
Create an array named a and its indexes as $2,$3,$4. The condition here is !a, (which means any line's $2,$3,$4 are NOT present in array a), and then doing 2 things:
Increasing that specific index's value to 1 so that next time that condition will NOT be true for same $2,$3,$4 indexes in array a.
Not specifying an action, (so awk works in the mode of condition and then action), so the default action will be to print the current line. This will go on for all the lines in Input_file, and the last line will not be printed as its $2,$3,$4 are already present in array a.
I hope this helps.

This works with POSIX and gnu awk:
$ awk '{s=""
for (i=2;i<=NF; i++)
s=s $i "|"}
s in seen { next }
++seen[s]' file
Which can be shortened to:
$ awk '{s=""; for (i=2;i<=NF; i++) s=s $i "|"} !seen[s]++' file
Also supports a variable number of columns.
If you want a sort uniq solution that also respects file order (i.e. the first of the set of duplicates is printed, not the later ones) you need to do a decorate, sort, undecorate approach.
You can:
use cat -n to decorate the file with line numbers;
sort -k3 -k1n to sort first on all the fields starting at the 3 though the end of the line then numerically on the line number added;
add -u if your version of sort supports that or use uniq -f3 to only keep the first in the group of dups;
finally use sed -e 's/^[[:space:]]*[0-9]*[[:space:]]*// to remove the added line numbers:
cat -n file | sort -k3 -k1n | uniq -f3 | sed -e 's/^[[:space:]]*[0-9]*[[:space:]]*//'
Awk is easier and faster in this case.

Related

piping commands of awk and sed is too slow! any ideas on how to make it work faster?

I am trying to convert a file containing a column with scaffold numbers and another one with corresponding individual sites into a bed file which lists sites in ranges. For example, this file ($indiv.txt):
SCAFF SITE
1 1
1 2
1 3
1 4
1 5
3 1
3 2
3 34
3 35
3 36
should be converted into $indiv.bed:
SCAFF SITE-START SITE-END
1 1 5
3 1 2
3 34 36
Currently, I am using the following code but it is super slow so I wanted to ask if anybody could come up with a quicker way??
COMMAND:
for scaff in $(awk '{print $1}' $indiv.txt | uniq)
do
awk -v I=$scaff '$1 == I { print $2 }' $indiv.txt | awk 'NR==1{first=$1;last=$1;next} $1 == last+1 {last=$1;next} {print first,last;first=$1;last=first} END{print first,last}' | sed "s/^/$scaff\t/" >> $indiv.bed
done
DESCRIPTION:
awk '{print $1}' $indiv.txt | uniq #outputs a list with the unique scaffold numbers
awk -v I=$scaff '$1 == I { print $2 }' $indiv.txt #extracts the values from column 2 if the value in the first column equals the variable $scaff
awk 'NR==1{first=$1;last=$1;next} $1 == last+1 {last=$1;next} {print first,last;first=$1;last=first} END{print first,last}' #converts the list of sequential numbers into ranges as described here: https://stackoverflow.com/questions/26809668/collapse-sequential-numbers-to-ranges-in-bash
sed "s/^/$scaff\t/" >> $indiv.bed #adds a column with the respective scaffold number and then outputs the file into $indiv.bed
Thanks a lot in advance!
Calling several programs for each line of the input must be slow. It's usually better to find a way how to process all the lines in one call.
I'd reach for Perl:
tail -n+2 indiv.txt \
| sort -u -nk1,1 -nk2,2 \
| perl -ane 'END {print " $F[1]"}
next if $p[0] == $F[0] && $F[1] == $p[1] + 1;
print " $p[1]\n#F";
} continue { #p = #F;' > indiv.bed
The first two lines sort the input so that the groups are always adjacent (might be unnecessary if your input is already sorted that way); Perl than reads the lines,-a splits each line into the #F array, the #p array is used to keep the previous line: if the current line has the same first element and the second element is greater by 1, we go to the continue section which just stores the current line into #p. Otherwise, we print the last element of the previous section and the first line of the current one. The END block is responsible for printing the last element of the last section.
The output is different from yours for sections that have only a single member.

all pairs of consecutive lines sharing a field, using awk

I would like to process a multi-line, multi-field input file so that I get a file with all pairs of consecutive lines ONLY IF they have the same value as field #1.
This is, for each line, the output would contain the line itself + the next line, and would omit combinations of lines with different values at field #1.
It's better explained with an example.
Given this input:
1 this
1 that
1 nye
2 more
2 sit
I want to produce something like:
1 this 1 that
1 that 1 nye
2 more 2 sit
So far I've got this:
awk 'NR % 2 == 1 { i=$0 ; next } { print i,$0 } END { if ( NR % 2 == 1 ) { print i } }' input.txt
My output:
1 this 1 that
1 nye 2 more
2 sit
As you can see, my code is blind to field #1 value, and also (and more importantly) it omits "intermediate" results like 1 that 1 nye (once it's done with a line, it jumps to the next pair of lines).
Any ideas? My preferred language is awk/gawk, but if it can be done using unix bash it's ok as well.
Thanks in advance!
You can use this awk:
awk 'NR>1 && ($1 in a){print a[$1], $0} {a[$1]=$0}' file
1 this 1 that
1 that 1 nye
2 more 2 sit
You can do it with simple commands. Assuming your input file is "test.txt" with content:
1 this
1 that
1 nye
2 more
2 sit
following commands gives the requested output:
sort -n test.txt > tmp1
(echo; cat tmp1) | paste tmp1 - | egrep '^([0-9])+ *[^ ]* *\1'
Just for fun
paste -d" " filename <(sed 1d filename) | awk '$1==$3'

How to sort elements alphabetically in a list that have same numeric values

First of all, I mustn't use awk or sed!
I counted the number of times a command was used in bash history. I've used the following code:
cat ${!#} | cut -f 1 -d " " | sort | uniq -c | sort -n | sort -r
Note that cat ${!#} is history.txt by deafult and I used cut to get rid of the numbered lines. Also I had to use sort -r at the end, otherwise I'd get the output ascending rather then descending (why is that?).
Ok my main question is sorting the commands with the same number of repetitions. Currently I'm getting this output:
seq 3
ps 2
echo 2
cut 2
uname 1
nl 1
cmp 1
But I need to get the following output:
seq 3
cut 2
echo 2
ps 2
cmp 1
diff 1
nl 1
uname 1
So yeah. Sort the commands by number of repetitions and if any of them are the same, sort them alphabetically.
I tried to tackle this with another sort, but haven't had much success thus far.
Also is it just a coincidence that commands in my output are reversed from the desired output? Or are they sorted descending by default?
Try this with GNU sort:
sort -k2nr -k1,1 file
Output:
seq 3
cut 2
echo 2
ps 2
cmp 1
nl 1
uname 1

Get lengths of zeroes (interrupted by ones)

I have a long column of ones and zeroes:
0
0
0
1
0
0
0
0
0
1
0
0
0
0
0
1
0
0
1
....
I can easily get the average number of zeroes between ones (just total/ones):
ones=$(grep -c 1 file.txt)
lines=$(wc -l < file.txt)
echo "$lines / $ones" | bc -l
But how can I get the length of strings of zeroes between the ones? In the short example above it would be:
3
5
5
2
I'd include uniq for a more easily read approach:
uniq -c file.txt | awk '/ 0$/ {print $1}'
Edit: fixed for the case where the last line is a 0
Easy in awk:
awk '/1/{print NR-prev-1; prev=NR;}END{if (NR>prev)print NR-prev;}'
Not so difficult in bash, either:
i=0
for x in $(<file.txt); do
if ((x)); then echo $i; i=0; else ((++i)); fi
done
((i)) && echo $i
Using awk, I would use the fact that a field with the value 0 evaluates as False:
awk '!$1{s++; next} {if (s) print s; s=0} END {if (s) print s}' file
This returns:
3
5
5
2
Also, note the END block to print any "remaining" zeroes appearing after the last 1.
Explanation
!$1{s++; next} if the field is not True, that is, if the field is 0, increment the counter. Then, skip to the next line.
{if (s) print s; s=0} otherwise, print the value of the counter and reset it, but just if it contains some value (to avoid printing 0 if the file starts with a 1).
END {if (s) print s} print the remaining value of the counter after processing the file, but just if it wasn't printed before.
If your file.txt is just a column of ones and zeros, you can use awk and change the record separator to "1\n". This makes each "record" a sequence of "0\n", and the count of 0's in the record is the length of the record divided by 2. Counts will be correct for leading and trailing ones and zeros.
awk 'BEGIN {RS="1\n"} { print length/2 }' file.txt
This seems to be pretty popular question today. Joining the party late, here is another short gnu-awk command to do the job:
awk -F '\n' -v RS='(1\n)+' 'NF{print NF-1}' file
3
5
5
2
How it works:
-F '\n' # set input field separator as \n (newline)
-v RS='(1\n)+' # set input record separator as multipled of 1 followed by newline
NF # execute the block if minimum one field is found
print NF-1 # print num of field -1 to get count of 0
Pure bash:
sum=0
while read n ; do
if ((n)) ; then
echo $sum
sum=0
else
((++sum))
fi
done < file.txt
((sum)) && echo $sum # Don't forget to output the last number if the file ended in 0.
Another way:
perl -lnE 'if(m/1/){say $.-1;$.=0}' < file
"reset" the line counter when 1.
prints
3
5
5
2
You can use awk:
awk '$1=="0"{s++} $1=="1"{if(s)print s;s=0} END{if(s)print(s)}'
Explanation:
The special variable $1 contains the value of the first field (column) of a line of text. Unless you specify the field delimiter using the -F command line option it defaults to a widespace - meaning $1 will contain 0 or 1 in your example.
If the value of $1 equals 0 a variable called s will get incremented but if $1 is equal to 1 the current value of s gets printed (if greater than zero) and re-initialized to 0. (Note that awk initializes s with 0 before the first increment operation)
The END block gets executed after the last line of input has been processed. If the file ends with 0(s) the number of 0s between the file's end and the last 1 will get printed. (Without the END block they wouldn't printed)
Output:
3
5
5
2
if you can use perl:
perl -lne 'BEGIN{$counter=0;} if ($_ == 1){ print $counter; $counter=0; next} $counter++' file
3
5
5
2
It actually looks better with awk same logic:
awk '$1{print c; c=0} !$1{c++}' file
3
5
5
2
My attempt. Not so pretty but.. :3
grep -n 1 test.txt | gawk '{y=$1-x; print y-1; x=$1}' FS=":"
Out:
3
5
5
2
A funny one, in pure Bash:
while read -d 1 -a u || ((${#u[#]})); do
echo "${#u[#]}"
done < file
This tells read to use 1 as a delimiter, i.e., to stop reading as soon as a 1 is encountered; read stores the 0's in the fields of the array u. Then we only need to count the number of fields in u with ${#u[#]}. The || ((${#u[#]})) is here just in case your file doesn't end with a 1.
More strange (and not fully correct) way:
perl -0x31 -laE 'say #F+0' <file
prints
3
5
5
2
0
It
reads the file with the record separator is set to character 1 the -0x31
with autosplit -a (splits the record into array #F)
and prints the number of elements in #F e.g. say #F+0 or could use say scalar #F
Unfortunately, after the final 1 (as record separator) it prints an empty record - therefore prints the last 0.
It is incorrect solution, showing it only as alternative curiosity.
Expanding erickson's excellent answer, you can say:
$ uniq -c file | awk '!$2 {print $1}'
3
5
5
2
From man uniq we see that the purpose of uniq is to:
Filter adjacent matching lines from INPUT (or standard input), writing
to OUTPUT (or standard output).
So uniq groups the numbers. Using the -c option we get a prefix with the number of occurrences:
$ uniq -c file
3 0
1 1
5 0
1 1
5 0
1 1
2 0
1 1
Then it is a matter of printing those the counters before the 0. For this we can use awk like: awk '!$2 {print $1}'. That is: print the second field if the field is 0.
The simplest solution would be to use sed together with awk, like this:
sed -n '$bp;/0/{:r;N;/0$/{h;br}};/1/{x;bp};:p;/.\+/{s/\n//g;p}' input.txt \
| awk '{print length}'
Explanation:
The sed command separates the 0s and creates output like this:
000
00000
00000
00
Piped into awk '{print length}' you can get the count of 0 for each interval:
Output:
3
5
5
2

generating frequency table from file

Given an input file containing one single number per line, how could I get a count of how many times an item occurred in that file?
cat input.txt
1
2
1
3
1
0
desired output (=>[1,3,1,1]):
cat output.txt
0 1
1 3
2 1
3 1
It would be great, if the solution could also be extended for floating numbers.
You mean you want a count of how many times an item appears in the input file? First sort it (using -n if the input is always numbers as in your example) then count the unique results.
sort -n input.txt | uniq -c
Another option:
awk '{n[$1]++} END {for (i in n) print i,n[i]}' input.txt | sort -n > output.txt
At least some of that can be done with
sort output.txt | uniq -c
But the order number count is reversed. This will fix that problem.
sort test.dat | uniq -c | awk '{print $2, $1}'
Using maphimbu from the Debian stda package:
# use 'jot' to generate 100 random numbers between 1 and 5
# and 'maphimbu' to print sorted "histogram":
jot -r 100 1 5 | maphimbu -s 1
Output:
1 20
2 21
3 20
4 21
5 18
maphimbu also works with floating point:
jot -r 100.0 10 15 | numprocess /%10/ | maphimbu -s 1
Output:
1 21
1.1 17
1.2 14
1.3 18
1.4 11
1.5 19
In addition to the other answers, you can use awk to make a simple graph. (But, again, it's not a histogram.)
perl -lne '$h{$_}++; END{for $n (sort keys %h) {print "$n\t$h{$n}"}}' input.txt
Loop over each line with -n
Each $_ number increments hash %h
Once the END of input.txt has been reached,
sort {$a <=> $b} the hash numerically
Print the number $n and the frequency $h{$n}
Similar code which works on floating point:
perl -lne '$h{int($_)}++; END{for $n (sort {$a <=> $b} keys %h) {print "$n\t$h{$n}"}}' float.txt
float.txt
1.732
2.236
1.442
3.162
1.260
0.707
output:
0 1
1 3
2 1
3 1
I had a similar problem as described, but across gigabytes of gzip'd log files. Because many of these solutions necessitated waiting until all the data was parsed, I opted to write rare to quickly parse and aggregate data based on a regexp.
In the case above, it's as simple as passing in the data to the histogram function:
rare histo input.txt
# OR
cat input.txt | rare histo
# Outputs:
1 3
0 1
2 1
3 1
But it can also handle more complex cases via regex/expressions, such as:
rare histo --match "(\d+)" --extract "{1}" input.txt

Resources