all pairs of consecutive lines sharing a field, using awk - bash

I would like to process a multi-line, multi-field input file so that I get a file with all pairs of consecutive lines ONLY IF they have the same value as field #1.
This is, for each line, the output would contain the line itself + the next line, and would omit combinations of lines with different values at field #1.
It's better explained with an example.
Given this input:
1 this
1 that
1 nye
2 more
2 sit
I want to produce something like:
1 this 1 that
1 that 1 nye
2 more 2 sit
So far I've got this:
awk 'NR % 2 == 1 { i=$0 ; next } { print i,$0 } END { if ( NR % 2 == 1 ) { print i } }' input.txt
My output:
1 this 1 that
1 nye 2 more
2 sit
As you can see, my code is blind to field #1 value, and also (and more importantly) it omits "intermediate" results like 1 that 1 nye (once it's done with a line, it jumps to the next pair of lines).
Any ideas? My preferred language is awk/gawk, but if it can be done using unix bash it's ok as well.
Thanks in advance!

You can use this awk:
awk 'NR>1 && ($1 in a){print a[$1], $0} {a[$1]=$0}' file
1 this 1 that
1 that 1 nye
2 more 2 sit

You can do it with simple commands. Assuming your input file is "test.txt" with content:
1 this
1 that
1 nye
2 more
2 sit
following commands gives the requested output:
sort -n test.txt > tmp1
(echo; cat tmp1) | paste tmp1 - | egrep '^([0-9])+ *[^ ]* *\1'

Just for fun
paste -d" " filename <(sed 1d filename) | awk '$1==$3'

Related

piping commands of awk and sed is too slow! any ideas on how to make it work faster?

I am trying to convert a file containing a column with scaffold numbers and another one with corresponding individual sites into a bed file which lists sites in ranges. For example, this file ($indiv.txt):
SCAFF SITE
1 1
1 2
1 3
1 4
1 5
3 1
3 2
3 34
3 35
3 36
should be converted into $indiv.bed:
SCAFF SITE-START SITE-END
1 1 5
3 1 2
3 34 36
Currently, I am using the following code but it is super slow so I wanted to ask if anybody could come up with a quicker way??
COMMAND:
for scaff in $(awk '{print $1}' $indiv.txt | uniq)
do
awk -v I=$scaff '$1 == I { print $2 }' $indiv.txt | awk 'NR==1{first=$1;last=$1;next} $1 == last+1 {last=$1;next} {print first,last;first=$1;last=first} END{print first,last}' | sed "s/^/$scaff\t/" >> $indiv.bed
done
DESCRIPTION:
awk '{print $1}' $indiv.txt | uniq #outputs a list with the unique scaffold numbers
awk -v I=$scaff '$1 == I { print $2 }' $indiv.txt #extracts the values from column 2 if the value in the first column equals the variable $scaff
awk 'NR==1{first=$1;last=$1;next} $1 == last+1 {last=$1;next} {print first,last;first=$1;last=first} END{print first,last}' #converts the list of sequential numbers into ranges as described here: https://stackoverflow.com/questions/26809668/collapse-sequential-numbers-to-ranges-in-bash
sed "s/^/$scaff\t/" >> $indiv.bed #adds a column with the respective scaffold number and then outputs the file into $indiv.bed
Thanks a lot in advance!
Calling several programs for each line of the input must be slow. It's usually better to find a way how to process all the lines in one call.
I'd reach for Perl:
tail -n+2 indiv.txt \
| sort -u -nk1,1 -nk2,2 \
| perl -ane 'END {print " $F[1]"}
next if $p[0] == $F[0] && $F[1] == $p[1] + 1;
print " $p[1]\n#F";
} continue { #p = #F;' > indiv.bed
The first two lines sort the input so that the groups are always adjacent (might be unnecessary if your input is already sorted that way); Perl than reads the lines,-a splits each line into the #F array, the #p array is used to keep the previous line: if the current line has the same first element and the second element is greater by 1, we go to the continue section which just stores the current line into #p. Otherwise, we print the last element of the previous section and the first line of the current one. The END block is responsible for printing the last element of the last section.
The output is different from yours for sections that have only a single member.

Bash: replacing a column by another and using AWK to print specific order

I have a dummy file that looks like so:
a ID_1 S1 S2
b SNP1 1 0
c SNP2 2 1
d SNP3 1 0
I want to replace the contents of column 2 by the corresponding line number. My file would then look like so:
a 1 S1 S2
b 2 1 0
c 3 2 1
d 4 1 0
I can do this with the following command:
cut -f 1,3-4 -d " " file.txt | awk '{print $1 " " FNR " " $2,$3}'
My question is, is there a better way of doing this? In particular, the real file I am working on has 2303 columns. Obviously I don't want to have to write:
cut -f 1,3-2303 -d " " file.txt | awk '{print $1 " " FNR " " $2,$3,$4,$5 ETC}'
Is there a way to tell awk to print from column 2 to the last column without having to write all the names?
Thanks
I think this should do
$ awk '{$2=FNR} 1' file.txt
a 1 S1 S2
b 2 1 0
c 3 2 1
d 4 1 0
change second column and print the changed record. Default OFS is single space which is what you need here
the above command is idiomatic way to write
awk '{$2=FNR} {print $0}' file.txt
you can think of simple awk program as awk 'cond1{action1} cond2{action2} ...'
only if cond1 evaluates to true, action1 is executed and so on. If action portion is omitted, awk by default prints input record. 1 is simply one way to write always true condition
See Idiomatic awk mentioned in https://stackoverflow.com/tags/awk/info for more such idioms
Following awk may also help you in same here.
awk '{sub(/.*/,FNR,$2)} 1' Input_file
Output will be as follows.
a 1 S1 S2
b 2 1 0
c 3 2 1
d 4 1 0
Explanation: It's explanation will be simple, using sub utility of awk to substitute everything in $2(second field) with FNR which is out of the box variable for awk to represent the current line number of any Input_file then mentioning 1 will print the current line of Input_file.

Bash - Compare rows then print just original rows

I've got files which look like this, (there can be more columns or rows):
dif-1-2-3-4.com 1 1 1
dif-1-2-3-5.com 1 1 2
dif-1-2-4-5.com 1 2 1
dif-1-3-4-5.com 2 1 1
dif-2-3-4-5.com 1 1 1
And I want to compare these numbers:
1 1 1
1 1 2
1 2 1
2 1 1
1 1 1
And print only those rows which do not repeat, so I get this:
dif-1-2-3-4.com 1 1 1
dif-1-2-3-5.com 1 1 2
dif-1-2-4-5.com 1 2 1
dif-1-3-4-5.com 2 1 1
Another simple approach is sort with uniq using a KEYDEF for fields 2-4 with sort and skipping field 1 with uniq, e.g.
$ sort file.txt -k 2,4 | uniq -f1
Example Use/Output
$ sort file.txt -k 2,4 | uniq -f1
dif-1-2-3-4.com 1 1 1
dif-1-2-3-5.com 1 1 2
dif-1-2-4-5.com 1 2 1
dif-1-3-4-5.com 2 1 1
Keep a running record of the triples already seen and only print the first time they appear:
$ awk '!(($2,$3,$4) in seen) {print; seen[$2,$3,$4]}' file
dif-1-2-3-4.com 1 1 1
dif-1-2-3-5.com 1 1 2
dif-1-2-4-5.com 1 2 1
dif-1-3-4-5.com 2 1 1
Try, the following awk code too:
awk '!a[$2,$3,$4]++' Input_file
Explanation:
Create an array named a and its indexes as $2,$3,$4. The condition here is !a, (which means any line's $2,$3,$4 are NOT present in array a), and then doing 2 things:
Increasing that specific index's value to 1 so that next time that condition will NOT be true for same $2,$3,$4 indexes in array a.
Not specifying an action, (so awk works in the mode of condition and then action), so the default action will be to print the current line. This will go on for all the lines in Input_file, and the last line will not be printed as its $2,$3,$4 are already present in array a.
I hope this helps.
This works with POSIX and gnu awk:
$ awk '{s=""
for (i=2;i<=NF; i++)
s=s $i "|"}
s in seen { next }
++seen[s]' file
Which can be shortened to:
$ awk '{s=""; for (i=2;i<=NF; i++) s=s $i "|"} !seen[s]++' file
Also supports a variable number of columns.
If you want a sort uniq solution that also respects file order (i.e. the first of the set of duplicates is printed, not the later ones) you need to do a decorate, sort, undecorate approach.
You can:
use cat -n to decorate the file with line numbers;
sort -k3 -k1n to sort first on all the fields starting at the 3 though the end of the line then numerically on the line number added;
add -u if your version of sort supports that or use uniq -f3 to only keep the first in the group of dups;
finally use sed -e 's/^[[:space:]]*[0-9]*[[:space:]]*// to remove the added line numbers:
cat -n file | sort -k3 -k1n | uniq -f3 | sed -e 's/^[[:space:]]*[0-9]*[[:space:]]*//'
Awk is easier and faster in this case.

Get lengths of zeroes (interrupted by ones)

I have a long column of ones and zeroes:
0
0
0
1
0
0
0
0
0
1
0
0
0
0
0
1
0
0
1
....
I can easily get the average number of zeroes between ones (just total/ones):
ones=$(grep -c 1 file.txt)
lines=$(wc -l < file.txt)
echo "$lines / $ones" | bc -l
But how can I get the length of strings of zeroes between the ones? In the short example above it would be:
3
5
5
2
I'd include uniq for a more easily read approach:
uniq -c file.txt | awk '/ 0$/ {print $1}'
Edit: fixed for the case where the last line is a 0
Easy in awk:
awk '/1/{print NR-prev-1; prev=NR;}END{if (NR>prev)print NR-prev;}'
Not so difficult in bash, either:
i=0
for x in $(<file.txt); do
if ((x)); then echo $i; i=0; else ((++i)); fi
done
((i)) && echo $i
Using awk, I would use the fact that a field with the value 0 evaluates as False:
awk '!$1{s++; next} {if (s) print s; s=0} END {if (s) print s}' file
This returns:
3
5
5
2
Also, note the END block to print any "remaining" zeroes appearing after the last 1.
Explanation
!$1{s++; next} if the field is not True, that is, if the field is 0, increment the counter. Then, skip to the next line.
{if (s) print s; s=0} otherwise, print the value of the counter and reset it, but just if it contains some value (to avoid printing 0 if the file starts with a 1).
END {if (s) print s} print the remaining value of the counter after processing the file, but just if it wasn't printed before.
If your file.txt is just a column of ones and zeros, you can use awk and change the record separator to "1\n". This makes each "record" a sequence of "0\n", and the count of 0's in the record is the length of the record divided by 2. Counts will be correct for leading and trailing ones and zeros.
awk 'BEGIN {RS="1\n"} { print length/2 }' file.txt
This seems to be pretty popular question today. Joining the party late, here is another short gnu-awk command to do the job:
awk -F '\n' -v RS='(1\n)+' 'NF{print NF-1}' file
3
5
5
2
How it works:
-F '\n' # set input field separator as \n (newline)
-v RS='(1\n)+' # set input record separator as multipled of 1 followed by newline
NF # execute the block if minimum one field is found
print NF-1 # print num of field -1 to get count of 0
Pure bash:
sum=0
while read n ; do
if ((n)) ; then
echo $sum
sum=0
else
((++sum))
fi
done < file.txt
((sum)) && echo $sum # Don't forget to output the last number if the file ended in 0.
Another way:
perl -lnE 'if(m/1/){say $.-1;$.=0}' < file
"reset" the line counter when 1.
prints
3
5
5
2
You can use awk:
awk '$1=="0"{s++} $1=="1"{if(s)print s;s=0} END{if(s)print(s)}'
Explanation:
The special variable $1 contains the value of the first field (column) of a line of text. Unless you specify the field delimiter using the -F command line option it defaults to a widespace - meaning $1 will contain 0 or 1 in your example.
If the value of $1 equals 0 a variable called s will get incremented but if $1 is equal to 1 the current value of s gets printed (if greater than zero) and re-initialized to 0. (Note that awk initializes s with 0 before the first increment operation)
The END block gets executed after the last line of input has been processed. If the file ends with 0(s) the number of 0s between the file's end and the last 1 will get printed. (Without the END block they wouldn't printed)
Output:
3
5
5
2
if you can use perl:
perl -lne 'BEGIN{$counter=0;} if ($_ == 1){ print $counter; $counter=0; next} $counter++' file
3
5
5
2
It actually looks better with awk same logic:
awk '$1{print c; c=0} !$1{c++}' file
3
5
5
2
My attempt. Not so pretty but.. :3
grep -n 1 test.txt | gawk '{y=$1-x; print y-1; x=$1}' FS=":"
Out:
3
5
5
2
A funny one, in pure Bash:
while read -d 1 -a u || ((${#u[#]})); do
echo "${#u[#]}"
done < file
This tells read to use 1 as a delimiter, i.e., to stop reading as soon as a 1 is encountered; read stores the 0's in the fields of the array u. Then we only need to count the number of fields in u with ${#u[#]}. The || ((${#u[#]})) is here just in case your file doesn't end with a 1.
More strange (and not fully correct) way:
perl -0x31 -laE 'say #F+0' <file
prints
3
5
5
2
0
It
reads the file with the record separator is set to character 1 the -0x31
with autosplit -a (splits the record into array #F)
and prints the number of elements in #F e.g. say #F+0 or could use say scalar #F
Unfortunately, after the final 1 (as record separator) it prints an empty record - therefore prints the last 0.
It is incorrect solution, showing it only as alternative curiosity.
Expanding erickson's excellent answer, you can say:
$ uniq -c file | awk '!$2 {print $1}'
3
5
5
2
From man uniq we see that the purpose of uniq is to:
Filter adjacent matching lines from INPUT (or standard input), writing
to OUTPUT (or standard output).
So uniq groups the numbers. Using the -c option we get a prefix with the number of occurrences:
$ uniq -c file
3 0
1 1
5 0
1 1
5 0
1 1
2 0
1 1
Then it is a matter of printing those the counters before the 0. For this we can use awk like: awk '!$2 {print $1}'. That is: print the second field if the field is 0.
The simplest solution would be to use sed together with awk, like this:
sed -n '$bp;/0/{:r;N;/0$/{h;br}};/1/{x;bp};:p;/.\+/{s/\n//g;p}' input.txt \
| awk '{print length}'
Explanation:
The sed command separates the 0s and creates output like this:
000
00000
00000
00
Piped into awk '{print length}' you can get the count of 0 for each interval:
Output:
3
5
5
2

Grouping elements by two fields on a space delimited file

I have this ordered data by column 2 then 3 and then 1 in a space delimited file (i used linux sort to do that):
0 0 2
1 0 2
2 0 2
1 1 4
2 1 4
I want to create a new file (leaving the old file as is)
0 2 0,1,2
1 4 1,2
Basically put the fields 2 and 3 first and group the elements of field 1 (as a comma separated list) by them. Is there a way to do that by an awk, sed, bash one liner, so to avoid writing a Java, C++ app for that?
Since the file is already ordered, you can print the line as they change:
awk '
seen==$2 FS $3 { line=line "," $1; next }
{ if(seen) print seen, line; seen=$2 FS $3; line=$1 }
END { print seen, line }
' file
0 2 0,1,2
1 4 1,2
This will preserve the order of output.
with your input and output this line may help:
awk '{f=$2 FS $3}!(f in a){i[++p]=f;a[f]=$1;next}
{a[f]=a[f]","$1}END{for(x=1;x<=p;x++)print i[x],a[i[x]]}' file
test:
kent$ cat f
0 0 2
1 0 2
2 0 2
1 1 4
2 1 4
kent$ awk '{f=$2 FS $3}!(f in a){i[++p]=f;a[f]=$1;next}{a[f]=a[f]","$1}END{for(x=1;x<=p;x++)print i[x],a[i[x]]}' f
0 2 0,1,2
1 4 1,2
awk 'a[$2, $3]++ { p = p "," $1; next } p { print p } { p = $2 FS $3 FS $1 } END { if (p) print p }' file
Output:
0 2 0,1,2
1 4 1,2
The solution assumes data on second and third column is sorted.
Using awk:
awk '{k=$2 OFS $3} !(k in a){a[k]=$1; b[++n]=k; next} {a[k]=a[k] "," $1}
END{for (i=1; i<=n; i++) print b[i],a[b[i]]}' file
0 2 0,1,2
1 4 1,2
Yet another take:
awk -v SUBSEP=" " '
{group[$2,$3] = group[$2,$3] $1 ","}
END {
for (g in group) {
sub(/,$/,"",group[g])
print g, group[g]
}
}
' file > newfile
The SUBSEP variable is the character that joins strings in a single-dimensional awk array.
http://www.gnu.org/software/gawk/manual/html_node/Multidimensional.html#Multidimensional
This might work for you (GNU sed):
sed -r ':a;$!N;/(. (. .).*)\n(.) \2.*/s//\1,\3/;ta;s/(.) (.) (.)/\2 \3 \1/;P;D' file
This appends the first column of the subsequent record to the first record until the second and third keys change. Then the fields in the first record are re-arranged and printed out.
This uses the data presented but can be adapted for more complex data.

Resources