awk super slow processing many rows but not many columns - performance

While looking into this this question the challenge was to take this matrix:
4 5 6 2 9 8 4 8
m d 6 7 9 5 4 g
t 7 4 2 4 2 5 3
h 5 6 2 5 s 3 4
r 5 7 1 2 2 4 1
4 1 9 0 5 6 d f
x c a 2 3 4 5 9
0 0 3 2 1 4 q w
And turn into:
4 5
m d
t 7
h 5
r 5
4 1
x c
0 0
6 2 # top of next 2 columns...
6 7
4 2
... each N elements from each row of the matrix -- in this example, N=2...
3 4
4 1
d f
5 9
q w # last element is lower left of matrix
The OP stated the input was 'much bigger' than the example without specifying the shape of the actual input (millions of rows? millions of columns? or both?)
I assumed (mistakenly) that the file had millions of rows (it was later specified to have millions of columns)
BUT the interesting thing is that most of the awks written were perfectly acceptable speed IF the shape of the data was millions of columns.
Example: #glennjackman posted a perfectly useable awk so long as the long end was in columns, not in rows.
Here, you can use his Perl to generate an example matrix of rows X columns. Here is that Perl:
perl -E '
my $cols = 2**20; # 1,048,576 columns - the long end
my $rows = 2**3; # 8 rows
my #alphabet=( 'a'..'z', 0..9 );
my $size = scalar #alphabet;
for ($r=1; $r <= $rows; $r++) {
for ($c = 1; $c <= $cols; $c++) {
my $idx = int rand $size;
printf "%s ", $alphabet[$idx];
}
printf "\n";
}' >file
Here are some candidate scripts that turn file (from that Perl script) into the output of 2 columns taken from the front of each row:
This is the speed champ regardless of the shape of input in Python:
$ cat col.py
import sys
cols=int(sys.argv[2])
offset=0
delim="\t"
with open(sys.argv[1], "r") as f:
dat=[line.split() for line in f]
while offset<=len(dat[0])-cols:
for sl in dat:
print(delim.join(sl[offset:offset+cols]))
offset+=cols
Here is a Perl that is also quick enough regardless of the shape of the data:
$ cat col.pl
push #rows, [#F];
END {
my $delim = "\t";
my $cols_per_group = 2;
my $col_start = 0;
while ( 1 ) {
for my $row ( #rows ) {
print join $delim, #{$row}[ $col_start .. ($col_start + $cols_per_group - 1) ];
}
$col_start += $cols_per_group;
last if ($col_start + $cols_per_group - 1) > $#F;
}
}
Here is an alternate awk that is slower but a consistent speed (and the number of lines in the file needs to be pre-calculated):
$ cat col3.awk
function join(start, end, result, i) {
for (i=start; i<=end; i++)
result = result $i (i==end ? ORS : FS)
return result
}
{ col_offset=0
for(i=1;i<=NF;i+=cols) {
s=join(i,i+cols-1)
col[NR+col_offset*nl]=join(i,i+cols-1)
col_offset++
++cnt
}
}
END { for(i=1;i<=cnt;i++) printf "%s", col[i]
}
And Glenn Jackman's awk (not to pick on him since ALL the awks had the same bad result with many rows):
function join(start, end, result, i) {
for (i=start; i<=end; i++)
result = result $i (i==end ? ORS : FS)
return result
}
{
c=0
for (i=1; i<NF; i+=n) {
c++
col[c] = col[c] join(i, i+n-1)
}
}
END {
for (i=1; i<=c; i++)
printf "%s", col[i] # the value already ends with newline
}
Here are the timings with many columns (ie, in the Perl scrip that generates file above, my $cols = 2**20 and my $rows = 2**3):
echo 'glenn jackman awk'
time awk -f col1.awk -v n=2 file >file1
echo 'glenn jackman gawk'
time gawk -f col1.awk -v n=2 file >file5
echo 'perl'
time perl -lan columnize.pl file >file2
echo 'dawg Python'
time python3 col.py file 2 >file3
echo 'dawg awk'
time awk -f col3.awk -v nl=$(awk '{cnt++} END{print cnt}' file) -v cols=2 file >file4
Prints:
# 2**20 COLUMNS; 2**3 ROWS
glenn jackman awk
real 0m4.460s
user 0m4.344s
sys 0m0.113s
glenn jackman gawk
real 0m4.493s
user 0m4.379s
sys 0m0.109s
perl
real 0m3.005s
user 0m2.774s
sys 0m0.230s
dawg Python
real 0m2.871s
user 0m2.721s
sys 0m0.148s
dawg awk
real 0m11.356s
user 0m11.038s
sys 0m0.312s
But transpose the shape of the data by setting my $cols = 2**3 and my $rows = 2**20 and run the same timings:
# 2**3 COLUMNS; 2**20 ROWS
glenn jackman awk
real 23m15.798s
user 16m39.675s
sys 6m35.972s
glenn jackman gawk
real 21m49.645s
user 16m4.449s
sys 5m45.036s
perl
real 0m3.605s
user 0m3.348s
sys 0m0.228s
dawg Python
real 0m3.157s
user 0m3.065s
sys 0m0.080s
dawg awk
real 0m11.117s
user 0m10.710s
sys 0m0.399s
So question:
What would cause the first awk to be 100x slower if the data are transposed to millions of rows vs millions of columns?
It is the same number of elements processed and the same total data. The join function is called the same number of times.

String concatenation being saved in a variable is one of the slowest operations in awk (IIRC it's slower than I/O) as you're constantly having to find a new memory location to hold the result of the concatenation and there's more of that happening in the awk scripts as the rows get longer so it's probably all of the string concatenation in the posted solutions that's causing the slowdown.
Something like this should be fast and shouldn't be dependent on how many fields there are vs how many records:
$ cat tst.awk
{
for (i=1; i<=NF; i++) {
vals[++numVals] = $i
}
}
END {
for (i=1; i<=numVals; i+=2) {
valNr = i + ((i-1) * NF) # <- not correct, fix it!
print vals[valNr], vals[valNr+1]
}
}
I don't have time right now to figure out the correct math to calculate the index for the single loop approach above (see the comment in the code) so here's a working version with 2 loops that doesn't require as much thought and shouldn't run much if any, slower:
$ cat tst.awk
{
for (i=1; i<=NF; i++) {
vals[++numVals] = $i
}
}
END {
inc = NF - 1
for (i=0; i<NF; i+=2) {
for (j=1; j<=NR; j++) {
valNr = i + j + ((j-1) * inc)
print vals[valNr], vals[valNr+1]
}
}
}
$ awk -f tst.awk file
4 5
m d
t 7
h 5
r 5
4 1
x c
0 0
6 2
6 7
4 2
6 2
7 1
9 0
a 2
3 2
9 8
9 5
4 2
5 s
2 2
5 6
3 4
1 4
4 8
4 g
5 3
3 4
4 1
d f
5 9
q w

A play with strings:
$ awk '
{
a[NR]=$0 # hash rows to a
c[NR]=1 # index pointer
}
END {
b=4 # buffer size for match
i=NR # row count
n=length(a[1]) # process til the first row is done
while(c[1]<n)
for(j=1;j<=i;j++) { # of each row
match(substr(a[j],c[j],b),/([^ ]+ ?){2}/) # read 2 fields and separators
print substr(a[j],c[j],RLENGTH) # output them
c[j]+=RLENGTH # increase index pointer
}
}' file
b=4 is a buffer that is optimal for 2 single digit fields and 2 single space separators (a b ) as was given in the original question, but if the data is real world data, b should be set to something more suitable. If omitted leading the match line to match(substr(a[j],c[j]),/([^ ]+ ?){2}/) kills the performance for data with lots of columns.
I got times around 8 seconds for datasets of sizes 220 x 23 and 23 x 220.

Ed Morton's approach did fix the speed issue.
Here is the awk I wrote that supports variable columns:
$ cat col.awk
{
for (i=1; i<=NF; i++) {
vals[++numVals] = $i
}
}
END {
for(col_offset=0; col_offset + cols <= NF; col_offset+=cols) {
for (i=1; i<=numVals; i+=NF) {
for(j=0; j<cols; j++) {
printf "%s%s", vals[i+j+col_offset], (j<cols-1 ? FS : ORS)
}
}
}
}
$ time awk -f col.awk -v cols=2 file >file.cols
real 0m5.810s
user 0m5.468s
sys 0m0.339s
This is about 6 seconds for datasets of sizes 220 x 23 and 23 x 220
But MAN it sure is nice to have strong support of arrays of arrays (such as in Perl or Python...)

Related

mean ad standart deviation of multiple files or every three lines in a single line in linux

My operating system is windows 10, I use bash on windows for executing linux commands. I have a file with 96 lines and I have multiple files that covered every three lines of this file and I want to add the mean and standard deviation of them into a single file as line by line.
Single file
1 31.31
2 32.24
3 32.11
4 20.97
5 20.93
6 20.91
7 22.58
8 22.46
9 22.52
10 20.71
11 20.25
12 20.51
File 1
1 31.31
2 32.24
3 32.11
File 2
4 20.97
5 20.93
6 20.91
File 3
7 22.58
8 22.46
9 22.52
First of all, I tried to split file with verbose mode to multiple files with
grep -i 'Sample' Sample3.txt | awk '{print $5, $6}' | sed 's/\,/\./g' >> Sample4.txt | split -l3 Sample4.txt --verbose
Can tcsh commands like foreach and awk used for bash scripting? can we do this in a single text file or do we have to split that single file into files?
for example output can be:
output.txt
mean stand.D.
31.88667 0.50362 ----- first three rows mean and sd
20.93667 0.030 ----- second three rows mean and sd
22.52 0.06 ----- third three rows mean and sd
etc etc etc
How about using this awk script?
BEGIN {
avg=0; j=0
fname = "file_output.txt"
printf "mean\t stand.D\n" > fname
}
{
avg = avg + $2
values[j] = $2
j = j + 1
if (NR % 3 == 0) {
printf "%f\t", avg/3 > fname
sd = 0
for (k = 0; k < 3; k++) {
sd = sd + (values[k]-avg/3)*(values[k]-avg/3)
}
printf "%f\n", sqrt(sd/3) > fname
avg = 0; j = 0
}
}
Output:
mean stand.D
31.8867 0.411204
20.9367 0.0249444
22.52 0.0489898
20.49 0.188326
"Bash script" (foo.sh):
#!/bin/bash
# data.txt is Single File
awk -F " " -f script.awk data.txt

calculate mean on even and odd lines independently in bash

I have a file with 2 columns and many rows. I would like to calculate the mean for each column for odd and even lines independantly, so that in the end I would have a file with 4 values: 2 columns with odd and even mean.
My file looks like this:
2 4
4 4
6 8
3 5
6 9
2 1
In the end I would like to obtain a file with the mean of 2,6,6 and 4,3,2 in the first column and the mean of 4,8,9 and 4,5,1 in the second column, that is:
4.66 7
3 3.33
If anyone could give me some advice I'd really appreaciate it, for the moment I'm only able to calculate the mean for all rows (not even and odd). Thank you very much in advance!
This is an awk hardcoded example but you can get the point :
awk 'NR%2{e1+=$1;e2+=$2;c++;next}
{o1+=$1;o2+=$2;d++}
END{print e1/c"\t"e2/c"\n"o1/d"\t"o2/d}' your_file
4.66667 7
3 3.33333
A more generalized version of Juan Diego Godoy's answer. Relies on GNU awk
gawk '
{
parity = NR % 2 == 1 ? "odd" : "even"
for (i=1; i<=NF; i++) {
sum[parity][i] += $i
count[parity][i] += 1
}
}
function result(parity) {
for (i=1; i<=NF; i++)
printf "%g\t", sum[parity][i] / count[parity][i]
print ""
}
END { result("odd"); result("even") }
'
This answer uses Bash and bc. It assumes that the input file consists of only integers and that there is an even number of lines.
#!/bin/bash
while read -r oddcol1 oddcol2; read -r evencol1 evencol2
do
(( oddcol1sum += oddcol1 ))
(( oddcol2sum += oddcol2 ))
(( evencol1sum += evencol1 ))
(( evencol2sum += evencol2 ))
(( count++ ))
done < inputfile
cat <<EOF | bc -l
scale=2
print "Odd Column 1 Mean: "; $oddcol1sum / $count
print "Odd Column 2 Mean: "; $oddcol2sum / $count
print "Even Column 1 Mean: "; $evencol1sum / $count
print "Even Column 2 Mean: "; $evencol2sum / $count
EOF
It could be modified to use arrays to make it more flexible.

Find smallest missing integer in an array

I'm writing a bash script which requires searching for the smallest available integer in an array and piping it into a variable.
I know how to identify the smallest or the largest integer in an array but I can't figure out how to identify the 'missing' smallest integer.
Example array:
1
2
4
5
6
In this example I would need 3 as a variable.
Using sed for this would be silly. With GNU awk you could do
array=(1 2 4 5 6)
echo "${array[#]}" | awk -v RS='\\s+' '{ a[$1] } END { for(i = 1; i in a; ++i); print i }'
...which remembers all numbers, then counts from 1 until it finds one that it doesn't remember and prints that. You can then remember this number in bash with
array=(1 2 4 5 6)
number=$(echo "${array[#]}" | awk -v RS='\\s+' '{ a[$1] } END { for(i = 1; i in a; ++i); print i }')
However, if you're already using bash, you could just do the same thing in pure bash:
#!/bin/bash
array=(1 2 4 5 6)
declare -a seen
for i in ${array[#]}; do
seen[$i]=1
done
for((number = 1; seen[number] == 1; ++number)); do true; done
echo $number
You can iterate from minimal to maximal number and take first non existing element,
use List::Util qw( first );
my #arr = sort {$a <=> $b} qw(1 2 4 5 6);
my $min = $arr[0];
my $max = $arr[-1];
my %seen;
#seen{#arr} = ();
my $first = first { !exists $seen{$_} } $min .. $max;
This code will do as you ask. It can easily be accelerated by using a binary search, but it is clearest stated in this way.
The first element of the array can be any integer, and the subroutine returns the first value that isn't in the sequence. It returns undef if the complete array is contiguous.
use strict;
use warnings;
use 5.010;
my #data = qw/ 1 2 4 5 6 /;
say first_missing(#data);
#data = ( 4 .. 99, 101 .. 122 );
say first_missing(#data);
sub first_missing {
my $start = $_[0];
for my $i ( 1 .. $#_ ) {
my $expected = $start + $i;
return $expected unless $_[$i] == $expected;
}
return;
}
output
3
100
Here is a Perl one liner:
$ echo '1 2 4 5 6' | perl -lane '}
{#a=sort { $a <=> $b } #F; %h=map {$_=>1} #a;
foreach ($a[0]..$a[-1]) { if (!exists($h{$_})) {print $_}} ;'
If you want to switch from a pipeline to a file input:
$ perl -lane '}
{#a=sort { $a <=> $b } #F; %h=map {$_=>1} #a;
foreach ($a[0]..$a[-1]) { if (!exists($h{$_})) {print $_}} ;' file
Since it is sorted in the process, input can be in arbitrary order.
$ cat tst.awk
BEGIN {
split("1 2 4 5 6",a)
for (i=1;a[i+1]==a[i]+1;i++) ;
print a[i]+1
}
$ awk -f tst.awk
3
Having fun with #Borodin's excellent answer:
#!/usr/bin/env perl
use 5.020; # why not?
use strict;
use warnings;
sub increasing_stream {
my $start = int($_[0]);
return sub {
$start += 1 + (rand(1) > 0.9);
};
}
my $stream = increasing_stream(rand(1000));
my $first = $stream->();
say $first;
while (1) {
my $next = $stream->();
say $next;
last unless $next == ++$first;
$first = $next;
}
say "Skipped: $first";
Output:
$ ./tyu.pl
381
382
383
384
385
386
387
388
389
390
391
392
393
395
Skipped: 394
Here's one bash solution (assuming the numbers are in a file, one per line):
sort -n numbers.txt | grep -n . |
grep -v -m1 '\([0-9]\+\):\1' | cut -f1 -d:
The first part sorts the numbers and then adds a sequence number to each one, and the second part finds the first sequence number which doesn't correspond to the number in the array.
Same thing, using sort and awk (bog-standard, no extensions in either):
sort -n numbers.txt | awk '$1!=NR{print NR;exit}'
Here is a slight variation on the theme set by other answers. Values coming in are not necessarily pre-sorted:
$ cat test
sort -nu <<END-OF-LIST |
1
5
2
4
6
END-OF-LIST
awk 'BEGIN { M = 1 } M > $1 { next } M == $1 { M++; next }
M < $1 { exit } END { print M }'
$ sh test
3
Notes:
If numbers are pre-sorted, do not bother with the sort.
If there are no missing numbers, the next higher number is output.
In this example, a here document supplies numbers, but one can use a file or pipe.
M may start greater than the smallest to ignore missing numbers below a threshold.
To auto-start the search at the lowest number, change BEGIN { M = 1 } to NR == 1 { M = $1 }.

How can I read first n and last n lines from a file?

How can I read the first n lines and the last n lines of a file?
For n=2, I read online that (head -n2 && tail -n2) would work, but it doesn't.
$ cat x
1
2
3
4
5
$ cat x | (head -n2 && tail -n2)
1
2
The expected output for n=2 would be:
1
2
4
5
head -n2 file && tail -n2 file
Chances are you're going to want something like:
... | awk -v OFS='\n' '{a[NR]=$0} END{print a[1], a[2], a[NR-1], a[NR]}'
or if you need to specify a number and taking into account #Wintermute's astute observation that you don't need to buffer the whole file, something like this is what you really want:
... | awk -v n=2 'NR<=n{print;next} {buf[((NR-1)%n)+1]=$0}
END{for (i=1;i<=n;i++) print buf[((NR+i-1)%n)+1]}'
I think the math is correct on that - hopefully you get the idea to use a rotating buffer indexed by the NR modded by the size of the buffer and adjusted to use indices in the range 1-n instead of 0-(n-1).
To help with comprehension of the modulus operator used in the indexing above, here is an example with intermediate print statements to show the logic as it executes:
$ cat file
1
2
3
4
5
6
7
8
.
$ cat tst.awk
BEGIN {
print "Populating array by index ((NR-1)%n)+1:"
}
{
buf[((NR-1)%n)+1] = $0
printf "NR=%d, n=%d: ((NR-1 = %d) %%n = %d) +1 = %d -> buf[%d] = %s\n",
NR, n, NR-1, (NR-1)%n, ((NR-1)%n)+1, ((NR-1)%n)+1, buf[((NR-1)%n)+1]
}
END {
print "\nAccessing array by index ((NR+i-1)%n)+1:"
for (i=1;i<=n;i++) {
printf "NR=%d, i=%d, n=%d: (((NR+i = %d) - 1 = %d) %%n = %d) +1 = %d -> buf[%d] = %s\n",
NR, i, n, NR+i, NR+i-1, (NR+i-1)%n, ((NR+i-1)%n)+1, ((NR+i-1)%n)+1, buf[((NR+i-1)%n)+1]
}
}
$
$ awk -v n=3 -f tst.awk file
Populating array by index ((NR-1)%n)+1:
NR=1, n=3: ((NR-1 = 0) %n = 0) +1 = 1 -> buf[1] = 1
NR=2, n=3: ((NR-1 = 1) %n = 1) +1 = 2 -> buf[2] = 2
NR=3, n=3: ((NR-1 = 2) %n = 2) +1 = 3 -> buf[3] = 3
NR=4, n=3: ((NR-1 = 3) %n = 0) +1 = 1 -> buf[1] = 4
NR=5, n=3: ((NR-1 = 4) %n = 1) +1 = 2 -> buf[2] = 5
NR=6, n=3: ((NR-1 = 5) %n = 2) +1 = 3 -> buf[3] = 6
NR=7, n=3: ((NR-1 = 6) %n = 0) +1 = 1 -> buf[1] = 7
NR=8, n=3: ((NR-1 = 7) %n = 1) +1 = 2 -> buf[2] = 8
Accessing array by index ((NR+i-1)%n)+1:
NR=8, i=1, n=3: (((NR+i = 9) - 1 = 8) %n = 2) +1 = 3 -> buf[3] = 6
NR=8, i=2, n=3: (((NR+i = 10) - 1 = 9) %n = 0) +1 = 1 -> buf[1] = 7
NR=8, i=3, n=3: (((NR+i = 11) - 1 = 10) %n = 1) +1 = 2 -> buf[2] = 8
This might work for you (GNU sed):
sed -n ':a;N;s/[^\n]*/&/2;Ta;2p;$p;D' file
This keeps a window of 2 (replace the 2's for n) lines and then prints the first 2 lines and at end of file prints the window i.e. the last 2 lines.
Here's a GNU sed one-liner that prints the first 10 and last 10 lines:
gsed -ne'1,10{p;b};:a;$p;N;21,$D;ba'
If you want to print a '--' separator between them:
gsed -ne'1,9{p;b};10{x;s/$/--/;x;G;p;b};:a;$p;N;21,$D;ba'
If you're on a Mac and don't have GNU sed, you can't condense as much:
sed -ne'1,9{' -e'p;b' -e'}' -e'10{' -e'x;s/$/--/;x;G;p;b' -e'}' -e':a' -e'$p;N;21,$D;ba'
Explanation
gsed -ne' invoke sed without automatic printing pattern space
-e'1,9{p;b}' print the first 9 lines
-e'10{x;s/$/--/;x;G;p;b}' print line 10 with an appended '--' separator
-e':a;$p;N;21,$D;ba' print the last 10 lines
If you are using a shell that supports process substitution, another way to accomplish this is to write to multiple processes, one for head and one for tail. Suppose for this example your input comes from a pipe feeding you content of unknown length. You want to use just the first 5 lines and the last 10 lines and pass them on to another pipe:
cat | { tee >(head -5) >(tail -10) 1>/dev/null} | cat
The use of {} collects the output from inside the group (there will be two different programs writing to stdout inside the process shells). The 1>/dev/null is to get rid of the extra copy tee will try to write to it's own stdout.
That demonstrates the concept and all the moving parts, but it can be simplified a little in practice by using the STDOUT stream of tee instead of discarding it. Note the command grouping is still necessary here to pass the output on through the next pipe!
cat | { tee >(head -5) | tail -15 } | cat
Obviously replace cat in the pipeline with whatever you are actually doing. If your input can handle the same content to writing to multiple files you could eliminate the use of tee entirely as well as monkeying with STDOUT. Say you have a command that accepts multiple -o output file name flags:
{ mycommand -o >(head -5) -o >(tail -10)} | cat
awk -v n=4 'NR<=n; {b = b "\n" $0} NR>=n {sub(/[^\n]*\n/,"",b)} END {print b}'
The first n lines are covered by NR<=n;. For the last n lines, we just keep track of a buffer holding the latest n lines, repeatedly adding one to the end and removing one from the front (after the first n).
It's possible to do it more efficiently, with an array of lines instead of a single buffer, but even with gigabytes of input, you'd probably waste more in brain time writing it out than you'd save in computer time by running it.
ETA: Because the above timing estimate provoked some discussion in (now deleted) comments, I'll add anecdata from having tried that out.
With a huge file (100M lines, 3.9 GiB, n=5) it's taken 454 seconds, compared to #EdMorton's lined-buffer solution, which executed in only 30 seconds. With more modest inputs ("mere" millions of lines) the ratio is similar: 4.7 seconds vs. 0.53 seconds.
Almost all of that additional time in this solution seems to be spent in the sub() function; a tiny fraction also does come from string concatenation being slower than just replacing an array member.
Use GNU parallel. To print the first three lines and the last three lines:
parallel {} -n 3 file ::: head tail
Based on dcaswell's answer, the following sed script prints the first and last 10 lines of a file:
# Make a test file first
testit=$(mktemp -u)
seq 1 100 > $testit
# This sed script:
sed -n ':a;1,10h;N;${x;p;i\
-----
;x;p};11,$D;ba' $testit
rm $testit
Yields this:
1
2
3
4
5
6
7
8
9
10
-----
90
91
92
93
94
95
96
97
98
99
100
Here is another AWK script. Assuming there might be overlap of head and tail.
File script.awk
BEGIN {range = 3} # Define the head and tail range
NR <= range {print} # Output the head; for the first lines in range
{ arr[NR % range] = $0} # Store the current line in a rotating array
END { # Last line reached
for (row = NR - range + 1; row <= NR; row++) { # Reread the last range lines from array
print arr[row % range];
}
}
Running the script
seq 1 7 | awk -f script.awk
Output
1
2
3
5
6
7
For overlapping head and tail:
seq 1 5 |awk -f script.awk
1
2
3
3
4
5
Print the first and last n lines
For n=1:
seq 1 10 | sed '1p;$!d'
Output:
1
10
For n=2:
seq 1 10 | sed '1,2P;$!N;$!D'
Output:
1
2
9
10
For n>=3, use the generic regex:
':a;$q;N;(n+1),(n*2)P;(n+1),$D;ba'
For n=3:
seq 1 10 | sed ':a;$q;N;4,6P;4,$D;ba'
Output:
1
2
3
8
9
10

Moving columns from the back of a file to the front

I have a file with a very large number of columns (basically several thousand sets of threes) with three special columns (Chr and Position and Name) at the end.
I want to move these final three columns to the front of the file, so that that columns become Name Chr Position and then the file continues with the trios.
I think this might be possible with awk, but I don't know enough about how awk works to do it!
Sample input:
Gene1.GType Gene1.X Gene1.Y ....ending in GeneN.Y Chr Position Name
Desired Output:
Name Chr Position (Gene1.GType Gene1.X Gene1.Y ) x n samples
I think the below example does more or less what you want.
$ cat file
A B C D E F G Chr Position Name
1 2 3 4 5 6 7 8 9 10
$ cat process.awk
{
printf $(NF-2)" "$(NF-1)" "$NF
for( i=1; i<NF-2; i++)
{
printf " "$i
}
print " "
}
$ awk -f process.awk file
Chr Position Name A B C D E F G
8 9 10 1 2 3 4 5 6 7
NF in awk denotes the number of field on a row.
one liner:
awk '{ Chr=$(NF-2) ; Position=$(NF-1) ; Name=$NF ; $(NF-2)=$(NF-1)=$NF="" ; print Name, Chr, Position, $0 }' file

Resources