I have a file with a very large number of columns (basically several thousand sets of threes) with three special columns (Chr and Position and Name) at the end.
I want to move these final three columns to the front of the file, so that that columns become Name Chr Position and then the file continues with the trios.
I think this might be possible with awk, but I don't know enough about how awk works to do it!
Sample input:
Gene1.GType Gene1.X Gene1.Y ....ending in GeneN.Y Chr Position Name
Desired Output:
Name Chr Position (Gene1.GType Gene1.X Gene1.Y ) x n samples
I think the below example does more or less what you want.
$ cat file
A B C D E F G Chr Position Name
1 2 3 4 5 6 7 8 9 10
$ cat process.awk
{
printf $(NF-2)" "$(NF-1)" "$NF
for( i=1; i<NF-2; i++)
{
printf " "$i
}
print " "
}
$ awk -f process.awk file
Chr Position Name A B C D E F G
8 9 10 1 2 3 4 5 6 7
NF in awk denotes the number of field on a row.
one liner:
awk '{ Chr=$(NF-2) ; Position=$(NF-1) ; Name=$NF ; $(NF-2)=$(NF-1)=$NF="" ; print Name, Chr, Position, $0 }' file
Related
While looking into this this question the challenge was to take this matrix:
4 5 6 2 9 8 4 8
m d 6 7 9 5 4 g
t 7 4 2 4 2 5 3
h 5 6 2 5 s 3 4
r 5 7 1 2 2 4 1
4 1 9 0 5 6 d f
x c a 2 3 4 5 9
0 0 3 2 1 4 q w
And turn into:
4 5
m d
t 7
h 5
r 5
4 1
x c
0 0
6 2 # top of next 2 columns...
6 7
4 2
... each N elements from each row of the matrix -- in this example, N=2...
3 4
4 1
d f
5 9
q w # last element is lower left of matrix
The OP stated the input was 'much bigger' than the example without specifying the shape of the actual input (millions of rows? millions of columns? or both?)
I assumed (mistakenly) that the file had millions of rows (it was later specified to have millions of columns)
BUT the interesting thing is that most of the awks written were perfectly acceptable speed IF the shape of the data was millions of columns.
Example: #glennjackman posted a perfectly useable awk so long as the long end was in columns, not in rows.
Here, you can use his Perl to generate an example matrix of rows X columns. Here is that Perl:
perl -E '
my $cols = 2**20; # 1,048,576 columns - the long end
my $rows = 2**3; # 8 rows
my #alphabet=( 'a'..'z', 0..9 );
my $size = scalar #alphabet;
for ($r=1; $r <= $rows; $r++) {
for ($c = 1; $c <= $cols; $c++) {
my $idx = int rand $size;
printf "%s ", $alphabet[$idx];
}
printf "\n";
}' >file
Here are some candidate scripts that turn file (from that Perl script) into the output of 2 columns taken from the front of each row:
This is the speed champ regardless of the shape of input in Python:
$ cat col.py
import sys
cols=int(sys.argv[2])
offset=0
delim="\t"
with open(sys.argv[1], "r") as f:
dat=[line.split() for line in f]
while offset<=len(dat[0])-cols:
for sl in dat:
print(delim.join(sl[offset:offset+cols]))
offset+=cols
Here is a Perl that is also quick enough regardless of the shape of the data:
$ cat col.pl
push #rows, [#F];
END {
my $delim = "\t";
my $cols_per_group = 2;
my $col_start = 0;
while ( 1 ) {
for my $row ( #rows ) {
print join $delim, #{$row}[ $col_start .. ($col_start + $cols_per_group - 1) ];
}
$col_start += $cols_per_group;
last if ($col_start + $cols_per_group - 1) > $#F;
}
}
Here is an alternate awk that is slower but a consistent speed (and the number of lines in the file needs to be pre-calculated):
$ cat col3.awk
function join(start, end, result, i) {
for (i=start; i<=end; i++)
result = result $i (i==end ? ORS : FS)
return result
}
{ col_offset=0
for(i=1;i<=NF;i+=cols) {
s=join(i,i+cols-1)
col[NR+col_offset*nl]=join(i,i+cols-1)
col_offset++
++cnt
}
}
END { for(i=1;i<=cnt;i++) printf "%s", col[i]
}
And Glenn Jackman's awk (not to pick on him since ALL the awks had the same bad result with many rows):
function join(start, end, result, i) {
for (i=start; i<=end; i++)
result = result $i (i==end ? ORS : FS)
return result
}
{
c=0
for (i=1; i<NF; i+=n) {
c++
col[c] = col[c] join(i, i+n-1)
}
}
END {
for (i=1; i<=c; i++)
printf "%s", col[i] # the value already ends with newline
}
Here are the timings with many columns (ie, in the Perl scrip that generates file above, my $cols = 2**20 and my $rows = 2**3):
echo 'glenn jackman awk'
time awk -f col1.awk -v n=2 file >file1
echo 'glenn jackman gawk'
time gawk -f col1.awk -v n=2 file >file5
echo 'perl'
time perl -lan columnize.pl file >file2
echo 'dawg Python'
time python3 col.py file 2 >file3
echo 'dawg awk'
time awk -f col3.awk -v nl=$(awk '{cnt++} END{print cnt}' file) -v cols=2 file >file4
Prints:
# 2**20 COLUMNS; 2**3 ROWS
glenn jackman awk
real 0m4.460s
user 0m4.344s
sys 0m0.113s
glenn jackman gawk
real 0m4.493s
user 0m4.379s
sys 0m0.109s
perl
real 0m3.005s
user 0m2.774s
sys 0m0.230s
dawg Python
real 0m2.871s
user 0m2.721s
sys 0m0.148s
dawg awk
real 0m11.356s
user 0m11.038s
sys 0m0.312s
But transpose the shape of the data by setting my $cols = 2**3 and my $rows = 2**20 and run the same timings:
# 2**3 COLUMNS; 2**20 ROWS
glenn jackman awk
real 23m15.798s
user 16m39.675s
sys 6m35.972s
glenn jackman gawk
real 21m49.645s
user 16m4.449s
sys 5m45.036s
perl
real 0m3.605s
user 0m3.348s
sys 0m0.228s
dawg Python
real 0m3.157s
user 0m3.065s
sys 0m0.080s
dawg awk
real 0m11.117s
user 0m10.710s
sys 0m0.399s
So question:
What would cause the first awk to be 100x slower if the data are transposed to millions of rows vs millions of columns?
It is the same number of elements processed and the same total data. The join function is called the same number of times.
String concatenation being saved in a variable is one of the slowest operations in awk (IIRC it's slower than I/O) as you're constantly having to find a new memory location to hold the result of the concatenation and there's more of that happening in the awk scripts as the rows get longer so it's probably all of the string concatenation in the posted solutions that's causing the slowdown.
Something like this should be fast and shouldn't be dependent on how many fields there are vs how many records:
$ cat tst.awk
{
for (i=1; i<=NF; i++) {
vals[++numVals] = $i
}
}
END {
for (i=1; i<=numVals; i+=2) {
valNr = i + ((i-1) * NF) # <- not correct, fix it!
print vals[valNr], vals[valNr+1]
}
}
I don't have time right now to figure out the correct math to calculate the index for the single loop approach above (see the comment in the code) so here's a working version with 2 loops that doesn't require as much thought and shouldn't run much if any, slower:
$ cat tst.awk
{
for (i=1; i<=NF; i++) {
vals[++numVals] = $i
}
}
END {
inc = NF - 1
for (i=0; i<NF; i+=2) {
for (j=1; j<=NR; j++) {
valNr = i + j + ((j-1) * inc)
print vals[valNr], vals[valNr+1]
}
}
}
$ awk -f tst.awk file
4 5
m d
t 7
h 5
r 5
4 1
x c
0 0
6 2
6 7
4 2
6 2
7 1
9 0
a 2
3 2
9 8
9 5
4 2
5 s
2 2
5 6
3 4
1 4
4 8
4 g
5 3
3 4
4 1
d f
5 9
q w
A play with strings:
$ awk '
{
a[NR]=$0 # hash rows to a
c[NR]=1 # index pointer
}
END {
b=4 # buffer size for match
i=NR # row count
n=length(a[1]) # process til the first row is done
while(c[1]<n)
for(j=1;j<=i;j++) { # of each row
match(substr(a[j],c[j],b),/([^ ]+ ?){2}/) # read 2 fields and separators
print substr(a[j],c[j],RLENGTH) # output them
c[j]+=RLENGTH # increase index pointer
}
}' file
b=4 is a buffer that is optimal for 2 single digit fields and 2 single space separators (a b ) as was given in the original question, but if the data is real world data, b should be set to something more suitable. If omitted leading the match line to match(substr(a[j],c[j]),/([^ ]+ ?){2}/) kills the performance for data with lots of columns.
I got times around 8 seconds for datasets of sizes 220 x 23 and 23 x 220.
Ed Morton's approach did fix the speed issue.
Here is the awk I wrote that supports variable columns:
$ cat col.awk
{
for (i=1; i<=NF; i++) {
vals[++numVals] = $i
}
}
END {
for(col_offset=0; col_offset + cols <= NF; col_offset+=cols) {
for (i=1; i<=numVals; i+=NF) {
for(j=0; j<cols; j++) {
printf "%s%s", vals[i+j+col_offset], (j<cols-1 ? FS : ORS)
}
}
}
}
$ time awk -f col.awk -v cols=2 file >file.cols
real 0m5.810s
user 0m5.468s
sys 0m0.339s
This is about 6 seconds for datasets of sizes 220 x 23 and 23 x 220
But MAN it sure is nice to have strong support of arrays of arrays (such as in Perl or Python...)
Consider the following (sorted) file test.txt where in the first column a occurs 3 times, b occurs once, c occurs 2 times and d occurs 4 times.
a 1
a 2
a 1
b 1
c 1
c 1
d 2
d 1
d 2
d 1
I would like to split this file to smaller files with maximum 4 lines. However, I need to retain the the groups in the smaller files, meaning that all lines that start with the same value in column $1 need to be in the same file. The size of the group is in this example never larger than the desired output length.
The expected output would be:
file1:
a 1
a 2
a 1
b 1
file2:
c 1
c 1
file3:
d 2
d 1
d 2
d 1
From the expected output, you can see that it if two or more groups together have less than the maximum line number (here 4), they should go into the same file.
Therefore: a + b have together 4 entries and they can go into the same file. However, c + d have together 6 entris. Therefore c has to go in its own file.
I am aware of this Awk oneliner:
awk '{print>$1".test"}' test.txt
But this results in a separate file for each group. This would not make much sense in the real-world problem that I am facing since it would lead to a lot of files being transferred to the HPC and back and making the overhead too intense.
A bash solution would be preferred. But it could also be Python.
Another awk. Had a busy day and this is only tested with your sample data so anything could happen. It creates files named filen.txt, where n>0:
$ awk -v n=4 '
BEGIN {
fc=1 # file numbering initialized
}
{
if($1==p||FNR==1) # when $1 remains same
b=b (++cc==1?"":ORS) $0 # keep buffering
else {
if(n-(cc+cp)>=0) { # if room in previous file
print b >> sprintf("file%d.txt",fc) # append to it
cp+=cc
} else { # if it just won t fit
close(sprintf("file%d.txt",fc))
print b > sprintf("file%d.txt",++fc) # creat new
cp=cc
}
b=$0
cc=1
}
p=$1
}
END { # same as the else above
if(n-(cc+cp)>=0)
print b >> sprintf("file%d.txt",fc)
else {
close(sprintf("file%d.txt",fc))
print b > sprintf("file%d.txt",++fc)
}
}' file
I hope I have understood your requirement correctly, could you please try following once written and tested with GNU awk.
awk -v count="1" '
FNR==NR{
max[$1]++
if(!a[$1]++){
first[++count2]=$1
}
next
}
FNR==1{
for(i in max){
maxtill=(max[i]>maxtill?max[i]:maxtill)
}
prev=$1
}
{
if(!b[$1]++){++count1};
c[$1]++
if(prev!=$1 && prev){
if((maxtill-currentFill)<max[$1]){count++}
else if(maxtill==max[$1]) {count++}
}
else if(prev==$1 && c[$1]==maxtill && count1<count2){
count++
}
else if(c[$1]==maxtill && prev==$1){
if(max[first[count1+1]]>(maxtill-c[$1])){ count++ }
}
prev=$1
outputFile="outfile"count
print > (outputFile)
currentFill=currentFill==maxtill?1:++currentFill
}
' Input_file Input_file
Testing of above solution with OP's sample Input_file:
cat Input_file
a 1
a 2
a 1
b 1
c 1
c 1
d 2
d 1
d 2
d 1
It will create 3 output files named outputfile1, outputfile2 and outputfile3 as follows.
cat outfile1
a 1
a 2
a 1
b 1
cat outfile2
c 1
c 1
cat outfile3
d 2
d 1
d 2
d 1
2nd time testing(with my custom samples): With my own sample Input_file, lets say following is Input_file.
cat Input_file
a 1
a 2
a 1
b 1
c 1
c 1
d 2
d 1
d 2
d 1
d 4
d 5
When I run above solution then 2 outputfiles will be created with name outputfile1 and outputfile2 as follows.
cat outputfile1
a 1
a 2
a 1
b 1
c 1
c 1
cat outfile2
d 2
d 1
d 2
d 1
d 4
d 5
This might work for you (GNU sed, bash and csplit):
f(){
local g=$1
shift
while (( $#>1))
do
(($#==2)) && echo $2 && break
(($2-$1==$g)) && echo $2 && shift && continue
(($3-$1==$g)) && echo $3 && shift 2 && continue
(($2-$1<$g)) && (($3-$1>$g)) && echo $2 && shift && continue
set -- $1 ${#:3}
done
}
csplit file $(f 4 $(sed -nE '1=;N;/^(\S+\s).*\n\1/!=;D' file))
This will split file into separate files named xxnn where nn is 00,01,02,...
The sed command produces a list of line numbers that splits the file on change of key.
The function f then rewrites these numbers grouping them in to lengths of 4 or less.
~
I have a huge file and I need to retrieve specific columns from File1 which is ~ 200000 rows and ~ 1000 Columns if it matches with the list of file2. (Prefer Bash over R )
for example my dummy data files are as follows,
file1
gene s1 s2 s3 s4 s5
a 1 2 1 2 1
b 2 3 2 3 3
c 1 1 1 1 1
d 1 1 2 2 2
and file2
sample
s4
s3
s7
s8
My desired output is
gene s3 s4
a 1 2
b 2 3
c 1 1
d 2 2
likewise, i have 3 different file2 and i have to pick different samples from the same file1 into a new file.
I would be very greatful if you guys can provide me with your valuable suggestions
P.S: I am a Biologist, i have very little coding experience
Regards
Ateeq
$ cat file1
gene s1 s2 s3 s4 s5
a 1 2 1 2 1
b 2 3 2 3 3
c 1 1 1 1 1
d 1 1 2 2 2
$ cat file2
gene
s4
s3
s8
s7
$ cat a
awk '
NR == FNR {
columns[ NR ] = $0
printf "%s\t", $0
next
}
FNR == 1 {
print ""
split( $0, headers )
for (x = 1 ; x <= length(headers) ; x++ )
{
aheaders[ headers[x]] = x
}
next
}
{
for ( x = 1 ; x <= length( columns ) ; x++ )
{
if (length( aheaders[ columns[x] ] ) == 0 )
printf "N/A\t"
else
printf "%s\t" , $aheaders[ columns[x] ]
}
print ""
}
' $*
$ ./a file2 file1 | column -t
gene s4 s3 s8 s7
a 2 1 N/A N/A
b 3 2 N/A N/A
c 1 1 N/A N/A
d 2 2 N/A N/A
The above should get you on your way. It's an extremely optimistic program and no negative testing was performed.
Awk is a tool that applies a set of commands to every line of every file that matches an expression. In general, the awk script has the form:
<pattern> <command>
There are three such pairs above. Each needs a little explanation:
NR == FNR {
columns[ NR ] = $0
printf "%s\t", $0
next
}
NR == FNR is a awk'ism. NR is the record number and FNR is the record number in the current file. NR is always increasing but FNR resets to 1 when awk parses the next file. NR==FNR is an idiom that is only true when parsing the first file.
I've designed the awk program to read the columns file first (you are calling this file2). File2 has a list of columns to output. As you can see, we are storing each line in the first file (file2) into an array called columns. We are also printing the columns out as we read them. In order to avoid newlines after each column name (since we want all the column headers to be on the same line), we use printf which doesn't output a newline (as opposed to print which does).
The 'next' at the end of the stanza tells awk to read the next line in the file without processing any of the other stanzas. After all, we just want to read the first file.
In summary, the first stanza remembers the column names (and order) and prints them out on a single line (without a newline).
The second "stanza":
FNR == 1 {
print ""
split( $0, headers )
for (x = 1 ; x <= length(headers) ; x++ )
{
aheaders[ headers[x]] = x
}
next
}
FNR==1 will match on the first line of any file. Due to the next in the previous stanza, we'll only hit this stanza when we are on the first line of the second file (file1). The first print "" statement adds the newline that was missing from the first stanza. Now the line with the column headers is complete.
The split command takes the first parameter, $0, the current line and splits it according to whitespace. We know the current line is the first line and has the column headers in it. The split command writes to an array named in the second parameter , headers. Now headers[1] = "gene" and headers[2] = "s4" , headers[3] = "s3", etc.
We're going to need to map the column names to the column numbers. The next bit of code takes each header value and creates an aheaders entry. aheders is an associative array that maps column header names to the column number.
aheaders["gene"] = 1
aheaders["s1"] = 2
aheaders["s2"] = 3
aheaders["s3"] = 4
aheaders["s4"] = 5
aheaders["s5"] = 6
When we're done making the aheaders array, the next command tells awk to skip to the next line of the input. From this point on, only the third stanza is going to have a true condition.
{
for ( x = 1 ; x <= length( columns ) ; x++ )
{
if (length( aheaders[ columns[x] ] ) == 0 )
printf "N/A\t"
else
printf "%s\t" , $aheaders[ columns[x] ]
}
print ""
}
The third stanza has no explicit . Awk will process this as always true. So this last is executed for every line of the second file.
At this point, we want to print the columns that are specified in columns array. We walk through each element of the array in order. The first time through the loop, columns[1] = "gene_symbol". This gives us:
printf "%s\t" , $aheaders[ "gene" ]
And since aheaders["gene"] = 1 this gives us:
printf "%s\t" , $1
And awk understands $1 to be the first field (or column) in the input line. Thus the first column is passed to printf which outputs the value with a tab (\t) appended.
The loop then executes another time with x=2 and columns[2]="s4". This results in the following print executing:
printf "%s\t" , $5
This prints the fifth column followed by a tab. The next iteration:
columns[3] = "s3"
aheaders["s3"] = 4
Which results in:
printf "%s\t" , $4
That is, the fourth field is output.
The next iteration we hit a failure situation:
columns[4] = "s8"
aheaders["s8"] = ''
In this case, the length( aheaders[ columns[x] ] ) == 0 is true so we just print out a placeholder - something to tell the operator their input may be invalid:
printf "N/A\t"
The same is output when we process the last columns[x] value "s7".
Now, since there are no more entries in columns, the loop exists and we hit the final print:
print ""
The empty string is provided to print because print by itself defaults to print $0 - the entire line.
At this point, awk reads the next line out of file1 hits the third block again (only). Thus awk continues until the second file is completely read.
Say I've got a file.txt
Position name1 name2 name3
2 A G F
4 G S D
5 L K P
7 G A A
8 O L K
9 E A G
and I need to get the output:
name1 name2 name3
2 2 7
4 7 9
7 9
It outputs each name, and the position numbers where there is an A or G
In file.txt, the name1 column has an A in position 2, G's in positions 4 and 7... therefore in the output file: 2,4,7 is listed under name1
...and so on
Strategy I've devised so far (not very efficient): reading each column one at a time, and outputting the position number when a match occurs. Then I'd get the result for each column and cbind them together using r.
I'm fairly certain there's a better way using awk or bash... ideas appreciated.
$ cat tst.awk
NR==1 {
for (nameNr=2;nameNr<=NF;nameNr++) {
printf "%5s%s", $nameNr, (nameNr<NF?OFS:ORS)
}
next
}
{
for (nameNr=2;nameNr<=NF;nameNr++) {
if ($nameNr ~ /^[AG]$/) {
hits[nameNr,++numHits[nameNr]] = $1
maxHits = (numHits[nameNr] > maxHits ? numHits[nameNr] : maxHits)
}
}
}
END {
for (hitNr=1; hitNr<=maxHits; hitNr++) {
for (nameNr=2;nameNr<=NF;nameNr++) {
printf "%5s%s", hits[nameNr,hitNr], (nameNr<NF?OFS:ORS)
}
}
}
$ awk -f tst.awk file
name1 name2 name3
2 2 7
4 7 9
7 9
Save the below script :
#!/bin/bash
gawk '{if( NR == 1 ) {print $2 >>"name1"; print $3 >>"name2"; print $4>>"name3";}}
{if($2=="A" || $2=="G"){print $1 >> "name1"}}
{if($3=="A" || $3=="G"){print $1 >> "name2"}}
{if($4=="A" || $4=="G"){print $1 >> "name3"}}
END{system("paste name*;rm name*")}' $1
as finder. Make finder an executable(using chmod) and then do :
./finder file.txt
Note : I have used three temporary files name1, name2 and name3. You could change the file names at your convenience. Also, these files will be deleted at the end.
Edit : Removed the BEGIN part of the gawk.
How can I read the first n lines and the last n lines of a file?
For n=2, I read online that (head -n2 && tail -n2) would work, but it doesn't.
$ cat x
1
2
3
4
5
$ cat x | (head -n2 && tail -n2)
1
2
The expected output for n=2 would be:
1
2
4
5
head -n2 file && tail -n2 file
Chances are you're going to want something like:
... | awk -v OFS='\n' '{a[NR]=$0} END{print a[1], a[2], a[NR-1], a[NR]}'
or if you need to specify a number and taking into account #Wintermute's astute observation that you don't need to buffer the whole file, something like this is what you really want:
... | awk -v n=2 'NR<=n{print;next} {buf[((NR-1)%n)+1]=$0}
END{for (i=1;i<=n;i++) print buf[((NR+i-1)%n)+1]}'
I think the math is correct on that - hopefully you get the idea to use a rotating buffer indexed by the NR modded by the size of the buffer and adjusted to use indices in the range 1-n instead of 0-(n-1).
To help with comprehension of the modulus operator used in the indexing above, here is an example with intermediate print statements to show the logic as it executes:
$ cat file
1
2
3
4
5
6
7
8
.
$ cat tst.awk
BEGIN {
print "Populating array by index ((NR-1)%n)+1:"
}
{
buf[((NR-1)%n)+1] = $0
printf "NR=%d, n=%d: ((NR-1 = %d) %%n = %d) +1 = %d -> buf[%d] = %s\n",
NR, n, NR-1, (NR-1)%n, ((NR-1)%n)+1, ((NR-1)%n)+1, buf[((NR-1)%n)+1]
}
END {
print "\nAccessing array by index ((NR+i-1)%n)+1:"
for (i=1;i<=n;i++) {
printf "NR=%d, i=%d, n=%d: (((NR+i = %d) - 1 = %d) %%n = %d) +1 = %d -> buf[%d] = %s\n",
NR, i, n, NR+i, NR+i-1, (NR+i-1)%n, ((NR+i-1)%n)+1, ((NR+i-1)%n)+1, buf[((NR+i-1)%n)+1]
}
}
$
$ awk -v n=3 -f tst.awk file
Populating array by index ((NR-1)%n)+1:
NR=1, n=3: ((NR-1 = 0) %n = 0) +1 = 1 -> buf[1] = 1
NR=2, n=3: ((NR-1 = 1) %n = 1) +1 = 2 -> buf[2] = 2
NR=3, n=3: ((NR-1 = 2) %n = 2) +1 = 3 -> buf[3] = 3
NR=4, n=3: ((NR-1 = 3) %n = 0) +1 = 1 -> buf[1] = 4
NR=5, n=3: ((NR-1 = 4) %n = 1) +1 = 2 -> buf[2] = 5
NR=6, n=3: ((NR-1 = 5) %n = 2) +1 = 3 -> buf[3] = 6
NR=7, n=3: ((NR-1 = 6) %n = 0) +1 = 1 -> buf[1] = 7
NR=8, n=3: ((NR-1 = 7) %n = 1) +1 = 2 -> buf[2] = 8
Accessing array by index ((NR+i-1)%n)+1:
NR=8, i=1, n=3: (((NR+i = 9) - 1 = 8) %n = 2) +1 = 3 -> buf[3] = 6
NR=8, i=2, n=3: (((NR+i = 10) - 1 = 9) %n = 0) +1 = 1 -> buf[1] = 7
NR=8, i=3, n=3: (((NR+i = 11) - 1 = 10) %n = 1) +1 = 2 -> buf[2] = 8
This might work for you (GNU sed):
sed -n ':a;N;s/[^\n]*/&/2;Ta;2p;$p;D' file
This keeps a window of 2 (replace the 2's for n) lines and then prints the first 2 lines and at end of file prints the window i.e. the last 2 lines.
Here's a GNU sed one-liner that prints the first 10 and last 10 lines:
gsed -ne'1,10{p;b};:a;$p;N;21,$D;ba'
If you want to print a '--' separator between them:
gsed -ne'1,9{p;b};10{x;s/$/--/;x;G;p;b};:a;$p;N;21,$D;ba'
If you're on a Mac and don't have GNU sed, you can't condense as much:
sed -ne'1,9{' -e'p;b' -e'}' -e'10{' -e'x;s/$/--/;x;G;p;b' -e'}' -e':a' -e'$p;N;21,$D;ba'
Explanation
gsed -ne' invoke sed without automatic printing pattern space
-e'1,9{p;b}' print the first 9 lines
-e'10{x;s/$/--/;x;G;p;b}' print line 10 with an appended '--' separator
-e':a;$p;N;21,$D;ba' print the last 10 lines
If you are using a shell that supports process substitution, another way to accomplish this is to write to multiple processes, one for head and one for tail. Suppose for this example your input comes from a pipe feeding you content of unknown length. You want to use just the first 5 lines and the last 10 lines and pass them on to another pipe:
cat | { tee >(head -5) >(tail -10) 1>/dev/null} | cat
The use of {} collects the output from inside the group (there will be two different programs writing to stdout inside the process shells). The 1>/dev/null is to get rid of the extra copy tee will try to write to it's own stdout.
That demonstrates the concept and all the moving parts, but it can be simplified a little in practice by using the STDOUT stream of tee instead of discarding it. Note the command grouping is still necessary here to pass the output on through the next pipe!
cat | { tee >(head -5) | tail -15 } | cat
Obviously replace cat in the pipeline with whatever you are actually doing. If your input can handle the same content to writing to multiple files you could eliminate the use of tee entirely as well as monkeying with STDOUT. Say you have a command that accepts multiple -o output file name flags:
{ mycommand -o >(head -5) -o >(tail -10)} | cat
awk -v n=4 'NR<=n; {b = b "\n" $0} NR>=n {sub(/[^\n]*\n/,"",b)} END {print b}'
The first n lines are covered by NR<=n;. For the last n lines, we just keep track of a buffer holding the latest n lines, repeatedly adding one to the end and removing one from the front (after the first n).
It's possible to do it more efficiently, with an array of lines instead of a single buffer, but even with gigabytes of input, you'd probably waste more in brain time writing it out than you'd save in computer time by running it.
ETA: Because the above timing estimate provoked some discussion in (now deleted) comments, I'll add anecdata from having tried that out.
With a huge file (100M lines, 3.9 GiB, n=5) it's taken 454 seconds, compared to #EdMorton's lined-buffer solution, which executed in only 30 seconds. With more modest inputs ("mere" millions of lines) the ratio is similar: 4.7 seconds vs. 0.53 seconds.
Almost all of that additional time in this solution seems to be spent in the sub() function; a tiny fraction also does come from string concatenation being slower than just replacing an array member.
Use GNU parallel. To print the first three lines and the last three lines:
parallel {} -n 3 file ::: head tail
Based on dcaswell's answer, the following sed script prints the first and last 10 lines of a file:
# Make a test file first
testit=$(mktemp -u)
seq 1 100 > $testit
# This sed script:
sed -n ':a;1,10h;N;${x;p;i\
-----
;x;p};11,$D;ba' $testit
rm $testit
Yields this:
1
2
3
4
5
6
7
8
9
10
-----
90
91
92
93
94
95
96
97
98
99
100
Here is another AWK script. Assuming there might be overlap of head and tail.
File script.awk
BEGIN {range = 3} # Define the head and tail range
NR <= range {print} # Output the head; for the first lines in range
{ arr[NR % range] = $0} # Store the current line in a rotating array
END { # Last line reached
for (row = NR - range + 1; row <= NR; row++) { # Reread the last range lines from array
print arr[row % range];
}
}
Running the script
seq 1 7 | awk -f script.awk
Output
1
2
3
5
6
7
For overlapping head and tail:
seq 1 5 |awk -f script.awk
1
2
3
3
4
5
Print the first and last n lines
For n=1:
seq 1 10 | sed '1p;$!d'
Output:
1
10
For n=2:
seq 1 10 | sed '1,2P;$!N;$!D'
Output:
1
2
9
10
For n>=3, use the generic regex:
':a;$q;N;(n+1),(n*2)P;(n+1),$D;ba'
For n=3:
seq 1 10 | sed ':a;$q;N;4,6P;4,$D;ba'
Output:
1
2
3
8
9
10