Bash tool to get nth line from a file - bash

Is there a "canonical" way of doing that? I've been using head -n | tail -1 which does the trick, but I've been wondering if there's a Bash tool that specifically extracts a line (or a range of lines) from a file.
By "canonical" I mean a program whose main function is doing that.

head and pipe with tail will be slow for a huge file. I would suggest sed like this:
sed 'NUMq;d' file
Where NUM is the number of the line you want to print; so, for example, sed '10q;d' file to print the 10th line of file.
Explanation:
NUMq will quit immediately when the line number is NUM.
d will delete the line instead of printing it; this is inhibited on the last line because the q causes the rest of the script to be skipped when quitting.
If you have NUM in a variable, you will want to use double quotes instead of single:
sed "${NUM}q;d" file

sed -n '2p' < file.txt
will print 2nd line
sed -n '2011p' < file.txt
2011th line
sed -n '10,33p' < file.txt
line 10 up to line 33
sed -n '1p;3p' < file.txt
1st and 3th line
and so on...
For adding lines with sed, you can check this:
sed: insert a line in a certain position

I have a unique situation where I can benchmark the solutions proposed on this page, and so I'm writing this answer as a consolidation of the proposed solutions with included run times for each.
Set Up
I have a 3.261 gigabyte ASCII text data file with one key-value pair per row. The file contains 3,339,550,320 rows in total and defies opening in any editor I have tried, including my go-to Vim. I need to subset this file in order to investigate some of the values that I've discovered only start around row ~500,000,000.
Because the file has so many rows:
I need to extract only a subset of the rows to do anything useful with the data.
Reading through every row leading up to the values I care about is going to take a long time.
If the solution reads past the rows I care about and continues reading the rest of the file it will waste time reading almost 3 billion irrelevant rows and take 6x longer than necessary.
My best-case-scenario is a solution that extracts only a single line from the file without reading any of the other rows in the file, but I can't think of how I would accomplish this in Bash.
For the purposes of my sanity I'm not going to be trying to read the full 500,000,000 lines I'd need for my own problem. Instead I'll be trying to extract row 50,000,000 out of 3,339,550,320 (which means reading the full file will take 60x longer than necessary).
I will be using the time built-in to benchmark each command.
Baseline
First let's see how the head tail solution:
$ time head -50000000 myfile.ascii | tail -1
pgm_icnt = 0
real 1m15.321s
The baseline for row 50 million is 00:01:15.321, if I'd gone straight for row 500 million it'd probably be ~12.5 minutes.
cut
I'm dubious of this one, but it's worth a shot:
$ time cut -f50000000 -d$'\n' myfile.ascii
pgm_icnt = 0
real 5m12.156s
This one took 00:05:12.156 to run, which is much slower than the baseline! I'm not sure whether it read through the entire file or just up to line 50 million before stopping, but regardless this doesn't seem like a viable solution to the problem.
AWK
I only ran the solution with the exit because I wasn't going to wait for the full file to run:
$ time awk 'NR == 50000000 {print; exit}' myfile.ascii
pgm_icnt = 0
real 1m16.583s
This code ran in 00:01:16.583, which is only ~1 second slower, but still not an improvement on the baseline. At this rate if the exit command had been excluded it would have probably taken around ~76 minutes to read the entire file!
Perl
I ran the existing Perl solution as well:
$ time perl -wnl -e '$.== 50000000 && print && exit;' myfile.ascii
pgm_icnt = 0
real 1m13.146s
This code ran in 00:01:13.146, which is ~2 seconds faster than the baseline. If I'd run it on the full 500,000,000 it would probably take ~12 minutes.
sed
The top answer on the board, here's my result:
$ time sed "50000000q;d" myfile.ascii
pgm_icnt = 0
real 1m12.705s
This code ran in 00:01:12.705, which is 3 seconds faster than the baseline, and ~0.4 seconds faster than Perl. If I'd run it on the full 500,000,000 rows it would have probably taken ~12 minutes.
mapfile
I have bash 3.1 and therefore cannot test the mapfile solution.
Conclusion
It looks like, for the most part, it's difficult to improve upon the head tail solution. At best the sed solution provides a ~3% increase in efficiency.
(percentages calculated with the formula % = (runtime/baseline - 1) * 100)
Row 50,000,000
00:01:12.705 (-00:00:02.616 = -3.47%) sed
00:01:13.146 (-00:00:02.175 = -2.89%) perl
00:01:15.321 (+00:00:00.000 = +0.00%) head|tail
00:01:16.583 (+00:00:01.262 = +1.68%) awk
00:05:12.156 (+00:03:56.835 = +314.43%) cut
Row 500,000,000
00:12:07.050 (-00:00:26.160) sed
00:12:11.460 (-00:00:21.750) perl
00:12:33.210 (+00:00:00.000) head|tail
00:12:45.830 (+00:00:12.620) awk
00:52:01.560 (+00:40:31.650) cut
Row 3,338,559,320
01:20:54.599 (-00:03:05.327) sed
01:21:24.045 (-00:02:25.227) perl
01:23:49.273 (+00:00:00.000) head|tail
01:25:13.548 (+00:02:35.735) awk
05:47:23.026 (+04:24:26.246) cut

With awk it is pretty fast:
awk 'NR == num_line' file
When this is true, the default behaviour of awk is performed: {print $0}.
Alternative versions
If your file happens to be huge, you'd better exit after reading the required line. This way you save CPU time See time comparison at the end of the answer.
awk 'NR == num_line {print; exit}' file
If you want to give the line number from a bash variable you can use:
awk 'NR == n' n=$num file
awk -v n=$num 'NR == n' file # equivalent
See how much time is saved by using exit, specially if the line happens to be in the first part of the file:
# Let's create a 10M lines file
for ((i=0; i<100000; i++)); do echo "bla bla"; done > 100Klines
for ((i=0; i<100; i++)); do cat 100Klines; done > 10Mlines
$ time awk 'NR == 1234567 {print}' 10Mlines
bla bla
real 0m1.303s
user 0m1.246s
sys 0m0.042s
$ time awk 'NR == 1234567 {print; exit}' 10Mlines
bla bla
real 0m0.198s
user 0m0.178s
sys 0m0.013s
So the difference is 0.198s vs 1.303s, around 6x times faster.

According to my tests, in terms of performance and readability my recommendation is:
tail -n+N | head -1
N is the line number that you want. For example, tail -n+7 input.txt | head -1 will print the 7th line of the file.
tail -n+N will print everything starting from line N, and head -1 will make it stop after one line.
The alternative head -N | tail -1 is perhaps slightly more readable. For example, this will print the 7th line:
head -7 input.txt | tail -1
When it comes to performance, there is not much difference for smaller sizes, but it will be outperformed by the tail | head (from above) when the files become huge.
The top-voted sed 'NUMq;d' is interesting to know, but I would argue that it will be understood by fewer people out of the box than the head/tail solution and it is also slower than tail/head.
In my tests, both tails/heads versions outperformed sed 'NUMq;d' consistently. That is in line with the other benchmarks that were posted. It is hard to find a case where tails/heads was really bad. It is also not surprising, as these are operations that you would expect to be heavily optimized in a modern Unix system.
To get an idea about the performance differences, these are the number that I get for a huge file (9.3G):
tail -n+N | head -1: 3.7 sec
head -N | tail -1: 4.6 sec
sed Nq;d: 18.8 sec
Results may differ, but the performance head | tail and tail | head is, in general, comparable for smaller inputs, and sed is always slower by a significant factor (around 5x or so).
To reproduce my benchmark, you can try the following, but be warned that it will create a 9.3G file in the current working directory:
#!/bin/bash
readonly file=tmp-input.txt
readonly size=1000000000
readonly pos=500000000
readonly retries=3
seq 1 $size > $file
echo "*** head -N | tail -1 ***"
for i in $(seq 1 $retries) ; do
time head "-$pos" $file | tail -1
done
echo "-------------------------"
echo
echo "*** tail -n+N | head -1 ***"
echo
seq 1 $size > $file
ls -alhg $file
for i in $(seq 1 $retries) ; do
time tail -n+$pos $file | head -1
done
echo "-------------------------"
echo
echo "*** sed Nq;d ***"
echo
seq 1 $size > $file
ls -alhg $file
for i in $(seq 1 $retries) ; do
time sed $pos'q;d' $file
done
/bin/rm $file
Here is the output of a run on my machine (ThinkPad X1 Carbon with an SSD and 16G of memory). I assume in the final run everything will come from the cache, not from disk:
*** head -N | tail -1 ***
500000000
real 0m9,800s
user 0m7,328s
sys 0m4,081s
500000000
real 0m4,231s
user 0m5,415s
sys 0m2,789s
500000000
real 0m4,636s
user 0m5,935s
sys 0m2,684s
-------------------------
*** tail -n+N | head -1 ***
-rw-r--r-- 1 phil 9,3G Jan 19 19:49 tmp-input.txt
500000000
real 0m6,452s
user 0m3,367s
sys 0m1,498s
500000000
real 0m3,890s
user 0m2,921s
sys 0m0,952s
500000000
real 0m3,763s
user 0m3,004s
sys 0m0,760s
-------------------------
*** sed Nq;d ***
-rw-r--r-- 1 phil 9,3G Jan 19 19:50 tmp-input.txt
500000000
real 0m23,675s
user 0m21,557s
sys 0m1,523s
500000000
real 0m20,328s
user 0m18,971s
sys 0m1,308s
500000000
real 0m19,835s
user 0m18,830s
sys 0m1,004s

Wow, all the possibilities!
Try this:
sed -n "${lineNum}p" $file
or one of these depending upon your version of Awk:
awk -vlineNum=$lineNum 'NR == lineNum {print $0}' $file
awk -v lineNum=4 '{if (NR == lineNum) {print $0}}' $file
awk '{if (NR == lineNum) {print $0}}' lineNum=$lineNum $file
(You may have to try the nawk or gawk command).
Is there a tool that only does the print that particular line? Not one of the standard tools. However, sed is probably the closest and simplest to use.

Save two keystrokes, print Nth line without using bracket:
sed -n Np <fileName>
^ ^
\ \___ 'p' for printing
\______ '-n' for not printing by default
For example, to print 100th line:
sed -n 100p foo.txt

This question being tagged Bash, here's the Bash (≥4) way of doing: use mapfile with the -s (skip) and -n (count) option.
If you need to get the 42nd line of a file file:
mapfile -s 41 -n 1 ary < file
At this point, you'll have an array ary the fields of which containing the lines of file (including the trailing newline), where we have skipped the first 41 lines (-s 41), and stopped after reading one line (-n 1). So that's really the 42nd line. To print it out:
printf '%s' "${ary[0]}"
If you need a range of lines, say the range 42–666 (inclusive), and say you don't want to do the math yourself, and print them on stdout:
mapfile -s $((42-1)) -n $((666-42+1)) ary < file
printf '%s' "${ary[#]}"
If you need to process these lines too, it's not really convenient to store the trailing newline. In this case use the -t option (trim):
mapfile -t -s $((42-1)) -n $((666-42+1)) ary < file
# do stuff
printf '%s\n' "${ary[#]}"
You can have a function do that for you:
print_file_range() {
# $1-$2 is the range of file $3 to be printed to stdout
local ary
mapfile -s $(($1-1)) -n $(($2-$1+1)) ary < "$3"
printf '%s' "${ary[#]}"
}
No external commands, only Bash builtins!

You may also used sed print and quit:
sed -n '10{p;q;}' file # print line 10

You can also use Perl for this:
perl -wnl -e '$.== NUM && print && exit;' some.file

As a followup to CaffeineConnoisseur's very helpful benchmarking answer... I was curious as to how fast the 'mapfile' method was compared to others (as that wasn't tested), so I tried a quick-and-dirty speed comparison myself as I do have bash 4 handy. Threw in a test of the "tail | head" method (rather than head | tail) mentioned in one of the comments on the top answer while I was at it, as folks are singing its praises. I don't have anything nearly the size of the testfile used; the best I could find on short notice was a 14M pedigree file (long lines that are whitespace-separated, just under 12000 lines).
Short version: mapfile appears faster than the cut method, but slower than everything else, so I'd call it a dud. tail | head, OTOH, looks like it could be the fastest, although with a file this size the difference is not all that substantial compared to sed.
$ time head -11000 [filename] | tail -1
[output redacted]
real 0m0.117s
$ time cut -f11000 -d$'\n' [filename]
[output redacted]
real 0m1.081s
$ time awk 'NR == 11000 {print; exit}' [filename]
[output redacted]
real 0m0.058s
$ time perl -wnl -e '$.== 11000 && print && exit;' [filename]
[output redacted]
real 0m0.085s
$ time sed "11000q;d" [filename]
[output redacted]
real 0m0.031s
$ time (mapfile -s 11000 -n 1 ary < [filename]; echo ${ary[0]})
[output redacted]
real 0m0.309s
$ time tail -n+11000 [filename] | head -n1
[output redacted]
real 0m0.028s
Hope this helps!

The fastest solution for big files is always tail|head, provided that the two distances:
from the start of the file to the starting line. Lets call it S
the distance from the last line to the end of the file. Be it E
are known. Then, we could use this:
mycount="$E"; (( E > S )) && mycount="+$S"
howmany="$(( endline - startline + 1 ))"
tail -n "$mycount"| head -n "$howmany"
howmany is just the count of lines required.
Some more detail in https://unix.stackexchange.com/a/216614/79743

All the above answers directly answer the question. But here's a less direct solution but a potentially more important idea, to provoke thought.
Since line lengths are arbitrary, all the bytes of the file before the nth line need to be read. If you have a huge file or need to repeat this task many times, and this process is time-consuming, then you should seriously think about whether you should be storing your data in a different way in the first place.
The real solution is to have an index, e.g. at the start of the file, indicating the positions where the lines begin. You could use a database format, or just add a table at the start of the file. Alternatively create a separate index file to accompany your large text file.
e.g. you might create a list of character positions for newlines:
awk 'BEGIN{c=0;print(c)}{c+=length()+1;print(c+1)}' file.txt > file.idx
then read with tail, which actually seeks directly to the appropriate point in the file!
e.g. to get line 1000:
tail -c +$(awk 'NR=1000' file.idx) file.txt | head -1
This may not work with 2-byte / multibyte characters, since awk is "character-aware" but tail is not.
I haven't tested this against a large file.
Also see this answer.
Alternatively - split your file into smaller files!

If you got multiple lines by delimited by \n (normally new line). You can use 'cut' as well:
echo "$data" | cut -f2 -d$'\n'
You will get the 2nd line from the file. -f3 gives you the 3rd line.

Using what others mentioned, I wanted this to be a quick & dandy function in my bash shell.
Create a file: ~/.functions
Add to it the contents:
getline() {
line=$1
sed $line'q;d' $2
}
Then add this to your ~/.bash_profile:
source ~/.functions
Now when you open a new bash window, you can just call the function as so:
getline 441 myfile.txt

Lots of good answers already. I personally go with awk. For convenience, if you use bash, just add the below to your ~/.bash_profile. And, the next time you log in (or if you source your .bash_profile after this update), you will have a new nifty "nth" function available to pipe your files through.
Execute this or put it in your ~/.bash_profile (if using bash) and reopen bash (or execute source ~/.bach_profile)
# print just the nth piped in line
nth () { awk -vlnum=${1} 'NR==lnum {print; exit}'; }
Then, to use it, simply pipe through it. E.g.,:
$ yes line | cat -n | nth 5
5 line

To print nth line using sed with a variable as line number:
a=4
sed -e $a'q:d' file
Here the '-e' flag is for adding script to command to be executed.

After taking a look at the top answer and the benchmark, I've implemented a tiny helper function:
function nth {
if (( ${#} < 1 || ${#} > 2 )); then
echo -e "usage: $0 \e[4mline\e[0m [\e[4mfile\e[0m]"
return 1
fi
if (( ${#} > 1 )); then
sed "$1q;d" $2
else
sed "$1q;d"
fi
}
Basically you can use it in two fashions:
nth 42 myfile.txt
do_stuff | nth 42

This is not a bash solution, but I found out that top choices didn't satisfy my needs, eg,
sed 'NUMq;d' file
was fast enough, but was hanging for hours and did not tell about any progress. I suggest compiling this cpp program and using it to find the row you want. You can compile it with g++ main.cpp, where main.cpp is the file with the content below. I got a.out and executed it with ./a.out
#include <iostream>
#include <string>
#include <fstream>
using namespace std;
int main() {
string filename;
cout << "Enter filename ";
cin >> filename;
int needed_row_number;
cout << "Enter row number ";
cin >> needed_row_number;
int progress_line_count;
cout << "Enter at which every number of rows to monitor progress ";
cin >> progress_line_count;
char ch;
int row_counter = 1;
fstream fin(filename, fstream::in);
while (fin >> noskipws >> ch) {
int ch_int = (int) ch;
if (row_counter == needed_row_number) {
cout << ch;
}
if (ch_int == 10) {
if (row_counter == needed_row_number) {
return 0;
}
row_counter++;
if (row_counter % progress_line_count == 0) {
cout << "Progress: line " << row_counter << endl;
}
}
}
return 0;
}

To get an nth line (single line)
If you want something that you can later customize without having to deal with bash you can compile this c program and drop the binary in your custom binaries directory. This assumes that you know how to edit the .bashrc file
accordingly (only if you want to edit your path variable), If you don't know, this is a helpful link.
To run this code use (assuming you named the binary "line").
line [target line] [target file]
example
line 2 somefile.txt
The code:
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
int main(int argc, char* argv[]){
if(argc != 3){
fprintf(stderr, "line needs a line number and a file name");
exit(0);
}
int lineNumber = atoi(argv[1]);
int counter = 0;
char *fileName = argv[2];
FILE *fileReader = fopen(fileName, "r");
if(fileReader == NULL){
fprintf(stderr, "Failed to open file");
exit(0);
}
size_t lineSize = 0;
char* line = NULL;
while(counter < lineNumber){
getline(&line, &linesize, fileReader);
counter++
}
getline(&line, &lineSize, fileReader);
printf("%s\n", line);
fclose(fileReader);
return 0;
}
EDIT: removed the fseek and replaced it with a while loop

I've put some of the above answers into a short bash script that you can put into a file called get.sh and link to /usr/local/bin/get (or whatever other name you prefer).
#!/bin/bash
if [ "${1}" == "" ]; then
echo "error: blank line number";
exit 1
fi
re='^[0-9]+$'
if ! [[ $1 =~ $re ]] ; then
echo "error: line number arg not a number";
exit 1
fi
if [ "${2}" == "" ]; then
echo "error: blank file name";
exit 1
fi
sed "${1}q;d" $2;
exit 0
Ensure it's executable with
$ chmod +x get
Link it to make it available on the PATH with
$ ln -s get.sh /usr/local/bin/get

UPDATE 1 : found much faster method in awk
just 5.353 secs to obtain a row above 133.6 mn :
rownum='133668997'; ( time ( pvE0 < ~/master_primelist_18a.txt |
LC_ALL=C mawk2 -F'^$' -v \_="${rownum}" -- '!_{exit}!--_' ) )
in0: 5.45GiB 0:00:05 [1.02GiB/s] [1.02GiB/s] [======> ] 71%
( pvE 0.1 in0 < ~/master_primelist_18a.txt |
LC_ALL=C mawk2 -F'^$' -v -- ; ) 5.01s user
1.21s system 116% cpu 5.353 total
77.37219=195591955519519519559551=0x296B0FA7D668C4A64F7F=
===============================================
I'd like to contest the notion that perl is faster than awk :
so while my test file isn't nearly quite as many rows, it's also twice the size, at 7.58 GB -
I even gave perl some built-in advantageous - like hard-coding in the row number, and also going second, thus gaining any potential speedups from OS caching mechanism, if any
f="$( grealpath -ePq ~/master_primelist_18a.txt )"
rownum='133668997'
fg;fg; pv < "${f}" | gwc -lcm
echo; sleep 2;
echo;
( time ( pv -i 0.1 -cN in0 < "${f}" |
LC_ALL=C mawk2 '_{exit}_=NR==+__' FS='^$' __="${rownum}"
) ) | mawk 'BEGIN { print } END { print _ } NR'
sleep 2
( time ( pv -i 0.1 -cN in0 < "${f}" |
LC_ALL=C perl -wnl -e '$.== 133668997 && print && exit;'
) ) | mawk 'BEGIN { print } END { print _ } NR' ;
fg: no current job
fg: no current job
7.58GiB 0:00:28 [ 275MiB/s] [============>] 100%
148,110,134 8,134,435,629 8,134,435,629 <<<< rows, chars, and bytes
count as reported by gnu-wc
in0: 5.45GiB 0:00:07 [ 701MiB/s] [=> ] 71%
( pv -i 0.1 -cN in0 < "${f}" | LC_ALL=C mawk2 '_{exit}_=NR==+__' FS='^$' ; )
6.22s user 2.56s system 110% cpu 7.966 total
77.37219=195591955519519519559551=0x296B0FA7D668C4A64F7F=
in0: 5.45GiB 0:00:17 [ 328MiB/s] [=> ] 71%
( pv -i 0.1 -cN in0 < "${f}" | LC_ALL=C perl -wnl -e ; )
14.22s user 3.31s system 103% cpu 17.014 total
77.37219=195591955519519519559551=0x296B0FA7D668C4A64F7F=
I can re-run the test with perl 5.36 or even perl-6 if u think it's gonna make a difference (haven't installed either), but a gap of
7.966 secs (mawk2) vs. 17.014 secs (perl 5.34)
between the two, with the latter more than double the prior, it seems clear which one is indeed meaningfully faster to fetch a single row way deep in ASCII files.
This is perl 5, version 34, subversion 0 (v5.34.0) built for darwin-thread-multi-2level
Copyright 1987-2021, Larry Wall
mawk 1.9.9.6, 21 Aug 2016, Copyright Michael D. Brennan

Related

How to extract the third largest file size in Unix [duplicate]

Is there a "canonical" way of doing that? I've been using head -n | tail -1 which does the trick, but I've been wondering if there's a Bash tool that specifically extracts a line (or a range of lines) from a file.
By "canonical" I mean a program whose main function is doing that.
head and pipe with tail will be slow for a huge file. I would suggest sed like this:
sed 'NUMq;d' file
Where NUM is the number of the line you want to print; so, for example, sed '10q;d' file to print the 10th line of file.
Explanation:
NUMq will quit immediately when the line number is NUM.
d will delete the line instead of printing it; this is inhibited on the last line because the q causes the rest of the script to be skipped when quitting.
If you have NUM in a variable, you will want to use double quotes instead of single:
sed "${NUM}q;d" file
sed -n '2p' < file.txt
will print 2nd line
sed -n '2011p' < file.txt
2011th line
sed -n '10,33p' < file.txt
line 10 up to line 33
sed -n '1p;3p' < file.txt
1st and 3th line
and so on...
For adding lines with sed, you can check this:
sed: insert a line in a certain position
I have a unique situation where I can benchmark the solutions proposed on this page, and so I'm writing this answer as a consolidation of the proposed solutions with included run times for each.
Set Up
I have a 3.261 gigabyte ASCII text data file with one key-value pair per row. The file contains 3,339,550,320 rows in total and defies opening in any editor I have tried, including my go-to Vim. I need to subset this file in order to investigate some of the values that I've discovered only start around row ~500,000,000.
Because the file has so many rows:
I need to extract only a subset of the rows to do anything useful with the data.
Reading through every row leading up to the values I care about is going to take a long time.
If the solution reads past the rows I care about and continues reading the rest of the file it will waste time reading almost 3 billion irrelevant rows and take 6x longer than necessary.
My best-case-scenario is a solution that extracts only a single line from the file without reading any of the other rows in the file, but I can't think of how I would accomplish this in Bash.
For the purposes of my sanity I'm not going to be trying to read the full 500,000,000 lines I'd need for my own problem. Instead I'll be trying to extract row 50,000,000 out of 3,339,550,320 (which means reading the full file will take 60x longer than necessary).
I will be using the time built-in to benchmark each command.
Baseline
First let's see how the head tail solution:
$ time head -50000000 myfile.ascii | tail -1
pgm_icnt = 0
real 1m15.321s
The baseline for row 50 million is 00:01:15.321, if I'd gone straight for row 500 million it'd probably be ~12.5 minutes.
cut
I'm dubious of this one, but it's worth a shot:
$ time cut -f50000000 -d$'\n' myfile.ascii
pgm_icnt = 0
real 5m12.156s
This one took 00:05:12.156 to run, which is much slower than the baseline! I'm not sure whether it read through the entire file or just up to line 50 million before stopping, but regardless this doesn't seem like a viable solution to the problem.
AWK
I only ran the solution with the exit because I wasn't going to wait for the full file to run:
$ time awk 'NR == 50000000 {print; exit}' myfile.ascii
pgm_icnt = 0
real 1m16.583s
This code ran in 00:01:16.583, which is only ~1 second slower, but still not an improvement on the baseline. At this rate if the exit command had been excluded it would have probably taken around ~76 minutes to read the entire file!
Perl
I ran the existing Perl solution as well:
$ time perl -wnl -e '$.== 50000000 && print && exit;' myfile.ascii
pgm_icnt = 0
real 1m13.146s
This code ran in 00:01:13.146, which is ~2 seconds faster than the baseline. If I'd run it on the full 500,000,000 it would probably take ~12 minutes.
sed
The top answer on the board, here's my result:
$ time sed "50000000q;d" myfile.ascii
pgm_icnt = 0
real 1m12.705s
This code ran in 00:01:12.705, which is 3 seconds faster than the baseline, and ~0.4 seconds faster than Perl. If I'd run it on the full 500,000,000 rows it would have probably taken ~12 minutes.
mapfile
I have bash 3.1 and therefore cannot test the mapfile solution.
Conclusion
It looks like, for the most part, it's difficult to improve upon the head tail solution. At best the sed solution provides a ~3% increase in efficiency.
(percentages calculated with the formula % = (runtime/baseline - 1) * 100)
Row 50,000,000
00:01:12.705 (-00:00:02.616 = -3.47%) sed
00:01:13.146 (-00:00:02.175 = -2.89%) perl
00:01:15.321 (+00:00:00.000 = +0.00%) head|tail
00:01:16.583 (+00:00:01.262 = +1.68%) awk
00:05:12.156 (+00:03:56.835 = +314.43%) cut
Row 500,000,000
00:12:07.050 (-00:00:26.160) sed
00:12:11.460 (-00:00:21.750) perl
00:12:33.210 (+00:00:00.000) head|tail
00:12:45.830 (+00:00:12.620) awk
00:52:01.560 (+00:40:31.650) cut
Row 3,338,559,320
01:20:54.599 (-00:03:05.327) sed
01:21:24.045 (-00:02:25.227) perl
01:23:49.273 (+00:00:00.000) head|tail
01:25:13.548 (+00:02:35.735) awk
05:47:23.026 (+04:24:26.246) cut
With awk it is pretty fast:
awk 'NR == num_line' file
When this is true, the default behaviour of awk is performed: {print $0}.
Alternative versions
If your file happens to be huge, you'd better exit after reading the required line. This way you save CPU time See time comparison at the end of the answer.
awk 'NR == num_line {print; exit}' file
If you want to give the line number from a bash variable you can use:
awk 'NR == n' n=$num file
awk -v n=$num 'NR == n' file # equivalent
See how much time is saved by using exit, specially if the line happens to be in the first part of the file:
# Let's create a 10M lines file
for ((i=0; i<100000; i++)); do echo "bla bla"; done > 100Klines
for ((i=0; i<100; i++)); do cat 100Klines; done > 10Mlines
$ time awk 'NR == 1234567 {print}' 10Mlines
bla bla
real 0m1.303s
user 0m1.246s
sys 0m0.042s
$ time awk 'NR == 1234567 {print; exit}' 10Mlines
bla bla
real 0m0.198s
user 0m0.178s
sys 0m0.013s
So the difference is 0.198s vs 1.303s, around 6x times faster.
According to my tests, in terms of performance and readability my recommendation is:
tail -n+N | head -1
N is the line number that you want. For example, tail -n+7 input.txt | head -1 will print the 7th line of the file.
tail -n+N will print everything starting from line N, and head -1 will make it stop after one line.
The alternative head -N | tail -1 is perhaps slightly more readable. For example, this will print the 7th line:
head -7 input.txt | tail -1
When it comes to performance, there is not much difference for smaller sizes, but it will be outperformed by the tail | head (from above) when the files become huge.
The top-voted sed 'NUMq;d' is interesting to know, but I would argue that it will be understood by fewer people out of the box than the head/tail solution and it is also slower than tail/head.
In my tests, both tails/heads versions outperformed sed 'NUMq;d' consistently. That is in line with the other benchmarks that were posted. It is hard to find a case where tails/heads was really bad. It is also not surprising, as these are operations that you would expect to be heavily optimized in a modern Unix system.
To get an idea about the performance differences, these are the number that I get for a huge file (9.3G):
tail -n+N | head -1: 3.7 sec
head -N | tail -1: 4.6 sec
sed Nq;d: 18.8 sec
Results may differ, but the performance head | tail and tail | head is, in general, comparable for smaller inputs, and sed is always slower by a significant factor (around 5x or so).
To reproduce my benchmark, you can try the following, but be warned that it will create a 9.3G file in the current working directory:
#!/bin/bash
readonly file=tmp-input.txt
readonly size=1000000000
readonly pos=500000000
readonly retries=3
seq 1 $size > $file
echo "*** head -N | tail -1 ***"
for i in $(seq 1 $retries) ; do
time head "-$pos" $file | tail -1
done
echo "-------------------------"
echo
echo "*** tail -n+N | head -1 ***"
echo
seq 1 $size > $file
ls -alhg $file
for i in $(seq 1 $retries) ; do
time tail -n+$pos $file | head -1
done
echo "-------------------------"
echo
echo "*** sed Nq;d ***"
echo
seq 1 $size > $file
ls -alhg $file
for i in $(seq 1 $retries) ; do
time sed $pos'q;d' $file
done
/bin/rm $file
Here is the output of a run on my machine (ThinkPad X1 Carbon with an SSD and 16G of memory). I assume in the final run everything will come from the cache, not from disk:
*** head -N | tail -1 ***
500000000
real 0m9,800s
user 0m7,328s
sys 0m4,081s
500000000
real 0m4,231s
user 0m5,415s
sys 0m2,789s
500000000
real 0m4,636s
user 0m5,935s
sys 0m2,684s
-------------------------
*** tail -n+N | head -1 ***
-rw-r--r-- 1 phil 9,3G Jan 19 19:49 tmp-input.txt
500000000
real 0m6,452s
user 0m3,367s
sys 0m1,498s
500000000
real 0m3,890s
user 0m2,921s
sys 0m0,952s
500000000
real 0m3,763s
user 0m3,004s
sys 0m0,760s
-------------------------
*** sed Nq;d ***
-rw-r--r-- 1 phil 9,3G Jan 19 19:50 tmp-input.txt
500000000
real 0m23,675s
user 0m21,557s
sys 0m1,523s
500000000
real 0m20,328s
user 0m18,971s
sys 0m1,308s
500000000
real 0m19,835s
user 0m18,830s
sys 0m1,004s
Wow, all the possibilities!
Try this:
sed -n "${lineNum}p" $file
or one of these depending upon your version of Awk:
awk -vlineNum=$lineNum 'NR == lineNum {print $0}' $file
awk -v lineNum=4 '{if (NR == lineNum) {print $0}}' $file
awk '{if (NR == lineNum) {print $0}}' lineNum=$lineNum $file
(You may have to try the nawk or gawk command).
Is there a tool that only does the print that particular line? Not one of the standard tools. However, sed is probably the closest and simplest to use.
Save two keystrokes, print Nth line without using bracket:
sed -n Np <fileName>
^ ^
\ \___ 'p' for printing
\______ '-n' for not printing by default
For example, to print 100th line:
sed -n 100p foo.txt
This question being tagged Bash, here's the Bash (≥4) way of doing: use mapfile with the -s (skip) and -n (count) option.
If you need to get the 42nd line of a file file:
mapfile -s 41 -n 1 ary < file
At this point, you'll have an array ary the fields of which containing the lines of file (including the trailing newline), where we have skipped the first 41 lines (-s 41), and stopped after reading one line (-n 1). So that's really the 42nd line. To print it out:
printf '%s' "${ary[0]}"
If you need a range of lines, say the range 42–666 (inclusive), and say you don't want to do the math yourself, and print them on stdout:
mapfile -s $((42-1)) -n $((666-42+1)) ary < file
printf '%s' "${ary[#]}"
If you need to process these lines too, it's not really convenient to store the trailing newline. In this case use the -t option (trim):
mapfile -t -s $((42-1)) -n $((666-42+1)) ary < file
# do stuff
printf '%s\n' "${ary[#]}"
You can have a function do that for you:
print_file_range() {
# $1-$2 is the range of file $3 to be printed to stdout
local ary
mapfile -s $(($1-1)) -n $(($2-$1+1)) ary < "$3"
printf '%s' "${ary[#]}"
}
No external commands, only Bash builtins!
You may also used sed print and quit:
sed -n '10{p;q;}' file # print line 10
You can also use Perl for this:
perl -wnl -e '$.== NUM && print && exit;' some.file
As a followup to CaffeineConnoisseur's very helpful benchmarking answer... I was curious as to how fast the 'mapfile' method was compared to others (as that wasn't tested), so I tried a quick-and-dirty speed comparison myself as I do have bash 4 handy. Threw in a test of the "tail | head" method (rather than head | tail) mentioned in one of the comments on the top answer while I was at it, as folks are singing its praises. I don't have anything nearly the size of the testfile used; the best I could find on short notice was a 14M pedigree file (long lines that are whitespace-separated, just under 12000 lines).
Short version: mapfile appears faster than the cut method, but slower than everything else, so I'd call it a dud. tail | head, OTOH, looks like it could be the fastest, although with a file this size the difference is not all that substantial compared to sed.
$ time head -11000 [filename] | tail -1
[output redacted]
real 0m0.117s
$ time cut -f11000 -d$'\n' [filename]
[output redacted]
real 0m1.081s
$ time awk 'NR == 11000 {print; exit}' [filename]
[output redacted]
real 0m0.058s
$ time perl -wnl -e '$.== 11000 && print && exit;' [filename]
[output redacted]
real 0m0.085s
$ time sed "11000q;d" [filename]
[output redacted]
real 0m0.031s
$ time (mapfile -s 11000 -n 1 ary < [filename]; echo ${ary[0]})
[output redacted]
real 0m0.309s
$ time tail -n+11000 [filename] | head -n1
[output redacted]
real 0m0.028s
Hope this helps!
The fastest solution for big files is always tail|head, provided that the two distances:
from the start of the file to the starting line. Lets call it S
the distance from the last line to the end of the file. Be it E
are known. Then, we could use this:
mycount="$E"; (( E > S )) && mycount="+$S"
howmany="$(( endline - startline + 1 ))"
tail -n "$mycount"| head -n "$howmany"
howmany is just the count of lines required.
Some more detail in https://unix.stackexchange.com/a/216614/79743
All the above answers directly answer the question. But here's a less direct solution but a potentially more important idea, to provoke thought.
Since line lengths are arbitrary, all the bytes of the file before the nth line need to be read. If you have a huge file or need to repeat this task many times, and this process is time-consuming, then you should seriously think about whether you should be storing your data in a different way in the first place.
The real solution is to have an index, e.g. at the start of the file, indicating the positions where the lines begin. You could use a database format, or just add a table at the start of the file. Alternatively create a separate index file to accompany your large text file.
e.g. you might create a list of character positions for newlines:
awk 'BEGIN{c=0;print(c)}{c+=length()+1;print(c+1)}' file.txt > file.idx
then read with tail, which actually seeks directly to the appropriate point in the file!
e.g. to get line 1000:
tail -c +$(awk 'NR=1000' file.idx) file.txt | head -1
This may not work with 2-byte / multibyte characters, since awk is "character-aware" but tail is not.
I haven't tested this against a large file.
Also see this answer.
Alternatively - split your file into smaller files!
If you got multiple lines by delimited by \n (normally new line). You can use 'cut' as well:
echo "$data" | cut -f2 -d$'\n'
You will get the 2nd line from the file. -f3 gives you the 3rd line.
Using what others mentioned, I wanted this to be a quick & dandy function in my bash shell.
Create a file: ~/.functions
Add to it the contents:
getline() {
line=$1
sed $line'q;d' $2
}
Then add this to your ~/.bash_profile:
source ~/.functions
Now when you open a new bash window, you can just call the function as so:
getline 441 myfile.txt
Lots of good answers already. I personally go with awk. For convenience, if you use bash, just add the below to your ~/.bash_profile. And, the next time you log in (or if you source your .bash_profile after this update), you will have a new nifty "nth" function available to pipe your files through.
Execute this or put it in your ~/.bash_profile (if using bash) and reopen bash (or execute source ~/.bach_profile)
# print just the nth piped in line
nth () { awk -vlnum=${1} 'NR==lnum {print; exit}'; }
Then, to use it, simply pipe through it. E.g.,:
$ yes line | cat -n | nth 5
5 line
To print nth line using sed with a variable as line number:
a=4
sed -e $a'q:d' file
Here the '-e' flag is for adding script to command to be executed.
After taking a look at the top answer and the benchmark, I've implemented a tiny helper function:
function nth {
if (( ${#} < 1 || ${#} > 2 )); then
echo -e "usage: $0 \e[4mline\e[0m [\e[4mfile\e[0m]"
return 1
fi
if (( ${#} > 1 )); then
sed "$1q;d" $2
else
sed "$1q;d"
fi
}
Basically you can use it in two fashions:
nth 42 myfile.txt
do_stuff | nth 42
This is not a bash solution, but I found out that top choices didn't satisfy my needs, eg,
sed 'NUMq;d' file
was fast enough, but was hanging for hours and did not tell about any progress. I suggest compiling this cpp program and using it to find the row you want. You can compile it with g++ main.cpp, where main.cpp is the file with the content below. I got a.out and executed it with ./a.out
#include <iostream>
#include <string>
#include <fstream>
using namespace std;
int main() {
string filename;
cout << "Enter filename ";
cin >> filename;
int needed_row_number;
cout << "Enter row number ";
cin >> needed_row_number;
int progress_line_count;
cout << "Enter at which every number of rows to monitor progress ";
cin >> progress_line_count;
char ch;
int row_counter = 1;
fstream fin(filename, fstream::in);
while (fin >> noskipws >> ch) {
int ch_int = (int) ch;
if (row_counter == needed_row_number) {
cout << ch;
}
if (ch_int == 10) {
if (row_counter == needed_row_number) {
return 0;
}
row_counter++;
if (row_counter % progress_line_count == 0) {
cout << "Progress: line " << row_counter << endl;
}
}
}
return 0;
}
To get an nth line (single line)
If you want something that you can later customize without having to deal with bash you can compile this c program and drop the binary in your custom binaries directory. This assumes that you know how to edit the .bashrc file
accordingly (only if you want to edit your path variable), If you don't know, this is a helpful link.
To run this code use (assuming you named the binary "line").
line [target line] [target file]
example
line 2 somefile.txt
The code:
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
int main(int argc, char* argv[]){
if(argc != 3){
fprintf(stderr, "line needs a line number and a file name");
exit(0);
}
int lineNumber = atoi(argv[1]);
int counter = 0;
char *fileName = argv[2];
FILE *fileReader = fopen(fileName, "r");
if(fileReader == NULL){
fprintf(stderr, "Failed to open file");
exit(0);
}
size_t lineSize = 0;
char* line = NULL;
while(counter < lineNumber){
getline(&line, &linesize, fileReader);
counter++
}
getline(&line, &lineSize, fileReader);
printf("%s\n", line);
fclose(fileReader);
return 0;
}
EDIT: removed the fseek and replaced it with a while loop
I've put some of the above answers into a short bash script that you can put into a file called get.sh and link to /usr/local/bin/get (or whatever other name you prefer).
#!/bin/bash
if [ "${1}" == "" ]; then
echo "error: blank line number";
exit 1
fi
re='^[0-9]+$'
if ! [[ $1 =~ $re ]] ; then
echo "error: line number arg not a number";
exit 1
fi
if [ "${2}" == "" ]; then
echo "error: blank file name";
exit 1
fi
sed "${1}q;d" $2;
exit 0
Ensure it's executable with
$ chmod +x get
Link it to make it available on the PATH with
$ ln -s get.sh /usr/local/bin/get
UPDATE 1 : found much faster method in awk
just 5.353 secs to obtain a row above 133.6 mn :
rownum='133668997'; ( time ( pvE0 < ~/master_primelist_18a.txt |
LC_ALL=C mawk2 -F'^$' -v \_="${rownum}" -- '!_{exit}!--_' ) )
in0: 5.45GiB 0:00:05 [1.02GiB/s] [1.02GiB/s] [======> ] 71%
( pvE 0.1 in0 < ~/master_primelist_18a.txt |
LC_ALL=C mawk2 -F'^$' -v -- ; ) 5.01s user
1.21s system 116% cpu 5.353 total
77.37219=195591955519519519559551=0x296B0FA7D668C4A64F7F=
===============================================
I'd like to contest the notion that perl is faster than awk :
so while my test file isn't nearly quite as many rows, it's also twice the size, at 7.58 GB -
I even gave perl some built-in advantageous - like hard-coding in the row number, and also going second, thus gaining any potential speedups from OS caching mechanism, if any
f="$( grealpath -ePq ~/master_primelist_18a.txt )"
rownum='133668997'
fg;fg; pv < "${f}" | gwc -lcm
echo; sleep 2;
echo;
( time ( pv -i 0.1 -cN in0 < "${f}" |
LC_ALL=C mawk2 '_{exit}_=NR==+__' FS='^$' __="${rownum}"
) ) | mawk 'BEGIN { print } END { print _ } NR'
sleep 2
( time ( pv -i 0.1 -cN in0 < "${f}" |
LC_ALL=C perl -wnl -e '$.== 133668997 && print && exit;'
) ) | mawk 'BEGIN { print } END { print _ } NR' ;
fg: no current job
fg: no current job
7.58GiB 0:00:28 [ 275MiB/s] [============>] 100%
148,110,134 8,134,435,629 8,134,435,629 <<<< rows, chars, and bytes
count as reported by gnu-wc
in0: 5.45GiB 0:00:07 [ 701MiB/s] [=> ] 71%
( pv -i 0.1 -cN in0 < "${f}" | LC_ALL=C mawk2 '_{exit}_=NR==+__' FS='^$' ; )
6.22s user 2.56s system 110% cpu 7.966 total
77.37219=195591955519519519559551=0x296B0FA7D668C4A64F7F=
in0: 5.45GiB 0:00:17 [ 328MiB/s] [=> ] 71%
( pv -i 0.1 -cN in0 < "${f}" | LC_ALL=C perl -wnl -e ; )
14.22s user 3.31s system 103% cpu 17.014 total
77.37219=195591955519519519559551=0x296B0FA7D668C4A64F7F=
I can re-run the test with perl 5.36 or even perl-6 if u think it's gonna make a difference (haven't installed either), but a gap of
7.966 secs (mawk2) vs. 17.014 secs (perl 5.34)
between the two, with the latter more than double the prior, it seems clear which one is indeed meaningfully faster to fetch a single row way deep in ASCII files.
This is perl 5, version 34, subversion 0 (v5.34.0) built for darwin-thread-multi-2level
Copyright 1987-2021, Larry Wall
mawk 1.9.9.6, 21 Aug 2016, Copyright Michael D. Brennan

while loops in parallel with input from splited file

I am stuck on that. So I have this while-read loop within my code that is taking so long and I would like to run it in many processors. But, I'd like to split the input file and run 14 loops (because I have 14 threads), one for each splited file, in parallel. Thing is that I don't know how to tell the while loop which file to get and work with.
For example, in a regular while-read loop I would code:
while read line
do
<some code>
done < input file or variable...
But in this case I would like to split the above input file in 14 files and run 14 while loops in parallel, one for each splited file.
I tried :
split -n 14 input_file
find . -name "xa*" | \
parallel -j 14 | \
while read line
do
<lot of stuff>
done
also tried
split -n 14 input_file
function loop {
while read line
do
<lot of stuff>
done
}
export -f loop
parallel -j 14 ::: loop
But neither I was able to tell which file would be the input to the loop so parallel would understand "take each of those xa* files and place into individual loops in parallel"
An example of the input file (a list of strings)
AEYS01000010.10484.12283
CVJT01000011.50.2173
KF625180.1.1799
KT949922.1.1791
LOBZ01000025.54942.57580
EDIT
This is the code.
The output is a table (741100 lines) with some statistics regarding DNA sequences alignments already made.
The loop takes an input_file (no broken lines, varies from 500 to ~45000 lines, 800Kb) with DNA sequence acessions, reads it line-by-line and look for each correspondent full taxonomy for those acessions in a databank (~45000 lines). Then, it does a few sums/divisions. Output is a .tsv and looks like this (an example for sequence "KF625180.1.1799"):
Rate of taxonomies for this sequence in %: KF625180.1.1799 D_6__Bacillus_atrophaeus
Taxonomy %aligned number_ocurrences_in_the_alignment num_ocurrences_in_databank %alingment/databank
D_6__Bacillus_atrophaeus 50% 1 20 5%
D_6__Bacillus_amyloliquefaciens 50% 1 154 0.649351%
$ head input file
AEYS01000010.10484.12283
CVJT01000011.50.217
KF625180.1.1799
KT949922.1.1791
LOBZ01000025.54942.57580
Two additional files are also used inside the loop. They are not the loop input.
1) a file called alnout_file that only serves for finding how many hits (or alignments) a given sequence had against the databank. It was also previously made outside this loop. It can vary in the number of lines from hundreads to thousands. Only columns 1 and 2 matters here. Column1 is the name of the sequence and col2 is the name of all sequences it matched in the databnk. It looks like that:
$ head alnout_file
KF625180.1.1799 KF625180.1.1799 100.0 431 0 0 1 431 1 431 -1 0
KF625180.1.1799 KP143082.1.1457 99.3 431 1 2 1 431 1 429 -1 0
KP143082.1.1457 KF625180.1.1799 99.3 431 1 2 1 429 1 431 -1 0
2) a databank .tsv file containing ~45000 taxonomies correspondent to the DNA sequences. Each taxonomy is in one line:
$ head taxonomy.file.tsv
KP143082.1.1457 D_0__Bacteria;D_1__Firmicutes;D_2__Bacilli;D_3__Bacillales;D_4__Bacillaceae;D_5__Bacillus;D_6__Bacillus_amyloliquefaciens
KF625180.1.1799 D_0__Bacteria;D_1__Firmicutes;D_2__Bacilli;D_3__Bacillales;D_4__Bacillaceae;D_5__Bacillus;D_6__Bacillus_atrophaeus
So, given sequence KF625180.1.1799. I previously aligned it against a databank containing ~45000 other DNA sequences and got an output whis has all the accessions to sequences that it matched. What the loop does is that it finds the taxonomies for all those sequences and calculates the "statistics" I mentionded previously. Code does it for all the DNA-sequences-accesions I have.
TAXONOMY=path/taxonomy.file.tsv
while read line
do
#find hits
hits=$(grep $line alnout_file | cut -f 2)
completename=$(grep $line $TAXONOMY | sed 's/D_0.*D_4/D_4/g')
printf "\nRate of taxonomies for this sequence in %%:\t$completename\n"
printf "Taxonomy\t%aligned\tnumber_ocurrences_in_the_alignment\tnum_ocurrences_in_databank\t%alingment/databank\n"
#find hits and calculate the frequence (%) of the taxonomy in the alignment output
# ex.: Bacillus_subtilis 33
freqHits=$(grep "${hits[#]}" $TAXONOMY | \
cut -f 2 | \
awk '{a[$0]++} END {for (i in a) {print i, "\t", a[i]/NR*100, "\t", a[i]}}' | \
sed -e 's/D_0.*D_5/D_5/g' -e 's#\s\t\s#\t#g' | \
sort -k2 -hr)
# print frequence of each taxonomy in the databank
freqBank=$(while read line; do grep -c "$line" $TAXONOMY; done < <(echo "$freqHits" | cut -f 1))
#print cols with taxonomy and calculations
paste <(printf %s "$freqHits") <(printf %s "$freqBank") | awk '{print $1,"\t",$2"%","\t",$3,"\t",$4,"\t",$3/$4*100"%"}'
done < input_file
It is a lot of greps and parsing so it takes about ~12h running in one processor for doing it to all the 45000 DNA sequence accessions. The, I would like to split input_file and do it in all the processors I have (14) because it would the time spend in that.
Thank you all for being so patient with me =)
You are looking for --pipe. In this case you can even use the optimized --pipepart (version >20160621):
export TAXONOMY=path/taxonomy.file.tsv
doit() {
while read line
do
#find hits
hits=$(grep $line alnout_file | cut -f 2)
completename=$(grep $line $TAXONOMY | sed 's/D_0.*D_4/D_4/g')
printf "\nRate of taxonomies for this sequence in %%:\t$completename\n"
printf "Taxonomy\t%aligned\tnumber_ocurrences_in_the_alignment\tnum_ocurrences_in_databank\t%alingment/databank\n"
#find hits and calculate the frequence (%) of the taxonomy in the alignment output
# ex.: Bacillus_subtilis 33
freqHits=$(grep "${hits[#]}" $TAXONOMY | \
cut -f 2 | \
awk '{a[$0]++} END {for (i in a) {print i, "\t", a[i]/NR*100, "\t", a[i]}}' | \
sed -e 's/D_0.*D_5/D_5/g' -e 's#\s\t\s#\t#g' | \
sort -k2 -hr)
# print frequence of each taxonomy in the databank
freqBank=$(while read line; do grep -c "$line" $TAXONOMY; done < <(echo "$freqHits" | cut -f 1))
#print cols with taxonomy and calculations
paste <(printf %s "$freqHits") <(printf %s "$freqBank") | awk '{print $1,"\t",$2"%","\t",$3,"\t",$4,"\t",$3/$4*100"%"}'
done
}
export -f doit
parallel -a input_file --pipepart doit
This will chop input_file into 10*ncpu blocks (where ncpu is the number of CPU threads), pass each block to doit, run ncpu jobs in parallel.
That said I think your real problem is spawning too many programs: If you rewrite doit in Perl or Python I will expect you will see a major speedup.
As an alternative I threw together a quick test.
#! /bin/env bash
mkfifo PIPELINE # create a single queue
cat "$1" > PIPELINE & # supply it with records
{ declare -i cnt=0 max=14
while (( ++cnt <= max )) # spawn loop creates worker jobs
do printf -v fn "%02d" $cnt
while read -r line # each work loop reads common stdin...
do echo "$fn:[$line]"
sleep 1
done >$fn.log 2>&1 & # these run in background in parallel
done # this one exits
} < PIPELINE # *all* read from the same queue
wait
cat [0-9][0-9].log
Doesn't need split, but does need a mkfifo.
Obviously, change the code inside the internal loop.
This answers what you asked, namely how to process in parallel the 14 files you get from running split. However, I don't think it is the best way of doing whatever it is that you are trying to do - but we would need some answers from you for that.
So, let's make a million line file and split it into 14 parts:
seq 1000000 > 1M
split -n 14 1M part-
That gives me 14 files called part-aa through part-an. Now your question is how to process those 14 parts in parallel - (read the last line first):
#!/bin/bash
# This function will be called for each of the 14 files
DoOne(){
# Pick up parameters
job=$1
file=$2
# Count lines in specified file
lines=$(wc -l < "$file")
echo "Job No: $job, file: $file, lines: $lines"
}
# Make the function above known to processes spawned by GNU Parallel
export -f DoOne
# Run 14 parallel instances of "DoOne" passing job number and filename to each
parallel -k -j 14 DoOne {#} {} ::: part-??
Sample Output
Job No: 1, file: part-aa, lines: 83861
Job No: 2, file: part-ab, lines: 72600
Job No: 3, file: part-ac, lines: 70295
Job No: 4, file: part-ad, lines: 70295
Job No: 5, file: part-ae, lines: 70294
Job No: 6, file: part-af, lines: 70295
Job No: 7, file: part-ag, lines: 70295
Job No: 8, file: part-ah, lines: 70294
Job No: 9, file: part-ai, lines: 70295
Job No: 10, file: part-aj, lines: 70295
Job No: 11, file: part-ak, lines: 70295
Job No: 12, file: part-al, lines: 70294
Job No: 13, file: part-am, lines: 70295
Job No: 14, file: part-an, lines: 70297
You would omit the -k argument to GNU Parallel normally - I only added it so the output comes in order.
I think that using a bunch of grep and awk commands is the wrong approach here - you would be miles better off using Perl, or awk. As you have not provided any sample files I generated some using this code:
#!/bin/bash
for a in {A..Z} {0..9} ; do
for b in {A..Z} {0..9} ; do
for c in {A..Z} {0..9} ; do
echo "${a}${b}${c}"
done
done
done > a
# Now make file "b" which has the same stuff but shuffled into a different order
gshuf < a > b
Note that there are 26 letters in the alphabet, so if I add the digits 0..9 to the letters of the alphabet, I get 36 alphanumeric digits and if I nest 3 loops of that I get 36^3 or 46,656 lines which matches your file sizes roughly. File a now looks like this:
AAA
AAB
AAC
AAD
AAE
AAF
File b looks like this:
UKM
L50
AOC
79U
K6S
6PO
12I
XEV
WJN
Now I want to loop through a finding the corresponding line in b. First, I use your approach:
time while read thing ; do grep $thing b > /dev/null ; done < a
That takes 9 mins 35 seconds.
If I now exit grep on the first match, on average I will find it in the middle, which means the time will be halved since I won't continue to needlessly read b after I find what I want.
time while read thing ; do grep -m1 $thing b > /dev/null ; done < a
That improves the time down to 4 mins 30 seconds.
If I now use awk to read the contents of b into an associative array (a.k.a. hash) and then read the elements of a and find them in b like this:
time awk 'FNR==NR{a[$1]=$1; next} {print a[$1]}' b a > /dev/null
That now runs in 0.07 seconds. Hopefully you get the idea of what I am driving at. I expect Perl would do this in the same time and also provide more expressive facilities for the maths in the middle of your loop too.
I hope this small script helps you out:
function process {
while read line; do
echo "$line"
done < $1
}
function loop {
file=$1
chunks=$2
dir=`mktemp -d`
cd $dir
split -n l/$chunks $file
for i in *; do
process "$i" &
done
rm -rf $dir
}
loop /tmp/foo 14
It runs the process loop on the specified file with the specified number of chunks (without splitting lines) in parallel (using & to put each invocation in the background). I hope it gets you started.
This can do the job for You, I am not familiar with parallel instead using native bash spawning processes &:
function loop () {
while IFS= read -r -d $'\n'
do
# YOUR BIG STUFF
done < "${1}"
}
arr_files=(./xa*)
for i in "${arr_files[#]}"
do loop "${i}" &
done
wait

Get a percentage of randomly chosen lines from a text file

I have a text file (bigfile.txt) with thousands of rows. I want to make a smaller text file with 1 % of the rows which are randomly chosen. I tried the following
output=$(wc -l bigfile.txt)
ds1=$(0.01*output)
sort -r bigfile.txt|shuf|head -n ds1
It give the following error:
head: invalid number of lines: ‘ds1’
I don't know what is wrong.
Even after you fix your issues with your bash script, it cannot do floating point arithmetic. You need external tools like Awk which I would use as
randomCount=$(awk 'END{print int((NR==0)?0:(NR/100))}' bigfile.txt)
(( randomCount )) && sort -r file | shuf | head -n "$randomCount"
E.g. Writing a file with with 221 lines using the below loop and trying to get random lines,
tmpfile=$(mktemp /tmp/abc-script.XXXXXX)
for i in {1..221}; do echo $i; done >> "$tmpfile"
randomCount=$(awk 'END{print int((NR==0)?0:(NR/100))}' "$tmpfile")
If I print the count, it would return me a integer number 2 and using that on the next command,
sort -r "$tmpfile" | shuf | head -n "$randomCount"
86
126
Roll a die (with rand()) for each line of the file and get a number between 0 and 1. Print the line if the die shows less than 0.01:
awk 'rand()<0.01' bigFile
Quick test - generate 100,000,000 lines and count how many get through:
seq 1 100000000 | awk 'rand()<0.01' | wc -l
999308
Pretty close to 1%.
If you want the order random as well as the selection, you can pass this through shuf afterwards:
seq 1 100000000 | awk 'rand()<0.01' | shuf
On the subject of efficiency which came up in the comments, this solution takes 24s on my iMac with 100,000,000 lines:
time { seq 1 100000000 | awk 'rand()<0.01' > /dev/null; }
real 0m23.738s
user 0m31.787s
sys 0m0.490s
The only other solution that works here, heavily based on OP's original code, takes 13 minutes 19s.

What's an easy way to read random line from a file?

What's an easy way to read random line from a file in a shell script?
You can use shuf:
shuf -n 1 $FILE
There is also a utility called rl. In Debian it's in the randomize-lines package that does exactly what you want, though not available in all distros. On its home page it actually recommends the use of shuf instead (which didn't exist when it was created, I believe). shuf is part of the GNU coreutils, rl is not.
rl -c 1 $FILE
Another alternative:
head -$((${RANDOM} % `wc -l < file` + 1)) file | tail -1
sort --random-sort $FILE | head -n 1
(I like the shuf approach above even better though - I didn't even know that existed and I would have never found that tool on my own)
This is simple.
cat file.txt | shuf -n 1
Granted this is just a tad slower than the "shuf -n 1 file.txt" on its own.
perlfaq5: How do I select a random line from a file? Here's a reservoir-sampling algorithm from the Camel Book:
perl -e 'srand; rand($.) < 1 && ($line = $_) while <>; print $line;' file
This has a significant advantage in space over reading the whole file in. You can find a proof of this method in The Art of Computer Programming, Volume 2, Section 3.4.2, by Donald E. Knuth.
using a bash script:
#!/bin/bash
# replace with file to read
FILE=tmp.txt
# count number of lines
NUM=$(wc - l < ${FILE})
# generate random number in range 0-NUM
let X=${RANDOM} % ${NUM} + 1
# extract X-th line
sed -n ${X}p ${FILE}
Single bash line:
sed -n $((1+$RANDOM%`wc -l test.txt | cut -f 1 -d ' '`))p test.txt
Slight problem: duplicate filename.
Here's a simple Python script that will do the job:
import random, sys
lines = open(sys.argv[1]).readlines()
print(lines[random.randrange(len(lines))])
Usage:
python randline.py file_to_get_random_line_from
Another way using 'awk'
awk NR==$((${RANDOM} % `wc -l < file.name` + 1)) file.name
A solution that also works on MacOSX, and should also works on Linux(?):
N=5
awk 'NR==FNR {lineN[$1]; next}(FNR in lineN)' <(jot -r $N 1 $(wc -l < $file)) $file
Where:
N is the number of random lines you want
NR==FNR {lineN[$1]; next}(FNR in lineN) file1 file2
--> save line numbers written in file1 and then print corresponding line in file2
jot -r $N 1 $(wc -l < $file) --> draw N numbers randomly (-r) in range (1, number_of_line_in_file) with jot. The process substitution <() will make it look like a file for the interpreter, so file1 in previous example.
#!/bin/bash
IFS=$'\n' wordsArray=($(<$1))
numWords=${#wordsArray[#]}
sizeOfNumWords=${#numWords}
while [ True ]
do
for ((i=0; i<$sizeOfNumWords; i++))
do
let ranNumArray[$i]=$(( ( $RANDOM % 10 ) + 1 ))-1
ranNumStr="$ranNumStr${ranNumArray[$i]}"
done
if [ $ranNumStr -le $numWords ]
then
break
fi
ranNumStr=""
done
noLeadZeroStr=$((10#$ranNumStr))
echo ${wordsArray[$noLeadZeroStr]}
Here is what I discovery since my Mac OS doesn't use all the easy answers. I used the jot command to generate a number since the $RANDOM variable solutions seems not to be very random in my test. When testing my solution I had a wide variance in the solutions provided in the output.
RANDOM1=`jot -r 1 1 235886`
#range of jot ( 1 235886 ) found from earlier wc -w /usr/share/dict/web2
echo $RANDOM1
head -n $RANDOM1 /usr/share/dict/web2 | tail -n 1
The echo of the variable is to get a visual of the generated random number.
Using only vanilla sed and awk, and without using $RANDOM, a simple, space-efficient and reasonably fast "one-liner" for selecting a single line pseudo-randomly from a file named FILENAME is as follows:
sed -n $(awk 'END {srand(); r=rand()*NR; if (r<NR) {sub(/\..*/,"",r); r++;}; print r}' FILENAME)p FILENAME
(This works even if FILENAME is empty, in which case no line is emitted.)
One possible advantage of this approach is that it only calls rand() once.
As pointed out by #AdamKatz in the comments, another possibility would be to call rand() for each line:
awk 'rand() * NR < 1 { line = $0 } END { print line }' FILENAME
(A simple proof of correctness can be given based on induction.)
Caveat about rand()
"In most awk implementations, including gawk, rand() starts generating numbers from the same starting number, or seed, each time you run awk."
-- https://www.gnu.org/software/gawk/manual/html_node/Numeric-Functions.html

Shell command to sum integers, one per line?

I am looking for a command that will accept (as input) multiple lines of text, each line containing a single integer, and output the sum of these integers.
As a bit of background, I have a log file which includes timing measurements. Through grepping for the relevant lines and a bit of sed reformatting I can list all of the timings in that file. I would like to work out the total. I can pipe this intermediate output to any command in order to do the final sum. I have always used expr in the past, but unless it runs in RPN mode I do not think it is going to cope with this (and even then it would be tricky).
How can I get the summation of integers?
Bit of awk should do it?
awk '{s+=$1} END {print s}' mydatafile
Note: some versions of awk have some odd behaviours if you are going to be adding anything exceeding 2^31 (2147483647). See comments for more background. One suggestion is to use printf rather than print:
awk '{s+=$1} END {printf "%.0f", s}' mydatafile
Paste typically merges lines of multiple files, but it can also be used to convert individual lines of a file into a single line. The delimiter flag allows you to pass a x+x type equation to bc.
paste -s -d+ infile | bc
Alternatively, when piping from stdin,
<commands> | paste -s -d+ - | bc
The one-liner version in Python:
$ python -c "import sys; print(sum(int(l) for l in sys.stdin))"
I would put a big WARNING on the commonly approved solution:
awk '{s+=$1} END {print s}' mydatafile # DO NOT USE THIS!!
that is because in this form awk uses a 32 bit signed integer representation: it will overflow for sums that exceed 2147483647 (i.e., 2^31).
A more general answer (for summing integers) would be:
awk '{s+=$1} END {printf "%.0f\n", s}' mydatafile # USE THIS INSTEAD
Plain bash:
$ cat numbers.txt
1
2
3
4
5
6
7
8
9
10
$ sum=0; while read num; do ((sum += num)); done < numbers.txt; echo $sum
55
With jq:
seq 10 | jq -s 'add' # 'add' is equivalent to 'reduce .[] as $item (0; . + $item)'
dc -f infile -e '[+z1<r]srz1<rp'
Note that negative numbers prefixed with minus sign should be translated for dc, since it uses _ prefix rather than - prefix for that. For example, via tr '-' '_' | dc -f- -e '...'.
Edit: Since this answer got so many votes "for obscurity", here is a detailed explanation:
The expression [+z1<r]srz1<rp does the following:
[ interpret everything to the next ] as a string
+ push two values off the stack, add them and push the result
z push the current stack depth
1 push one
<r pop two values and execute register r if the original top-of-stack (1)
is smaller
] end of the string, will push the whole thing to the stack
sr pop a value (the string above) and store it in register r
z push the current stack depth again
1 push 1
<r pop two values and execute register r if the original top-of-stack (1)
is smaller
p print the current top-of-stack
As pseudo-code:
Define "add_top_of_stack" as:
Remove the two top values off the stack and add the result back
If the stack has two or more values, run "add_top_of_stack" recursively
If the stack has two or more values, run "add_top_of_stack"
Print the result, now the only item left in the stack
To really understand the simplicity and power of dc, here is a working Python script that implements some of the commands from dc and executes a Python version of the above command:
### Implement some commands from dc
registers = {'r': None}
stack = []
def add():
stack.append(stack.pop() + stack.pop())
def z():
stack.append(len(stack))
def less(reg):
if stack.pop() < stack.pop():
registers[reg]()
def store(reg):
registers[reg] = stack.pop()
def p():
print stack[-1]
### Python version of the dc command above
# The equivalent to -f: read a file and push every line to the stack
import fileinput
for line in fileinput.input():
stack.append(int(line.strip()))
def cmd():
add()
z()
stack.append(1)
less('r')
stack.append(cmd)
store('r')
z()
stack.append(1)
less('r')
p()
Pure and short bash.
f=$(cat numbers.txt)
echo $(( ${f//$'\n'/+} ))
perl -lne '$x += $_; END { print $x; }' < infile.txt
My fifteen cents:
$ cat file.txt | xargs | sed -e 's/\ /+/g' | bc
Example:
$ cat text
1
2
3
3
4
5
6
78
9
0
1
2
3
4
576
7
4444
$ cat text | xargs | sed -e 's/\ /+/g' | bc
5148
I've done a quick benchmark on the existing answers which
use only standard tools (sorry for stuff like lua or rocket),
are real one-liners,
are capable of adding huge amounts of numbers (100 million), and
are fast (I ignored the ones which took longer than a minute).
I always added the numbers of 1 to 100 million which was doable on my machine in less than a minute for several solutions.
Here are the results:
Python
:; seq 100000000 | python -c 'import sys; print sum(map(int, sys.stdin))'
5000000050000000
# 30s
:; seq 100000000 | python -c 'import sys; print sum(int(s) for s in sys.stdin)'
5000000050000000
# 38s
:; seq 100000000 | python3 -c 'import sys; print(sum(int(s) for s in sys.stdin))'
5000000050000000
# 27s
:; seq 100000000 | python3 -c 'import sys; print(sum(map(int, sys.stdin)))'
5000000050000000
# 22s
:; seq 100000000 | pypy -c 'import sys; print(sum(map(int, sys.stdin)))'
5000000050000000
# 11s
:; seq 100000000 | pypy -c 'import sys; print(sum(int(s) for s in sys.stdin))'
5000000050000000
# 11s
Awk
:; seq 100000000 | awk '{s+=$1} END {print s}'
5000000050000000
# 22s
Paste & Bc
This ran out of memory on my machine. It worked for half the size of the input (50 million numbers):
:; seq 50000000 | paste -s -d+ - | bc
1250000025000000
# 17s
:; seq 50000001 100000000 | paste -s -d+ - | bc
3750000025000000
# 18s
So I guess it would have taken ~35s for the 100 million numbers.
Perl
:; seq 100000000 | perl -lne '$x += $_; END { print $x; }'
5000000050000000
# 15s
:; seq 100000000 | perl -e 'map {$x += $_} <> and print $x'
5000000050000000
# 48s
Ruby
:; seq 100000000 | ruby -e "puts ARGF.map(&:to_i).inject(&:+)"
5000000050000000
# 30s
C
Just for comparison's sake I compiled the C version and tested this also, just to have an idea how much slower the tool-based solutions are.
#include <stdio.h>
int main(int argc, char** argv) {
long sum = 0;
long i = 0;
while(scanf("%ld", &i) == 1) {
sum = sum + i;
}
printf("%ld\n", sum);
return 0;
}
 
:; seq 100000000 | ./a.out
5000000050000000
# 8s
Conclusion
C is of course fastest with 8s, but the Pypy solution only adds a very little overhead of about 30% to 11s. But, to be fair, Pypy isn't exactly standard. Most people only have CPython installed which is significantly slower (22s), exactly as fast as the popular Awk solution.
The fastest solution based on standard tools is Perl (15s).
Using the GNU datamash util:
seq 10 | datamash sum 1
Output:
55
If the input data is irregular, with spaces and tabs at odd places, this may confuse datamash, then either use the -W switch:
<commands...> | datamash -W sum 1
...or use tr to clean up the whitespace:
<commands...> | tr -d '[[:blank:]]' | datamash sum 1
If the input is large enough, the output will be in scientific notation.
seq 100000000 | datamash sum 1
Output:
5.00000005e+15
To convert that to decimal, use the the --format option:
seq 100000000 | datamash --format '%.0f' sum 1
Output:
5000000050000000
Plain bash one liner
$ cat > /tmp/test
1
2
3
4
5
^D
$ echo $(( $(cat /tmp/test | tr "\n" "+" ) 0 ))
BASH solution, if you want to make this a command (e.g. if you need to do this frequently):
addnums () {
local total=0
while read val; do
(( total += val ))
done
echo $total
}
Then usage:
addnums < /tmp/nums
You can using num-utils, although it may be overkill for what you need. This is a set of programs for manipulating numbers in the shell, and can do several nifty things, including of course, adding them up. It's a bit out of date, but they still work and can be useful if you need to do something more.
https://suso.suso.org/programs/num-utils/index.phtml
It's really simple to use:
$ seq 10 | numsum
55
But runs out of memory for large inputs.
$ seq 100000000 | numsum
Terminado (killed)
The following works in bash:
I=0
for N in `cat numbers.txt`
do
I=`expr $I + $N`
done
echo $I
I realize this is an old question, but I like this solution enough to share it.
% cat > numbers.txt
1
2
3
4
5
^D
% cat numbers.txt | perl -lpe '$c+=$_}{$_=$c'
15
If there is interest, I'll explain how it works.
Cannot avoid submitting this, it is the most generic approach to this Question, please check:
jot 1000000 | sed '2,$s/$/+/;$s/$/p/' | dc
It is to be found over here, I was the OP and the answer came from the audience:
Most elegant unix shell one-liner to sum list of numbers of arbitrary precision?
And here are its special advantages over awk, bc, perl, GNU's datamash and friends:
it uses standards utilities common in any unix environment
it does not depend on buffering and thus it does not choke with really long inputs.
it implies no particular precision limits -or integer size for that matter-, hello AWK friends!
no need for different code, if floating point numbers need to be added, instead.
it theoretically runs unhindered in the minimal of environments
sed 's/^/.+/' infile | bc | tail -1
Pure bash and in a one-liner :-)
$ cat numbers.txt
1
2
3
4
5
6
7
8
9
10
$ I=0; for N in $(cat numbers.txt); do I=$(($I + $N)); done; echo $I
55
Alternative pure Perl, fairly readable, no packages or options required:
perl -e "map {$x += $_} <> and print $x" < infile.txt
For Ruby Lovers
ruby -e "puts ARGF.map(&:to_i).inject(&:+)" numbers.txt
Here's a nice and clean Raku (formerly known as Perl 6) one-liner:
say [+] slurp.lines
We can use it like so:
% seq 10 | raku -e "say [+] slurp.lines"
55
It works like this:
slurp without any arguments reads from standard input by default; it returns a string. Calling the lines method on a string returns a list of lines of the string.
The brackets around + turn + into a reduction meta operator which reduces the list to a single value: the sum of the values in the list. say then prints it to standard output with a newline.
One thing to note is that we never explicitly convert the lines to numbers—Raku is smart enough to do that for us. However, this means our code breaks on input that definitely isn't a number:
% echo "1\n2\nnot a number" | raku -e "say [+] slurp.lines"
Cannot convert string to number: base-10 number must begin with valid digits or '.' in '⏏not a number' (indicated by ⏏)
in block <unit> at -e line 1
You can do it in python, if you feel comfortable:
Not tested, just typed:
out = open("filename").read();
lines = out.split('\n')
ints = map(int, lines)
s = sum(ints)
print s
Sebastian pointed out a one liner script:
cat filename | python -c"from fileinput import input; print sum(map(int, input()))"
The following should work (assuming your number is the second field on each line).
awk 'BEGIN {sum=0} \
{sum=sum + $2} \
END {print "tot:", sum}' Yourinputfile.txt
$ cat n
2
4
2
7
8
9
$ perl -MList::Util -le 'print List::Util::sum(<>)' < n
32
Or, you can type in the numbers on the command line:
$ perl -MList::Util -le 'print List::Util::sum(<>)'
1
3
5
^D
9
However, this one slurps the file so it is not a good idea to use on large files. See j_random_hacker's answer which avoids slurping.
One-liner in Racket:
racket -e '(define (g) (define i (read)) (if (eof-object? i) empty (cons i (g)))) (foldr + 0 (g))' < numlist.txt
C (not simplified)
seq 1 10 | tcc -run <(cat << EOF
#include <stdio.h>
int main(int argc, char** argv) {
int sum = 0;
int i = 0;
while(scanf("%d", &i) == 1) {
sum = sum + i;
}
printf("%d\n", sum);
return 0;
}
EOF)
My version:
seq -5 10 | xargs printf "- - %s" | xargs | bc
C++ (simplified):
echo {1..10} | scc 'WRL n+=$0; n'
SCC project - http://volnitsky.com/project/scc/
SCC is C++ snippets evaluator at shell prompt

Resources