What's the most elegant way to add Redis to /etc/services? - shell

So today, April 23rd 2015, the Internet Assigned Numbers Authority had decreed the use of port 6379 to Redis, a frabjous day indeed!
I wish to com·mem·o·rate this splendid occasion by adding the following line to my /etc/services file:
redis 6379/tcp
What would be the best way to go about it? By best I mean, of course, the following:
Needless to say, the new line should be inserted in its proper place (i.e.g. under the Assigned Numbers block, right after gnutella-rtr 6347/udp on my system)
I've considered the use of various text editors, but it feels out of place
Ideally, the solution should be a copy-pastable one-liner
I can envision the awk script that could do that but I'm looking for something more, a certain je ne sais quoi
Update re #Markus' sed proposal: I'm afraid the problem would be applying this "patch" on other systems that do not necessarily have the same /etc/services file so, expanding on point #1 above, the solution must ensure that regardless the specifics of to-be-preceding service in the file, order is kept.
Update 2: a few points that seem important to state - a) while not mandatory, the solution's length (or lack of rather) is certainly an important part of its elegance (similarly for external dependencies [i.e. lack of these]); b) I/we assumed that /etc/services is sorted, but it would be interesting to see what happens when it isn't; c) assume that you have root privileges and be careful with that rm / -rf command.

A single line sort that puts it in the right place:
echo -e "redis\t\t6379/tcp" | sort -k2 -n -o /etc/services -m - /etc/services

Nothing in the rules against answering my own question - this one uses Redis exclusively:
cat /etc/services | redis-cli -x SET services; redis-cli --raw EVAL 'local s = redis.call("GET", KEYS[1]) local b, p, name, port, proto; p = 1 repeat b, p, name, port, proto = string.find(s, "([%a%w\-]*)%s*(%d+)/(%w+)", p) if (p and tonumber(port) > tonumber(ARGV[2])) then s = string.sub(s, 1, b-1) .. ARGV[1] .. "\t\t" .. ARGV[2] .. "/" .. ARGV[3] .. "\t\t\t# " .. ARGV[4] .. "\n" .. string.sub(s, b, string.len(s)) return s end until not(p)' 1 services redis 6379 tcp "remote dictionary server" > /etc/services
Formatted Lua code:
local s = redis.call("GET", KEYS[1])
local b, p, name, port, proto
p = 1
repeat
b, p, name, port, proto = string.find(s, "([%a%w\-]*)%s*(%d+)/(%w+)", p)
if (p and tonumber(port) > tonumber(ARGV[2])) then
s = string.sub(s, 1, b-1) .. ARGV[1] .. "\t\t" .. ARGV[2]
.. "/" .. ARGV[3] .. "\t\t\t# " .. ARGV[4] .. "\n"
.. string.sub(s, b, string.len(s))
return s
end
until not(p)
Note: a similar challenge (https://gist.github.com/jorinvo/2e43ffa981a97bc17259#gistcomment-1440996) had inspired this answer. I chose a pure Lua script approach instead of leveraging Sorted Sets... although I could :)

An idempotent awk one-liner that inserts 6379 in order would be:
awk -v inserted=0 '/^[a-z]/ { if ($2 + 0 == 6379) { inserted=1 }; if (inserted == 0 && $2 + 0 > 6379) { print "redis\t\t6379/tcp"; inserted=1 }; print $0 }' /etc/services > /tmp/services && mv /tmp/services /etc/services

Real men don't use sed/awk :)
TMP_SERVICES=/tmp/services.$RANDOM
while read line
do
printf %b "$line\n" >> $TMP_SERVICES
if [[ $line == *"6347/udp"* ]]; then
printf %b "redis\t\t6379/tcp\n" >> $TMP_SERVICES
fi
done<"/etc/services"
mv -fb $TMP_SERVICES /etc/services

How about a python script, Itamar?
It works on the notion of extracting the port number (called the index in the code) and if we are above 6378 but have not yet printed our Redis line, print it, then mark that sentinel true and just print all lines (including the one we are on) after.
#!/usr/bin/python
lines = open("/etc/services").readlines()
printed=False
for line in lines:
if printed:
print line.rstrip()
continue
datafields = line.split()
if line[0] == "#":
print line.rstrip()
else:
datafields = line.split()
try:
try:
index,proto = datafields[1].split("/")
index = int(index)
except:
index,proto = datafields[0].split("/")
index = int(index)
if index > 6378:
if not printed:
print "redis 6379/tcp #Redis DSS"
printed = True
print line.rstrip()
except:
print datafields
raise
The relevant section on my file for comparison:
gnutella-rtr 6347/tcp # gnutella-rtr
# Serguei Osokine <osokin#paragraph.com>
# 6348-6381 Unassigned
redis 6379/tcp #Redis DSS
metatude-mds 6382/udp # Metatude Dialogue Server
metatude-mds 6382/tcp # Metatude Dialogue Server
Notice the line above Redis is a range. Short of breaking the range this is a workable solution for me. You could break the range but IMO this works just fine. Splitting the range seems a bit much for a simple, elegant script. Especially considering the likelihood most services files don't have the unassigned ranges listed (this is on OS X) - and that they are in a comment anyway.
UPDATE
If you don't care about the local file and it's comments, this gets you all currently assigned ports which are not Reserved or Discard-ed:
curl -s http://www.iana.org/assignments/service-names-port-numbers/service-names-port-numbers.csv| awk -F',' '$4!~/(Discard|Unassigned|Reserved)/ && $1 && $2+0>0 && $1!~/FIX/ {printf "%-16s\t%s/%s\t#%s\n", $1,$2,$3,$4}' > /etc/services
The FIX test is because some of those lines have embedded newlines - which can be a pain in awk.

sed -i '/6347\/udp/a redis 6379\/tcp' /etc/services
good luck!
update
sed -i '/6347\/udp/a redis \t\t 6379\/tcp' /etc/services
looks better ...
update 2
lol
sed -i ''"$(echo $(echo $(grep -n $(awk {'print$2'} /etc/services | awk -F "/" '$1<6379'{'print$1'} | tail -1) /etc/services | awk -F ':' {'print$1'}|tail -1) + 1)|bc)"'i redis \t\t 6379\/tcp' /etc/services

I like the in-place subsitution of sed, so I combine it with an awk search of the next line in services
R="redis\t\t6379/tcp\t\t\t# data structure server" N=$(awk '{if($2+0 == 6379) exit(1);if ($2+0 > 6379) {print $0;exit(0)}}' /etc/services) && sed -i "s~$N~$R\n$N~" /etc/services

Related

How to extract the third largest file size in Unix [duplicate]

Is there a "canonical" way of doing that? I've been using head -n | tail -1 which does the trick, but I've been wondering if there's a Bash tool that specifically extracts a line (or a range of lines) from a file.
By "canonical" I mean a program whose main function is doing that.
head and pipe with tail will be slow for a huge file. I would suggest sed like this:
sed 'NUMq;d' file
Where NUM is the number of the line you want to print; so, for example, sed '10q;d' file to print the 10th line of file.
Explanation:
NUMq will quit immediately when the line number is NUM.
d will delete the line instead of printing it; this is inhibited on the last line because the q causes the rest of the script to be skipped when quitting.
If you have NUM in a variable, you will want to use double quotes instead of single:
sed "${NUM}q;d" file
sed -n '2p' < file.txt
will print 2nd line
sed -n '2011p' < file.txt
2011th line
sed -n '10,33p' < file.txt
line 10 up to line 33
sed -n '1p;3p' < file.txt
1st and 3th line
and so on...
For adding lines with sed, you can check this:
sed: insert a line in a certain position
I have a unique situation where I can benchmark the solutions proposed on this page, and so I'm writing this answer as a consolidation of the proposed solutions with included run times for each.
Set Up
I have a 3.261 gigabyte ASCII text data file with one key-value pair per row. The file contains 3,339,550,320 rows in total and defies opening in any editor I have tried, including my go-to Vim. I need to subset this file in order to investigate some of the values that I've discovered only start around row ~500,000,000.
Because the file has so many rows:
I need to extract only a subset of the rows to do anything useful with the data.
Reading through every row leading up to the values I care about is going to take a long time.
If the solution reads past the rows I care about and continues reading the rest of the file it will waste time reading almost 3 billion irrelevant rows and take 6x longer than necessary.
My best-case-scenario is a solution that extracts only a single line from the file without reading any of the other rows in the file, but I can't think of how I would accomplish this in Bash.
For the purposes of my sanity I'm not going to be trying to read the full 500,000,000 lines I'd need for my own problem. Instead I'll be trying to extract row 50,000,000 out of 3,339,550,320 (which means reading the full file will take 60x longer than necessary).
I will be using the time built-in to benchmark each command.
Baseline
First let's see how the head tail solution:
$ time head -50000000 myfile.ascii | tail -1
pgm_icnt = 0
real 1m15.321s
The baseline for row 50 million is 00:01:15.321, if I'd gone straight for row 500 million it'd probably be ~12.5 minutes.
cut
I'm dubious of this one, but it's worth a shot:
$ time cut -f50000000 -d$'\n' myfile.ascii
pgm_icnt = 0
real 5m12.156s
This one took 00:05:12.156 to run, which is much slower than the baseline! I'm not sure whether it read through the entire file or just up to line 50 million before stopping, but regardless this doesn't seem like a viable solution to the problem.
AWK
I only ran the solution with the exit because I wasn't going to wait for the full file to run:
$ time awk 'NR == 50000000 {print; exit}' myfile.ascii
pgm_icnt = 0
real 1m16.583s
This code ran in 00:01:16.583, which is only ~1 second slower, but still not an improvement on the baseline. At this rate if the exit command had been excluded it would have probably taken around ~76 minutes to read the entire file!
Perl
I ran the existing Perl solution as well:
$ time perl -wnl -e '$.== 50000000 && print && exit;' myfile.ascii
pgm_icnt = 0
real 1m13.146s
This code ran in 00:01:13.146, which is ~2 seconds faster than the baseline. If I'd run it on the full 500,000,000 it would probably take ~12 minutes.
sed
The top answer on the board, here's my result:
$ time sed "50000000q;d" myfile.ascii
pgm_icnt = 0
real 1m12.705s
This code ran in 00:01:12.705, which is 3 seconds faster than the baseline, and ~0.4 seconds faster than Perl. If I'd run it on the full 500,000,000 rows it would have probably taken ~12 minutes.
mapfile
I have bash 3.1 and therefore cannot test the mapfile solution.
Conclusion
It looks like, for the most part, it's difficult to improve upon the head tail solution. At best the sed solution provides a ~3% increase in efficiency.
(percentages calculated with the formula % = (runtime/baseline - 1) * 100)
Row 50,000,000
00:01:12.705 (-00:00:02.616 = -3.47%) sed
00:01:13.146 (-00:00:02.175 = -2.89%) perl
00:01:15.321 (+00:00:00.000 = +0.00%) head|tail
00:01:16.583 (+00:00:01.262 = +1.68%) awk
00:05:12.156 (+00:03:56.835 = +314.43%) cut
Row 500,000,000
00:12:07.050 (-00:00:26.160) sed
00:12:11.460 (-00:00:21.750) perl
00:12:33.210 (+00:00:00.000) head|tail
00:12:45.830 (+00:00:12.620) awk
00:52:01.560 (+00:40:31.650) cut
Row 3,338,559,320
01:20:54.599 (-00:03:05.327) sed
01:21:24.045 (-00:02:25.227) perl
01:23:49.273 (+00:00:00.000) head|tail
01:25:13.548 (+00:02:35.735) awk
05:47:23.026 (+04:24:26.246) cut
With awk it is pretty fast:
awk 'NR == num_line' file
When this is true, the default behaviour of awk is performed: {print $0}.
Alternative versions
If your file happens to be huge, you'd better exit after reading the required line. This way you save CPU time See time comparison at the end of the answer.
awk 'NR == num_line {print; exit}' file
If you want to give the line number from a bash variable you can use:
awk 'NR == n' n=$num file
awk -v n=$num 'NR == n' file # equivalent
See how much time is saved by using exit, specially if the line happens to be in the first part of the file:
# Let's create a 10M lines file
for ((i=0; i<100000; i++)); do echo "bla bla"; done > 100Klines
for ((i=0; i<100; i++)); do cat 100Klines; done > 10Mlines
$ time awk 'NR == 1234567 {print}' 10Mlines
bla bla
real 0m1.303s
user 0m1.246s
sys 0m0.042s
$ time awk 'NR == 1234567 {print; exit}' 10Mlines
bla bla
real 0m0.198s
user 0m0.178s
sys 0m0.013s
So the difference is 0.198s vs 1.303s, around 6x times faster.
According to my tests, in terms of performance and readability my recommendation is:
tail -n+N | head -1
N is the line number that you want. For example, tail -n+7 input.txt | head -1 will print the 7th line of the file.
tail -n+N will print everything starting from line N, and head -1 will make it stop after one line.
The alternative head -N | tail -1 is perhaps slightly more readable. For example, this will print the 7th line:
head -7 input.txt | tail -1
When it comes to performance, there is not much difference for smaller sizes, but it will be outperformed by the tail | head (from above) when the files become huge.
The top-voted sed 'NUMq;d' is interesting to know, but I would argue that it will be understood by fewer people out of the box than the head/tail solution and it is also slower than tail/head.
In my tests, both tails/heads versions outperformed sed 'NUMq;d' consistently. That is in line with the other benchmarks that were posted. It is hard to find a case where tails/heads was really bad. It is also not surprising, as these are operations that you would expect to be heavily optimized in a modern Unix system.
To get an idea about the performance differences, these are the number that I get for a huge file (9.3G):
tail -n+N | head -1: 3.7 sec
head -N | tail -1: 4.6 sec
sed Nq;d: 18.8 sec
Results may differ, but the performance head | tail and tail | head is, in general, comparable for smaller inputs, and sed is always slower by a significant factor (around 5x or so).
To reproduce my benchmark, you can try the following, but be warned that it will create a 9.3G file in the current working directory:
#!/bin/bash
readonly file=tmp-input.txt
readonly size=1000000000
readonly pos=500000000
readonly retries=3
seq 1 $size > $file
echo "*** head -N | tail -1 ***"
for i in $(seq 1 $retries) ; do
time head "-$pos" $file | tail -1
done
echo "-------------------------"
echo
echo "*** tail -n+N | head -1 ***"
echo
seq 1 $size > $file
ls -alhg $file
for i in $(seq 1 $retries) ; do
time tail -n+$pos $file | head -1
done
echo "-------------------------"
echo
echo "*** sed Nq;d ***"
echo
seq 1 $size > $file
ls -alhg $file
for i in $(seq 1 $retries) ; do
time sed $pos'q;d' $file
done
/bin/rm $file
Here is the output of a run on my machine (ThinkPad X1 Carbon with an SSD and 16G of memory). I assume in the final run everything will come from the cache, not from disk:
*** head -N | tail -1 ***
500000000
real 0m9,800s
user 0m7,328s
sys 0m4,081s
500000000
real 0m4,231s
user 0m5,415s
sys 0m2,789s
500000000
real 0m4,636s
user 0m5,935s
sys 0m2,684s
-------------------------
*** tail -n+N | head -1 ***
-rw-r--r-- 1 phil 9,3G Jan 19 19:49 tmp-input.txt
500000000
real 0m6,452s
user 0m3,367s
sys 0m1,498s
500000000
real 0m3,890s
user 0m2,921s
sys 0m0,952s
500000000
real 0m3,763s
user 0m3,004s
sys 0m0,760s
-------------------------
*** sed Nq;d ***
-rw-r--r-- 1 phil 9,3G Jan 19 19:50 tmp-input.txt
500000000
real 0m23,675s
user 0m21,557s
sys 0m1,523s
500000000
real 0m20,328s
user 0m18,971s
sys 0m1,308s
500000000
real 0m19,835s
user 0m18,830s
sys 0m1,004s
Wow, all the possibilities!
Try this:
sed -n "${lineNum}p" $file
or one of these depending upon your version of Awk:
awk -vlineNum=$lineNum 'NR == lineNum {print $0}' $file
awk -v lineNum=4 '{if (NR == lineNum) {print $0}}' $file
awk '{if (NR == lineNum) {print $0}}' lineNum=$lineNum $file
(You may have to try the nawk or gawk command).
Is there a tool that only does the print that particular line? Not one of the standard tools. However, sed is probably the closest and simplest to use.
Save two keystrokes, print Nth line without using bracket:
sed -n Np <fileName>
^ ^
\ \___ 'p' for printing
\______ '-n' for not printing by default
For example, to print 100th line:
sed -n 100p foo.txt
This question being tagged Bash, here's the Bash (≥4) way of doing: use mapfile with the -s (skip) and -n (count) option.
If you need to get the 42nd line of a file file:
mapfile -s 41 -n 1 ary < file
At this point, you'll have an array ary the fields of which containing the lines of file (including the trailing newline), where we have skipped the first 41 lines (-s 41), and stopped after reading one line (-n 1). So that's really the 42nd line. To print it out:
printf '%s' "${ary[0]}"
If you need a range of lines, say the range 42–666 (inclusive), and say you don't want to do the math yourself, and print them on stdout:
mapfile -s $((42-1)) -n $((666-42+1)) ary < file
printf '%s' "${ary[#]}"
If you need to process these lines too, it's not really convenient to store the trailing newline. In this case use the -t option (trim):
mapfile -t -s $((42-1)) -n $((666-42+1)) ary < file
# do stuff
printf '%s\n' "${ary[#]}"
You can have a function do that for you:
print_file_range() {
# $1-$2 is the range of file $3 to be printed to stdout
local ary
mapfile -s $(($1-1)) -n $(($2-$1+1)) ary < "$3"
printf '%s' "${ary[#]}"
}
No external commands, only Bash builtins!
You may also used sed print and quit:
sed -n '10{p;q;}' file # print line 10
You can also use Perl for this:
perl -wnl -e '$.== NUM && print && exit;' some.file
As a followup to CaffeineConnoisseur's very helpful benchmarking answer... I was curious as to how fast the 'mapfile' method was compared to others (as that wasn't tested), so I tried a quick-and-dirty speed comparison myself as I do have bash 4 handy. Threw in a test of the "tail | head" method (rather than head | tail) mentioned in one of the comments on the top answer while I was at it, as folks are singing its praises. I don't have anything nearly the size of the testfile used; the best I could find on short notice was a 14M pedigree file (long lines that are whitespace-separated, just under 12000 lines).
Short version: mapfile appears faster than the cut method, but slower than everything else, so I'd call it a dud. tail | head, OTOH, looks like it could be the fastest, although with a file this size the difference is not all that substantial compared to sed.
$ time head -11000 [filename] | tail -1
[output redacted]
real 0m0.117s
$ time cut -f11000 -d$'\n' [filename]
[output redacted]
real 0m1.081s
$ time awk 'NR == 11000 {print; exit}' [filename]
[output redacted]
real 0m0.058s
$ time perl -wnl -e '$.== 11000 && print && exit;' [filename]
[output redacted]
real 0m0.085s
$ time sed "11000q;d" [filename]
[output redacted]
real 0m0.031s
$ time (mapfile -s 11000 -n 1 ary < [filename]; echo ${ary[0]})
[output redacted]
real 0m0.309s
$ time tail -n+11000 [filename] | head -n1
[output redacted]
real 0m0.028s
Hope this helps!
The fastest solution for big files is always tail|head, provided that the two distances:
from the start of the file to the starting line. Lets call it S
the distance from the last line to the end of the file. Be it E
are known. Then, we could use this:
mycount="$E"; (( E > S )) && mycount="+$S"
howmany="$(( endline - startline + 1 ))"
tail -n "$mycount"| head -n "$howmany"
howmany is just the count of lines required.
Some more detail in https://unix.stackexchange.com/a/216614/79743
All the above answers directly answer the question. But here's a less direct solution but a potentially more important idea, to provoke thought.
Since line lengths are arbitrary, all the bytes of the file before the nth line need to be read. If you have a huge file or need to repeat this task many times, and this process is time-consuming, then you should seriously think about whether you should be storing your data in a different way in the first place.
The real solution is to have an index, e.g. at the start of the file, indicating the positions where the lines begin. You could use a database format, or just add a table at the start of the file. Alternatively create a separate index file to accompany your large text file.
e.g. you might create a list of character positions for newlines:
awk 'BEGIN{c=0;print(c)}{c+=length()+1;print(c+1)}' file.txt > file.idx
then read with tail, which actually seeks directly to the appropriate point in the file!
e.g. to get line 1000:
tail -c +$(awk 'NR=1000' file.idx) file.txt | head -1
This may not work with 2-byte / multibyte characters, since awk is "character-aware" but tail is not.
I haven't tested this against a large file.
Also see this answer.
Alternatively - split your file into smaller files!
If you got multiple lines by delimited by \n (normally new line). You can use 'cut' as well:
echo "$data" | cut -f2 -d$'\n'
You will get the 2nd line from the file. -f3 gives you the 3rd line.
Using what others mentioned, I wanted this to be a quick & dandy function in my bash shell.
Create a file: ~/.functions
Add to it the contents:
getline() {
line=$1
sed $line'q;d' $2
}
Then add this to your ~/.bash_profile:
source ~/.functions
Now when you open a new bash window, you can just call the function as so:
getline 441 myfile.txt
Lots of good answers already. I personally go with awk. For convenience, if you use bash, just add the below to your ~/.bash_profile. And, the next time you log in (or if you source your .bash_profile after this update), you will have a new nifty "nth" function available to pipe your files through.
Execute this or put it in your ~/.bash_profile (if using bash) and reopen bash (or execute source ~/.bach_profile)
# print just the nth piped in line
nth () { awk -vlnum=${1} 'NR==lnum {print; exit}'; }
Then, to use it, simply pipe through it. E.g.,:
$ yes line | cat -n | nth 5
5 line
To print nth line using sed with a variable as line number:
a=4
sed -e $a'q:d' file
Here the '-e' flag is for adding script to command to be executed.
After taking a look at the top answer and the benchmark, I've implemented a tiny helper function:
function nth {
if (( ${#} < 1 || ${#} > 2 )); then
echo -e "usage: $0 \e[4mline\e[0m [\e[4mfile\e[0m]"
return 1
fi
if (( ${#} > 1 )); then
sed "$1q;d" $2
else
sed "$1q;d"
fi
}
Basically you can use it in two fashions:
nth 42 myfile.txt
do_stuff | nth 42
This is not a bash solution, but I found out that top choices didn't satisfy my needs, eg,
sed 'NUMq;d' file
was fast enough, but was hanging for hours and did not tell about any progress. I suggest compiling this cpp program and using it to find the row you want. You can compile it with g++ main.cpp, where main.cpp is the file with the content below. I got a.out and executed it with ./a.out
#include <iostream>
#include <string>
#include <fstream>
using namespace std;
int main() {
string filename;
cout << "Enter filename ";
cin >> filename;
int needed_row_number;
cout << "Enter row number ";
cin >> needed_row_number;
int progress_line_count;
cout << "Enter at which every number of rows to monitor progress ";
cin >> progress_line_count;
char ch;
int row_counter = 1;
fstream fin(filename, fstream::in);
while (fin >> noskipws >> ch) {
int ch_int = (int) ch;
if (row_counter == needed_row_number) {
cout << ch;
}
if (ch_int == 10) {
if (row_counter == needed_row_number) {
return 0;
}
row_counter++;
if (row_counter % progress_line_count == 0) {
cout << "Progress: line " << row_counter << endl;
}
}
}
return 0;
}
To get an nth line (single line)
If you want something that you can later customize without having to deal with bash you can compile this c program and drop the binary in your custom binaries directory. This assumes that you know how to edit the .bashrc file
accordingly (only if you want to edit your path variable), If you don't know, this is a helpful link.
To run this code use (assuming you named the binary "line").
line [target line] [target file]
example
line 2 somefile.txt
The code:
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
int main(int argc, char* argv[]){
if(argc != 3){
fprintf(stderr, "line needs a line number and a file name");
exit(0);
}
int lineNumber = atoi(argv[1]);
int counter = 0;
char *fileName = argv[2];
FILE *fileReader = fopen(fileName, "r");
if(fileReader == NULL){
fprintf(stderr, "Failed to open file");
exit(0);
}
size_t lineSize = 0;
char* line = NULL;
while(counter < lineNumber){
getline(&line, &linesize, fileReader);
counter++
}
getline(&line, &lineSize, fileReader);
printf("%s\n", line);
fclose(fileReader);
return 0;
}
EDIT: removed the fseek and replaced it with a while loop
I've put some of the above answers into a short bash script that you can put into a file called get.sh and link to /usr/local/bin/get (or whatever other name you prefer).
#!/bin/bash
if [ "${1}" == "" ]; then
echo "error: blank line number";
exit 1
fi
re='^[0-9]+$'
if ! [[ $1 =~ $re ]] ; then
echo "error: line number arg not a number";
exit 1
fi
if [ "${2}" == "" ]; then
echo "error: blank file name";
exit 1
fi
sed "${1}q;d" $2;
exit 0
Ensure it's executable with
$ chmod +x get
Link it to make it available on the PATH with
$ ln -s get.sh /usr/local/bin/get
UPDATE 1 : found much faster method in awk
just 5.353 secs to obtain a row above 133.6 mn :
rownum='133668997'; ( time ( pvE0 < ~/master_primelist_18a.txt |
LC_ALL=C mawk2 -F'^$' -v \_="${rownum}" -- '!_{exit}!--_' ) )
in0: 5.45GiB 0:00:05 [1.02GiB/s] [1.02GiB/s] [======> ] 71%
( pvE 0.1 in0 < ~/master_primelist_18a.txt |
LC_ALL=C mawk2 -F'^$' -v -- ; ) 5.01s user
1.21s system 116% cpu 5.353 total
77.37219=195591955519519519559551=0x296B0FA7D668C4A64F7F=
===============================================
I'd like to contest the notion that perl is faster than awk :
so while my test file isn't nearly quite as many rows, it's also twice the size, at 7.58 GB -
I even gave perl some built-in advantageous - like hard-coding in the row number, and also going second, thus gaining any potential speedups from OS caching mechanism, if any
f="$( grealpath -ePq ~/master_primelist_18a.txt )"
rownum='133668997'
fg;fg; pv < "${f}" | gwc -lcm
echo; sleep 2;
echo;
( time ( pv -i 0.1 -cN in0 < "${f}" |
LC_ALL=C mawk2 '_{exit}_=NR==+__' FS='^$' __="${rownum}"
) ) | mawk 'BEGIN { print } END { print _ } NR'
sleep 2
( time ( pv -i 0.1 -cN in0 < "${f}" |
LC_ALL=C perl -wnl -e '$.== 133668997 && print && exit;'
) ) | mawk 'BEGIN { print } END { print _ } NR' ;
fg: no current job
fg: no current job
7.58GiB 0:00:28 [ 275MiB/s] [============>] 100%
148,110,134 8,134,435,629 8,134,435,629 <<<< rows, chars, and bytes
count as reported by gnu-wc
in0: 5.45GiB 0:00:07 [ 701MiB/s] [=> ] 71%
( pv -i 0.1 -cN in0 < "${f}" | LC_ALL=C mawk2 '_{exit}_=NR==+__' FS='^$' ; )
6.22s user 2.56s system 110% cpu 7.966 total
77.37219=195591955519519519559551=0x296B0FA7D668C4A64F7F=
in0: 5.45GiB 0:00:17 [ 328MiB/s] [=> ] 71%
( pv -i 0.1 -cN in0 < "${f}" | LC_ALL=C perl -wnl -e ; )
14.22s user 3.31s system 103% cpu 17.014 total
77.37219=195591955519519519559551=0x296B0FA7D668C4A64F7F=
I can re-run the test with perl 5.36 or even perl-6 if u think it's gonna make a difference (haven't installed either), but a gap of
7.966 secs (mawk2) vs. 17.014 secs (perl 5.34)
between the two, with the latter more than double the prior, it seems clear which one is indeed meaningfully faster to fetch a single row way deep in ASCII files.
This is perl 5, version 34, subversion 0 (v5.34.0) built for darwin-thread-multi-2level
Copyright 1987-2021, Larry Wall
mawk 1.9.9.6, 21 Aug 2016, Copyright Michael D. Brennan

Batch Convert IP Addresses into Decimals?

I have a large file that contains 2 IPs per line - and there's about 3 million lines total.
Here's an example of the file:
1.32.0.0,1.32.255.255
5.72.0.0,5.75.255.255
5.180.0.0,5.183.255.255
222.127.228.22,222.127.228.23
222.127.228.24,222.127.228.24
I need to convert each IP to an IP Decimal, like this:
18874368,18939903
88604672,88866815
95682560,95944703
3732923414,3732923415
3732923416,3732923416
I'd prefer a way to do this strictly via command line. I'm okay with perl or python being used, as long as it doesn't require extra modules to be installed.
I thought I had come across a way that someone converted IPs like this using sed but can't seem to find that tutorial anymore. Any help would be appreciated.
If you have gnu awk installed (for the RT variable), you could use this one-liner:
awk -F. -v RS='[\n,]' '{printf "%d%s", (($1*256+$2)*256+$3)*256+$4, RT}' file
18874368,18939903
88604672,88866815
95682560,95944703
3732923414,3732923415
3732923416,3732923416
Here it is python solution, that use only standard modules (re, sys):
import re
import sys
def multiplier_generator():
""" Cyclic generator of powers of 256 (from 256**3 down to 256**0)
The mulitpliers tupple could be replaced by inline calculation
of power, but this approach has better performance.
"""
multipliers = (
256**3,
256**2,
256**1,
256**0,
)
idx = 0
while 1 == 1:
yield multipliers[idx]
idx = (idx + 1) % 4
def replacer(match_object):
"""re.sub replacer for ip group"""
multiplier = multiplier_generator()
res = 0
for i in xrange(1,5):
res += multiplier.next()*int(match_object.group(i))
return str(res)
if __name__ == "__main__":
std_in = ""
if len(sys.argv) > 1:
with open(sys.argv[1],'r') as f:
std_in = f.read()
else:
std_in = sys.stdin.read()
print re.sub(r"([0-9]+)\.([0-9]+)\.([0-9]+)\.([0-9]+)", replacer, std_in )
This solution replace every ip address, that can be found in text from standard input or from file passed as first parameter, i.e:
python convert.py < input_file.txt, or
python convert.py file.txt, or
echo "1.2.3.4, 5.6.7.8" | python convert.py.
With bash:
ip2dec() {
set -- ${1//./ } # split $1 with "." to $1 $2 $3 $4
declare -i dec # set integer attribute
dec=$1*256*256*256+$2*256*256+$3*256+$4
echo -n $dec
}
while IFS=, read -r a b; do ip2dec $a; echo -n ,; ip2dec $b; echo; done < file
Output:
18874368,18939903
88604672,88866815
95682560,95944703
3732923414,3732923415
3732923416,3732923416
With bash and using shift (one CPU instruction) instead of multiply (a lot of instructions):
ip2dec() { local IFS=.
set -- $1 # split $1 with "." to $1 $2 $3 $4
printf '%s' "$(($1<<24+$2<<16+$3<<8+$4))"
}
while IFS=, read -r a b; do
printf '%s,%s\n' "$(ip2dec $a)" "$(ip2dec $b)"
done < file

Calculate difference between two number in a file

I want to know if it is possible to calculate the difference between two float number contained in a file in two distinct lines in one bash command line.
File content example :
Start at 123456.789
...
...
...
End at 123654.987
I would like to do an echo of 123654.987-123456.789
Is that possible? What is this magic command line ?
Thank you!
awk '
/Start/ { start = $3 } # 3rd field in line matching "Start"
/End/ {
end = $3; # 3rd field in line matching "End"
print end - start # Print the difference.
}
' < file
If you really want to do this on one line:
awk '/Start/ { start = $3 } /End/ { end = $3; print end - start }' < file
you can do this with this command:
start=`grep 'Start' FILENAME| cut -d ' ' -f 3`; end=`grep 'End' FILENAME | cut -d ' ' -f 3`; echo "$end-$start" | bc
You need the 'bc' program for this (for floating point math). You can install it with apt-get install bc, or yum, or rpm, zypper... OS specific :)
Bash doesn't support floating point operations. But you can split your numbers to parts and perform integer operations. Example:
#!/bin/bash
echo $(( ${2%.*} - ${1%.*} )).$(( ${2#*.} - ${1#*.} ))
Result:
./test.sh 123456.789 123654.987
198.198
EDIT:
Correct solution would be using not command line hack, but tool designed or performing fp operations. For example, bc:
echo 123654.987-123456.789 | bc
output:
198.198
Here's a weird way:
printf -- "-%s+%s\n" $(grep -oP '(Start|End) at \K[\d.]+' file) | bc

Conditional Sort using Awk or sort

Alright, so I asked a question a week or so ago about how I could use sed or awk to extract a block of text between two blank lines, as well as omit part of the extracted text. The answers I got pretty much satisfied my needs, but now I'm doing something extra for fun (and for OCD's sake).
I want to sort the output from awk in this round. I found this question & answer but it doesn't quite help me to solve the problem. I've also tried wrapping my head around a lot of awk documentation as well to try and figure out how I could do this, to no avail.
So here's the block of code in my script that does all the dirty work:
# This block of stuff fetches the nameservers as reported by the registrar and DNS zone
# Then it gets piped into awk to work some more formatting magic...
# The following is a step-for-step description since I can't put comments inside the awk block:
# BEGIN:
# Set the record separator to a blank line
# Set the input/output field separators to newlines
# FNR == 3:
# The third block of dig's output is the nameservers reported by the registrar
# Also blanks the last field & strips it since it's just a useless dig comment
dig +trace +additional $host | \
awk -v host="$host" '
BEGIN {
RS = "";
FS = "\n"
}
FNR == 3 {
print "Nameservers of",host,"reported by the registrar:";
OFS = "\n";
$NF = ""; sub( /[[:space:]]+$/, "" );
print
}
'
And here's the output if I pass google.com in as the value of $host (other hostnames may produce output of differing line counts):
Nameservers of google.com reported by the registrar:
google.com. 172800 IN NS ns2.google.com.
google.com. 172800 IN NS ns1.google.com.
google.com. 172800 IN NS ns3.google.com.
google.com. 172800 IN NS ns4.google.com.
ns2.google.com. 172800 IN A 216.239.34.10
ns1.google.com. 172800 IN A 216.239.32.10
ns3.google.com. 172800 IN A 216.239.36.10
ns4.google.com. 172800 IN A 216.239.38.10
The idea is, using either the existing block of awk, or piping awk's output into a combination of more awk, sort, or whatever else, sort that block of text using a conditional algorithm:
if ( column 4 == 'NS' )
sort by column 5
else // This will ensure that the col 1 sort includes A and AAAA records
sort by column 1
I've pretty much got the same preferences for answers as the previous question:
Most important of all, it must be portable since I've encountered different behaviour between OS X (my home system) and Fedora (what I use at work) when using sed (had to replace it with gsed on OS X) and grep's -m flag (used in another script)
An explanation of how the solution works would be very much appreciated, as a learning opportunity moreso than anything else. I already learned quite a bit from the awk solution already provided in the previous question.
If the solution can be implemented within the same block of awk, that would also be awesome
If not, then something simple and eloquent that I can pipe awk's output through would suffice
Here's a solution based on #shellter's idea. Pipe the output of your nameserver records to this:
awk '$4 == "NS" {print $1, $5, $0} $4 == "A" {print $1, $1, $0}' | sort | cut -f3- -d' '
Explanation:
With awk, we take only the NS and A records, and re-print the same line with prefix: primary search column + secondary search column
sort will sort the lines, thanks to the way we set the first and second column, the order should be as you wanted
With cut we get rid of the prefix that we used for sorting
I know you asked about awk solution, but since you tagged it with bash too, I thought I'd provide such a version. It should also be more portable than awk ;)
# the whole line
declare -a lines
# the key to use for sorting
declare -a keys
# insert into the arrays at the appropriate position
function insert
{
local key="$1"
local line="$2"
local count=${#lines[*]}
local i
# go from the end backwards
for((i=count; i>0; i-=1))
do
# if we have the insertion point, break
[[ "${keys[i-1]}" > "$key" ]] || break
# shift the current item to make room for the new one
lines[i]=${lines[i-1]}
keys[i]=${keys[i-1]}
done
# insert the new item
lines[i]=$line
keys[i]=$key
}
# This block of stuff fetches the nameservers as reported by the registrar and DNS zone
# The third block of dig's output is the nameservers reported by the registrar
# Also blanks the last field & strips it since it's just a useless dig comment
block=0
dig +trace +additional $host |
while read f1 f2 f3 f4 f5
do
# empty line begins new block
if [ -z "$f1" ]
then
# increment block counter
block=$((block+1))
# and read next line
continue
fi
# if we are not in block #3, read next line
[[ $block == 3 ]] || continue
# ;; ends the block
if [[ "$f1" == ";;" ]]
then
echo "Nameservers of $host reported by the registrar:"
# print the lines collected so far
for((i=0; i<${#lines[*]}; i+=1))
do
echo ${lines[i]}
done
# don't bother reading the rest
break
fi
# figure out what key to use for sorting
if [[ "$f4" == "NS" ]]
then
key=$f5
else
key=$f1
fi
# add the line to the arrays
insert "$key" "$f1 $f2 $f3 $f4 $f5"
done

Bash tool to get nth line from a file

Is there a "canonical" way of doing that? I've been using head -n | tail -1 which does the trick, but I've been wondering if there's a Bash tool that specifically extracts a line (or a range of lines) from a file.
By "canonical" I mean a program whose main function is doing that.
head and pipe with tail will be slow for a huge file. I would suggest sed like this:
sed 'NUMq;d' file
Where NUM is the number of the line you want to print; so, for example, sed '10q;d' file to print the 10th line of file.
Explanation:
NUMq will quit immediately when the line number is NUM.
d will delete the line instead of printing it; this is inhibited on the last line because the q causes the rest of the script to be skipped when quitting.
If you have NUM in a variable, you will want to use double quotes instead of single:
sed "${NUM}q;d" file
sed -n '2p' < file.txt
will print 2nd line
sed -n '2011p' < file.txt
2011th line
sed -n '10,33p' < file.txt
line 10 up to line 33
sed -n '1p;3p' < file.txt
1st and 3th line
and so on...
For adding lines with sed, you can check this:
sed: insert a line in a certain position
I have a unique situation where I can benchmark the solutions proposed on this page, and so I'm writing this answer as a consolidation of the proposed solutions with included run times for each.
Set Up
I have a 3.261 gigabyte ASCII text data file with one key-value pair per row. The file contains 3,339,550,320 rows in total and defies opening in any editor I have tried, including my go-to Vim. I need to subset this file in order to investigate some of the values that I've discovered only start around row ~500,000,000.
Because the file has so many rows:
I need to extract only a subset of the rows to do anything useful with the data.
Reading through every row leading up to the values I care about is going to take a long time.
If the solution reads past the rows I care about and continues reading the rest of the file it will waste time reading almost 3 billion irrelevant rows and take 6x longer than necessary.
My best-case-scenario is a solution that extracts only a single line from the file without reading any of the other rows in the file, but I can't think of how I would accomplish this in Bash.
For the purposes of my sanity I'm not going to be trying to read the full 500,000,000 lines I'd need for my own problem. Instead I'll be trying to extract row 50,000,000 out of 3,339,550,320 (which means reading the full file will take 60x longer than necessary).
I will be using the time built-in to benchmark each command.
Baseline
First let's see how the head tail solution:
$ time head -50000000 myfile.ascii | tail -1
pgm_icnt = 0
real 1m15.321s
The baseline for row 50 million is 00:01:15.321, if I'd gone straight for row 500 million it'd probably be ~12.5 minutes.
cut
I'm dubious of this one, but it's worth a shot:
$ time cut -f50000000 -d$'\n' myfile.ascii
pgm_icnt = 0
real 5m12.156s
This one took 00:05:12.156 to run, which is much slower than the baseline! I'm not sure whether it read through the entire file or just up to line 50 million before stopping, but regardless this doesn't seem like a viable solution to the problem.
AWK
I only ran the solution with the exit because I wasn't going to wait for the full file to run:
$ time awk 'NR == 50000000 {print; exit}' myfile.ascii
pgm_icnt = 0
real 1m16.583s
This code ran in 00:01:16.583, which is only ~1 second slower, but still not an improvement on the baseline. At this rate if the exit command had been excluded it would have probably taken around ~76 minutes to read the entire file!
Perl
I ran the existing Perl solution as well:
$ time perl -wnl -e '$.== 50000000 && print && exit;' myfile.ascii
pgm_icnt = 0
real 1m13.146s
This code ran in 00:01:13.146, which is ~2 seconds faster than the baseline. If I'd run it on the full 500,000,000 it would probably take ~12 minutes.
sed
The top answer on the board, here's my result:
$ time sed "50000000q;d" myfile.ascii
pgm_icnt = 0
real 1m12.705s
This code ran in 00:01:12.705, which is 3 seconds faster than the baseline, and ~0.4 seconds faster than Perl. If I'd run it on the full 500,000,000 rows it would have probably taken ~12 minutes.
mapfile
I have bash 3.1 and therefore cannot test the mapfile solution.
Conclusion
It looks like, for the most part, it's difficult to improve upon the head tail solution. At best the sed solution provides a ~3% increase in efficiency.
(percentages calculated with the formula % = (runtime/baseline - 1) * 100)
Row 50,000,000
00:01:12.705 (-00:00:02.616 = -3.47%) sed
00:01:13.146 (-00:00:02.175 = -2.89%) perl
00:01:15.321 (+00:00:00.000 = +0.00%) head|tail
00:01:16.583 (+00:00:01.262 = +1.68%) awk
00:05:12.156 (+00:03:56.835 = +314.43%) cut
Row 500,000,000
00:12:07.050 (-00:00:26.160) sed
00:12:11.460 (-00:00:21.750) perl
00:12:33.210 (+00:00:00.000) head|tail
00:12:45.830 (+00:00:12.620) awk
00:52:01.560 (+00:40:31.650) cut
Row 3,338,559,320
01:20:54.599 (-00:03:05.327) sed
01:21:24.045 (-00:02:25.227) perl
01:23:49.273 (+00:00:00.000) head|tail
01:25:13.548 (+00:02:35.735) awk
05:47:23.026 (+04:24:26.246) cut
With awk it is pretty fast:
awk 'NR == num_line' file
When this is true, the default behaviour of awk is performed: {print $0}.
Alternative versions
If your file happens to be huge, you'd better exit after reading the required line. This way you save CPU time See time comparison at the end of the answer.
awk 'NR == num_line {print; exit}' file
If you want to give the line number from a bash variable you can use:
awk 'NR == n' n=$num file
awk -v n=$num 'NR == n' file # equivalent
See how much time is saved by using exit, specially if the line happens to be in the first part of the file:
# Let's create a 10M lines file
for ((i=0; i<100000; i++)); do echo "bla bla"; done > 100Klines
for ((i=0; i<100; i++)); do cat 100Klines; done > 10Mlines
$ time awk 'NR == 1234567 {print}' 10Mlines
bla bla
real 0m1.303s
user 0m1.246s
sys 0m0.042s
$ time awk 'NR == 1234567 {print; exit}' 10Mlines
bla bla
real 0m0.198s
user 0m0.178s
sys 0m0.013s
So the difference is 0.198s vs 1.303s, around 6x times faster.
According to my tests, in terms of performance and readability my recommendation is:
tail -n+N | head -1
N is the line number that you want. For example, tail -n+7 input.txt | head -1 will print the 7th line of the file.
tail -n+N will print everything starting from line N, and head -1 will make it stop after one line.
The alternative head -N | tail -1 is perhaps slightly more readable. For example, this will print the 7th line:
head -7 input.txt | tail -1
When it comes to performance, there is not much difference for smaller sizes, but it will be outperformed by the tail | head (from above) when the files become huge.
The top-voted sed 'NUMq;d' is interesting to know, but I would argue that it will be understood by fewer people out of the box than the head/tail solution and it is also slower than tail/head.
In my tests, both tails/heads versions outperformed sed 'NUMq;d' consistently. That is in line with the other benchmarks that were posted. It is hard to find a case where tails/heads was really bad. It is also not surprising, as these are operations that you would expect to be heavily optimized in a modern Unix system.
To get an idea about the performance differences, these are the number that I get for a huge file (9.3G):
tail -n+N | head -1: 3.7 sec
head -N | tail -1: 4.6 sec
sed Nq;d: 18.8 sec
Results may differ, but the performance head | tail and tail | head is, in general, comparable for smaller inputs, and sed is always slower by a significant factor (around 5x or so).
To reproduce my benchmark, you can try the following, but be warned that it will create a 9.3G file in the current working directory:
#!/bin/bash
readonly file=tmp-input.txt
readonly size=1000000000
readonly pos=500000000
readonly retries=3
seq 1 $size > $file
echo "*** head -N | tail -1 ***"
for i in $(seq 1 $retries) ; do
time head "-$pos" $file | tail -1
done
echo "-------------------------"
echo
echo "*** tail -n+N | head -1 ***"
echo
seq 1 $size > $file
ls -alhg $file
for i in $(seq 1 $retries) ; do
time tail -n+$pos $file | head -1
done
echo "-------------------------"
echo
echo "*** sed Nq;d ***"
echo
seq 1 $size > $file
ls -alhg $file
for i in $(seq 1 $retries) ; do
time sed $pos'q;d' $file
done
/bin/rm $file
Here is the output of a run on my machine (ThinkPad X1 Carbon with an SSD and 16G of memory). I assume in the final run everything will come from the cache, not from disk:
*** head -N | tail -1 ***
500000000
real 0m9,800s
user 0m7,328s
sys 0m4,081s
500000000
real 0m4,231s
user 0m5,415s
sys 0m2,789s
500000000
real 0m4,636s
user 0m5,935s
sys 0m2,684s
-------------------------
*** tail -n+N | head -1 ***
-rw-r--r-- 1 phil 9,3G Jan 19 19:49 tmp-input.txt
500000000
real 0m6,452s
user 0m3,367s
sys 0m1,498s
500000000
real 0m3,890s
user 0m2,921s
sys 0m0,952s
500000000
real 0m3,763s
user 0m3,004s
sys 0m0,760s
-------------------------
*** sed Nq;d ***
-rw-r--r-- 1 phil 9,3G Jan 19 19:50 tmp-input.txt
500000000
real 0m23,675s
user 0m21,557s
sys 0m1,523s
500000000
real 0m20,328s
user 0m18,971s
sys 0m1,308s
500000000
real 0m19,835s
user 0m18,830s
sys 0m1,004s
Wow, all the possibilities!
Try this:
sed -n "${lineNum}p" $file
or one of these depending upon your version of Awk:
awk -vlineNum=$lineNum 'NR == lineNum {print $0}' $file
awk -v lineNum=4 '{if (NR == lineNum) {print $0}}' $file
awk '{if (NR == lineNum) {print $0}}' lineNum=$lineNum $file
(You may have to try the nawk or gawk command).
Is there a tool that only does the print that particular line? Not one of the standard tools. However, sed is probably the closest and simplest to use.
Save two keystrokes, print Nth line without using bracket:
sed -n Np <fileName>
^ ^
\ \___ 'p' for printing
\______ '-n' for not printing by default
For example, to print 100th line:
sed -n 100p foo.txt
This question being tagged Bash, here's the Bash (≥4) way of doing: use mapfile with the -s (skip) and -n (count) option.
If you need to get the 42nd line of a file file:
mapfile -s 41 -n 1 ary < file
At this point, you'll have an array ary the fields of which containing the lines of file (including the trailing newline), where we have skipped the first 41 lines (-s 41), and stopped after reading one line (-n 1). So that's really the 42nd line. To print it out:
printf '%s' "${ary[0]}"
If you need a range of lines, say the range 42–666 (inclusive), and say you don't want to do the math yourself, and print them on stdout:
mapfile -s $((42-1)) -n $((666-42+1)) ary < file
printf '%s' "${ary[#]}"
If you need to process these lines too, it's not really convenient to store the trailing newline. In this case use the -t option (trim):
mapfile -t -s $((42-1)) -n $((666-42+1)) ary < file
# do stuff
printf '%s\n' "${ary[#]}"
You can have a function do that for you:
print_file_range() {
# $1-$2 is the range of file $3 to be printed to stdout
local ary
mapfile -s $(($1-1)) -n $(($2-$1+1)) ary < "$3"
printf '%s' "${ary[#]}"
}
No external commands, only Bash builtins!
You may also used sed print and quit:
sed -n '10{p;q;}' file # print line 10
You can also use Perl for this:
perl -wnl -e '$.== NUM && print && exit;' some.file
As a followup to CaffeineConnoisseur's very helpful benchmarking answer... I was curious as to how fast the 'mapfile' method was compared to others (as that wasn't tested), so I tried a quick-and-dirty speed comparison myself as I do have bash 4 handy. Threw in a test of the "tail | head" method (rather than head | tail) mentioned in one of the comments on the top answer while I was at it, as folks are singing its praises. I don't have anything nearly the size of the testfile used; the best I could find on short notice was a 14M pedigree file (long lines that are whitespace-separated, just under 12000 lines).
Short version: mapfile appears faster than the cut method, but slower than everything else, so I'd call it a dud. tail | head, OTOH, looks like it could be the fastest, although with a file this size the difference is not all that substantial compared to sed.
$ time head -11000 [filename] | tail -1
[output redacted]
real 0m0.117s
$ time cut -f11000 -d$'\n' [filename]
[output redacted]
real 0m1.081s
$ time awk 'NR == 11000 {print; exit}' [filename]
[output redacted]
real 0m0.058s
$ time perl -wnl -e '$.== 11000 && print && exit;' [filename]
[output redacted]
real 0m0.085s
$ time sed "11000q;d" [filename]
[output redacted]
real 0m0.031s
$ time (mapfile -s 11000 -n 1 ary < [filename]; echo ${ary[0]})
[output redacted]
real 0m0.309s
$ time tail -n+11000 [filename] | head -n1
[output redacted]
real 0m0.028s
Hope this helps!
The fastest solution for big files is always tail|head, provided that the two distances:
from the start of the file to the starting line. Lets call it S
the distance from the last line to the end of the file. Be it E
are known. Then, we could use this:
mycount="$E"; (( E > S )) && mycount="+$S"
howmany="$(( endline - startline + 1 ))"
tail -n "$mycount"| head -n "$howmany"
howmany is just the count of lines required.
Some more detail in https://unix.stackexchange.com/a/216614/79743
All the above answers directly answer the question. But here's a less direct solution but a potentially more important idea, to provoke thought.
Since line lengths are arbitrary, all the bytes of the file before the nth line need to be read. If you have a huge file or need to repeat this task many times, and this process is time-consuming, then you should seriously think about whether you should be storing your data in a different way in the first place.
The real solution is to have an index, e.g. at the start of the file, indicating the positions where the lines begin. You could use a database format, or just add a table at the start of the file. Alternatively create a separate index file to accompany your large text file.
e.g. you might create a list of character positions for newlines:
awk 'BEGIN{c=0;print(c)}{c+=length()+1;print(c+1)}' file.txt > file.idx
then read with tail, which actually seeks directly to the appropriate point in the file!
e.g. to get line 1000:
tail -c +$(awk 'NR=1000' file.idx) file.txt | head -1
This may not work with 2-byte / multibyte characters, since awk is "character-aware" but tail is not.
I haven't tested this against a large file.
Also see this answer.
Alternatively - split your file into smaller files!
If you got multiple lines by delimited by \n (normally new line). You can use 'cut' as well:
echo "$data" | cut -f2 -d$'\n'
You will get the 2nd line from the file. -f3 gives you the 3rd line.
Using what others mentioned, I wanted this to be a quick & dandy function in my bash shell.
Create a file: ~/.functions
Add to it the contents:
getline() {
line=$1
sed $line'q;d' $2
}
Then add this to your ~/.bash_profile:
source ~/.functions
Now when you open a new bash window, you can just call the function as so:
getline 441 myfile.txt
Lots of good answers already. I personally go with awk. For convenience, if you use bash, just add the below to your ~/.bash_profile. And, the next time you log in (or if you source your .bash_profile after this update), you will have a new nifty "nth" function available to pipe your files through.
Execute this or put it in your ~/.bash_profile (if using bash) and reopen bash (or execute source ~/.bach_profile)
# print just the nth piped in line
nth () { awk -vlnum=${1} 'NR==lnum {print; exit}'; }
Then, to use it, simply pipe through it. E.g.,:
$ yes line | cat -n | nth 5
5 line
To print nth line using sed with a variable as line number:
a=4
sed -e $a'q:d' file
Here the '-e' flag is for adding script to command to be executed.
After taking a look at the top answer and the benchmark, I've implemented a tiny helper function:
function nth {
if (( ${#} < 1 || ${#} > 2 )); then
echo -e "usage: $0 \e[4mline\e[0m [\e[4mfile\e[0m]"
return 1
fi
if (( ${#} > 1 )); then
sed "$1q;d" $2
else
sed "$1q;d"
fi
}
Basically you can use it in two fashions:
nth 42 myfile.txt
do_stuff | nth 42
This is not a bash solution, but I found out that top choices didn't satisfy my needs, eg,
sed 'NUMq;d' file
was fast enough, but was hanging for hours and did not tell about any progress. I suggest compiling this cpp program and using it to find the row you want. You can compile it with g++ main.cpp, where main.cpp is the file with the content below. I got a.out and executed it with ./a.out
#include <iostream>
#include <string>
#include <fstream>
using namespace std;
int main() {
string filename;
cout << "Enter filename ";
cin >> filename;
int needed_row_number;
cout << "Enter row number ";
cin >> needed_row_number;
int progress_line_count;
cout << "Enter at which every number of rows to monitor progress ";
cin >> progress_line_count;
char ch;
int row_counter = 1;
fstream fin(filename, fstream::in);
while (fin >> noskipws >> ch) {
int ch_int = (int) ch;
if (row_counter == needed_row_number) {
cout << ch;
}
if (ch_int == 10) {
if (row_counter == needed_row_number) {
return 0;
}
row_counter++;
if (row_counter % progress_line_count == 0) {
cout << "Progress: line " << row_counter << endl;
}
}
}
return 0;
}
To get an nth line (single line)
If you want something that you can later customize without having to deal with bash you can compile this c program and drop the binary in your custom binaries directory. This assumes that you know how to edit the .bashrc file
accordingly (only if you want to edit your path variable), If you don't know, this is a helpful link.
To run this code use (assuming you named the binary "line").
line [target line] [target file]
example
line 2 somefile.txt
The code:
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
int main(int argc, char* argv[]){
if(argc != 3){
fprintf(stderr, "line needs a line number and a file name");
exit(0);
}
int lineNumber = atoi(argv[1]);
int counter = 0;
char *fileName = argv[2];
FILE *fileReader = fopen(fileName, "r");
if(fileReader == NULL){
fprintf(stderr, "Failed to open file");
exit(0);
}
size_t lineSize = 0;
char* line = NULL;
while(counter < lineNumber){
getline(&line, &linesize, fileReader);
counter++
}
getline(&line, &lineSize, fileReader);
printf("%s\n", line);
fclose(fileReader);
return 0;
}
EDIT: removed the fseek and replaced it with a while loop
I've put some of the above answers into a short bash script that you can put into a file called get.sh and link to /usr/local/bin/get (or whatever other name you prefer).
#!/bin/bash
if [ "${1}" == "" ]; then
echo "error: blank line number";
exit 1
fi
re='^[0-9]+$'
if ! [[ $1 =~ $re ]] ; then
echo "error: line number arg not a number";
exit 1
fi
if [ "${2}" == "" ]; then
echo "error: blank file name";
exit 1
fi
sed "${1}q;d" $2;
exit 0
Ensure it's executable with
$ chmod +x get
Link it to make it available on the PATH with
$ ln -s get.sh /usr/local/bin/get
UPDATE 1 : found much faster method in awk
just 5.353 secs to obtain a row above 133.6 mn :
rownum='133668997'; ( time ( pvE0 < ~/master_primelist_18a.txt |
LC_ALL=C mawk2 -F'^$' -v \_="${rownum}" -- '!_{exit}!--_' ) )
in0: 5.45GiB 0:00:05 [1.02GiB/s] [1.02GiB/s] [======> ] 71%
( pvE 0.1 in0 < ~/master_primelist_18a.txt |
LC_ALL=C mawk2 -F'^$' -v -- ; ) 5.01s user
1.21s system 116% cpu 5.353 total
77.37219=195591955519519519559551=0x296B0FA7D668C4A64F7F=
===============================================
I'd like to contest the notion that perl is faster than awk :
so while my test file isn't nearly quite as many rows, it's also twice the size, at 7.58 GB -
I even gave perl some built-in advantageous - like hard-coding in the row number, and also going second, thus gaining any potential speedups from OS caching mechanism, if any
f="$( grealpath -ePq ~/master_primelist_18a.txt )"
rownum='133668997'
fg;fg; pv < "${f}" | gwc -lcm
echo; sleep 2;
echo;
( time ( pv -i 0.1 -cN in0 < "${f}" |
LC_ALL=C mawk2 '_{exit}_=NR==+__' FS='^$' __="${rownum}"
) ) | mawk 'BEGIN { print } END { print _ } NR'
sleep 2
( time ( pv -i 0.1 -cN in0 < "${f}" |
LC_ALL=C perl -wnl -e '$.== 133668997 && print && exit;'
) ) | mawk 'BEGIN { print } END { print _ } NR' ;
fg: no current job
fg: no current job
7.58GiB 0:00:28 [ 275MiB/s] [============>] 100%
148,110,134 8,134,435,629 8,134,435,629 <<<< rows, chars, and bytes
count as reported by gnu-wc
in0: 5.45GiB 0:00:07 [ 701MiB/s] [=> ] 71%
( pv -i 0.1 -cN in0 < "${f}" | LC_ALL=C mawk2 '_{exit}_=NR==+__' FS='^$' ; )
6.22s user 2.56s system 110% cpu 7.966 total
77.37219=195591955519519519559551=0x296B0FA7D668C4A64F7F=
in0: 5.45GiB 0:00:17 [ 328MiB/s] [=> ] 71%
( pv -i 0.1 -cN in0 < "${f}" | LC_ALL=C perl -wnl -e ; )
14.22s user 3.31s system 103% cpu 17.014 total
77.37219=195591955519519519559551=0x296B0FA7D668C4A64F7F=
I can re-run the test with perl 5.36 or even perl-6 if u think it's gonna make a difference (haven't installed either), but a gap of
7.966 secs (mawk2) vs. 17.014 secs (perl 5.34)
between the two, with the latter more than double the prior, it seems clear which one is indeed meaningfully faster to fetch a single row way deep in ASCII files.
This is perl 5, version 34, subversion 0 (v5.34.0) built for darwin-thread-multi-2level
Copyright 1987-2021, Larry Wall
mawk 1.9.9.6, 21 Aug 2016, Copyright Michael D. Brennan

Resources