Reading groups of lines from a large text file - bash

I am looking to pull certain groups of lines from large (~870,000,000 line) text files. For example in a 50 line file I might want lines 3-6, 18-27, and 39-45.
From browsing Stack Overflow, I have found that the bash command:
tail -n+NUMstart file |head -nNUMend
is the fastest way to get a single line or group of lines starting at NUMstart and going to NUMend. However when reading multiple groups of lines this seems inefficient. Normally the technique wouldn't matter so much, but with files this large it makes a huge difference.
Is there a better way to go about this than using the above command for each group of lines? I am assuming the answer will most likely be a bash command but am really open to any language/tool that will do the job best.

To show lines 3-6, 18-27 and 39-45 with sed:
sed -n "3,6p;18,27p;39,45p" file
It is also possible to feed sed from a file.
Content of file foobar:
3,6p
18,27p
39,45p
Usage:
sed -n -f foobar file

awk to the rescue!
awk -v lines='3-6,18-27,39-45' '
BEGIN {n=split(lines,a,",");
for(i=1;i<=n;i++)
{split(a[i],t,"-");
rs[++c]=t[1]; re[c]=t[2]}}
{for(i=s;i<=c;i++)
if(NR>=rs[i] && NR<=re[i]) {print; next}
else if(NR>re[i]) s++;
if(s>c) exit}' file
provides an early exit after the last printed line. No error checking, the ranges should be provided in increasing order.

The problem with tail -n XX file | head -n YY for different ranges is that you are running it several times, hence the inefficiency. Otherwise, benchmarks suggest that they are the best solution.
For this specific case, you may want to use awk:
awk '(NR>=start1 && NR<=end1) || (NR>=start2 && NR<=end2) || ...' file
In your case:
awk '(NR>=3 && NR<=6) || (NR>=18 && NR<=27) || (NR>=39 && NR<=45)' file
That is, you group the ranges and let awk print the corresponding lines when they occur, just looping through the file once. It may be also useful to add a final NR==endX {exit} (endX being the closing item from the last range) so that it finishes processing once it has read the last interesting line.
In your case:
awk '(NR>=3 && NR<=6) || (NR>=18 && NR<=27) || (NR>=39 && NR<=45); NR==45 {exit}' file

Related

Fastest way -- Appending a line to a file only if it does not already exist

given this question Appending a line to a file only if it does not already exist
is there a faster way than the solution provided by #drAlberT?
grep -q -F 'string' foo.bar || echo 'string' >> foo.bar
I have implemented the above solution and I have to iterate it over a 500k lines file (i.e. check if a line is not already in a 500k lines set). Moreover, I've to run this process for a lot of times, maybe 10-50 million times. Needless to say it's kind of slow as it takes 25-30ms to run on my server (so 3-10+ days of runtime in total).
EDIT: the flow is the following: I have a file with 500k lines, each time I run, I get maybe 10-30 new lines and I check if they are already there or not. If not I add them, then I repeat many times. The order of my 500k lines files is important as I'm going through it with another process.
EDIT2: the 500k lines file is always containing unique lines, and I only care about "full lines", no substrings.
Thanks a lot!
Few suggested improvements:
Try using awk instead of grep so that you can both detect the string and write it in one action;
If you do use grep don't use a Bash loop to feed each potential match to grep and then append that one word to the file. Instead, read all the potential lines into grep as matches (using -f file_name) and print the matches. Then invert the matches and append the inverted match. See last pipeline here;
Exit as soon as you see the string (for a single string) rather than continuing to loop over a big file;
Don't call the script millions of times with one or just a few lines -- organize the glue script (in Bash I suppose) so that the core script is called once or a few times with all the lines instead;
Perhaps use multicores since the files are not dependent on each other. Maybe with GNU Parallel (or you could use Python or Ruby or Perl that has support for threads).
Consider this awk for a single line to add:
$ awk -v line=line_to_append 'FNR==NR && line==$0{f=1; exit}
END{if (!f) print line >> FILENAME}' file
Or for multiple lines:
$ awk 'FNR==NR {lines[$0]; next}
$0 in lines{delete lines[$0]}
END{for (e in lines) print e >> FILENAME}' lines file
Some timings using a copy of the Unix words file (235,886 lines) with a five line lines file that has two overlaps:
$ echo "frob
knob
kabbob
stew
big slob" > lines
$ time awk 'FNR==NR {lines[$0]; next}
$0 in lines{delete lines[$0]}
END{for (e in lines) print e >> FILENAME}' lines words
real 0m0.056s
user 0m0.051s
sys 0m0.003s
$ tail words
zythum
Zyzomys
Zyzzogeton
frob
kabbob
big slob
Edit 2
Try this as being the best of both:
$ time grep -x -f lines words |
awk 'FNR==NR{a[$0]; next} !($0 in a)' - lines >> words
real 0m0.012s
user 0m0.010s
sys 0m0.003s
Explanation:
grep -x -f lines words find the lines that ARE in words
awk 'FNR==NR{a[$0]; next} !($0 in a)' - lines invert those into lines that are NOT in words
>> words append those to the file
Turning the millions of passes over the file into a script with millions of actions will save you a lot of overhead. Searching for a single label at each pass over the file is incredibly inefficient; you can search for as many labels as you can comfortably fit into memory in a single pass over the file.
Something along the following lines, perhaps.
awk 'NR==FNR { a[$0]++; next }
$0 in a { delete a[$0] }
1
END { for (k in a) print k }' strings bigfile >bigfile.new
If you can't fit strings in memory all at once, splitting that into suitable chunks will obviously allow you to finish this in as many passes as you have chunks.
On the other hand, if you have already (effectively) divided the input set into sets of 10-30 labels, you can obviously only search for those 10-30 in one pass. Still, this should provide you with a speed improvement on the order of 10-30 times.
This assumes that a "line" is always a full line. If the label can be a substring of a line in the input file, or vice versa, this will need some refactoring.
If duplicates are not valid in the file, just append them all and filter out the duplicates:
cat myfile mynewlines | awk '!n[$0]++' > mynewfile
This will allow appending millions of lines in seconds.
If order additionally doesn't matter and your files are more than a few gigabytes, you can use sort -u instead.
Have the script read new lines from stdin after consuming the original file. All lines are stored in an associative array (without any compression such as md5sum).
Appending the suffix 'x' is targeted to handle inputs such as '-e'; better ways probably exist.
#!/bin/bash
declare -A aa
while read line; do aa["x$line"]=1;
done < file.txt
while read line; do
if [ x${aa[$line]} == x ]; then
aa[$line]=1;
echo "x$line" >> file.txt
fi
done

performance issues in shell script

I have a 200 MB tab separated text file with millions of rows. In this file, I have a column with multiple locations like US , UK , AU etc.
Now I want to break this file on the basis of this column. Though this code is working fine for me, but facing performance issue as it is taking more than 1 hour to split the file into multiple files based on locations. Here is the code:
#!/bin/bash
read -p "Please enter the file to split " file
read -p "Enter the Col No. to split " col_no
#set -x
header=`head -1 $file`
cnt=1
while IFS= read -r line
do
if [ $((cnt++)) -eq 1 ]
then
echo "$line" >> /dev/null
else
loc=`echo "$line" | cut -f "$col_no"`
f_name=`echo "file_"$loc".txt"`
if [ -f "$f_name" ]
then
echo "$line" >> "$f_name";
else
touch "$f_name";
echo "file $f_name created.."
echo "$line" >> "$f_name";
sed -i '1i '"$header"'' "$f_name"
fi
fi
done < $file
The logic applied here is that we are reading the entire file only once, and depending on the locations, we are creating and appending the data to it.
Please suggest necessary improvements in the code to enhance its performance.
Following is a sample data and is separated by colon instead of tab. The country code is in the 4th column:
ID1:ID2:ID3:ID4:ID5
100:abcd:TEST1:ZA:CCD
200:abcd:TEST2:US:CCD
300:abcd:TEST3:AR:CCD
400:abcd:TEST4:BE:CCD
500:abcd:TEST5:CA:CCD
600:abcd:TEST6:DK:CCD
312:abcd:TEST65:ZA:CCD
1300:abcd:TEST4153:CA:CCD
There are a couple of things to bear in mind:
Reading files using while read is slow
Creating subshells and executing external processes is slow
This is a job for a text processing tool, such as awk.
I would suggest that you used something like this:
# save first line
NR == 1 {
header = $0
next
}
{
filename = "file_" $col ".txt"
# if country code has changed
if (filename != prev) {
# close the previous file
close(prev)
# if we haven't seen this file yet
if (!(filename in seen)) {
print header > filename
}
seen[filename]
}
# print whole line to file
print >> filename
prev = filename
}
Run the script using something along the following lines:
awk -v col="$col_no" -f script.awk file
where $col_no is a shell variable containing the column number with the country codes.
If you don't have too many different country codes, you can get away with leaving all the files open, in which case you can remove the call to close(filename).
You can test the script on the sample provided in the question like this:
awk -F: -v col=4 -f script.awk file
Note that I've added -F: to change the input field separator to :.
I think Tom is on the right track, but I'd simplify this a little.
Awk is magical in some ways. One of those ways is that it will keep all its input and output file handles open unless you explicitly close them. So if you create a variable containing an output file name, you can simply redirect to your variable and trust that awk will send the data to the place you've specified and eventually close the output file when it runs out of input to process.
(N.B. an extension of this magic is that in addition to redirects, you can maintain multiple PIPES. Imagine if you were to cmd="gzip -9 > file_"$4".txt.gz"; print | cmd)
The following splits your file without adding a header to each output file.
awk -F: 'NR>1 {out="file_"$4".txt"; print > out}' inp.txt
If adding the header is important, a little more code is required. But not much.
awk -F: 'NR==1{h=$0;next} {out="file_"$4".txt"} !(out in files){print h > out; files[out]} {print > out}' inp.txt
Or, because this one-liner is now a bit long, we can split it out for explanation:
awk -F: '
NR==1 {h=$0;next} # Capture the header
{out="file_"$4".txt"} # Capture the output file
!(out in files){ # If we haven't seen this output file before,
print h > out; # print the header to it,
files[out] # and record the fact that we've seen it.
}
{print > out} # Finally, print our line of input.
' inp.txt
I tested these two scripts successfully on the input data you provided in your question. With this type of solution, there is no need to sort your input data -- your output in each file will be in the order in which that subset's records appeared in your input data.
Note: different versions of awk will permit you to open different numbers of open files. GNU awk (gawk) has a limit in the thousands -- significantly more than the number of countries you might have to deal with. BSD awk version 20121220 (in FreeBSD) appears to run out after 21117 files. BSD awk version 20070501 (in OS X El Capitan) is limited to 17 files.
If you're not confident in your potential number of open files, you can experiment with your version of awk usig something like this:
mkdir -p /tmp/i
awk '{o="/tmp/i/file_"NR".txt"; print "hello" > o; printf "\r%d ",NR > "/dev/stderr"}' /dev/random
You can also test the number of open pipes:
awk '{o="cat >/dev/null; #"NR; print "hello" | o; printf "\r%d ",NR > "/dev/stderr"}' /dev/random
(If you have a /dev/yes or something that just spits out lines of text ad nauseam, that would be better than using /dev/random for input.)
I haven't previously come across this limit in my own awk programming because when I've needed to create many many output files, I've always used gawk. :-P

Shell scripting to find the delimiter

I have a file with three columns, which has pipe as a delimiter. Now some lines in the file can have a "," instead of "|", due to some error. I want to output all such erroneous rows.
You can also use grep, it is more complicated:
egrep "\|.*\|.*\|" input
echo No pipe
egrep "^[^\|]*$" input
echo One pipe
egrep "^[^\|]*\|[^\|\]*$" input
echo 3+ pipe
egrep "\|[^\|]*\|[^\|\]*\|" input
Before combining the greps, first introduce new variables
p (pipe) and n (no pipe)
p="\|"
n="[^\|]*"
echo "p=$p, n=$n"
echo No pipe
egrep "^$n$" input
echo One pipe
egrep "^$n$p$n$" input
echo 3+ pipe
egrep "$p$n$p$n$p" input
Now bring all together
egrep "^$n$|^$n$p$n$|$p$n$p$n$p" input
Edit: The comments and variable names were about "slashes", but they are pipes (with backslashes). That was a bit confusing.
To count the number of columns with awk you can use the NF variable:
$ cat file
ABC|12345|EAR
PQRST|123|TWOEYES
ssdf|fdas,sdfsf
$ awk -F\| 'NF!=3' file
ssdf|fdas,sdfsf
However, this does not seem to cover all the possible ways the data could be corrupted based on the various revisions of the question and the comments.
A better approach would be to define the exact format that the data must follow. For example, assuming that a line is "correct" if it is three columns, with the first and third letters only, and the second numeric, you could write the following script to match all non conforming lines:
awk -F\| '!(NF==3 && $1$3 ~ /^[a-zA-Z]+$/ && $2+0==$2)' file
Test (notice that only the second line (which is conforming) does not get printed):
$ cat file
A,BC|12345|EAR
PQRST|123|TWOEYES
ssdf|fdas,sdfsf
ABC|3983|MAKE,
sf dl lfsdklf |kldsamfklmadkfmask |mfkmadskfmdslafmka
ABC|abs|EWE
sdf|123|123
$ awk -F\| '!(NF==3&&$1$3~/^[a-zA-Z]+$/&&$2+0==$2)' file
A,BC|12345|EAR
ssdf|fdas,sdfsf
ABC|3983|MAKE,
sf dl lfsdklf |kldsamfklmadkfmask |mfkmadskfmdslafmka
ABC|abs|EWE
sdf|123|12
You can adapt the above command to your specific needs, based on what you think is a valid input. For example, if you wanted to also restrict the length of each line to 50 characters, you could do
awk -F\| '!(NF==3 && $1$3 ~ /^[a-zA-Z]+$/ && $2+0==$2 && length($0)<50)' file

Grep outputs multiple lines, need while loop

I have a script which uses grep to find lines in a text file (ics calendar to be specific)
My script finds a date match, then goes up and down a few lines to copy the summary and start time of the appointment into a separate variable. The problem I have is that I'm going to have multiple appointments at the same time, and I need to run through the whole process for each result in grep.
Example:
LINE=`grep -F -n 20130304T232200 /path/to/calendar.ics | cut -f1 d:`
And it outputs only the lines, such as
86 89
Then it goes on to capture my other variables, as such:
SUMMARYLINE=$(( $LINE + 5 ))
SUMMARY:`sed -n "$SUMMARYLINE"p /path/to/calendar.ics
my script runs fine with one output, but it obviously won't work with more than 1 and I need for it to. should I send the grep results into an array? a separate text file to read from? I'm sure I'll need a while loop in here somehow. Need some help please.
You can call grep from a loop quite easily:
while IFS=':' read -r LINE notused # avoids the use of cut
do
# First field is now in $LINE
# Further processing
done < <(grep -F -n 20130304T232200 /path/to/calendar.ics)
However, if the file is not too large then it might be easier to read the whole file into an array and more around that.
With your proposed solution, you are reading through the file several times. Using awk, you can do it in one pass:
awk -F: -v time=20130304T232200 '
$1 == "SUMMARY" {summary = substr($0,9)}
/^DTSTART/ {start = $2}
/^END:VEVENT/ && start == time {print summary}
' calendar.ics

Compare execution log's ignoring the execution times

I'm new on linux SO and bash commands and i think someone with more experience could help me. I wanna compare 2 different text files with log's of an execution, but some lines (not all of them) begin with a time' token like this:
12345 ps line 1 content
23456 ps line 2 content
line 3 content
345 ps line 4 content
Those tokens have different values in each log, but, in that comparison, i don't care about them, i wanna just to compare the line contents and ignore them. I could use 'sed' command to generate new files without that tokens and then comepare them, but i pretend to do that repeatedly and could save me some time if i use just one command or one sh file. I've tried to use 'sed' and 'diff' combined, but without success. Would anyone please be able to help me?
You can use the following sed one liner to remove the numbers from the beginning of the file:
sed 's/^[0-9]* ps//g' file1
To diff two such files (less timestamps) you can use process substitution.
diff <(sed 's/^[0-9]* ps//g' file1) <(sed 's/^[0-9]* ps//g' file2)
Untested since you didn't show 2 input files and the expected output but from your description I THINK this would do what you want:
awk '
{ sub(/^[[:digit:]]+[[:space:]]*/,"") }
NR==FNR { file1[FNR] = $0; next }
{ print ($0 == file1[FNR] ? "==" : "!="), $0 }
' file1 file2
If that doesn't do it, post some small sample input and expected output.

Resources