grep of 50000 strings in a big file performance improvement

grep of 50000 strings in a big file performance improvement - performance

I have a file, which is about 200 MB of size, with about 1.2 M lines in it. Let's say it as reading.txt. I have another file, input.txt,
in which there are about 50000 lines. I want to take a string in each line from input.txt file and grep in reading.txt. For a matched line,
in reading.txt get that complete line and write into other file, output.txt.
As of now, I am looping through every string of input.txt file, grep in reading.txt file. This approach is consuming more than 1 hour time.
Is there any option to increase performance so that time consumption reduces for this process.
while read line
do
LC_ALL=C grep ${line} reading.txt 2>/dev/null
done<input.txt >> output.txt

man grep yields (among others):
-f FILE, --file=FILE
Obtain patterns from FILE, one per line. If this option is used
multiple times or is combined with the -e (--regexp) option,
search for all patterns given. The empty file contains zero
patterns, and therefore matches nothing.

grep -f input.txt reading.txt > output.txt
...will print all lines in 'reading.txt', with a sub string matching a line in 'input.txt', in the order of 'reading.txt', to 'output.txt'
You don't specify this, but it may be relevant (you said 1.2MB lines in 'reading.txt') - a separate output file for every matching line:
#!/bin/sh
nl='
'
IFS=$nl
c=0
for i in $(grep -f input.txt reading.txt); do
c=$((c+1))
echo "$i" > output$c.txt
done
There are tidier methods of setting IFS to a new line, for example in bash: IFS=$'\n' (also you can use > output$((++c)).txt in bash)

Related

echo last character of text file in Unix/Bash

I need to see the last characters of bunch of text files (or alternatively test whether they are "}" and give a list of files that test negative ). Is there an easy way to do this from the command line.
(Ideally the solution works without reading the whole file from the start because in addition to there being many they can also be quite large.
P.S.: Any answer would be great but I would really appreciate if the function and syntax of everything in the answer can be fully explained.

It can be done fairly easily with tail and then string indexing in bash. For example, you obtain the last line in a file with, tail -n1 file. You will need to store the line in a variable using command-substitution, e.g.
lastln=$(tail -n1 file)
Then it is simply a matter of indexing the last characters, e.g.
echo ${lastln:(-1)}
(note: when indexing from the end of the string, you must put the offset (e.g. -1 in parenthesis (-1) -- or -- you must leave a space before the -1, e.g. echo ${lastln: -1} is also valid.)

You can try this:
for file in file1 file2; do tail -n 1 "$file" | grep -q '}$' || echo "$file"; done
where you should replace file1 file2 with the list of files you want to analyze, e.g. * or the like. Now what happens here? The outer part
for file in file1 file2; do ...; done
is a simple loop over the files, where inside the loop, you can refer to the current file as $file. Then,
tail -n 1 "$file"
prints the last line of the given file and
| grep -q '}$'
redirects the output to grep (turned into silent mode with -q), which looks for '}' immediatly followed by the end of the line ($). The return value of this command can be used to chain another action: when grep returns non-zero (indicating failure, i.e., the pattern is not matched), the last part
|| echo "$file"
is executed, resulting in the list of files you need.

How to loop a variable range in cut command

I have a file with 2 columns, and i want to use the values from the second column to set the range in the cut command to select a range of characters from another file. The range i desire is the character in the position of the value in the second column plus the next 10 characters. I will give an example in a while.
My files are something like that:
File with 2 columns and no blank lines between lines (file1.txt):
NAME1 10
NAME2 25
NAME3 48
NAME4 66
File that i want to extract the variable range of characters(just one very long line with no spaces and no bold font) (file2.txt):
GATCGAGCGGGATTCTTTTTTTTTAGGCGAGTCAGCTAGCATCAGCTACGAGAGGCGAGGGCGGGCTATCACGACTACGACTACGACTACAGCATCAGCATCAGCGCACTAGAGCGAGGCTAGCTAGCTACGACTACGATCAGCATCGCACATCGACTACGATCAGCATCAGCTACGCATCGAAGAGAGAGC
...or, more literally (for copy/paste to test):
GATCGAGCGGGATTCTTTTTTTTTAGGCGAGTCAGCTAGCATCAGCTACGAGAGGCGAGGGCGGGCTATCACGACTACGACTACGACTACAGCATCAGCATCAGCGCACTAGAGCGAGGCTAGCTAGCTACGACTACGATCAGCATCGCACATCGACTACGATCAGCATCAGCTACGCATCGAAGAGAGAGC
Desired resulting file, one sequence per line (result.txt):
GATTCTTTTT
GGCGAGTCAG
CGAGAGGCGA
TATCACGACT
The resulting file would have the characters from 10-20, 25-35, 48-58 and 66-76, each range in a new line. So, it would always keep the range of 10, but in different start points and those start points are set by the values in the second column from the first file.
I tried the command:
for i in $(awk '{print $2}' file1.txt);
do
p1=$i;
p2=`expr "$1" + 10`
cut -c$p1-$2 file2.txt > result.txt;
done
I don't get any output or error message.
I also tried:
while read line; do
set $line
p2=`expr "$2" + 10`
cut -c$2-$p2 file2.txt > result.txt;
done <file1.txt
This last command gives me an error message:
cut: invalid range with no endpoint: -
Try 'cut --help' for more information.
expr: non-integer argument

There's no need for cut here; dd can do the job of indexing into a file, and reading only the number of bytes you want. (Note that status=none is a GNUism; you may need to leave it out on other platforms and redirect stderr otherwise if you want to suppress informational logging).
while read -r name index _; do
dd if=file2.txt bs=1 skip="$index" count=10 status=none
printf '\n'
done <file1.txt >result.txt
This approach avoids excessive memory requirements (as present when reading the whole of file2 -- assuming it's large), and has bounded performance requirements (overhead is equal to starting one copy of dd per sequence to extract).

Using awk
$ awk 'FNR==NR{a=$0; next} {print substr(a,$2+1,10)}' file2 file1
GATTCTTTTT
GGCGAGTCAG
CGAGAGGCGA
TATCACGACT

If file2.txt is not too large, then you can read it in memory,
and use Bash sub-strings to extract the desired ranges:
data=$(<file2.txt)
while read -r name index _; do
echo "${data:$index:10}"
done <file1.txt >result.txt
This will be much more efficient than running cut or another process for every single range definition.
(Thanks to #CharlesDuffy for the tip to read data without a useless cat, and the while loop.)

One way to solve it:
#!/bin/bash
while read line; do
pos=$(echo "$line" | cut -f2 -d' ')
x=$(head -c $(( $pos + 10 )) file2.txt | tail -c 10)
echo "$x"
done < file1.txt > result.txt
It's not the solution an experienced bash hacker would use, but it is very good for someone who is new to bash. It uses tools that are very versatile, although somewhat bad if you need high performance. Shell scripting is commonly used by people who rarely shell scripts, but knows a few commands and just wants to get the job done. That's why I'm including this solution, even if the other answers are superior for more experienced people.
The first line is pretty easy. It just extracts the numbers from file1.txt. The second line uses the very nice tools head and tail. Usually, they are used with lines instead of characters. Nevertheless, I print the first pos + 10 characters with head. The result is piped into tail which prints the last 10 characters.
Thanks to #CharlesDuffy for improvements.

Delete everything after a certain line in bash

I was wondering if there was a way to delete everything after a certain line of a text file in bash. So say there's a text file with 10 lines, and I want to delete every line after line number 4, so only the first 4 lines remained, how would I go about doing that?

You can use GNU sed:
sed -i '5,$d' file.txt
That is, 5,$ means the range line 5 until the end, and d means to delete.
Only the first 4 lines will remain.
The -i flag tells sed to edit the file in-place.
If you have only BSD sed, then the -i flag requires a backup file suffix:
sed -i.bak '5,$d' file.txt
As #ephemient pointed out, while this solution is simple,
it's inefficient because sed will still read the input until the end of the file, which is unnecessary.
As #agc pointed out, the inverse logic of my first proposal might be actually more intuitive. That is, do not print by default (-n flag),
and explicitly print range 1,4:
sed -ni.bak 1,4p file.txt
Another simple alternative, assuming that the first 4 lines are not excessively long and so they easily fit in memory, and also assuming that the 4th line ends with a newline character,
you can read the first 4 lines into memory and then overwrite the file:
lines=$(head -n 4 file.txt)
echo "$lines" > file.txt

Minor refinements on Janos' answer, ephemient's answer, and cdark's comment:
Simpler (and faster) sed code:
sed -i 4q file
When a filter util can't directly edit a file, there's
sponge:
head -4 file | sponge file
Most efficient for Linux might be truncate -- coreutils sibling util to fallocate, which offers the same minimal I/O of ephemient's more portable, (but more complex), dd-based answer:
truncate -s `head -4 file | wc -c` file

The sed method that #janos is simple but inefficient. It will read every line from the original file, even ones it could ignore (although that can be fixed using 4q), and -i actually creates a new file (which it renames to replace the original file). And there's the annoying bit where you need to use sed -i '5,$d' file.txt with GNU sed but sed -i '' '5,$d' file.txt with BSD sed in order to remove the existing file instead of leaving a backup.
Another method that performs less I/O:
dd bs=1 count=0 if=/dev/null of=file.txt \
seek=$(grep -b ^ file.txt | tail -n+5 | head -n1 | cut -d: -f1)
grep -b ^ file.txt prints out byte offsets on each line, e.g.
$ yes | grep -b ^
0:y
2:y
4:y
...
tail -n+5 skips the first 4 lines, outputting the 5th and subsequent lines
head -n1 takes only the next line (e.g. only the 5th line)
After head reads the one line, it will exit. This causes tail to exit because it has nowhere to output to anymore. This causes grep to exit for the same reason. Thus, the rest of file.txt does not need to be examined.
cut -d: -f1 takes only the first part before the : (the byte offset)
dd bs=1 count=0 if=/dev/null of=file.txt seek=N
using a block size of 1 byte, seek to block N of file.txt
copy 0 blocks of size 1 byte from /dev/null to file.txt
truncate file.txt here (because conv=notrunc was not given)
In short, this removes all data on the 5th and subsequent lines from file.txt.
On Linux there is a command named fallocate which can similarly extend or truncate a file, but that's not portable.
UNIX filesystems support efficiently truncating files in-place, and these commands are portable. The downside is that it's more work to write out.
(Also, dd will print some unnecessary stats to stderr, and will exit with an error if the file has fewer than 5 lines, although in that case it will leave the existing file contents in place, so the behavior is still correct. Those can be addressed also, if needed.)

If I don't know the line number, merely the line content (I need to know that there is nothing below the line containing 'knowntext' that I want to preserve.), then I use.
sed -i '/knowntext/,$d' inputfilename
to directly alter the file, or to be cautious
sed '/knowntext/,$d' inputfilename > outputfilename
where inputfilename is unaltered, and outputfilename contains the truncated version of the input.
I am not competent to comment on the efficiency of this, but I know that files of 20kB or so are dealt with faster than I can blink.

Using GNU awk (v. 4.1.0+, see here). First we create a test file (NOTICE THE DISCLAIMER):
$ seq 1 10 > file # THIS WILL OVERWRITE FILE NAMED file WITH TEST DATA
Then the code and validation (WILL MODIFY THE ORIGINAL FILE NAMED file):
$ awk -i inplace 'NR<=4' file
$ cat file
1
2
3
4
Explained:
$ awk -i inplace ' # edit is targetted to the original file (try without -i ...)
NR<=4 # output first 4 records
' file # file
You could also exit on line NR==5 which would be quicker if you redirected the output of the program to a new file (remove # for action) which would be the same as head -4 file > new_file:
$ awk 'NR==5{exit}1' file # > new_file
When testing, don't forget the seq part first.

Iterating grep over and over. How can I make my script faster?

I have to find a string of numbers from one file in another file.
My code is this:
#!/bin/sh
IFS="F"
while read f1 f2
do
LC_ALL=C fgrep -m 1 "$f1" BC_Tel.inp
done < telephonelist.txt
The string of numbers are located in telephonelist.txt. The format of this text file is as follows:
8901040000001304669F 370040000130466
8901040000001317380F 370040000131738
8901040000001330045F 370040000133004
8901040000001330052F 370040000133005
8901040000001330060F 370040000133006
I'm looking for the lines with the above numbers delimited by 'F' in BC_Tel.inp, which has the following format:
981040000030289765F1 655F370D1E86260ED550A2D6F80EFF96 01000045384136453332440303FFFFFFFFFFFFFFFF0000 01000037333643383234380303FFFFFFFFFFFFFFFF0000 083907400030289765 00000031323334FFFFFFFF030334303733323638310AFF 01000034383532FFFFFFFF030334333738333137320AFF 0020 01007F107FD2266C31249530FC531B474F6D44482C007F007F007F007F007F007F007F007F007F007F007F007F007F007F007F107F97AB34277D5378AEC893716281F99ABC007F007F007F007F007F007F007F007F007F007F007F007F007F007F007F107F6608B51E4378BE23072E843D6741A184007F007F007F007F007F007F007F007F007F007F007F007F007F007F 636C8D46973FAE4C1BD181BB4E0D4DA2A5E0455E86406CCF40F309F63470CE07 000003817826FF0187494010083A65626501586519104106 083907400030289765636C8D46973FAE4C1BD181BB4E0D4DA2 080900000000101003636C8D46973FAE4C1BD181BB4E0D4DA2 8901040000038279561 40732681
telephonelist.txt and BC_Tel.inp are huge files with over a million lines. The script works fine but I want to make it faster. I'm basically running over the txt file once, but I'm greping over and over the .inp file. How do I go about making this process faster?
tl;dr
I want to optimize my code so it runs faster.

A single grep will do it:
cut -d"F" -f1 telephonelist.txt | grep -F -m1 -f- BC_Tel.inp
the -f option to grep provides a filename containing the patterns. Here, we're using the filename - to indicate "stdin".

Using sed to dynamically generate a file name

I have a CSV file that I'd like to split up based on a field in the file. Essentially, there can be two brands, GVA and HBVL. I'd like to split the file into a file for each brand before I import it into a database.
Sample of the CSV file
"D509379D5055821451C3695A3752DCCD",'1900-01-01 01:00:00',"M","1740","GVA",'2009-07-01 13:25:00',0
"159A58BE41012787D531C7157F688D86",'1900-01-01 00:00:00',"V","1880","GVA",'2008-06-06 11:21:00',0
"D0BB5C058794BBE4478DDA536D1E4872",'1900-01-01 00:00:00',"M","9270","GVA",'2007-09-18 13:21:00',0
"BCC7096803E5E60E05DC12FB9951E0CF",'1900-01-01 00:00:00',"M","3500","HBVL",'2007-09-18 13:21:00',1
"7F85FCE6F13775A8A3054E3438B81599",'1900-01-01 00:00:00',"M","3970","HBVL",'2007-09-18 13:20:00',0
Part of the problem is the size of the file. It's about 39mb. My original attempt at this looked like this:
while read line ; do
name=`echo $line | sed -n 's/\(.*\)"\(GVA\|HBVL\)",\(.*\)$/\2/ p' | tr [:upper:] [:lower:] `
info=`echo $line | sed -n 's/\(.*\)"\(GVA\|HBVL\)",\(.*\)$/\1\3/ p'`
echo "${info}" >> ${BASEDIR}/${today}/${name}.txt
done < ${file}
After about 2.5 hours, only about 1/2 of the file had been processed. I have another file that could potentially be up to 250 mb in size and I can't imagine how long that would take.
What I'd like to do is pull out the brand out of the line and write the line to a file named after the brand. I can remove the brand, but I don't now how to use it to create a file. I've started in sed, but I'm not above using another language if it's more appropriate.

The original while loop with multiple commands per line is DIRE!
sed -e '/"GVA"/w gva.file' -e '/"HBVL"/w hbvl.file' -n $file
The sed script says:
write lines that match the GVA tag to gva.file
write lines that match the HBVL tag to hbvl.file
and don't print anything else ('-n')
Note that different versions of sed can handle different numbers of auxilliary files. If you need more than, say, twenty output files at once, you may need to look at other technology (but test what the limit is on your machine). If the file is sorted so that all the GVA records appear together followed by all the HBVL records, you could consider using csplit. Alternatively, a scripting language like Perl could handle more. If you exceed the number of file descriptors allowed to your process, it becomes hard to do the splitting in a single pass over the data file.

grep '"GVA"' $file >GVA.txt
grep '"HVBL"' $file >HVBL.txt

# awk -F"," '{o=$5;gsub(/\"/,"",o);print $0 > o}' OFS="," file
# more GVA
"D509379D5055821451C3695A3752DCCD",'1900-01-01 01:00:00',"M","1740","GVA",'2009-07-01 13:25:00',0
"159A58BE41012787D531C7157F688D86",'1900-01-01 00:00:00',"V","1880","GVA",'2008-06-06 11:21:00',0
"D0BB5C058794BBE4478DDA536D1E4872",'1900-01-01 00:00:00',"M","9270","GVA",'2007-09-18 13:21:00',0
# more HBVL
"BCC7096803E5E60E05DC12FB9951E0CF",'1900-01-01 00:00:00',"M","3500","HBVL",'2007-09-18 13:21:00',1
"7F85FCE6F13775A8A3054E3438B81599",'1900-01-01 00:00:00',"M","3970","HBVL",'2007-09-18 13:20:00',0

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

grep of 50000 strings in a big file performance improvement - performance

man grep yields (among others): -f FILE, --file=FILE Obtain patterns from FILE, one per line. If this option is used multiple times or is combined with the -e (--regexp) option, search for all patterns given. The empty file contains zero patterns, and therefore matches nothing.

Related

echo last character of text file in Unix/Bash

How to loop a variable range in cut command

Delete everything after a certain line in bash

Iterating grep over and over. How can I make my script faster?

Using sed to dynamically generate a file name

Categories

Resources