Iterating grep over and over. How can I make my script faster? - performance

I have to find a string of numbers from one file in another file.
My code is this:
#!/bin/sh
IFS="F"
while read f1 f2
do
LC_ALL=C fgrep -m 1 "$f1" BC_Tel.inp
done < telephonelist.txt
The string of numbers are located in telephonelist.txt. The format of this text file is as follows:
8901040000001304669F 370040000130466
8901040000001317380F 370040000131738
8901040000001330045F 370040000133004
8901040000001330052F 370040000133005
8901040000001330060F 370040000133006
I'm looking for the lines with the above numbers delimited by 'F' in BC_Tel.inp, which has the following format:
981040000030289765F1 655F370D1E86260ED550A2D6F80EFF96 01000045384136453332440303FFFFFFFFFFFFFFFF0000 01000037333643383234380303FFFFFFFFFFFFFFFF0000 083907400030289765 00000031323334FFFFFFFF030334303733323638310AFF 01000034383532FFFFFFFF030334333738333137320AFF 0020 01007F107FD2266C31249530FC531B474F6D44482C007F007F007F007F007F007F007F007F007F007F007F007F007F007F007F107F97AB34277D5378AEC893716281F99ABC007F007F007F007F007F007F007F007F007F007F007F007F007F007F007F107F6608B51E4378BE23072E843D6741A184007F007F007F007F007F007F007F007F007F007F007F007F007F007F 636C8D46973FAE4C1BD181BB4E0D4DA2A5E0455E86406CCF40F309F63470CE07 000003817826FF0187494010083A65626501586519104106 083907400030289765636C8D46973FAE4C1BD181BB4E0D4DA2 080900000000101003636C8D46973FAE4C1BD181BB4E0D4DA2 8901040000038279561 40732681
telephonelist.txt and BC_Tel.inp are huge files with over a million lines. The script works fine but I want to make it faster. I'm basically running over the txt file once, but I'm greping over and over the .inp file. How do I go about making this process faster?
tl;dr
I want to optimize my code so it runs faster.

A single grep will do it:
cut -d"F" -f1 telephonelist.txt | grep -F -m1 -f- BC_Tel.inp
the -f option to grep provides a filename containing the patterns. Here, we're using the filename - to indicate "stdin".

Related

Extracting a value from a same file from multiple directories

Directory name F1 F2 F3……F120
Inside each directory, a file with a common name ‘xyz.txt’
File xyz.txt has a value
Example:
F1
Xyz.txt
3.345e-2
F2
Xyz.txt
2.345e-2
F3
Xyz.txt
1.345e-2
--
F120
Xyz.txt
0.345e-2
I want to extract these values and paste them in a single file say ‘new.txt’ in a column like
New.txt
3.345e-2
2.345e-2
1.345e-2
---
0.345e-2
Any help please? Thank you so much.
If your files look very similar then you can use grep. For example:
cat F{1..120}/xyz.txt | grep -E '^[0-9][.][0-9]{3}e-[0-9]$' > new.txt
This is a general example as any number can be anything. The regular expression says that the whole line must consist of: a any digit [0-9], a dot character [.], three digits [0-9]{3}, the letter 'e' and any digit [0-9].
If your data is more regular you can also try more simple solution:
cat F{1..120}/xyz.txt | grep -E '^[0-9][.]345e-2$' > new.txt
In this solution only the first digit can be anything.
If your files might contain something else than the line, but the line you want to extract can be unambiguously extracted with a regex, you can use
sed -n '/^[0-9]\.[0-9]*e-*[0-9]*$/p' F*/Xyz.txt >new.txt
The same can be done with grep, but you have to separately tell it to not print the file name. The -x option can be used as a convenience to simplify the regex.
grep -h -x '[0-9]\.[0-9]*e-*[0-9]*' F*/Xyz.txt >new.txt
If you have some files which match the wildcard which should be excluded, try a more complex wildcard, or multiple wildcards which only match exactly the files you want, like maybe F[1-9]/Xyz.txt F[1-9][0-9]/Xyz.txt F1[0-9][0-9]/Xyz.txt
This might work for you (GNU parallel and grep):
parallel -k grep -hE '^[0-9][.][0-9]{3}e-[0-9]$' F{}/xyz.txt ::: {1..120}
Process files in parallel but output results in order.
If the files contain just one line, and you want the whole thing, you can use bash range expansion:
cat /path/to/F{1..120}/Xyz.txt > output.txt
(this keeps the order too).
If the files have more lines, and you need to actually extract the value, use grep -o (-o is not posix, but your grep probably has it).
grep -o '[0-9].345-e2' /path/to/F{1..120}/Xyz.txt > output.txt

Executing a bash loop script from a file

I am trying to execute this in unix. So let's for example say I have five files named after dates, and in each of those files there are thousand of numerical values (six to ten digit number). Now, lets say I also have bunch of numerical values and I want to know which value belongs to which file.I am trying to do it the hard way like below but how do I put all my values in a file and just do a loop from there.
FILES:
20170101
20170102
20170103
20170104
20170105
Code:
for i in 5555555 67554363 564324323 23454657 666577878 345576867; do
echo $i; grep -l $i 201701*;
done
Or, why loop at all? If you have a file containing all your numbers (say numbers.txt you can find in which date file each are contained and on what line with a simple
grep -nH -w -f numbers.txt 201701*
Where the -f option simply tells grep to use the values contained in the file numbers.txt to search in each of the files matching 201701*. The -nH options for listing the line number and filename associated with each match, respectively. And as Ed points out below, the -w option to insure grep only select lines containing the whole word sought.
You can also do it with a while loop and read from the file if you create it as #Barmar suggested:
while read -r i; do
...
done < numbers.txt
Put the values in a file numbers.txt and do:
for i in $(cat numbers.txt); do
...
done

How to quickly check a .gz file without unzip? [duplicate]

How to get the first few lines from a gziped file ?
I tried zcat, but its throwing an error
zcat CONN.20111109.0057.gz|head
CONN.20111109.0057.gz.Z: A file or directory in the path name does not exist.
zcat(1) can be supplied by either compress(1) or by gzip(1). On your system, it appears to be compress(1) -- it is looking for a file with a .Z extension.
Switch to gzip -cd in place of zcat and your command should work fine:
gzip -cd CONN.20111109.0057.gz | head
Explanation
-c --stdout --to-stdout
Write output on standard output; keep original files unchanged. If there are several input files, the output consists of a sequence of independently compressed members. To obtain better compression, concatenate all input files before compressing
them.
-d --decompress --uncompress
Decompress.
On some systems (e.g., Mac), you need to use gzcat.
On a mac you need to use the < with zcat:
zcat < CONN.20111109.0057.gz|head
If a continuous range of lines needs be, one option might be:
gunzip -c file.gz | sed -n '5,10p;11q' > subFile
where the lines between 5th and 10th lines (both inclusive) of file.gz are extracted into a new subFile. For sed options, refer to the manual.
If every, say, 5th line is required:
gunzip -c file.gz | sed -n '1~5p;6q' > subFile
which extracts the 1st line and jumps over 4 lines and picks the 5th line and so on.
If you want to use zcat, this will show the first 10 rows
zcat your_filename.gz | head
Let's say you want the 16 first row
zcat your_filename.gz | head -n 16
This awk snippet will let you show not only the first few lines - but a range you can specify. It will also add line numbers which i needed for debugging an error message pointing to a certain line way down in a gzipped file.
gunzip -c file.gz | awk -v from=10 -v to=20 'NR>=from { print NR,$0; if (NR>=to) exit 1}'
Here is the awk snippet used in the one liner above. In awk NR is a built-in variable (Number of records found so far) which usually is equivalent to a line number. the from and to variable are picked up from the command line via the -v options.
NR>=from {
print NR,$0;
if (NR>=to)
exit 1
}

My .gz/.zip file contains a huge text file; without saving that file unpacked to disk, how to extract its lines that match a regular expression?

I have a file.gz (not a .tar.gz!) or file.zip file. It contains one file (20GB-sized text file with tens of millions of lines) named 1.txt.
Without saving 1.txt to disk as a whole (this requirement is the same as in my previous question), I want to extract all its lines that match some regular expression and don't match another regex.
The resulting .txt files must not exceed a predefined limit, say, one million lines.
That is, if there are 3.5M lines in 1.txt that match those conditions, I want to get 4 output files: part1.txt, part2.txt, part3.txt, part4.txt (the latter will contain 500K lines), that's all.
I tried to make use of something like
gzip -c path/to/test/file.gz | grep -P --regexp='my regex' | split -l1000000
But the above code doesn't work. Maybe Bash can do it, as in my previous question, but I don't know how.
You can perhaps use zgrep.
zgrep [ grep_options ] [ -e ] pattern filename.gz ...
NOTE: zgrep is a wrapper script (installed with gzip package), which essentially uses the same command internally as mentioned in other answers.
However, it looks more readable in the script & easier to write the command manually.
I'm afraid It's imposible, quote from gzip man:
If you wish to create a single archive file with multiple members so
that members can later be extracted independently, use an archiver
such as tar or zip.
UPDATE: After de edit, if the gz only contains one file , a one step tool like awk shoul be fine:
gzip -cd path/to/test/file.gz | awk 'BEGIN{global=1}/my regex/{count+=1;print $0 >"part"global".txt";if (count==1000000){count=0;global+=1}}'
split is also a good choice but you will have to rename files after it.
Your solution is almost good. The problem is that You should specify for gzip what to do. To decompress use -d. So try:
gzip -dc path/to/test/file.gz | grep -P --regexp='my regex' | split -l1000000
But with this you will have a bunch of files like xaa, xab, xac, ... I suggest to use the PREFIX and numeric suffixes features to create better output:
gzip -dc path/to/test/file.gz | grep -P --regexp='my regex' | split -dl1000000 - file
In this case the result files will look like: file01, file02, fil03 etc.
If You want to filter out some not matching perl style regex, you can try something like this:
gzip -dc path/to/test/file.gz | grep -P 'my regex' | grep -vP 'other regex' | split -dl1000000 - file
I hope this helps.

Using sed to dynamically generate a file name

I have a CSV file that I'd like to split up based on a field in the file. Essentially, there can be two brands, GVA and HBVL. I'd like to split the file into a file for each brand before I import it into a database.
Sample of the CSV file
"D509379D5055821451C3695A3752DCCD",'1900-01-01 01:00:00',"M","1740","GVA",'2009-07-01 13:25:00',0
"159A58BE41012787D531C7157F688D86",'1900-01-01 00:00:00',"V","1880","GVA",'2008-06-06 11:21:00',0
"D0BB5C058794BBE4478DDA536D1E4872",'1900-01-01 00:00:00',"M","9270","GVA",'2007-09-18 13:21:00',0
"BCC7096803E5E60E05DC12FB9951E0CF",'1900-01-01 00:00:00',"M","3500","HBVL",'2007-09-18 13:21:00',1
"7F85FCE6F13775A8A3054E3438B81599",'1900-01-01 00:00:00',"M","3970","HBVL",'2007-09-18 13:20:00',0
Part of the problem is the size of the file. It's about 39mb. My original attempt at this looked like this:
while read line ; do
name=`echo $line | sed -n 's/\(.*\)"\(GVA\|HBVL\)",\(.*\)$/\2/ p' | tr [:upper:] [:lower:] `
info=`echo $line | sed -n 's/\(.*\)"\(GVA\|HBVL\)",\(.*\)$/\1\3/ p'`
echo "${info}" >> ${BASEDIR}/${today}/${name}.txt
done < ${file}
After about 2.5 hours, only about 1/2 of the file had been processed. I have another file that could potentially be up to 250 mb in size and I can't imagine how long that would take.
What I'd like to do is pull out the brand out of the line and write the line to a file named after the brand. I can remove the brand, but I don't now how to use it to create a file. I've started in sed, but I'm not above using another language if it's more appropriate.
The original while loop with multiple commands per line is DIRE!
sed -e '/"GVA"/w gva.file' -e '/"HBVL"/w hbvl.file' -n $file
The sed script says:
write lines that match the GVA tag to gva.file
write lines that match the HBVL tag to hbvl.file
and don't print anything else ('-n')
Note that different versions of sed can handle different numbers of auxilliary files. If you need more than, say, twenty output files at once, you may need to look at other technology (but test what the limit is on your machine). If the file is sorted so that all the GVA records appear together followed by all the HBVL records, you could consider using csplit. Alternatively, a scripting language like Perl could handle more. If you exceed the number of file descriptors allowed to your process, it becomes hard to do the splitting in a single pass over the data file.
grep '"GVA"' $file >GVA.txt
grep '"HVBL"' $file >HVBL.txt
# awk -F"," '{o=$5;gsub(/\"/,"",o);print $0 > o}' OFS="," file
# more GVA
"D509379D5055821451C3695A3752DCCD",'1900-01-01 01:00:00',"M","1740","GVA",'2009-07-01 13:25:00',0
"159A58BE41012787D531C7157F688D86",'1900-01-01 00:00:00',"V","1880","GVA",'2008-06-06 11:21:00',0
"D0BB5C058794BBE4478DDA536D1E4872",'1900-01-01 00:00:00',"M","9270","GVA",'2007-09-18 13:21:00',0
# more HBVL
"BCC7096803E5E60E05DC12FB9951E0CF",'1900-01-01 00:00:00',"M","3500","HBVL",'2007-09-18 13:21:00',1
"7F85FCE6F13775A8A3054E3438B81599",'1900-01-01 00:00:00',"M","3970","HBVL",'2007-09-18 13:20:00',0

Resources