optimizing this script to match lines of one txt file with another - bash

Okay so I am at best a novice in bash scripting but I wrote this very small script late last night to take the first 40 character's of each line of a fairly large text file (~300,000 lines) and search through a much larger text file for matches (~2.2 million lines), and then output all of the results into the matching lines into an new text file.
so the script looks like this:
#!/bin/bash
while read -r line
do
match=${line:0:40}
grep "$match" large_list.txt
done <"small_list.txt"
and then by calling the script like so
$ bash my_script.sh > outputfile.txt &
and this gives me all the common elements between the 2 list's. Now this is all well and good and slowly works. but I am running this on a m1.smalll ec2 instance and fair enough (the proccesing on this is shit and I could spin up a larger instance to handle all this or do it on my desktop and upload the file). However I would rather learn a more efficentr way of accomplishing the same task, However I can't quite seem to figure this out. Any tidbits of how to best go about this , or complete the task more efficently would really be very very appreciated
to give you an idea of how slow this is working i started the script about 10 hours ago and I am about 10% of the way through all the matches.
Also I am not set in using bash so scripts in other language's are fair game .. I figure the pro's on S.O. can easily improve my rock for a hammer aproach
edit: adding input and output's and morre information about the data
input: (small text file)
8E636C0B21E42A3FC6AA3C412B31E3C61D8DD062|Vice S01E09 HDTV XviD-FUM[ettv]|Video TV|http://bitsnoop.com/vice-s01e09-hdtv-xvid-fum-ettv-q49614889.html|http://torrage.com/torrent/36A02E282D49EB7D94ACB798654829493CA929CB.torrent
3B9403AD73124A84AAE12E83A2DE446149516AC3|Sons of Guns S04E08 HDTV XviD-FUM[ettv]|Video TV|http://bitsnoop.com/sons-of-guns-s04e08-hdtv-xvid-fum-e-q49613491.html|http://torrage.com/torrent/3B9403AD73124A84AAE12E83A2DE446149516AC3.torrent
C4ADF747050D1CF64E9A626CA2563A0B8BD856E7|Save Me S01E06 HDTV XviD-FUM[ettv]|Video TV|http://bitsnoop.com/save-me-s01e06-hdtv-xvid-fum-ettv-q49515711.html|http://torrage.com/torrent/C4ADF747050D1CF64E9A626CA2563A0B8BD856E7.torrent
B71EFF95502E086F4235882F748FB5F2131F11CE|Da Vincis Demons S01E08 HDTV x264-EVOLVE|Video TV|http://bitsnoop.com/da-vincis-demons-s01e08-hdtv-x264-e-q49515709.html|http://torrage.com/torrent/B71EFF95502E086F4235882F748FB5F2131F11CE.torrent
match against (large text file)
86931940E7F7F9C1A9774EA2EA41AE59412F223B|0|0
8E636C0B21E42A3FC6AA3C412B31E3C61D8DD062|4|2|20705|9550|21419
ADFA5DD6F0923AE641F97A96D50D6736F81951B1|0|0
CF2349B5FC486E7E8F48591EC3D5F1B47B4E7567|1|0|429|428|22248
290DF9A8B6EC65EEE4EC4D2B029ACAEF46D40C1F|1|0|523|446|14276
C92DEBB9B290F0BB0AA291114C98D3FF310CF0C3|0|0|21448
Output:
8E636C0B21E42A3FC6AA3C412B31E3C61D8DD062|4|2|20705|9550|21419
additional clarifications: so Basically there is a hash which is first 40 charecter's of the input file (a file I have already reduced size to about 15% of original, SO for each line in this file there is a hash in the larger text file (that I am matching against) with some corresponding information now it is the line in the larger file that I would like to write to a new file so that in the end I have a 1:1 ratio of all thing in smaller text file to my output_file.txt
In this case I am showing the first line of the input being matched (line 2 of larger file)and then written to an output file

awk solution adopted from this answer:
awk -F"|" 'NR==FNR{a[$1]=$2;next}{if (a[$1]) print}' small.txt large.txt

some python to the rescue.
I created two text-files using the following snippet:
#!/usr/bin/env python
import random
import string
N=2000000
for i in range(N):
s = ''.join(random.choice(string.ascii_uppercase + string.digits) for x in range(40))
print s + '|4|2|20705|9550|21419'
one 300k and one 2M lines This gives me the following files:
$ ll
-rwxr-xr-x 1 210 Jun 11 22:29 gen_random_string.py*
-rw-rw-r-- 1 119M Jun 11 22:31 large.txt
-rw-rw-r-- 1 18M Jun 11 22:29 small.txt
Then I appended a line from small.txt to the end of large.txt so that I had a matching pattern
Then some more python:
#!/usr/bin/env python
target = {}
with open("large.txt") as fd:
for line in fd:
target[line.split('|')[0]] = line.strip()
with open("small.txt") as fd:
for line in fd:
if line.split('|')[0] in target:
print target[line.split('|')[0]]
Some timings:
$ time ./comp.py
3A8DW2UUJO3FYTE8C5ESE25IC9GWAEJLJS2N9CBL|4|2|20705|9550|21419
real 0m2.574s
user 0m2.400s
sys 0m0.168s
$ time awk -F"|" 'NR==FNR{a[$1]=$2;next}{if (a[$1]) print}' small.txt large.txt
3A8DW2UUJO3FYTE8C5ESE25IC9GWAEJLJS2N9CBL|4|2|20705|9550|21419
real 0m4.380s
user 0m4.248s
sys 0m0.124s
Update:
To conserve memory, do the dictionary-lookup the other way
#!/usr/bin/env python
target = {}
with open("small.txt") as fd:
for line in fd:
target[line.split('|')[0]] = line.strip()
with open("large.txt") as fd:
for line in fd:
if line.split('|')[0] in target:
print line.strip()

Related

Is there a faster way to combine files in an ordered fashion than a for loop?

For some context, I am trying to combine multiple files (in an ordered fashion) named FILENAME.xxx.xyz (xxx starts from 001 and increases by 1) into a single file (denoted as $COMBINED_FILE), then replace a number of lines of text in the $COMBINED_FILE taking values from another file (named $ACTFILE). I have two for loops to do this which work perfectly fine. However, when I have a larger number of files, this process tends to take a fairly long time. As such, I am wondering if anyone has any ideas on how to speed this process up?
Step 1:
for i in {001..999}; do
[[ ! -f ${FILENAME}.${i}.xyz ]] && break
cat ${FILENAME}.${i}.xyz >> ${COMBINED_FILE}
mv -f ${FILENAME}.${i}.xyz ${XYZDIR}/${JOB_BASENAME}_${i}.xyz
done
Step 2:
for ((j=0; j<=${NUM_CONF}; j++)); do
let "n = 2 + (${j} * ${LINES_PER_CONF})"
let "m = ${j} + 1"
ENERGY=$(awk -v NUM=$m 'NR==NUM { print $2 }' $ACTFILE)
sed -i "${n}s/.*/${ENERGY}/" ${COMBINED_FILE}
done
I forgot to mention: there are other files named FILENAME.*.xyz which I do not want to append to the $COMBINED_FILE
Some details about the files:
FILENAME.xxx.xyz are molecular xyz files of the form:
Line 1: Number of atoms
Line 2: Title
Line 3-Number of atoms: Molecular coordinates
Line (number of atoms +1): same as line 1
Line (number of atoms +2): Title 2
... continues on (where line 1 through Number of atoms is associated with conformer 1, and so on)
The ACT file is a file containing the energies which has the form:
Line 1: conformer1 Energy
Line 2: conformer2 Energy2
Where conformer1 is in column 1 and the energy is in column 2.
The goal is to make the energy for the conformer the title for in the combined file (where the energy must be the title for a specific conformer)
If you know that at least one matching file exists, you should be able to do this:
cat -- ${FILENAME}.[0-9][0-9][0-9].xyz > ${COMBINED_FILE}
Note that this will match the 000 file, whereas your script counts from 001. If you know that 000 either doesn't exist or isn't a problem if it were to exist, then you should just be able to do the above.
However, moving these files to renamed names in another directory does require a loop, or one of the less-than-highly portable pattern-based renaming utilities.
If you could change your workflow so that the filenames are preserved, it could just be:
mv -- ${FILENAME}.[0-9][0-9][0-9].xyz ${XYZDIR}/${JOB_BASENAME}
where we now have a directory named after the job basename, rather than a path component fragment.
The Step 2 processing should be doable entirely in Awk, rather than a shell loop; you can read the file into an associative array indexed by line number, and have random access over it.
Awk can also accept multiple files, so the following pattern may be workable for processing the individual files:
awk 'your program' ${FILENAME}.[0-9][0-9][0-9].xyz
for instance just before catenating and moving them away. Then you don't have to rely on a fixed LINES_PER_FILE and such. Awk has the FNR variable which is the record in the current file; condition/action pairs can tell when processing has moved to the next file.
GNU Awk also has extensions BEGINFILE and ENDFILE, which are similar to the standard BEGIN and END, but are executed around each processed file; you can do some calculations over the record and in ENDFILE print the results for that file, and clear your accumulation variables for the next file. This is nicer than checking for FNR == 1, and having an END action for the last file.
if you really wanna materialize all the file names without globbing you can always jot it (it's like seq with more integer digits in default mode before going to scientific notation) :
jot -w 'myFILENAME.%03d' - 0 999 |
mawk '_<(_+=(NR == +_)*__)' \_=17 __=91 # extracting fixed interval
# samples without modulo(%) math
myFILENAME.016
myFILENAME.107
myFILENAME.198
myFILENAME.289
myFILENAME.380
myFILENAME.471
myFILENAME.562
myFILENAME.653
myFILENAME.744
myFILENAME.835
myFILENAME.926

Calling AWK inside Gnuplot produces two extra lines

I am using AWK to preprocess files for plotting and fitting in Gnuplot 5.2 on Windows 10. e.g. like this:
plot '<awk "{print}" file1.dat file2.dat'
While fit works fine, plot yields this message:
Bad data on line 1 of file <awk "{print}" file1.dat file2.dat
I took a look at the bad data with print system('awk "{print}" file1.dat file2.dat') which shows that there are two extra lines infront of the data, which I can even reproduce with a minimal print system('awk ""') which gives
fstat < 0: fd = 0
fstat < 0: fd = 2
Of course, if I just want to extract a number out of the AWK command, I can do something like
sum = real(substr(system('awk "{sum+=$2} END {print sum}" file1.dat'), 37,-1))
While this is annoying, it works. But I have not found any work around for plot. Even more I would like a solution that avoids the extra lines from the beginning. Does anyone have an idea how to do that?
Here, I have two more test cases that might provide information:
If I run AWK in CMD, the extra lines are not there.
Other CMD commands also do not procude the lines in gnuplot, i.e. if I call print system('echo test')

Slight error when using awk to remove spaces from a CSV column

I have used the following awk command on my bash script to delete spaces on the 26th column of my CSV;
awk 'BEGIN{FS=OFS="|"} {gsub(/ /,"",$26)}1' original.csv > final.csv
Out of 400 rows, I have about 5 random rows that this doesn't work on even if I rerun the script on final.csv. Can anyone assist me with a method to take care of this? Thank you in advance.
EDIT: Here is a sample of the 26th column on original.csv vs final.csv respectively;
2212026837 2212026837
2256 41688 6 2256416886
2076113566 2076113566
2009 84517 7 2009845177
2067950476 2067950476
2057 90531 5 2057 90531 5
2085271676 2085271676
2095183426 2095183426
2347366235 2347366235
2200160434 2200160434
2229359595 2229359595
2045373466 2045373466
2053849895 2053849895
2300 81552 3 2300 81552 3
I see two possibilities.
The simplest is that you have some whitespace other than a space. You can fix that by using a more general regex in your gsub: instead of / /, use /[[:space:]]/.
If that solves your problem, great! You got lucky, move on. :)
The other possible problem is trickier. The CSV (or, in this case, pipe-SV) format is not as simple as it appears, since you can have quoted delimiters inside fields. This, for instance, is a perfectly valid 4-field line in a pipe-delimited file:
field 1|"field 2 contains some |pipe| characters"|field 3|field 4
If the first 4 fields on a line in your file looked like that, your gsub on $26 would actually operate on $24 instead, leaving $26 alone. If you have data like that, the only real solution is to use a scripting language with an actual CSV parsing library. Perl has Text::CSV, but it's not installed by default; Python's csv module is, so you could use a program like so:
import csv, fileinput as fi, re;
for row in csv.reader(fi.input(), delimiter='|'):
row[25] = re.sub(r'\s+', '', row[25]) # fields start at 0 instead of 1
print '|'.join(row)
Save the above in a file like colfixer.py and run it with python colfixer.py original.csv >final.csv.
(If you tried hard enough, you could get that shoved into a -c option string and run it from the command line without creating a script file, but Python's not really built for that and it gets ugly fast.)
You can use the string function split, and iterate over the corresponding array to reassign the 26th field:
awk 'BEGIN{FS=OFS="|"} {
n = split($26, a, /[[:space:]]+/)
$26=a[1]
for(i=2; i<=n; i++)
$26=$26""a[i]
}1' original.csv > final.csv

Improve speed for csv.reader/csv.writer in the csv module

I'm doing a pretty simple operation; opening a csv file, deleting the first column, and writing out to a new file. The following code works fine, but it takes 50-60 seconds on my 700 MB file:
import csv
from time import time
#create empty output file
f = open('testnew.csv',"w")
f.close()
t = time()
with open('test.csv',"rt") as source:
rdr= csv.reader( source )
with open('testnew.csv',"a") as result:
wtr= csv.writer( result )
for r in rdr:
del r[0]
_ = wtr.writerow( r )
print(round(time()-t))
By contrast, the following shell script does the same thing in 7-8 seconds:
START_TIME=$SECONDS
cut -d',' -f2- < test.csv > testnew.csv
echo $(($SECONDS - $START_TIME))
Is there a way I can get comparable performance in Python?
If I understand correctly, the shell script is simply splitting lines at the first ,, regardless of whether it is enclosed in quotes or not, and writing out the second part. (I do not know what the shell script does if there is no ,.) The csv method does much more, which is useless for you. To just do the same thing as the shell in python, skip the csv module.
for line in source:
parts = line.split(',', maxsplit=1)
source.write(parts[len(parts)-1])
This passes lines without a comma as is. It leaves spaces after the comma (I do not know what cut does. If you do not want that, you can either use re.split instead of line.split or add .lstrip() just before the closing ) on the last line.
Your bash script not parse csv file, only split and cut. So, in python we can do the same:
with open('test.csv',"r") as source:
with open('testnew.csv',"w") as result:
for l in source:
_, tail = l.split(',', 1)
result.write(tail)
My simple profiling (4Mb file):
bash - 193 ms
python csv parsing - 2391 ms
python string splitting - 620 ms
Python 2 is faster for some reason:
bash - 193 ms
python csv parsing - 1471 ms
python string splitting - 373 ms

Smart split file with gzipping each part?

I have a very long file with numbers. Something like output of this perl program:
perl -le 'print int(rand() * 1000000) for 1..10'
but way longer - around hundreds of gigabytes.
I need to split this file into many others. For test purposes, let's assume that 100 files, and output file number is taken by taking module of number with 100.
With normal files, I can do it simply with:
perl -le 'print int(rand() * 1000000) for 1..1000' | awk '{z=$1%100; print > z}'
But I have a problem when I need to compress splitted parts. Normally, I could:
... | awk '{z=$1%100; print | "gzip -c - > "z".txt.gz"}'
But, when ulimit is configured to allow less open files than number of "partitions", awk breaks with:
awk: (FILENAME=- FNR=30) fatal: can't open pipe `gzip -c - > 60.txt.gz' for output (Too many open files)
This doesn't break with normal file output, as GNU awk is apparently smart enough to recycle file handles.
Do you know any way (aside from writing my own stream-splitting-program, implementing buffering, and some sort of pool-of-filehandles management) to handle such case - that is: splitting to multiple files, where access to output files is random, and gzipping all output partitions on the fly?
I didn't write it in question itself, but since the additional information is together with solution, I'll write it all here.
So - the problem was on Solaris. Apparently there is a limitation, that no program using stdio on Solaris can have more than 256 open filehandles ?!
It is described in here in detail. The important point is that it's enough to set one env variable before running my problematic program, and the problem is gone:
export LD_PRELOAD_32=/usr/lib/extendedFILE.so.1

Resources