Faster grep in many files from several strings in a file - performance

I have the following working script to grep in a directory of Many files from some specific strings previously saved into a file.
I use the files extension to grep all files as its name are random and note that every string from my previously file should be searched in all the files.
Also, I cut the outputting grep as it return 2 or 3 lines of the matched file and I only want a specific part that shows the filename.
I might be using something redundant, how it could be faster?
#!/bin/bash
#working but slow
cd /var/FILES_DIRECTORY
while read line
do
LC_ALL=C fgrep "$line" *.cps | cut -c1-27 >> /var/tmp/test_OUT.txt
done < "/var/tmp/test_STRINGS.txt"

grep -F -f /var/tmp/test_STRINGS.txt *.cps | cut -c1-27
Isn't what you're looking for ?

this should speed up your script :
#!/bin/bash
#working fast
cd /var/FILES_DIRECTORY
export LC_ALL=C
grep -f /var/tmp/test_STRINGS.txt *.cps | cut -c1-27 > /var/tmp/test_OUT.txt

Related

How to use lines in a file as keyword for grep?

I've search lots of questions on here and other sites, and people have suggested things that should fix my problem, but I think there's something wrong with my code that I just don't recognize.
I have 24 .fasta files from NGS sequencing that are 150bp long. There's approximately 1M reads for each file. The reads are from targeted sequencing where we electroplated vectors with cDNA for genes of interest, and a unique barcode sequence. I need to look through the sequencing files for the presence or absence of the barcode sequence which corresponds to a specific gene.
I have a .txt list of the barcodeSequences that I want to pass to grep to look for the barcode in the .fasta file. I've tried so many variations of this command. I can give grep each barcode individually but that's so time consuming, I know it's possible to give it the list of barcode sequences and search each .fasta for each of the barcodes and record how many times each barcode is found in each file.
Here's my code where I give it each barcode individually:
# Barcode 33
mkdir --mode 755 $dir/BC33
FILES="*.fasta"
for f in $FILES; do
cat "$f" | tr -d "\n" | tr ">" "\n" | grep 'TATTAGAGTTTGAGAATAAGTAGT' > $dir/BC33/"$f"
done
I tried to adapt it so that I don't have to feed every barcode sequence in individually:
dir="/home/lozzib/AG_Barcode_Seq/"
cd $dir
FILES="*.fasta"
for f in $FILES; do
cat "$f" | tr -d "\n" | tr ">" "\n" | grep -c -f BarcodeScreenSeq.txt | sort > $dir/Results/"$f"
echo "Finished $f"
done
But it is not searching for the barcode sequences. With this iteration it is just returning new files in the /Results directory that are empty. I also tried a nest loop, where I tried to make the barcode sequence a variable that changed like the $FILES, but that just gave me a new file with the names of my .fasta files:
dir="/home/lozzib/AG_Barcode_Seq/"
cd $dir
FILES="*.fasta"
for f in $FILES; do
for b in `cat /home/lozzib/AG_Barcode_Seq/BarcodeScreenSeq.txt`; do
cat "$f" | grep -c "$b" | sort > $dir/"$f"_Barcode
done ;
done
I want a output .txt file that has:
<barcode sequence>: <# of times that bc was found>
for each .fasta file because I want to put all the samples together to make one large excel sheet which shows each barcode and how many times it was found in each sample.
Please help, I've tried everything I can think of.
EDIT
Here is what the BarcodeScreenSeq.txt file would look like. It's just a txt file where each line is a barcode sequence:
head BarcodeScreenSeq.txt
TATTATGAGAAAGTTGAATAGTAG
ATGAAAGTTAGAGTTTATGATAAG
AATAGATAAGATTGATTGTGTTTG
TGTTAAATGTATGTAGTAATTGAG
ATAGATTTAAGTGAAGAGAGTTAT
GAATGTTTGTAAATGTATAGATAG
AAATTGTGAAAGATTGTTTGTGTA
TGTAAGTGAAATAGTGAGTTATTT
GAATTGTATAAAGTATTAGATGTG
AGTGAGATTATGAGTATTGATTTA
EDIT
lozzib#gliaserver:~/AG_Barcode_Seq$ file BarcodeScreenSeq.txt
BarcodeScreenSeq.txt: ASCII text, with CRLF line terminators
Windows Line Endings
Your BarcodeScreenSeq.txt has windows line endings. Each line ends with the special characters \r\n. Linux tools such as grep only deal with linux line endings \r and interpret your file ...
TATTATG\r\n
ATGAAAG\r\n
...
to look for the patterns TATTATG\r, ATGAAAG\r, ... (note the \r at the end). Because of the \r there is no match.
Either: Convert your file once bye running dos2unix BarcodeScreenSeq.txt or sed -i 's/\r//g' BarcodeScreenSeq.txt. This will change your file.
Or: replace every BarcodeScreenSeq.txt in the following scripts by <(tr -d '\r' < BarcodeScreenSeq.txt). This won't change the file, but creates more overhead as the file is converted over and over again.
Command
grep -c has only one counter. If you pass multiple search patterns at once (for instance using -f BarcodeScreenSeq.txt) you still get only one number for all patterns together.
To count the occurrences of each pattern individually you can use the following trick:
for file in *.fasta; do
grep -oFf BarcodeScreenSeq.txt "$file" |
sort | uniq -c |
awk '{print $2 ": " $1 }' > "Results/$file"
done
grep -o will print each match as a single line.
sort | uniq -c will count how often each line occurs.
awk is only there to change the format from #matches pattern to pattern: #matches.
Benefit: The command should be fairly fast.
Drawback: Patterns from BarcodeScreenSeq.txt that are not found in $file won't be listed at all. Your result will leave out lines of the form pattern: 0.
If you really need the lines of the form pattern: 0 you could use another trick:
for file in *.fasta; do
grep -oFf BarcodeScreenSeq.txt "$file" |
cat - BarcodeScreenSeq.txt |
sort | uniq -c |
awk '{print $2 ": " ($1 - 1) }' > "Results/$file"
done
cat - BarcodeScreenSeq.txt will insert the content of BarcodeScreenSeq.txt at the end of grep's output such that #matches is one bigger than it should be. The number is corrected by awk.
You can read a text file one line at a time and process each line separately using a redirect, like so:
for f in *.fasta; do
while read -r seq; do
grep -c "${seq}" "${f}" > "${dir}"/"${f}"_Barcode
done < /home/lozzib/AG_Barcode_Seq/BarcodeScreenSeq.txt
done

Can gzip in a BASH script give an exit status before a file is completely unzipped? How to prevent this?

I am trying to write a script to collect a few statistics on .fastq files sequentially. In my script, I unzip and re-zip each file in a loop (gzip / gzip -d), applying commands to find the stats on each file whilst it is unzipped using i.e.
command "${file_name//.fastq.gz/.fastq}" .
When I do this, my script sometimes fails to collect statistics for certain files, (e.g. giving me line counts of zero). However, it does this seemingly randomly and running the script more than once it sometimes collects stats and sometimes doesn't for the same file.
I believe this is because the gzip is returning an exit status before the file is fully unzipped, meaning that my script continues and sometimes collects stats from half-zipped files. In support of this, I have seen different stats for the file sizes returned for the same file (but not for stats like header line counts, which seem to be either zero or the values I would expect).
Is there a best way to tell BASH to wait until the file is completely unzipped? I have tried using variants of until [ -f "${file_name//.fastq.gz/.fastq}" ] and until [ command "${file_name//.fastq.gz/.fastq}" != 0 ] but I still don't always get the right results this way (I have checked by unzipping filesand applying each command manually), then comparing values to the script each time.
I've posted the script below and linked a picture showing output from two different runs of the script on four files to highlight the issue. For the record, I have been running this script on a sun grid engine and it has been returning no error messages to accompany this.
#!/bin/bash
#$ -cwd
#$ -l h_rt=01:00:00
#$ -l h_vmem=1G
#$ -o fastq_qc_stats_job_out_file.txt
#$ -e fastq_qc_stats_job_error_file.txt
#First, make a file to store the output
if echo "${PWD}/" | grep -iq "/[a-Z]*_[0-9]*/" # if in numbered batches (i.e. the data to be analysed is split across multiple numbered files)
then batch=`echo "${PWD}/" | grep -o "/[a-Z]*_[0-9]*/" | cut -d '_' -f 2 | cut -d '/' -f 1` # get the batch number to name the output file,
else batch=`basename $PWD`; fi # otherwise just use the final part of the directory name.
header_line=`echo {'FILE','RUN','SIZE','READ_COUNT','MEAN_READ_LENGTH'} | sed 's/ /,/g'` # make a header line
echo $header_line > "QC_FASTQ_stats_${batch}.csv" # make a .csv file with the header line (by batch)
#Now loop through the FASTQ files and add the following information for each of them
for file in `ls *.fastq.gz`
do gzip -d $file # unzip the file
f="${file//.fastq.gz/.fastq}"
accession=`echo ${f} | cut -d '.' -f 1 | cut -d '_' -f 1`
filesize=`du -h ${f} | awk '{print $1}'`
readcount=`grep -E '^#[EDS]RR' ${f} | grep -E ' length=' | wc -l`
averagelength=`grep ' length=' ${f} | awk '{print $NF}' | cut -d '=' -f 2 | awk '{ total += $1 } END { print total/NR }'` # calculates mean
filestats=`echo $file $accession $filesize $readcount $averagelength | sed 's/ /,/g'`
echo $filestats >> "QC_FASTQ_stats_${batch}.csv" # add stats for each .fastq file to the .csv file
gzip ${f} # re-zip the file
done
An example of the output variation when run twice for the same files - see 4th file

bash: cURL from a file, increment filename if duplicate exists

I'm trying to curl a list of URLs to aggregate the tabular data on them from a set of 7000+ URLs. The URLs are in a .txt file. My goal was to cURL each line and save them to a local folder after which I would grep and parse out the HTML tables.
Unfortunately, because of the format of the URLs in the file, duplicates exist (example.com/State/City.html. When I ran a short while loop, I got back fewer than 5500 files, so there are at least 1500 dupes in the list. As a result, I tried to grep the "/State/City.html" section of the URL and pipe it to sed to remove the / and substitute a hyphen to use with curl -O. cURL was trying to grab
Here's a sample of what I tried:
while read line
do
FILENAME=$(grep -o -E '\/[A-z]+\/[A-z]+\.htm' | sed 's/^\///' | sed 's/\//-/')
curl $line -o '$FILENAME'
done < source-url-file.txt
It feels like I'm missing something fairly straightforward. I've scanned the man page because I worried I had confused -o and -O which I used to do a lot.
When I run the loop in the terminal, the output is:
Warning: Failed to create the file State-City.htm
I think you dont need multitude seds and grep, just 1 sed should suffice
urls=$(echo -e 'example.com/s1/c1.html\nexample.com/s1/c2.html\nexample.com/s1/c1.html')
for u in $urls
do
FN=$(echo "$u" | sed -E 's/^(.*)\/([^\/]+)\/([^\/]+)$/\2-\3/')
if [[ ! -f "$FN" ]]
then
touch "$FN"
echo "$FN"
fi
done
This script should work and also take care of downloading same files multiple files.
just replace the touch command by your curl one
First: you didn't pass the url info to grep.
Second: try this line instead:
FILENAME=$(echo $line | egrep -o '\/[^\/]+\/[^\/]+\.html' | sed 's/^\///' | sed 's/\//-/')

How can I generate multiple output files in for loop

I am new to bash so this is a very basic question:
I am trying to use the below written $for loop to perform a series of commands for several files in a directory, which should end in a new output (f:r.bw) for every single file.
Basically, I have files like chr1.gz, chr2.gz and so on that should end up as chr1.bw, chr2.bw ...
The way it is now, it seems to constantly overwrite the same output file and I cannot figure out what the correct syntax is.
$ for file in *.gz
do
zcat < $file | grep -v ^track > f:r.temp
wigToBigWig -clip -fixedSummaries -keepAllChromosomes f:r.temp hg19.chrom.sizes f:r.bw
rm f:r.temp
done
Thanks for help
Instead of using a fixed filename f:r.temp, base your destination name on $file:
for file in *.gz; do
zcat <"$file" | grep -v '^track' >"${file%.gz}.temp"
wigToBigWig -clip -fixedSummaries -keepAllChromosomes \
"${file%.gz}.temp" hg19.chrom.sizes "${file%.gz}.bw"
rm -f "${file%.gz}.temp"
done
${file%.gz} is a parameter expansion operation, which trims .gz off the end of the name; ${file%.gz}.bw, thus, trims the .gz and adds a .bw.
Even better, if wigToBigWig doesn't need a real (seekable) input file, you can give it a pipeline into the zcat | grep process directly and not need any temporary file:
for file in *.gz; do
wigToBigWig -clip -fixedSummaries -keepAllChromosomes \
<(zcat <"$file" | grep -v '^track') \
hg19.chrom.sizes \
"${file%.gz}.bw"
done

Trying to write a script to clean <script.aa=([].slice+'hjkbghkj') from multiple htm files, recursively

I am trying to modify a bash script to remove a glob of malicious code from a large number of files.
The community will benefit from this, so here it is:
#!/bin/bash
grep -r -l 'var createDocumentFragm' /home/user/Desktop/infected_site/* > /home/user/Desktop/filelist.txt
for i in $(cat /home/user/Desktop/filelist.txt)
do
cp -f $i $i.bak
done
for i in $(cat /home/user/Desktop/filelist.txt)
do
$i | sed 's/createDocumentFragm.*//g' > $i.awk
awk '/<\/SCRIPT>/{p=1;print}/<\/script>/{p=0}!p'
This is where the script bombs out with this message:
+ for i in '$(cat /home/user/Desktop/filelist.txt)'
+ sed 's/createDocumentFragm.*//g'
+ /home/user/Desktop/infected_site/index.htm
I get 2 errors and the script stops.
/home/user/Desktop/infected_site/index.htm: line 1: syntax error near unexpected token `<'
/home/user/Desktop/infected_site/index.htm: line 1: `<html><head><script>(function (){ '
I have the first 2 parts done.
The files containing createDocumentfragm have been enumerated in a text file correctly.
The files in the textfile.txt have been duplicated, in their original location with a .bak added to them IE: infected_site/some_directory/infected_file.htm and infected_file.htm.bak
effectively making sure we have a backup.
All I need to do now is write an AWK command that will use the list of files in filelist.txt, use the entire glob of malicious text as a pattern, and remove it from the files. Using just the uppercase script as the starting point, and the lower case script is too generic and could delete legitimate text
I suspect this may help me, but I don't know how to use it correctly.
http://backreference.org/2010/03/13/safely-escape-variables-in-awk/
Once I have this part figured out, and after you have verified that the files weren't mangled you can do this to clean out the bak files:
for i in $(cat /home/user/Desktop/filelist.txt)
do
rm -f $i.bak
done
Several things:
You have:
$i | sed 's/var createDocumentFragm.*//g' > $i.awk
You should probably meant this (using your use of cat which we'll talk about in a moment):
cat $i | sed 's/var createDocumentFragm.*//g' > $i.awk
You're treating each file in your file list as if it was a command and not a file.
Now, about your use of cat. If you're using cat for almost anything but concatenating multiple files together, you probably are doing something not quite right. For example, you could have done this:
sed 's/var createDocumentFragm.*//g' "$i" > $i.awk
I'm also a bit confused about the awk statement. Exactly what file are you using awk on? Your awk statement is using STDIN and STDOUT, so it's reading file names from the for loop and then printing the output on the screen. Is the sed statement suppose to feed into the awk statement?
Note that I don't have to print out my file to STDOUT, then pipe that into sed. The sed command can take the file name directly.
You also want to avoid for loops over a list of files. That is very inefficient, and can cause problems with the command line getting overloaded. Not a big issue today, but can affect you when you least suspect it. What happens is that your $(cat /home/user/Desktop/filelist.txt) must execute first before the for loop can even start.
A little rewriting of your program:
cd ~/Desktop
grep -r -l 'var createDocumentFragm' infected_site/* > filelist.txt
while read file
do
cp -f "$file" "$file.bak"
sed 's/var createDocumentFragm.*//g' "$file" > "$i.awk"
awk '/<\/SCRIPT>/{p=1;print}/<\/script>/{p=0}!p'
done < filelist.txt
We can use one loop, and we made it a while loop. I could even feed the grep into that while loop:
grep -r -l 'var createDocumentFragm' infected_site/* | while read file
do
cp -f "$file" "$file.bak"
sed 's/var createDocumentFragm.*//g' "$file" > "$i.awk"
awk '/<\/SCRIPT>/{p=1;print}/<\/script>/{p=0}!p'
done < filelist.txt
and then I don't even have to create a temporary file.
Let me know what's going on with the awk. I suspect you wanted something like this:
grep -r -l 'var createDocumentFragm' infected_site/* | while read file
do
cp -f "$file" "$file.bak"
sed 's/var createDocumentFragm.*//g' "$file" \
| awk '/<\/SCRIPT>/{p=1;print}/<\/script>/{p=0}!p' > "$i.awk"
done < filelist.txt
Also note I put quotes around file names. This helps prevent problems if file name has a space in it.

Resources