Bash - change the filename by changing the filename variable - bash

I want to save the results of a multiple grep in a .txt format. I do
for i in GO_*.txt; do
grep -o "GO:\w*" ${i} | grep -f - ../PFAM2GO.txt > ${i}_PFAM+GO.txt
done
The thing is that, obviously, the final filename comprehends the original file extension too, being GO_*.txt_PFAM+GO.txt.
Now, I'd like to only have GO_*_PFAM+GO.txt. Is there a way to modify the ${i} as to cancel the .txt without having to perform a rename or a mv afterwards?
Note: the * part has variable length.

You can use parameter expansion to remove the extension from the filename:
for i in GO_*.txt; do
name="${i%.txt}"
grep -o "GO:\w*" "${i}" | grep -f - ../PFAM2GO.txt > "${name}_PFAM+GO.txt"
done

Related

pattern matching in the filename and change extension - bash script

I want to use name-last.txt files to call another several files in previous directories which names belong to the filename string:
For example, for Perez-Castillo.txt, I want to used: (1) grep in Perez-Castillo.txt, (2) grep in Perez.list and (3) grep in Castillo.list.
I have this part:
for i in *.txt;
do
wc -l $i > out1.txt
grep -c "something" ../${i%-*}.list > out2.txt
grep -c "something" ../${i#*-}.list > out3.txt
done;
However, I fail to call i.e Castillo.list, as my script is calling Castillo.txt.list
Any suggestion?
Bash doesn't let you nest two transformations into a single parameter expansion, so there is no way to delete both a prefix and a suffix with a parameter expansion.
So the simplest approach is to just remove the .txt extension at the beginning:
for i in *.txt; do
pfx=${i%.txt}
wc -l "${pfx}.txt" > out1.txt
grep -c "something" "../${pfx%-*}.list" > out2.txt
grep -c "something" "../${pfx#*-}.list" > out3.txt
done;

bash: cURL from a file, increment filename if duplicate exists

I'm trying to curl a list of URLs to aggregate the tabular data on them from a set of 7000+ URLs. The URLs are in a .txt file. My goal was to cURL each line and save them to a local folder after which I would grep and parse out the HTML tables.
Unfortunately, because of the format of the URLs in the file, duplicates exist (example.com/State/City.html. When I ran a short while loop, I got back fewer than 5500 files, so there are at least 1500 dupes in the list. As a result, I tried to grep the "/State/City.html" section of the URL and pipe it to sed to remove the / and substitute a hyphen to use with curl -O. cURL was trying to grab
Here's a sample of what I tried:
while read line
do
FILENAME=$(grep -o -E '\/[A-z]+\/[A-z]+\.htm' | sed 's/^\///' | sed 's/\//-/')
curl $line -o '$FILENAME'
done < source-url-file.txt
It feels like I'm missing something fairly straightforward. I've scanned the man page because I worried I had confused -o and -O which I used to do a lot.
When I run the loop in the terminal, the output is:
Warning: Failed to create the file State-City.htm
I think you dont need multitude seds and grep, just 1 sed should suffice
urls=$(echo -e 'example.com/s1/c1.html\nexample.com/s1/c2.html\nexample.com/s1/c1.html')
for u in $urls
do
FN=$(echo "$u" | sed -E 's/^(.*)\/([^\/]+)\/([^\/]+)$/\2-\3/')
if [[ ! -f "$FN" ]]
then
touch "$FN"
echo "$FN"
fi
done
This script should work and also take care of downloading same files multiple files.
just replace the touch command by your curl one
First: you didn't pass the url info to grep.
Second: try this line instead:
FILENAME=$(echo $line | egrep -o '\/[^\/]+\/[^\/]+\.html' | sed 's/^\///' | sed 's/\//-/')

How can I generate multiple output files in for loop

I am new to bash so this is a very basic question:
I am trying to use the below written $for loop to perform a series of commands for several files in a directory, which should end in a new output (f:r.bw) for every single file.
Basically, I have files like chr1.gz, chr2.gz and so on that should end up as chr1.bw, chr2.bw ...
The way it is now, it seems to constantly overwrite the same output file and I cannot figure out what the correct syntax is.
$ for file in *.gz
do
zcat < $file | grep -v ^track > f:r.temp
wigToBigWig -clip -fixedSummaries -keepAllChromosomes f:r.temp hg19.chrom.sizes f:r.bw
rm f:r.temp
done
Thanks for help
Instead of using a fixed filename f:r.temp, base your destination name on $file:
for file in *.gz; do
zcat <"$file" | grep -v '^track' >"${file%.gz}.temp"
wigToBigWig -clip -fixedSummaries -keepAllChromosomes \
"${file%.gz}.temp" hg19.chrom.sizes "${file%.gz}.bw"
rm -f "${file%.gz}.temp"
done
${file%.gz} is a parameter expansion operation, which trims .gz off the end of the name; ${file%.gz}.bw, thus, trims the .gz and adds a .bw.
Even better, if wigToBigWig doesn't need a real (seekable) input file, you can give it a pipeline into the zcat | grep process directly and not need any temporary file:
for file in *.gz; do
wigToBigWig -clip -fixedSummaries -keepAllChromosomes \
<(zcat <"$file" | grep -v '^track') \
hg19.chrom.sizes \
"${file%.gz}.bw"
done

Faster grep in many files from several strings in a file

I have the following working script to grep in a directory of Many files from some specific strings previously saved into a file.
I use the files extension to grep all files as its name are random and note that every string from my previously file should be searched in all the files.
Also, I cut the outputting grep as it return 2 or 3 lines of the matched file and I only want a specific part that shows the filename.
I might be using something redundant, how it could be faster?
#!/bin/bash
#working but slow
cd /var/FILES_DIRECTORY
while read line
do
LC_ALL=C fgrep "$line" *.cps | cut -c1-27 >> /var/tmp/test_OUT.txt
done < "/var/tmp/test_STRINGS.txt"
grep -F -f /var/tmp/test_STRINGS.txt *.cps | cut -c1-27
Isn't what you're looking for ?
this should speed up your script :
#!/bin/bash
#working fast
cd /var/FILES_DIRECTORY
export LC_ALL=C
grep -f /var/tmp/test_STRINGS.txt *.cps | cut -c1-27 > /var/tmp/test_OUT.txt

Trying to write a script to clean <script.aa=([].slice+'hjkbghkj') from multiple htm files, recursively

I am trying to modify a bash script to remove a glob of malicious code from a large number of files.
The community will benefit from this, so here it is:
#!/bin/bash
grep -r -l 'var createDocumentFragm' /home/user/Desktop/infected_site/* > /home/user/Desktop/filelist.txt
for i in $(cat /home/user/Desktop/filelist.txt)
do
cp -f $i $i.bak
done
for i in $(cat /home/user/Desktop/filelist.txt)
do
$i | sed 's/createDocumentFragm.*//g' > $i.awk
awk '/<\/SCRIPT>/{p=1;print}/<\/script>/{p=0}!p'
This is where the script bombs out with this message:
+ for i in '$(cat /home/user/Desktop/filelist.txt)'
+ sed 's/createDocumentFragm.*//g'
+ /home/user/Desktop/infected_site/index.htm
I get 2 errors and the script stops.
/home/user/Desktop/infected_site/index.htm: line 1: syntax error near unexpected token `<'
/home/user/Desktop/infected_site/index.htm: line 1: `<html><head><script>(function (){ '
I have the first 2 parts done.
The files containing createDocumentfragm have been enumerated in a text file correctly.
The files in the textfile.txt have been duplicated, in their original location with a .bak added to them IE: infected_site/some_directory/infected_file.htm and infected_file.htm.bak
effectively making sure we have a backup.
All I need to do now is write an AWK command that will use the list of files in filelist.txt, use the entire glob of malicious text as a pattern, and remove it from the files. Using just the uppercase script as the starting point, and the lower case script is too generic and could delete legitimate text
I suspect this may help me, but I don't know how to use it correctly.
http://backreference.org/2010/03/13/safely-escape-variables-in-awk/
Once I have this part figured out, and after you have verified that the files weren't mangled you can do this to clean out the bak files:
for i in $(cat /home/user/Desktop/filelist.txt)
do
rm -f $i.bak
done
Several things:
You have:
$i | sed 's/var createDocumentFragm.*//g' > $i.awk
You should probably meant this (using your use of cat which we'll talk about in a moment):
cat $i | sed 's/var createDocumentFragm.*//g' > $i.awk
You're treating each file in your file list as if it was a command and not a file.
Now, about your use of cat. If you're using cat for almost anything but concatenating multiple files together, you probably are doing something not quite right. For example, you could have done this:
sed 's/var createDocumentFragm.*//g' "$i" > $i.awk
I'm also a bit confused about the awk statement. Exactly what file are you using awk on? Your awk statement is using STDIN and STDOUT, so it's reading file names from the for loop and then printing the output on the screen. Is the sed statement suppose to feed into the awk statement?
Note that I don't have to print out my file to STDOUT, then pipe that into sed. The sed command can take the file name directly.
You also want to avoid for loops over a list of files. That is very inefficient, and can cause problems with the command line getting overloaded. Not a big issue today, but can affect you when you least suspect it. What happens is that your $(cat /home/user/Desktop/filelist.txt) must execute first before the for loop can even start.
A little rewriting of your program:
cd ~/Desktop
grep -r -l 'var createDocumentFragm' infected_site/* > filelist.txt
while read file
do
cp -f "$file" "$file.bak"
sed 's/var createDocumentFragm.*//g' "$file" > "$i.awk"
awk '/<\/SCRIPT>/{p=1;print}/<\/script>/{p=0}!p'
done < filelist.txt
We can use one loop, and we made it a while loop. I could even feed the grep into that while loop:
grep -r -l 'var createDocumentFragm' infected_site/* | while read file
do
cp -f "$file" "$file.bak"
sed 's/var createDocumentFragm.*//g' "$file" > "$i.awk"
awk '/<\/SCRIPT>/{p=1;print}/<\/script>/{p=0}!p'
done < filelist.txt
and then I don't even have to create a temporary file.
Let me know what's going on with the awk. I suspect you wanted something like this:
grep -r -l 'var createDocumentFragm' infected_site/* | while read file
do
cp -f "$file" "$file.bak"
sed 's/var createDocumentFragm.*//g' "$file" \
| awk '/<\/SCRIPT>/{p=1;print}/<\/script>/{p=0}!p' > "$i.awk"
done < filelist.txt
Also note I put quotes around file names. This helps prevent problems if file name has a space in it.

Resources