Trying to write a script to clean <script.aa=([].slice+'hjkbghkj') from multiple htm files, recursively - bash

I am trying to modify a bash script to remove a glob of malicious code from a large number of files.
The community will benefit from this, so here it is:
#!/bin/bash
grep -r -l 'var createDocumentFragm' /home/user/Desktop/infected_site/* > /home/user/Desktop/filelist.txt
for i in $(cat /home/user/Desktop/filelist.txt)
do
cp -f $i $i.bak
done
for i in $(cat /home/user/Desktop/filelist.txt)
do
$i | sed 's/createDocumentFragm.*//g' > $i.awk
awk '/<\/SCRIPT>/{p=1;print}/<\/script>/{p=0}!p'
This is where the script bombs out with this message:
+ for i in '$(cat /home/user/Desktop/filelist.txt)'
+ sed 's/createDocumentFragm.*//g'
+ /home/user/Desktop/infected_site/index.htm
I get 2 errors and the script stops.
/home/user/Desktop/infected_site/index.htm: line 1: syntax error near unexpected token `<'
/home/user/Desktop/infected_site/index.htm: line 1: `<html><head><script>(function (){ '
I have the first 2 parts done.
The files containing createDocumentfragm have been enumerated in a text file correctly.
The files in the textfile.txt have been duplicated, in their original location with a .bak added to them IE: infected_site/some_directory/infected_file.htm and infected_file.htm.bak
effectively making sure we have a backup.
All I need to do now is write an AWK command that will use the list of files in filelist.txt, use the entire glob of malicious text as a pattern, and remove it from the files. Using just the uppercase script as the starting point, and the lower case script is too generic and could delete legitimate text
I suspect this may help me, but I don't know how to use it correctly.
http://backreference.org/2010/03/13/safely-escape-variables-in-awk/
Once I have this part figured out, and after you have verified that the files weren't mangled you can do this to clean out the bak files:
for i in $(cat /home/user/Desktop/filelist.txt)
do
rm -f $i.bak
done

Several things:
You have:
$i | sed 's/var createDocumentFragm.*//g' > $i.awk
You should probably meant this (using your use of cat which we'll talk about in a moment):
cat $i | sed 's/var createDocumentFragm.*//g' > $i.awk
You're treating each file in your file list as if it was a command and not a file.
Now, about your use of cat. If you're using cat for almost anything but concatenating multiple files together, you probably are doing something not quite right. For example, you could have done this:
sed 's/var createDocumentFragm.*//g' "$i" > $i.awk
I'm also a bit confused about the awk statement. Exactly what file are you using awk on? Your awk statement is using STDIN and STDOUT, so it's reading file names from the for loop and then printing the output on the screen. Is the sed statement suppose to feed into the awk statement?
Note that I don't have to print out my file to STDOUT, then pipe that into sed. The sed command can take the file name directly.
You also want to avoid for loops over a list of files. That is very inefficient, and can cause problems with the command line getting overloaded. Not a big issue today, but can affect you when you least suspect it. What happens is that your $(cat /home/user/Desktop/filelist.txt) must execute first before the for loop can even start.
A little rewriting of your program:
cd ~/Desktop
grep -r -l 'var createDocumentFragm' infected_site/* > filelist.txt
while read file
do
cp -f "$file" "$file.bak"
sed 's/var createDocumentFragm.*//g' "$file" > "$i.awk"
awk '/<\/SCRIPT>/{p=1;print}/<\/script>/{p=0}!p'
done < filelist.txt
We can use one loop, and we made it a while loop. I could even feed the grep into that while loop:
grep -r -l 'var createDocumentFragm' infected_site/* | while read file
do
cp -f "$file" "$file.bak"
sed 's/var createDocumentFragm.*//g' "$file" > "$i.awk"
awk '/<\/SCRIPT>/{p=1;print}/<\/script>/{p=0}!p'
done < filelist.txt
and then I don't even have to create a temporary file.
Let me know what's going on with the awk. I suspect you wanted something like this:
grep -r -l 'var createDocumentFragm' infected_site/* | while read file
do
cp -f "$file" "$file.bak"
sed 's/var createDocumentFragm.*//g' "$file" \
| awk '/<\/SCRIPT>/{p=1;print}/<\/script>/{p=0}!p' > "$i.awk"
done < filelist.txt
Also note I put quotes around file names. This helps prevent problems if file name has a space in it.

Related

need to clean file via SED or GREP

I have these files
NotRequired.txt (having lines which need to be remove)
Need2CleanSED.txt (big file , need to clean)
Need2CleanGRP.txt (big file , need to clean)
content:
more NotRequired.txt
[abc-xyz_pqr-pe2_123]
[lon-abc-tkt_1202]
[wat-7600-1_414]
[indo-pak_isu-5_761]
I am reading above file and want to remove lines from Need2Clean???.txt, trying via SED and GREP but no success.
myFile="NotRequired.txt"
while IFS= read -r HKline
do
sed -i '/$HKline/d' Need2CleanSED.txt
done < "$myFile"
myFile="NotRequired.txt"
while IFS= read -r HKline
do
grep -vE \"$HKline\" Need2CleanGRP.txt > Need2CleanGRP.txt
done < "$myFile"
Looks as if the Variable and characters [] making some problem.
What you're doing is extremely inefficient and error prone. Just do this:
grep -vF -f NotRequired.txt Need2CleanGRP.txt > tmp &&
mv tmp Need2CleanGRP.txt
Thanks to grep -F the above treats each line of NotRequired.txt as a string rather than a regexp so you don't have to worry about escaping RE metachars like [ and you don't need to wrap it in a shell loop - that one command will remove all undesirable lines in one execution of grep.
Never do command file > file btw as the shell might decide to execute the > file first and so empty file before command gets a chance to read it! Always do command file > tmp && mv tmp file instead.
Your assumption is correct. The [...] construct looks for any characters in that set, so you have to preface ("escape") them with \. The easiest way is to do that in your original file:
sed -i -e 's:\[:\\[:' -e 's:\]:\\]:' "${myFile}"
If you don't like that, you can probably put the sed command in where you're directing the file in:
done < replace.txt|sed -e 's:\[:\\[:' -e 's:\]:\\]:'
Finally, you can use sed on each HKline variable:
HKline=$( echo $HKline | sed -e 's:\[:\\[:' -e 's:\]:\\]:' )
try gnu sed:
sed -Ez 's/\n/\|/g;s!\[!\\[!g;s!\]!\\]!g; s!(.*).!/\1/d!' NotRequired.txt| sed -Ef - Need2CleanSED.txt
Two sed process are chained into one by shell pipe
NotRequired.txt is 'slurped' by sed -z all at once and substituted its \n and [ meta-char with | and \[ respectively of which the 2nd process uses it as regex script for the input file, ie. Need2CleanSED.txt. 1st process output;
/\[abc-xyz_pqr-pe2_123\]|\[lon-abc-tkt_1202\]|\[wat-7600-1_414\]|\[indo-pak_isu-5_761\]/d
add -u ie. unbuffered, option to evade from batch process, sort of direct i/o

What's wrong with this file renaming loop?

I'm trying to iterate through all the files in a directory and rename them from the prefix ABC to XYZ using the command below
while read file; do mv \"$file\" \"$(echo $file | sed -e s/ABC/XYZ/g)\" ; done < <(ls -1)
When I throw an echo in front of the mv, everything looks like it should work fine and copy/pasting the outputted command works fine but it won't execute correctly within the context of the loop giving me a usage error as if the command is malformed like below.
usage: mv [-f | -i | -n] [-v] source target
mv [-f | -i | -n] [-v] source ... directory
Even though the outputted command from the check with echo gives
mv "ABC Test1" "XYZ Test1"
which should be a valid command and works if I copy paste.
Any idea what is going on?
Relace:
while read file; do mv \"$file\" \"$(echo $file | sed -e s/ABC/XYZ/g)\" ; done < <(ls -1)
With:
for file in *
do
mv "$file" "${file//ABC/XYZ}"
done
Notes:
This is very important: Never parse ls. ls is only designed to produce human-friendly output.
To iterate over all files in a directory, use for file in *; do ...; done. This will work reliably for all manor of file names including file names with newlines, blanks, or other difficult characters.
\" produces a literal character, not a syntactic character. Since we want the syntactic meaning of " here, we leave it unescaped.
There are times when one needs sed but this isn't one of them.
The shell is capable of doing simple substitutions without all the issues associated with command substitution. Thus, $(echo $file | sed -e s/ABC/XYZ/g) can be replaced with ${file//ABC/XYZ}.
The form ${var//old/new} is called pattern substitution and is documented in man bash.
Very stupid mistake. There was no need to escape the quotes in the mv command. Taking those out makes it work as expected. Escaping the quotes shows the correct output with echo but does not give intended behavior.
while read file; do mv "$file" "$(echo $file | sed -e s/ABC/XYZ/g)" ; done < <(ls -1)

Can envsubst not do in-place substitution?

I have a config file which contains some ENV_VARIABLE styled variables.
This is my file.
It might contain $EXAMPLES of text.
Now I want that variable replaced with a value which is saved in my actual environment variables. So I'm trying this:
export EXAMPLES=lots
envsubst < file.txt > file.txt
But it doesn't work when the input file and output file are identical. The result is an empty file of size 0.
There must be a good reason for this, some bash basics that I'm not aware of?
How do I achieve what I want to do, ideally without first outputting to a different file and then replacing the original file with it?
I know that I can do it easily enough with sed, but when I discovered the envsubst command I thought that it should be perfect for my use case, so I'd like to use that.
Here is the solution that I use:
originalfile="file.txt"
tmpfile=$(mktemp)
cp --attributes-only --preserve $originalfile $tmpfile
cat $originalfile | envsubst > $tmpfile && mv $tmpfile $originalfile
Be careful with other solutions that do not use a temporary file. Pipes are asynchronous, so the file will occasionally be read after it has already been truncated.
Redirects are handled by the shell, not the program being executed, and they are set up before the program is invoked.
The redirect >output.file has the effect of creating output.file if it doesn't exist and emptying it if it does. Either way, you end up with an empty file, and that is what the program's output is redirected to.
Programs like sed which are capable of "in-place" modification must take the filename as a command-line argument, not as a redirect.
In your case, I would suggest using a temporary file and then renaming it if all goes OK.
envsubst < file.txt | tee file.txt
I found another shortcut to put into temp file and then rename it to original file.
envsubst < in.txt > out.txt && mv out.txt in.txt
To avoid creating a temporary file, use sponge not tee:
envsubst < file.txt | sponge file.txt
From https://linux.die.net/man/1/sponge:
sponge reads standard input and writes it out to the specified file. Unlike a shell redirect, sponge soaks up all its input before opening the output file. This allows constricting pipelines that read from and write to the same file.
You can achieve in-place substitution by calling envsubst from gnu sed with the "e" command:
EXAMPLES=lots sed -i 's/.*/echo & | envsubst/e' file.txt
It's worth noting that the mv solution won't maintain file permissions. Using cp -pf would be preferable in the case that you're modifying an executable file.
tmpfile=$(mktemp)
cat file.txt | envsubst > "$tmpfile" && cp -pf "$tmpfile" file.txt
rm -f "$tmpfile"
This answer was framed from two other answers. I guess this is the best solution.
originalFile=file.txt
tmpfile=$(mktemp)
cat $originalFile | envsubst > "$tmpfile" && cp -pf "$tmpfile" $originalFile
rm -f "$tmpfile"
Updated 20221011 - Using 1 sed command
sed -i -r 's/["`]|\$\(/\\&/g; s/.*/echo "&"/ e' ./input.txt
Updated 20221007 - Using 2 sed commands
sed -i -r 's/["`]|\$\(/\\&/g' input.txt
sed -i -r 's/.*/echo "&"/ e' input.txt
Do it without envsubst
envsubst_file () {
local original_file=$1
local temp_file=$(mktemp)
trap "rm -f ${temp_file}" 0 2 3 15
cp -p ${original_file} ${temp_file}
cat ${original_file} | sed -r 's/["`]|\$\(/\\&/g' | sed -r 's/.*/echo "&"/g' | sh > ${temp_file}
mv ${temp_file} ${original_file}
}
envsubst_file 'input.txt'
First using sed to escapes double quotes("), backtick(`) and command $( by prefixing with backslash(\),then using sed again replace with
echo "&"
Finally executing the shell script and redirecting to ${temp_file}
If you use bash, check this:
a=`<file.txt` && envsubst <<<"$a" >file.txt
Tested on 500mb file, works as expected.
In the end I found that using envsubst was too dangerous after all. My files might contain dollar signs in places where I don't want any substitution to happen, and envsubst will just replace them with empty strings if no corresponding environment variable is defined. Not cool.

Remove Lines in Multiple Text Files that Begin with a Certain Word

I have hundreds of text files in one directory. For all files, I want to delete all the lines that begin with HETATM. I would need a csh or bash code.
I would think you would use grep, but I'm not sure.
Use sed like this:
sed -i -e '/^HETATM/d' *.txt
to process all files in place.
-i means "in place".
-e means to execute the command that follows.
/^HETATM/ means "find lines starting with HETATM", and the following d means "delete".
Make a backup first!
If you really want to do it with grep, you could do this:
#!/bin/bash
for f in *.txt
do
grep -v "^HETATM" "%f" > $$.tmp && mv $$.tmp "$f"
done
It makes a temporary file of the output from grep (in file $$.tmp) and only overwrites your original file if the command executes successfully.
Using the -v option of grep to get all the lines that do not match:
grep -v '^HETATM' input.txt > output.txt

sed delete not working with cat variable

I have a file named test-domain, the contents of which contain the line 100.am.
When I do this, the line with 100.am is deleted from the test-domain file, as expected:
for x in $(echo 100.am); do sed -i "/$x/d" test-domain; done
However, if instead of echo 100.am, I read each line from a file named unwanted-lines, it does NOT work.
for x in $(cat unwanted-lines); do sed -i "/$x/d" test-domain; done
This is even if the only contents of unwanted-lines is one line, with the exact contents 100.am.
Does anyone know why sed delete line works if you use echo in your variable, but not if you use cat?
fgrep -v -f unwanted-lines test-domain > /tmp/Buffer
mv /tmp/Buffer test-domain
sed is not interesting in this case due to multiple call in shell (poor efficiency and lot of ressources used). The way to still use sed is to preload line to delete, and make a search base on this preloaded info but very heavy compare to fgrep in this case
Does anyone know why sed delete line works if you use echo in your
variable, but not if you use cat?
I believe that your file containing unwanted lines contains CR+LF line endings due to which it doesn't work when you use the file. You could strip the CR in your loop:
for x in $(cat unwanted-lines); do x="${x//$'\r'}"; sed -i "/$x/d" test-domain; done
One better strategy than yours would be to use a genuine editor, e.g., ed, as so:
ed -s test-domain < <(
shopt -s extglob
while IFS= read -r l; do
[[ $l = *([[:space:]]) ]] && continue
l=${l//./\\.}
echo "g/$l/d"
done < unwanted-lines
echo "wq"
)
Caveat. You must make sure that the file unwanted-lines doesn't contain any character that could clash with ed's regexps and commands. I have already included a match for a period (i.e., replace . with \.).
This method is quite efficient, as you're not forking so many times on sed, writing temp files, renaming them, etc.
Another possibility would be to use grep, but then you won't have the editing option ed offers.
Remark. ed is the standard editor.
why not just applying the sed command on your file?
sed -i '/.*100\.am/d' your_file

Resources