I'm getting mad with a script performance.
Basically I have to replace 600 strings in more than 35000 files.
I have got something like this:
patterns=(
oldText1 newText1
oldText2 newText2
oldText3 newText3
)
pattern_count=${#patterns[*]}
files=(`find \. -name '*.js'`);
files_count=${#files[*]}
for ((i=0; i < $pattern_count ; i=i+2)); do
search=${patterns[i]};
replace=${patterns[i+1]};
echo -en "\e[0K\r Status "$proggress"%. Iteration: "$i" of " $pattern_count;
for ((j=0; j < $files_count; j++)); do
command sed -i s#$search#$replace#g ${files[j]};
proggress=$(($i*100/$files_count));
echo -en "\e[0K\r Inside the second loop: " $proggress"%. File: "$j" of "$files_count;
done
proggress=$(($i*100/$pattern_count));
echo -en "\e[0K\r Status "$proggress"%. Iteration: "$i" of " $pattern_count;
done
But this takes tons of minutes. There is another solution? Probably using sed just one time and not in a double loop?
Thanks a lot.
Create a proper sed script:
s/pattern1/replacement1/g
s/pattern2/replacement2/g
...
Run this script with sed -f script.sed file (or in whatever way is required).
You may create that sed script using your array:
printf 's/%s/%s/g\n' "${patterns[#]}" >script.sed
Applying it to the files:
find . -type f -name '*.js' -exec sed -i -f script.sed {} ';'
I don't quite know how GNU sed (which I assume you're using) is handling multiple files when you use -i, but you may also want to try
find . -type f -name '*.js' -exec sed -i -f script.sed {} +
which may potentially be much more efficient (executing as few sed commands as possible). As always, test on data that you can afford to throw away after testing.
For more information about using -exec with find, see https://unix.stackexchange.com/questions/389705
You don't need to run sed multiple times over one file. You can separate sed commands with ';'
You can execute multiple seds in parallel
For example:
patterns=(
oldText1 newText1
oldText2 newText2
oldText3 newText3
)
// construct sed argument such as 's/old/new/g;s/old2/new2/g;...'
sedarg=$(
for ((i = 0; i < ${#patterns[#]}; i += 2)); do
echo -n "s/${patterns[i]}/${patterns[i+1]}/g;"
done
)
// find all files named '*.js' and pass them to args with zero as separator
// xargs will parse them:
// -0 use zero as separator
// --verbose will print the line before execution (ie. sed -i .... file)
// -n1 pass one argument/one line to one sed
// -P8 run 8 seds simulteneusly (experiment with that value, depends on how fast your cpu and harddrive is)
find . -type f -name '*.js' -print0 | xargs -0 --verbose -n1 -P8 sed -i "$sedarg"
If you need the progress bar so much, I guess you can count the lines xargs --verbose returns or better use parallel --bar, see this post.
Related
I want to find all files with certain name (Myfile.txt) that do not contain certain string (my-wished-string) and then do a sed in order to do a replace in the found files. I tried with:
find . -type f -name "Myfile.txt" -exec grep -H -E -L "my-wished-string" {} + | sed 's/similar-to-my-wished-string/my-wished-string/'
But this only displays me all files with wished name that miss the "my-wished-string", but does not execute the replacement. Do I miss here something?
With a for loop and invoking a shell.
find . -type f -name "Myfile.txt" -exec sh -c '
for f; do
grep -H -E -L "my-wished-string" "$f" &&
sed -i "s/similar-to-my-wished-string/my-wished-string/" "$f"
done' sh {} +
You might want to add a -q to grep and -n to sed to silence the printing/output to stdout
You can do this by constructing two stacks; the first containing the files to search, and the second containing negative hits, which will then be iterated over to perform the replacement.
find . -type f -name "Myfile.txt" > stack1
while read -r line;
do
[ -z $(sed -n '/my-wished-string/p' "${line}") ] && echo "${line}" >> stack2
done < stack1
while read -r line;
do
sed -i "s/similar-to-my-wished-string/my-wished-string/" "${line}"
done < stack2
With some versions of sed, you can use -i to edit the file. But don't pipe the list of names to sed, just execute sed in the find:
find . -type f -name Myfile.txt -not -exec grep -q "my-wished-string" {} \; -exec sed -i 's/similar-to-my-wished-string/my-wished-string/g' {} \;
Note that any file which contains similar-to-my-wished-string also contains the string my-wished-string as a substring, so with these exact strings the command is a no-op, but I suppose your actual strings are different than these.
How can I chain multiple grep commands?
For example, if I want to search recursively for all PHP files that are publicly accessible, i.e those which contain $_user_location = 'public; and search for "SendQueue() inside all these files, what should I do?
Few of my failed attempts :
grep -rnw ./* -e "^.*user_location.*public" *.php | grep -i "^.*SendQueue().*" --color
grep -rnw ./* -e "^.*user_location.*public" *.php | xargs -0 -i "^.*SendQueue().*" --color
Print grep results with filename, extract filenames and pass those filenames to second grep.
grep -H ..... | cut -d: -f1 | xargs -d'\n' grep ....
Works as long as there are no : in filenames and usually there are none.
You could always do a plain old loop:
for i in *.php; do
if grep -q .... "$i"; then
grep .... "$i"
fi
done
Using awk:
$ awk '
/SendQueue\(\)/ { # hash all SendQueue() containing records
a[++i]=$0
}
/.*user_location.*public/ { # if condition met, flag up
f=1
}
END {
if(f) # if flag up
for(j=1;j<=i;j++) # output all hashed records
print a[j]
}' file
Testfile:
$_user_location = 'public;
SendQueue()
In the lack of sample output you only get:
SendQueue()
For multiple files:
$ for f in *.php ; do awk ... $f ; done
If you add -l option to your first grep, you'll get all the file names which you can feed to you second grep, like :
grep -i "^.*SendQueue().*" --color $(grep -l ...)
assuming you don't have special characters in file names.
Some alternative, could be quicker...
1. Using sed
sed -s '/\(SendQueue()\|_user_location = \o47public\)/H;${ x;s/\n/ /g;/SendQueue.*_user_location\|_user_location.*SendQueue/F;};d' *.php
Could by write:
sed -s '
/\(SendQueue()\|_user_location = \o47public\)/H;
${
x;
s/\n/ /g;
/SendQueue.*_user_location\|_user_location.*SendQueue/F;
};
d' *.php
Or with find:
find /path -type f -name '*.php' -exec sed -s '
/\(SendQueue()\|_user_location = \o47public\)/H;
${
x;
s/\n/ /g;
/SendQueue.*_user_location\|_user_location.*SendQueue/F;
};
d' {} +
2. Using grep
But reading each file only 1 time
grep -c "\(SendQueue()\|_user_location = 'public\)" *.php | grep :2$
or
grep -c "\(SendQueue()\|_user_location = 'public\)" *.txt | sed -ne 's/:2$//p'
Then
find /path -type f -name '*.php' -exec grep -c \
"\(SendQueue()\|_user_location = 'public\)" {} + |
sed -ne 's/:2$//p'
Of course, this work only if you're sure all sentence could be present only once.
Remark
To ensure no commented line will polute result, you could replace regex by
"^[^#/]*\(SendQueue()\|_user_location = 'public\)"
In all submited alternatives
I can mention two ways of doing this:
POSIX way
You can use find(1) in order to do recursive search. find is defined by POSIX and is most likely included in your system.
find . -type f -name '*.php' -exec grep -q "\$_user_location.*=.*'public" {} \; -exec grep 'SendQueue()' {} +
Here is the explanation for what this command does:
-type f Look for files
-name '*.php With the suffix .php
-exec grep -q ... {} \; Run the first grep sequence individually.
-exec grep {} + Run the second grep sequence on the files that were matched previously.
Ripgrep way
ripgrep is a really fast recursive grep tool. This will take much less search time, but you will need to obtain it separately.
rg --glob '*.php' -l "\$_user_location.*=.*'public" | xargs rg 'SendQueue\(\)'
Here is the explanation for what this command does:
--glob '*.php' Only looks inside files with the suffix .php
-l Only lists files that match
We enter the first query and pipe all the matching files to xargs
xargs runs rg with the second query and adds the received files as arguments so that ripgrep only searches those files.
Which one to use
ripgrep really shines on huge directories, but it really isn't necessary otherwise for what you are asking. Picking find is enough for most cases. The time you will spend obtaining ripgrep will probably be more than the time you will save by using it for this specific operation. ripgrep is a really nice tool regardless.
EDIT:
The find command has 2 -exec options:
-exec grep (...) {} \; This calls the grep command for each file match. This will run the following:
grep (query) file1.php
grep (query) file2.php
grep (query) file3.php
find tracks the command result for each file, and passes them to the next test if they succeed.
-exec grep (...) {} + This calls the command with all the files attached as arguments. This will expand as:
grep (query) file1.php file2.php file3.php
I have a garbage dump of a bunch of Wordpress files and I'm trying to convert them all to Markdown.
The script I wrote is:
htmlDocs=($(find . -print | grep -i '.*[.]html'))
for html in "${htmlDocs[#]}"
do
P_MD=${html}.markdown
echo "${html} \> ${P_MD}"
pandoc --ignore-args -r html -w markdown < "${html}" | awk 'NR > 130' | sed '/<div class="site-info">/,$d' > "${P_MD}"
done
As far as I understand, the first line should be making an array of all html files in all subdirectories, then the for loop has a line to create a variable with the Markdown name (followed by a debugging echo), then the actual pandoc command to do the conversion.
One at a time, this command works.
However, when I try to execute it, OSX gives me:
$ ./pandoc_convert.command
./pandoc_convert.command: line 1: : No such file or directory
./pandoc_convert.command: line 1: : No such file or directory
o_0
Help?
There may be many reasons why the script fails, because the way you create the array is incorrect:
htmlDocs=($(find . -print | grep -i '.*[.]html'))
Arrays are assigned in the form: NAME=(VALUE1 VALUE2 ... ), where NAME is the name of the variable, VALUE1, VALUE2, and the rest are fields separated with characters that are present in the $IFS (input field separator) variable. Suppose you find a file name with spaces. Then the expression will create separate items in the array.
Another issue is that the expression doesn't handle globbing, i.e. file name generation based on the shell expansion of special characters such as *:
mkdir dir.html
touch \ *.html
touch a\ b\ c.html
a=($(find . -print | grep -i '.*[.]html'))
for html in "${a[#]}"; do echo ">>>${html}<<<"; done
Output
>>>./a<<<
>>>b<<<
>>>c.html<<<
>>>./<<<
>>>a b c.html<<<
>>>dir.html<<<
>>> *.html<<<
>>>./dir.html<<<
I know two ways to fix this behavior: 1) temporarily disable globbing, and 2) use the mapfile command.
Disabling Globbing
# Disable globbing, remember current -f flag value
[[ "$-" == *f* ]] || globbing_disabled=1
set -f
IFS=$'\n' a=($(find . -print | grep -i '.*[.]html'))
for html in "${a[#]}"; do echo ">>>${html}<<<"; done
# Restore globbing
test -n "$globbing_disabled" && set +f
Output
>>>./ .html<<<
>>>./a b c.html<<<
>>>./ *.html<<<
>>>./dir.html<<<
Using mapfile
The mapfile is introduced in Bash 4. The command reads lines from the standard input into an indexed array:
mapfile -t a < <(find . -print | grep -i '.*[.]html')
for html in "${a[#]}"; do echo ">>>${html}<<<"; done
The find Options
The find command selects all types of nodes, including directories. You should use the -type option, e.g. -type f for files.
If you want to filter the result set with a regular expression use -regex option, or -iregex for case-insensitive matching:
mapfile -t a < <(find . -type f -iregex .*\.html$)
for html in "${a[#]}"; do echo ">>>${html}<<<"; done
Output
>>>./ .html<<<
>>>./a b c.html<<<
>>>./ *.html<<<
echo vs. printf
Finally, don't use echo in new software. Use printf instead:
mapfile -t a < <(find . -type f -iregex .*\.html$)
for html in "${a[#]}"; do printf '>>>%s<<<\n' "$html"; done
Alternative Approach
However, I would rather pipe a loop with a read:
find . -type f -iregex .*\.html$ | while read line
do
printf '>>>%s<<<\n' "$line"
done
In this example, the read command reads a line from the standard input and stores the value into line variable.
Although I like the mapfile feature, I find the code with the pipe more clear.
Try adding the bash shebang and set IFS to handle spaces in folders and filenames:
#!/bin/bash
SAVEIFS=$IFS
IFS=$(echo -en "\n\b")
htmlDocs=($(find . -print | grep -i '.*[.]html'))
for html in "${htmlDocs[#]}"
do
P_MD=${html}.markdown
echo "${html} \> ${P_MD}"
pandoc --ignore-args -r html -w markdown < "${html}" | awk 'NR > 130' | sed '/<div class="site-info">/,$d' > "${P_MD}"
done
IFS=$SAVEIFS
for i in USER; do
find /home/$i/public_html/ -type f -iname '*.php' \
| xargs grep -A1 -l 'GLOBALS\|preg_replace\|array_diff_ukey\|gzuncompress\|gzinflate\|post_var\|sF=\|qV=\|_REQUEST'
done
Its ignoring the -A1. The end result is I just want it to show me files that contain any of matching words but only on the first line of the script. If there is a better more efficient less resource intensive way that would be great as well as this will be ran on very large shared servers.
Use awk instead:
for i in USER; do
find /home/$i/public_html/ -type f -iname '*.php' -exec \
awk 'FNR == 1 && /GLOBALS|preg_replace|array_diff_ukey|gzuncompress|gzinflate|post_var|sF=|qV=|_REQUEST/
{ print FILENAME }' {} +
done
This will print the current input file if the first line matches. It's not ideal, since it will read all of each file. If your version of awk supports it, you can use
awk '/GLOBALS|.../ { print FILENAME } {nextfile}'
The nextfile command will execute for the first line, effectively skipping the rest of the file after awk tests if it matches the regular expression.
The following code is untested:
for i in USER; do
find /home/$i/public_html/ -type f -iname '*.php' | while read -r; do
head -n1 "$REPLY" | grep -q 'GLOBALS\|preg_replace\|array_diff_ukey\|gzuncompress\|gzinflate\|post_var\|sF=\|qV=\|_REQUEST' \
&& echo "$REPLY"
done
done
The idea is to loop over each find result, explicitly test the first line, and print the filename if a match was found. I don't like it though because it feels so clunky.
for j in (find /home/$i/public_html/ -type f -iname '*.php');
do result=$(head -1l $j| grep $stuff );
[[ x$result |= x ]] && echo "$j: $result";
done
You'll need a little more effort to skip leasing blank lines. Fgrep will save resources.
A little perl would bring great improvement, but it's hard to type it on a phone.
Edit:
On a less cramped keyboard, inserted less brief solution.
I had previously used a simple find command to delete tar files not accessed in the last x days (in this example, 3 days):
find /PATH/TO/FILES -type f -name "*.tar" -atime +3 -exec rm {} \;
I now need to improve this script by deleting in order of access date and my bash writing skills are a bit rusty. Here's what I need it to do:
check the size of a directory /PATH/TO/FILES
if size in 1) is greater than X size, get a list of the files by access date
delete files in order until size is less than X
The benefit here is for cache and backup directories, I will only delete what I need to to keep it within a limit, whereas the simplified method might go over size limit if one day is particularly large. I'm guessing I need to use stat and a bash for loop?
I improved brunner314's example and fixed the problems in it.
Here is a working script I'm using:
#!/bin/bash
DELETEDIR="$1"
MAXSIZE="$2" # in MB
if [[ -z "$DELETEDIR" || -z "$MAXSIZE" || "$MAXSIZE" -lt 1 ]]; then
echo "usage: $0 [directory] [maxsize in megabytes]" >&2
exit 1
fi
find "$DELETEDIR" -type f -printf "%T#::%p::%s\n" \
| sort -rn \
| awk -v maxbytes="$((1024 * 1024 * $MAXSIZE))" -F "::" '
BEGIN { curSize=0; }
{
curSize += $3;
if (curSize > maxbytes) { print $2; }
}
' \
| tac | awk '{printf "%s\0",$0}' | xargs -0 -r rm
# delete empty directories
find "$DELETEDIR" -mindepth 1 -depth -type d -empty -exec rmdir "{}" \;
Here's a simple, easy to read and understand method I came up with to do this:
DIRSIZE=$(du -s /PATH/TO/FILES | awk '{print $1}')
if [ "$DIRSIZE" -gt "$SOMELIMIT" ]
then
for f in `ls -rt --time=atime /PATH/TO/FILES/*.tar`; do
FILESIZE=`stat -c "%s" $f`
FILESIZE=$(($FILESIZE/1024))
DIRSIZE=$(($DIRSIZE - $FILESIZE))
if [ "$DIRSIZE" -lt "$LIMITSIZE" ]; then
break
fi
done
fi
I didn't need to use loops, just some careful application of stat and awk. Details and explanation below, first the code:
find /PATH/TO/FILES -name '*.tar' -type f \
| sed 's/ /\\ /g' \
| xargs stat -f "%a::%z::%N" \
| sort -r \
| awk '
BEGIN{curSize=0; FS="::"}
{curSize += $2}
curSize > $X_SIZE{print $3}
'
| sed 's/ /\\ /g' \
| xargs rm
Note that this is one logical command line, but for the sake of sanity I split it up.
It starts with a find command based on the one above, without the parts that limit it to files older than 3 days. It pipes that to sed, to escape any spaces in the file names find returns, then uses xargs to run stat on all the results. The -f "%a::%z::%N" tells stat the format to use, with the time of last access in the first field, the size of the file in the second, and the name of the file in the third. I used '::' to separate the fields because it is easier to deal with spaces in the file names that way. Sort then sorts them on the first field, with -r to reverse the ordering.
Now we have a list of all the files we are interested in, in order from latest accessed to earliest accessed. Then the awk script adds up all the sizes as it goes through the list, and begins outputting them when it gets over $X_SIZE. The files that are not output this way will be the ones kept, the other file names go to sed again to escape any spaces and then to xargs, which runs rm them.