Find and replace html code for multiple files within multiple directories - shell

I have a very basic understanding of shell scripting, but what I need to do requires more complex commands.
For one task, I need to find and replace html code within the index.html files on my server. These files are in multiple directories with a consistent naming convention. ([letter][3-digit number]) See the example below.
files: index.html
path: /www/mysite/board/today/[rsh][0-9]/
string to find: (div id="id")[code](/div)<--#include="(path)"-->(div id="id")[more code](/div)
string to replace with: (div id="id")<--include="(path)"-->(/div)
I hope you don't mind the pseudo-regex. The folders containing my target index.html files look similar to r099, s017, h123. And suffice the say, the html code I'm trying to replace is relatively long, but its still just a string.
The second task is similar to the first, only the filename changes as well.
files: [rsh][0-9].html
path: www/mysite/person/[0-9]/[0-9]/[0-9]/card/2011/
string: (div id="id")[code](/div)<--include="(path)"-->(div id="id")[more code](/div)
string to replace with: (div id="id")<--include="(path)"-->(/div)
I've seen other examples on SO and elsewhere on the net that simply show scripts modifying files under a single directory to find & replace a string without any special characters, but I haven't seen an example similar to what I'm trying to do just yet.
Any assistance would be greatly appreciated.
Thank You.

You have three separate sub-problems:
replacing text in a file
coping with special characters
selecting files to apply the transformation to
​1. The canonical text replacement tool is sed:
sed -e 's/PATTERN/REPLACEMENT/g' <INPUT_FILE >OUTPUT_FILE
If you have GNU sed (e.g. on Linux or Cygwin), pass -i to transform the file in place. You can act on more than one file in the same command line.
sed -i -e 's/PATTERN/REPLACEMENT/g' FILE OTHER_FILE…
If your sed doesn't have the -i option, you need to write to a different file and move that into place afterwards. (This is what GNU sed does behind the scenes.)
sed -e 's/PATTERN/REPLACEMENT/g' <FILE >FILE.tmp
mv FILE.tmp FILE
​2. If you want to replace a literal string by a literal string, you need to prefix all special characters by a backslash. For sed patterns, the special characters are .\[^$* plus the separator for the s command (usually /). For sed replacement text, the special characters are \& and newlines. You can use sed to turn a string into a suitable pattern or replacement text.
pattern=$(printf %s "$string_to_replace" | sed -e 's![.\[^$*/]!\\&!g')
replacement=$(printf %s "$replacement_string" | sed -e 's![\&]!\\&!g')
​3. To act on multiple files directly in one or more directories, use shell wildcards. Your requirements don't seem completely consistent; I think these are the patterns you're looking for, but be sure to review them.
/www/mysite/board/today/[rsh][0-9][0-9][0-9]/index.html
/www/mysite/person/[0-9]/[0-9]/[0-9]/card/2011/[rsh][0-9].html
This will match files like /www/mysite/board/today/r012/index.html and /www/mysite/person/4/5/6/card/2011/h7.html, but not /www/mysite/board/today/subdir/s012/index.html or /www/mysite/board/today/r1234/index.html.
If you need to act on files in subdirectories recursively, use find. It doesn't seem to be in your requirements and this answer is long enough already, so I'll stop here.
​4. Putting it all together:
string_to_replace='(div id="id")[code](/div)<--#include="(path)"-->(div id="id")[more code](/div)'
replacement_string='(div id="id")<--include="(path)"-->(/div)'
pattern=$(printf %s "$string_to_replace" | sed -e 's![.\[^$*/]!\\&!g')
replacement=$(printf %s "$replacement_string" | sed -e 's![\&]!\\&!g')
sed -i -e "s/$pattern/$replacement/g" \
/www/mysite/board/today/[rsh][0-9][0-9][0-9]/index.html \
/www/mysite/person/[0-9]/[0-9]/[0-9]/card/2011/[rsh][0-9].html
Final note: you seem to be working on HTML with regular expressions. That's often not a good idea.

Finding the files can easily be done using find -regex:
find www/mysite/board/today -regex ".*[rsh][0-9][0-9][0-9]/index.html"
find www/mysite/person -regex ".*[0-9]/[0-9]/[0-9]/card/2011/[rsh][0-9][0-9][0-9].html"
Due to nature of HTML, replacing the content might not be very easy with sed, so I would suggest using an HTML or XML parsing library in a perl script. Can you provide a short sample of an actual html file and the result of the replacements?

Related

Regex to match characters between two specific characters in shell script

I want to clean my file before/after saving so I have to delete unnecessary characters that I have there. Sadly, even that my regex is working in Regex101, it does not work in shell script I wrote.
I am getting my list from Kubernetes via
kubectl get pods -n $1 -o jsonpath='{range .items[*]}{#.spec.containers[*].image}{","}{#.status.containerStatuses[*].imageID}{"\n"}{end}'
Then I saving it to the temp file and using sed to clear it - the regex should match and (sed should) delete any character between , and # (also should delete #). I am escaping them since they are special characters.
sed -i 's/(?<=\,)(.*?)(?<=\#)//g' temp
The problem is that this regex is working fine (for example in Regex101) but is not working with the sed command. I even tried awk but getting the same output.
awk '!/(?<=\,)(.*?)(?<=\#)/' temp
Am I missing something or is the regex acting differently somehow in Unix/shell?
Thanks for any input.
Example content of the file (for test):
docker.elastic.co/elasticsearch/elasticsearch:7.17.5,docker-pullable://docker.elastic.co/elasticsearch/elasticsearch#sha256:76344d5f89b13147743db0487eb76b03a7f9f0cd55abe8ab887069711f2ee27d
docker.io/bitnami/kafka:3.3.1-debian-11-r11,docker-pullable://bitnami/kafka#sha256:be29db0e37b6ab13df5fc14988a4aa64ee772c7f28b4b57898015cf7435ff662
docker.io/bitnami/mongodb:6.0.3-debian-11-r0,docker-pullable://bitnami/mongodb#sha256:e7438d7964481c0bcfcc8f31bca2d73022c0b7ba883143091a71ae01be6d9edb
docker.io/bitnami/postgresql:14.1.0-debian-10-r80,docker-pullable://bitnami/postgresql#sha256:6eb9c4ab3444e395df159e2cad21f283e4bf30802958467590c886f376dc9959
docker.io/bitnami/zookeeper:3.8.0-debian-11-r47,docker-pullable://bitnami/zookeeper#sha256:0f3169499c5ee02386c3cb262b2a0d3728998d9f0a94130a8161e389f61d1462
Expected output:
docker.elastic.co/elasticsearch/elasticsearch:7.17.5,sha256:76344d5f89b13147743db0487eb76b03a7f9f0cd55abe8ab887069711f2ee27d
docker.io/bitnami/kafka:3.3.1-debian-11-r11,sha256:be29db0e37b6ab13df5fc14988a4aa64ee772c7f28b4b57898015cf7435ff662
docker.io/bitnami/mongodb:6.0.3-debian-11-r0,sha256:e7438d7964481c0bcfcc8f31bca2d73022c0b7ba883143091a71ae01be6d9edb
docker.io/bitnami/postgresql:14.1.0-debian-10-r80,sha256:6eb9c4ab3444e395df159e2cad21f283e4bf30802958467590c886f376dc9959
docker.io/bitnami/zookeeper:3.8.0-debian-11-r47,sha256:0f3169499c5ee02386c3cb262b2a0d3728998d9f0a94130a8161e389f61d1462
You are trying to use Perl extensions which are not supported by more traditional regex tools like sed and Awk.
Perhaps see also Why are there so many different regular expression dialects? and the Stack Overflow regex tag info page.
If I can guess what you are trying to do, you want simply
sed -i 's/,[^#]*#/,/g' temp
The /g flag is unnecessary if you only expect one match per line.
Neither , nor # is a regex metacharacter; they do not require escaping.
Usually you would want to avoid using a temporary file or sed -i; perhaps simply
kubectl blah blah | sed 's/,[^#]*#/,/' > temp
to create the file, or remove the redirection if you want to pipe the results further.

sed command to change names for few files in different directories at once

I have few folders as S1S, S2S ,S3S ... , In each of these folders there is a file1 .
This file1 in each folder consistent of
1990.A.BHT_S1S.dat
1994.I.BHT_S1S.dat
1995.K.BHT_S1S.dat
likewise S1S extension change according to the folder.
I'm trying to change these names into 1990.A.BHT type for all folders using this command
for dir in S*
do
cd $dir
sed -i 's/_${dir}\.dat//g' file1 > file2
cd ../
done
but i get an empty file for file2
Can someone help me to figure out my mistake please?
This might work for you (GNU sed and parallel):
parallel sed 's/_{}\.dat//' {}/file1 \> {}/file2 ::: S*S
Create a new file file2 in each directory S1S S2S S3S ... from file1 with the string _SnS.dat removed (where SnS represents the current directory).
There are several problems here. First, as konsolebox said in a comment, sed -i modifies the original file rather than producing output that can be redirected with >, so you need to remove that option.
Second, variables don't expand in single-quoted strings, so 's/_${dir}\.dat//g' doesn't use the dir variable, it just treats that whole thing as a literal string.
The third is probably ok, but using cd in a script is dangerous, because if it fails for some reason the rest of the script will run in unexpected places, with possibly very bad results. It's generally better to use explicit paths, like sed ... "$dir/file1" instead of cding to $dir and then using sed ... file1.
Finally (again probably ok here) is that you should almost always put double-quotes around variable references, to avoid weird parsing of some characters.
So here's how I'd rewrite the script snippet:
for dir in S*
do
sed "s/_${dir}\.dat//g" "$dir/file1" > "$dir/file2"
done
p.s. shellcheck.net is good at spotting common mistakes in shell scripts; it spots three of the four problems I saw (all but the sed -i problem). I recommend running your scripts through it as a check.

How to use sed to remove ./ between two characters in Unix shell

I am trying to remove ./ between two characters using sed but not getting the desired output.
Sample:
e2b66a3d84ee448c33d7f2a2f7e51c58 ./2017_06_10_0400.txt
I tried the below but it is not working as expected, even the . in the ".txt" is getting removed.
sed -i 's/[./,]//g'
Beware: don't even think of using the -i option until you know the code is working. You can screw things up big time!
Use:
sed -e 's%[.]/%%g'
You can choose the delimiter in a s/// command, and when the regular expressions involve /, it is sensible to choose something else — I often use % when it doesn't figure in the text. The -e is optional. Using [.] to detect an actual dot is one way; you can write \. if you prefer, but I'm allergic to avoidable backslashes (if you've never had to write 16 backslashes in a row to get troff to do what you want, you haven't suffered enough).
Be aware that the -i option behaves differently in GNU sed and BSD (macOS) sed. Using -i.bak works in both (for an arbitrary, non-empty string such as .bak). Otherwise, your code isn't portable (which may or may not matter to you now, but might well do later on).
You have:
sed -i 's/[./,]//g'
The trouble with this is that it looks for any of the characters ., / or , in isolation — so it removes the . in .txt as well as the . and / in ./. You need to look for consecutive characters — as in my suggested solution.
try this:
echo "e2b66a3d84ee448c33d7f2a2f7e51c58 ./2017_06_10_0400.txt" | sed -e 's|\./||'
You need to use escape character \
's#\.\/##g'
:=>echo "e2b66a3d84ee448c33d7f2a2f7e51c58 ./2017_06_10_0400.txt" | sed 's#\.\/##g'
e2b66a3d84ee448c33d7f2a2f7e51c58 2017_06_10_0400.txt
:=>

Changing the prefix of a file with sed

I would like some advice on this script.
I'm trying to use sed (I didn't manage it with rename) to change a file that contains lines of the format (my test file name is sedtest):
COPY W:\Interfaces\Payments\Tameia\Unprocessed\X151008\E*.*
(that's not the only content of the file).
My goal is to replace the 151008 date part with a different date, I've tried to come up with a solution in sed using this:
sed -i -e "s/Unprocessed\X.*/Unprocessed\X'BLABLA'/" sedtest
but it doesnt seem to work, the line remains unchanged, it's like it doesn't recognize the pattern because of the \. I've tried some alternative delimiters like #, but to no avail.
Thanks in advance for any advice.
There's a couple of issues with your sed command. I would suggest changing it to this:
sed -r 's/(Unprocessed\\X)[0-9]+/\1BLABLA/' file
Since your version of sed supports -i without requiring that you add a suffix to create a backup file, I assume you're using the GNU version, which also supports extended regular expressions with the -r switch. The command captures the part within the () and uses it in the replacement \1. Don't forget that backslashes must be escaped.
If you're going to use -i, I would recommend doing so like -i.bak, so a backup of your file is made to file.bak before it is overwritten.
You haven't shown the exact output you were looking for but I assumed that you wanted the line to become:
COPY W:\Interfaces\Payments\Tameia\Unprocessed\XBLABLA\E*.*
Remember that * is greedy, so .* would match everything up to the end of the line. That's why I changed it to [0-9]+, so that only the digits were replaced, leaving the rest of the line intact.
As you've mentioned using a variable in the replacement, you should use something like this:
sed -r -i.bak "s/(Unprocessed\\X)[0-9]+/\1$var/" file
This assumes that $var is safe to use, i.e. doesn't contain characters that will be interpreted by sed, like \, / or &. See this question for details on handling such cases reliably.

Replacing string consisting of mostly metacharacters

I have a file containing dozens of lines following the below format:
source ../foo/bar
source ../foo1/bar1
source ../foo2/bar2
etc
I've been puzzling through using sed to find the source ../foo/ part and replace it with nothing in order to delete it. I've been reading this very helpful post on the unix/linux Stack Exchange about escaping meta characters and the resulting regex is:
source \.\.\/\.\*?\/
Following the instruction from that post, my complete sed looks like this:
sed -i 's/source \.\.\/\.\*?\///' TARGETFILE
The command completes with no errors, but the file is untouched. I have also tried:
sed -i 's/source \.\.\/\.\*?\//''/' TARGETFILE
I'm sure I'm making an assumption on something or a syntax error, but I'm sure many of you can appreciate the difficulty of spotting the error.
Thank you.
sed -i 's+source \.*/[^/]*/++' TARGETFILE
That deletes any leading dots, a slash, a path component and a slash.
Using + instead of / as a delimitor makes it a bit simpler to match /.
\.* matches a sequence of dots. [^/] matches anything but /.
sed 's+source \.*/[^/]*/++' <<EXAMPLE
source ../foo/bar
source ..../baz/fop
source /buzz/this/that
EXAMPLE
results in
bar
fop
this/that
To delete those lines having pattern source ../.*/ you can just use d command in sed:
sed -i.bak '\~source \.\./[^/]*/~d' file
Using grep you can avoid using regex:
grep -vF 'source ../' file

Resources