Removing punctuation using sed - bash

I am trying to write a script that removes punctuation from a text file.
I tried using sed, however am open to other suggestions (like awk)
This is my code so far
declare -a marks=('\.' '\,' '\;' '\:')
for i in {0..3}
do
sed -i 's/${marks[i]}//g' test.txt
done
cat test.txt`
I think my main problem is am not using escape keys correctly.

The command tr is great for that:
tr -d '[:punct:]' < test.txt > tmp.txt && mv -f tmp.txt test.txt
-d stands for delete.
Choose a non-existing file tmp.txt; to generate a temporary file a solution is mktemp -u.
Here is a small script which removes any punctuation in the files passed as arguments:
#! /bin/bash
t=$(mktemp -u)
for f ; do
tr -d '[:punct:]' < "$f" > "$t" && mv -f "$t" "$f"
done
for f is a shortcut for for f in "$#", which iterates over each argument without word splitting.

Using ed instead:
printf "%s\n" 'g/[[:punct:]]/s/[[:punct:]]//g' w | ed -s test.txt
removes all punctuation characters from a file and saves the remaining text.

Related

Updating a config file based on the presence of a specific string

I want to be able to comment and uncomment lines which are "managed" using a bash script.
I am trying to write a script which will update all of the config lines which have the word #managed after them and remove the preceeding # if it exists.
The rest of the config file needs to be left unchanged. The config file looks like this:
configFile.txt
#config1=abc #managed
#config2=abc #managed
config3=abc #managed
config3=abc
This is the script I have created so far. It iterates the file, finds lines which contain "#managed" and detects if they are currently commented.
I need to then write this back to the file, how do I do that?
manage.sh
#!/bin/bash
while read line; do
STR='#managed'
if grep -q "$STR" <<< "$line"; then
echo "debug - this is managed"
firstLetter=${$line:0:1}
if [ "$firstLetter" = "#" ]; then
echo "Remove the initial # from this line"
fi
fi
echo "$line"
done < configFile.txt
With your approach using grep and sed.
str='#managed$'
file=ConfigFile.txt
grep -q "^#.*$str" "$file" && sed "/^#.*$str/s/^#//" "$file"
Looping through files ending in *.txt
#!/usr/bin/env bash
str='#managed$'
for file in *.txt; do
grep -q "^#.*$str" "$file" &&
sed "/^#.*$str/s/^#//" "$file"
done
In place editing with sed requires the -i flag/option but that varies from different version of sed, the GNU version does not require an -i.bak args, while the BSD version does.
On a Mac, ed should be installed by default, so just replace the sed part with.
printf '%s\n' "g/^#.*$str/s/^#//" ,p Q | ed -s "$file"
Replace the Q with w to actually write back the changes to the file.
Remove the ,p if no output to stdout is needed/required.
On a side note, embedding grep and sed in a shell loop that reads line-by-line the contents of a text file is considered a bad practice from shell users/developers/coders. Say the file has 100k lines, then grep and sed would have to run 100k times too!
This sed one-liner should do the trick:
sed -i.orig '/#managed/s/^#//' configFile.txt
It deletes the # character at the beginning of the line if the line contains the string #managed.
I wouldn't do it in bash (because that would be slower than sed or awk, for instance), but if you want to stick with bash:
#! /bin/bash
while IFS= read -r line; do
if [[ $line = *'#managed'* && ${line:0:1} = '#' ]]; then
line=${line:1}
fi
printf '%s\n' "$line"
done < configFile.txt > configFile.tmp
mv configFile.txt configFile.txt.orig && mv configFile.tmp configFile.txt

Optimize shell script for multiple sed replacements

I have a file containing a list of replacement pairs (about 100 of them) which are used by sed to replace strings in files.
The pairs go like:
old|new
tobereplaced|replacement
(stuffiwant).*(too)|\1\2
and my current code is:
cat replacement_list | while read i
do
old=$(echo "$i" | awk -F'|' '{print $1}') #due to the need for extended regex
new=$(echo "$i" | awk -F'|' '{print $2}')
sed -r "s/`echo "$old"`/`echo "$new"`/g" -i file
done
I cannot help but think that there is a more optimal way of performing the replacements. I tried turning the loop around to run through lines of the file first but that turned out to be much more expensive.
Are there any other ways of speeding up this script?
EDIT
Thanks for all the quick responses. Let me try out the various suggestions before choosing an answer.
One thing to clear up: I also need subexpressions/groups functionality. For example, one replacement I might need is:
([0-9])U|\10 #the extra brackets and escapes were required for my original code
Some details on the improvements (to be updated):
Method: processing time
Original script: 0.85s
cut instead of awk: 0.71s
anubhava's method: 0.18s
chthonicdaemon's method: 0.01s
You can use sed to produce correctly -formatted sed input:
sed -e 's/^/s|/; s/$/|g/' replacement_list | sed -r -f - file
I recently benchmarked various string replacement methods, among them a custom program, sed -e, perl -lnpe and an probably not that widely known MySQL command line utility, replace. replace being optimized for string replacements was almost an order of magnitude faster than sed. The results looked something like this (slowest first):
custom program > sed > LANG=C sed > perl > LANG=C perl > replace
If you want performance, use replace. To have it available on your system, you'll need to install some MySQL distribution, though.
From replace.c:
Replace strings in textfile
This program replaces strings in files or from stdin to stdout. It accepts a list of from-string/to-string pairs and replaces each occurrence of a from-string with the corresponding to-string. The first occurrence of a found string is matched. If there is more than one possibility for the string to replace, longer matches are preferred before shorter matches.
...
The programs make a DFA-state-machine of the strings and the speed isn't dependent on the count of replace-strings (only of the number of replaces). A line is assumed ending with \n or \0. There are no limit exept memory on length of strings.
More on sed. You can utilize multiple cores with sed, by splitting your replacements into #cpus groups and then pipe them through sed commands, something like this:
$ sed -e 's/A/B/g; ...' file.txt | \
sed -e 's/B/C/g; ...' | \
sed -e 's/C/D/g; ...' | \
sed -e 's/D/E/g; ...' > out
Also, if you use sed or perl and your system has an UTF-8 setup, then it also boosts performance to place a LANG=C in front of the commands:
$ LANG=C sed ...
You can cut down unnecessary awk invocations and use BASH to break name-value pairs:
while IFS='|' read -r old new; do
# echo "$old :: $new"
sed -i "s~$old~$new~g" file
done < replacement_list
IFS='|' will give enable read to populate name-value in 2 different shell variables old and new.
This is assuming ~ is not present in your name-value pairs. If that is not the case then feel free to use an alternate sed delimiter.
Here is what I would try:
store your sed search-replace pair in a Bash array like ;
build your sed command based on this array using parameter expansion
run command.
patterns=(
old new
tobereplaced replacement
)
pattern_count=${#patterns[*]} # number of pattern
sedArgs=() # will hold the list of sed arguments
for (( i=0 ; i<$pattern_count ; i=i+2 )); do # don't need to loop on the replacement…
search=${patterns[i]};
replace=${patterns[i+1]}; # … here we got the replacement part
sedArgs+=" -e s/$search/$replace/g"
done
sed ${sedArgs[#]} file
This result in this command:
sed -e s/old/new/g -e s/tobereplaced/replacement/g file
You can try this.
pattern=''
cat replacement_list | while read i
do
old=$(echo "$i" | awk -F'|' '{print $1}') #due to the need for extended regex
new=$(echo "$i" | awk -F'|' '{print $2}')
pattern=${pattern}"s/${old}/${new}/g;"
done
sed -r ${pattern} -i file
This will run the sed command only once on the file with all the replacements. You may also want to replace awk with cut. cut may be more optimized then awk, though I am not sure about that.
old=`echo $i | cut -d"|" -f1`
new=`echo $i | cut -d"|" -f2`
You might want to do the whole thing in awk:
awk -F\| 'NR==FNR{old[++n]=$1;new[n]=$2;next}{for(i=1;i<=n;++i)gsub(old[i],new[i])}1' replacement_list file
Build up a list of old and new words from the first file. The next ensures that the rest of the script isn't run on the first file. For the second file, loop through the list of replacements and perform them each one by one. The 1 at the end means that the line is printed.
{ cat replacement_list;echo "-End-"; cat YourFile; } | sed -n '1,/-End-/ s/$/³/;1h;1!H;$ {g
t again
:again
/^-End-³\n/ {s///;b done
}
s/^\([^|]*\)|\([^³]*\)³\(\n\)\(.*\)\1/\1|\2³\3\4\2/
t again
s/^[^³]*³\n//
t again
:done
p
}'
More for fun to code via sed. Try maybe for a time perfomance because this start only 1 sed that is recursif.
for posix sed (so --posix with GNU sed)
explaination
copy replacement list in front of file content with a delimiter (for line with ³ and for list with -End-) for an easier sed handling (hard to use \n in class character in posix sed.
place all line in buffer (add the delimiter of line for replacement list and -End- before)
if this is -End-³, remove the line and go to final print
replace each first pattern (group 1) found in text by second patttern (group 2)
if found, restart (t again)
remove first line
restart process (t again). T is needed because b does not reset the test and next t is always true.
Thanks to #miku above;
I have a 100MB file with a list of 80k replacement-strings.
I tried various combinations of sed's sequentially or parallel, but didn't see throughputs getting shorter than about a 20-hour runtime.
Instead I put my list into a sequence of scripts like "cat in | replace aold anew bold bnew cold cnew ... > out ; rm in ; mv out in".
I randomly picked 1000 replacements per file, so it all went like this:
# first, split my replace-list into manageable chunks (89 files in this case)
split -a 4 -l 1000 80kReplacePairs rep_
# next, make a 'replace' script out of each chunk
for F in rep_* ; do \
echo "create and make executable a scriptfile" ; \
echo '#!/bin/sh' > run_$F.sh ; chmod +x run_$F.sh ; \
echo "for each chunk-file line, strip line-ends," ; \
echo "then with sed, turn '{long list}' into 'cat in | {long list}' > out" ; \
cat $F | tr '\n' ' ' | sed 's/^/cat in | replace /;s/$/ > out/' >> run_$F.sh ;
echo "and append commands to switch in and out files, for next script" ; \
echo -e " && \\\\ \nrm in && mv out in\n" >> run_$F.sh ; \
done
# put all the replace-scripts in sequence into a main script
ls ./run_rep_aa* > allrun.sh
# make it executable
chmod +x allrun.sh
# run it
nohup ./allrun.sh &
.. which ran in under 5 mins, a lot less than 20 hours !
Looking back, I could have used more pairs per script, by finding how many lines would make up the limit.
xargs --show-limits </dev/null 2>&1 | grep --color=always "actually use:"
Maximum length of command we could actually use: 2090490
So just under 2MB; how many pairs would that be for my script ?
head -c 2090490 80kReplacePairs | wc -l
76923
So it seems I could have used 2 * 40000-line chunks
to expand on chthonicdaemon's solution
live demo
#! /bin/sh
# build regex from text file
REGEX_FILE=some-patch.regex.diff
# test
# set these with "export key=val"
SOME_VAR_NAME=hello
ANOTHER_VAR_NAME=world
escape_b() {
echo "$1" | sed 's,/,\\/,g'
}
regex="$(
(echo; cat "$REGEX_FILE"; echo) \
| perl -p -0 -e '
s/\n#[^\n]*/\n/g;
s/\(\(SOME_VAR_NAME\)\)/'"$(escape_b "$SOME_VAR_NAME")"'/g;
s/\(\(ANOTHER_VAR_NAME\)\)/'"$(escape_b "$ANOTHER_VAR_NAME")"'/g;
s/([^\n])\//\1\\\//g;
s/\n-([^\n]+)\n\+([^\n]*)(?:\n\/([^\n]+))?\n/s\/\1\/\2\/\3;\n/g;
'
)"
echo "regex:"; echo "$regex" # debug
exec perl -00 -p -i -e "$regex" "$#"
prefixing lines with -+/ allows empty "plus" values, and protects leading whitespace from buggy text editors
sample input: some-patch.regex.diff
# file format is similar to diff/patch
# this is a comment
# replace all "a/a" with "b/b"
-a/a
+b/b
/g
-a1|a2
+b1|b2
/sg
# this is another comment
-(a1).*(a2)
+b\1b\2b
-a\na\na
+b
-a1-((SOME_VAR_NAME))-a2
+b1-((ANOTHER_VAR_NAME))-b2
sample output
s/a\/a/b\/b/g;
s/a1|a2/b1|b2/;;
s/(a1).*(a2)/b\1b\2b/;
s/a\na\na/b/;
s/a1-hello-a2/b1-world-b2/;
this regex format is compatible with sed and perl
since miku mentioned mysql replace:
replacing fixed strings with regex is non-trivial,
since you must escape all regex chars,
but you also must handle backslash escapes ...
naive escaper:
echo '\(\n' | perl -p -e 's/([.+*?()\[\]])/\\\1/g'
\\(\n

Using sed in a for loop with variables and regex

I'm trying to build a script where a portion of it utilizes 'sed' to tag the filename onto the end of each line in that file, then dumps the output to a master list.
The part of the script giving me trouble is sed here:
DIR=/var/www/flatuser
FILES=$DIR/*
for f in $FILES
do
echo "processing $f file...."
sed -i "s/$/:$f/" $f
cat $f >> $DIR/master.txt
done
The issue is that the 'sed' statement works fine outside of the for loop, but when I place it in the script, I believe it's having issues interpreting the dollar signs. I've tried nearly every combo of " and ' that I can think of to get it to interpret the variable and it continuously either puts "$f" at the end of each line, or it fails outright.
Thanks for any input!
You just need to escape the dollar sign:
sed -i "s/\$/:$f/" "$f"
so that the shell passes it literally to sed.
To expand on Charles Duffy's point about quoting variables:
DIR=/var/www/flatuser
for f in "$DIR"/*
do
echo "processing $f file...."
sed -i "s/\$/:${f##*/}/" "$f"
cat "$f" >> "$DIR/master.txt"
done
If any file names contain a space, it's too late to do anything about it if you assign the list of file names to $FILES; you can no longer distinguish between spaces that belong to file names and spaces that separate file names. You could use an array instead, but it's simpler to just put the glob directly in the for loop. Here's how you would use an array:
DIR=/var/www/flatuser
FILES=( "$DIR"/* )
for f in "${FILES[#]}"
do
echo "processing $f file...."
sed -i "s/\$/:${f##*/}/" "$f"
cat "$f" >> "$DIR/master.txt"
done
For versions of sed that don't use -i, here's a way to explicitly handle the temp file needed to simulate in-place editing:
t=$(mktmp sXXXX); sed "s/\$/:$f/" "$f" > "$t"; mv "$t" "$f" && rm "$t"
Personally, I'd do this like so:
dir=/var/www/flatuser
for f in "$dir"/*; do
[[ $f = */master.txt ]] && continue
while read -r; do printf '%s:%s\n' "$REPLY" "${f##*/}"; done <"$f"
done >/var/www/flatuser/master.txt
It doesn't modify your files in-place the way sed -i does, so it's safe to run more than one time (the sed -i version will add the names to your files in-place every time it runs, so you'll end up with each line having more than one copy of the filename on it).
Also, sed -i isn't specified by POSIX, so not all operating systems will have it.
The problem is NOT the dollar sign. It's that the variable $f contains a "/" character, and sed is using that to separate expressions. Try using "#" as the separator.
DIR=/var/www/flatuser
FILES=$DIR/*
for f in $FILES
do
echo "processing $f file...."
sed -i s#"$"#:"$f"# $f
cat $f >> $DIR/master.txt
done
it's old, but maybe it helps someone.
Why not basename the file to get rid of leading directory
DIR=/var/www/flatuser
FILES=( "$DIR"/* )
for f in "${FILES[#]}"
do
echo "processing $f file...."
b=`basename $f`
sed -i "s/\$/:${b##*/}/" "$b"
cat "$f" >> "$DIR/master.txt"
done
not tested ...

Recursive BASH renaming

EDIT: Ok, I'm sorry, I should have specified that I was on Windows, and using win-bash, which is based on bash 1.14.2, along with the gnuwin32 tools. This means all of the solutions posted unfortunately didn't help out. It doesn't contain many of the advanced features. I have however figured it out finally. It's an ugly script, but it works.
#/bin/bash
function readdir
{
cd "$1"
for infile in *
do
if [ -d "$infile" ]; then
readdir "$infile"
else
renamer "$infile"
fi
done
cd ..
}
function renamer
{
#replace " - " with a single underscore.
NEWFILE1=`echo "$1" | sed 's/\s-\s/_/g'`
#replace spaces with underscores
NEWFILE2=`echo "$NEWFILE1" | sed 's/\s/_/g'`
#replace "-" dashes with underscores.
NEWFILE3=`echo "$NEWFILE2" | sed 's/-/_/g'`
#remove exclamation points
NEWFILE4=`echo "$NEWFILE3" | sed 's/!//g'`
#remove commas
NEWFILE5=`echo "$NEWFILE4" | sed 's/,//g'`
#remove single quotes
NEWFILE6=`echo "$NEWFILE5" | sed "s/'//g"`
#replace & with _and_
NEWFILE7=`echo "$NEWFILE6" | sed "s/&/_and_/g"`
#remove single quotes
NEWFILE8=`echo "$NEWFILE7" | sed "s/’//g"`
mv "$1" "$NEWFILE8"
}
for infile in *
do
if [ -d "$infile" ]; then
readdir "$infile"
else
renamer "$infile"
fi
done
ls
I'm trying to create a bash script to recurse through a directory and rename files, to remove spaces, dashes and other characters. I've gotten the script working fine for what I need, except for the recursive part of it. I'm still new to this, so it's not as efficient as it should be, but it works. Anyone know how to make this recursive?
#/bin/bash
for infile in *.*;
do
#replace " - " with a single underscore.
NEWFILE1=`echo $infile | sed 's/\s-\s/_/g'`;
#replace spaces with underscores
NEWFILE2=`echo $NEWFILE1 | sed 's/\s/_/g'`;
#replace "-" dashes with underscores.
NEWFILE3=`echo $NEWFILE2 | sed 's/-/_/g'`;
#remove exclamation points
NEWFILE4=`echo $NEWFILE3 | sed 's/!//g'`;
#remove commas
NEWFILE5=`echo $NEWFILE4 | sed 's/,//g'`;
mv "$infile" "$NEWFILE5";
done;
find is the command able to display all elements in a filesystem hierarchy. You can use it to execute a command on every found file or pipe the results to xargs which will handle the execution part.
Take care that for infile in *.* does not work on files containing whitespaces. Check the -print0 option of find, coupled to the -0 option of xargs.
All those semicolons are superfluous and there's no reason to use all those variables. If you want to put the sed commands on separate lines and intersperse detailed comments you can still do that.
#/bin/bash
find . | while read -r file
do
newfile=$(echo "$file" | sed '
#replace " - " with a single underscore.
s/\s-\s/_/g
#replace spaces with underscores
s/\s/_/g
#replace "-" dashes with underscores.
s/-/_/g
#remove exclamation points
s/!//g
#remove commas
s/,//g')
mv "$infile" "$newfile"
done
This is much shorter:
#/bin/bash
find . | while read -r file
do
# replace " - " or space or dash with underscores
# remove exclamation points and commas
newfile=$(echo "$file" | sed 's/\s-\s/_/g; s/\s/_/g; s/-/_/g; s/!//g; s/,//g')
mv "$infile" "$newfile"
done
Shorter still:
#/bin/bash
find . | while read -r file
do
# replace " - " or space or dash with underscores
# remove exclamation points and commas
newfile=$(echo "$file" | sed 's/\s-\s/_/g; s/[-\s]/_/g; s/[!,]//g')
mv "$infile" "$newfile"
done
In bash 4, setting the globstar option allows recursive globbing.
shopt -s globstar
for infile in **
...
Otherwise, use find.
while read infile
do
...
done < <(find ...)
or
find ... -exec ...
I've used 'find' in the past to locate files then had it execute another application.
See '-exec'
rename 's/pattern/replacement/' glob_pattern

Split string by newline and space in Bourne shell

I'm currently using the following to split a file into words - Is there some quicker way?
while read -r line
do
for word in $line
do
words="${words}\n${word}"
done
done
What about using tr?
tr -s '[:space:]' '\n' < myfile.txt
The -s squeezes multiple whitespace characters into one.
xargs -n 1 echo <myfile.txt
sed 's/[[:space:]]/\n/g' file.txt

Resources