How to search and replace text in an xml file with SED? - bash

I have to convert a list of xml files in a folder from UTF-16 to UTF-8, remove the BOM, and then replace the keyword inside the file from UTF-16 to UTF-8.
I'm using cygwin to run a bash shell script to accomplish this, but I've never worked with SED before today and I need help!
I found a SED one liner for removing the BOM, now I need another for replacing the text from UTF-16 to UTF-8 in the xml header.
This is what I have so far:
#!/bin/bash
mkdir -p outUTF8
#Convert files to unix format.
find -exec dos2unix {} \;
#Use a for loop to convert all the xml files.
for f in `ls -1 *.xml`; do
sed -i -e '1s/^\xEF\xBB\xBF//' FILE
iconv -f utf-16 -t utf-8 $f > outUTF8/$f
sed 's/UTF-16/UTF-8/g' $f > outUTF8/$f
echo $f
done
However, this line:
sed 's/UTF-16/UTF-8/g' $f > outUTF8/$f
is hanging the script. Any ideas as to the proper format for this?

Try something like this -
for filename in *.xml; do
sed -i".bak" -e '1s/^\xEF\xBB\xBF//' "$filename"
iconv -f utf-16 -t utf-8 "$filename" > outUTF8/"$filename"
sed -i 's/UTF-16/UTF-8/g' outUTF8/"$filename"
done
The first sed will make a backup of your original files with an extension .bak. Then it will use iconv to convert the file and save it under a newly created directory with same filename. Lastly, you will make an in-file change with sed to remove the text.

2 things
How big is your $f file, if it's really really big, it may just take a long to complete.
Opps, I see you have an echo $f at the bottom of your loop. Move it before the sed command so you can see if there any spaces in the filenames.
2a:-). OR just change all references to $f to "$f" to protect against spaces.
I hope this helps.

Related

sh script to replace text in multiple files

I am trying to replace every occurrence in a .prm file of the string "/net/origin/devdata1/slin" with "/tools/common/test/HATS" in over a hundred files using sed. I think I am having trouble with the proper syntax for a for loop that loops through different files in a directory(/home/AutoTest), and what/if I need as command line arguments. Thanks in advance.
OLD="/net/origin/devdata1/slin"
NEW="/toolscommon/test/HATS"
DIR="/home/AutoTest"
for f in $DIR
do
cp $f $f.bak
sed 's+$OLD+$NEW+g' $f.bak > $f
[ -f "$f" ]
rm -f $f.bak
done
Using sed
Try:
old="/net/origin/devdata1/slin"
new="/toolscommon/test/HATS"
dir="/home/AutoTest"
sed -i "s+$old+$new+g" "$dir"/*
sed -i will update files in-place.
Also, it is best practices to use lower or mixed case names for your shell variables. The system uses all-caps names for its variables and you don't want to accidentally overwrite one of them.
There is a potential danger here if old or new contained any sed-active characters. If so, arbitrary files could be deleted or mangled.
Using awk
old="/net/origin/devdata1/slin"
new="/toolscommon/test/HATS"
dir="/home/AutoTest"
awk -i inplace -v old="$old" -v new="$new" '{gsub(old, new)} 1' "$dir"/*
Because awk treats old and new as data and not code, this is safer than the sed version.

Iterating through files in a folder with sed

I've a list of csv-files and would like to use a for loop to edit the content for each file. I'd like to do that with sed. I have this sed commands which works fine when testing it on one file:
sed 's/[ "-]//g'
So now I want to execute this command for each file in a folder. I've tried this but so far no luck:
for i in *.csv; do sed 's/[ "-]//g' > $i.csv; done
I would like that he would overwrite each file with the edit performed by sed. The sed commands removes all spaces, the " and the '-' character.
Small changes,
for i in *.csv
do
sed -i 's/[ "-]//g' "$i"
done
Changes
when you iterate through the for you get the filenames in $i as example one.csv, two.csv etc. You can directly use these as input to the sed command.
-i Is for inline changes, the sed will do the substitution and updates the file for you. No output redirection is required.
In the code you wrote, I guess you missed any inputs to the sed command
In my case i want to replace every first occurrence of a particular string in each line for several text files, i've use the following:
//want to replace 16 with 1 in each files only for the first occurance
sed -i 's/16/1/' *.txt
In your case, In terminal you can try this
sed 's/[ "-]//g' *.csv
In certain scenarios it might be worth considering finding the files and executing a command on them like explained in this answer (as stated there, make sure echo $PATH doesn't contain .)
find /path/to/csv/ -type f '*.csv' -execdir sed -i 's/[ "-]//g' {} \;
here we:
find all files (type f) which end with .csv in the folder /path/to/csv/
sed the found files in place, ie we replace the original files with the changed version instead of creating numbered csv files ($i.csv)

Replacing Foreign Characters with English Equivalents in filenames with UNIX Bash Script

I'm trying to use sed to process a list of filenames and replace every foreign character in the file name with an English equivelent. E.g.
málaga.txt -> malaga.txt
My script is the following:
for f in *.txt
do
newf=$(echo $f | sed 'y/āáǎàēéěèīíǐìōóǒòūúǔùǖǘǚǜĀÁǍÀĒÉĚÈĪÍǏÌŌÓǑÒŪÚǓÙǕǗǙǛ/aaaaeeeeiiiioooouuuuüüüüAAAAEEEEIIIIOOOOUUUUÜÜÜÜ/')
mv $f $newf
done
This currently has no effect on the filenames. However if I use the same regex to process a text file. E.g.
cat blah.txt | sed 'y/āáǎàēéěèīíǐìōóǒòūúǔùǖǘǚǜĀÁǍÀĒÉĚÈĪÍǏÌŌÓǑÒŪÚǓÙǕǗǙǛ/aaaaeeeeiiiioooouuuuüüüüAAAAEEEEIIIIOOOOUUUUÜÜÜÜ/
It works perfectly - all foreign characters are substituted with their English equivalents. Any help would be greatly appreciated. This is on Mac OsX in a UNIX shell.
This should do it:
for f in *.txt; do
newf=$(echo $f | iconv -f utf-8-mac -t utf-8 | sed 'y/āáǎàēéěèīíǐìōóǒòūúǔùǖǘǚǜĀÁǍÀĒÉĚÈĪÍǏÌŌÓǑÒŪÚǓÙǕǗǙǛ/aaaaeeeeiiiioooouuuuüüüüAAAAEEEEIIIIOOOOUUUUÜÜÜÜ/')
mv $f $newf
done
iconv -f utf-8-mac -t utf-8 converts the text from utf-8-mac to utf-8, which resolves the precomposed/decomposed problem discussed in the comments by #PavelGurkov and #ninjalj.

bash removing part of a file name

I have the following files in the following format:
$ ls CombinedReports_LLL-*'('*.csv
CombinedReports_LLL-20140211144020(Untitled_1).csv
CombinedReports_LLL-20140211144020(Untitled_11).csv
CombinedReports_LLL-20140211144020(Untitled_110).csv
CombinedReports_LLL-20140211144020(Untitled_111).csv
CombinedReports_LLL-20140211144020(Untitled_12).csv
CombinedReports_LLL-20140211144020(Untitled_13).csv
CombinedReports_LLL-20140211144020(Untitled_14).csv
CombinedReports_LLL-20140211144020(Untitled_15).csv
CombinedReports_LLL-20140211144020(Untitled_16).csv
CombinedReports_LLL-20140211144020(Untitled_17).csv
CombinedReports_LLL-20140211144020(Untitled_18).csv
CombinedReports_LLL-20140211144020(Untitled_19).csv
I would like this part removed:
20140211144020 (this is the timestamp the reports were run so this will vary)
and end up with something like:
CombinedReports_LLL-(Untitled_1).csv
CombinedReports_LLL-(Untitled_11).csv
CombinedReports_LLL-(Untitled_110).csv
CombinedReports_LLL-(Untitled_111).csv
CombinedReports_LLL-(Untitled_12).csv
CombinedReports_LLL-(Untitled_13).csv
CombinedReports_LLL-(Untitled_14).csv
CombinedReports_LLL-(Untitled_15).csv
CombinedReports_LLL-(Untitled_16).csv
CombinedReports_LLL-(Untitled_17).csv
CombinedReports_LLL-(Untitled_18).csv
CombinedReports_LLL-(Untitled_19).csv
I was thinking simply along the lines of the mv command, maybe something like this:
$ ls CombinedReports_LLL-*'('*.csv
but maybe a sed command or other would be better
rename is part of the perl package. It renames files according to perl-style regular expressions. To remove the dates from your file names:
rename 's/[0-9]{14}//' CombinedReports_LLL-*.csv
If rename is not available, sed+shell can be used:
for fname in Combined*.csv ; do mv "$fname" "$(echo "$fname" | sed -r 's/[0-9]{14}//')" ; done
The above loops over each of your files. For each file, it performs a mv command: mv "$fname" "$(echo "$fname" | sed -r 's/[0-9]{14}//')" where, in this case, sed is able to use the same regular expression as the rename command above. s/[0-9]{14}// tells sed to look for 14 digits in a row and replace them with an empty string.
Without using an other tools like rename or sed and sticking strictly to bash alone:
for f in CombinedReports_LLL-*.csv
do
newName=${f/LLL-*\(/LLL-(}
mv -i "$f" "$newName"
done
for f in CombinedReports_LLL-* ; do
b=${f:0:20}${f:34:500}
mv "$f" "$b"
done
You can try line by line on shell:
f="CombinedReports_LLL-20140211144020(Untitled_11).csv"
b=${f:0:20}${f:34:500}
echo $b
You can use the rename utility for this. It uses syntax much like sed to change filenames. The following example (from the rename man-page) shows how to remove the trailing '.bak' extension from a list of backup files in the local directory:
rename 's/\.bak$//' *.bak
I'm using the advice given in the top response and have put the following line into a shell script:
ls *.nii | xargs rename 's/[f_]{2}//' f_0*.nii
In terminal, this line works perfectly, but in my script it will not execute and reads * as a literal part of the file name.

sed command to fix filenames in a directory

I run a script which generated about 10k files in a directory. I just discovered that there is a bug in the script which causes some filenames to have a carriage return (presumably a '\n' character).
I want to run a sed command to remove the carriage return from the filenames.
Anyone knows which params to pass to sed to clean up the filenames in the manner described?
I am running Linux (Ubuntu)
I don't know how sed would do this, but this python script should do the trick:.
This isn't sed, but I find python a lot easier to use when doing things like these:
#!/usr/bin/env python
import os
files = os.listdir('.')
for file in files:
os.rename(file, file.replace('\r', '').replace('\n', ''))
print 'Processed ' + file.replace('\r', '').replace('\n', '')
It strips any occurrences of both \r and \n from all of the filenames in a given directory.
To run it, save it somewhere, cd into your target directory (with the files to be processed), and run python /path/to/the/file.py.
Also, if you plan on doing more batch renaming, consider Métamorphose. It's a really nice and powerful GUI for this stuff. And, it's free!
Good luck!
Actually, try this: cd into the directory, type in python, and then just paste this in:
exec("import os\nfor file in os.listdir('.'):\n os.rename(file, file.replace('\\r', '').replace('\\n', ''))\n print 'Processed ' + file.replace('\\r', '').replace('\\n', '')")
It's a one-line version of the previous script, and you don't have to save it.
Version 2, with space replacement powers:
#!/usr/bin/env python
import os
for file in os.listdir('.'):
os.rename(file, file.replace('\r', '').replace('\n', '').replace(' ', '_')
print 'Processed ' + file.replace('\r', '').replace('\n', '')
And here's the one-liner:
exec("import os\nfor file in os.listdir('.'):\n os.rename(file, file.replace('\\r', '').replace('\\n', '')replace(' ', '_'))\n print 'Processed ' + file.replace('\\r', '').replace('\\n', '');")
If there are no spaces in your filenames, you can do:
for f in *$'\n'; do mv "$f" $f; done
It won't work if the newlines are embedded, but it will work for trailing newlines.
If you must use sed:
for f in *$'\n'; do mv "$f" "$(echo "$f" | sed '/^$/d')"; done
Using the rename Perl script:
rename 's/\n//g' *$'\n'
or the util-linux-ng utility:
rename $'\n' '' *$'\n'
If the character is a return instead of a newline, change the \n or ^$ to \r in any places they appear above.
The reason you aren't getting any pure-sed answers is that fundamentally sed edits file contents, not file names; thus the answers that use sed all do something like echo the filename into a pipe (pseudo file), edit that with sed, then use mv to turn that back into a filename.
Since sed is out, here's a pure-bash version to add to the Perl, Python, etc scripts you have so far:
killpattern=$'[\r\n]' # remove both carriage returns and linefeeds
for f in *; do
if [[ "$f" == *$killpattern* ]]; then
mv "$f" "${f//$killpattern/}"
fi
done
...but since ${var//pattern/replacement} isn't available in plain sh (along with [[...]]), here's a version using sh-only syntax, and tr to do the character replacement:
for f in *; do
new="$(printf %s "$f" | tr -d "\r\n")"
if [ "$f" != "$new" ]; then
mv "$f" "$new"
fi
done
EDIT: If you really want it with sed, take a look at this:
http://www.linuxquestions.org/questions/programming-9/merge-lines-in-a-file-using-sed-191121/
Something along these lines should work similar to the perl below:
for i in *; do echo mv "$i" `echo "$i"|sed ':a;N;s/\n//;ta'`; done
With perl, try something along these lines:
for i in *; do mv "$i" `echo "$i"|perl -pe 's/\n//g'`; done
This will rename all files in the current folder by removing all newline characters from them. If you need to go recursive, you can use find instead - be aware of the escaping in that case, though.
In fact there is a way to use sed:
carr='\n' # specify carriage return
files=( $(ls -f) ) # array of files in current dir
for i in ${files[#]}
do
if [[ -n $(echo "$i" | grep $carr) ]] # filenames with carriage return
then
mv "$i" "$(echo "$i" | sed 's/\\n//g')" # move!
fi
done
This actually works.

Resources