BASH: File sorting according to file name - bash

I need to sort 12000 filles into 1000 groups, according to its name and create for each group a new folder containing filles of this group. The name of each file is given in multi-column format (with _ separator), where the second column is varried from 1 to 12 (number of the part) and the last column ranged from 1 to 1000 (number of the system), indicating that initially 1000 different systems (last column) were splitted on 12 separate parts (second column).
Here is an example for a small subset based on 3 systems devided by 12 parts, totally 36 filles.
7000_01_lig_cne_1.dlg
7000_02_lig_cne_1.dlg
7000_03_lig_cne_1.dlg
...
7000_12_lig_cne_1.dlg
7000_01_lig_cne_2.dlg
7000_02_lig_cne_2.dlg
7000_03_lig_cne_2.dlg
...
7000_12_lig_cne_2.dlg
7000_01_lig_cne_3.dlg
7000_02_lig_cne_3.dlg
7000_03_lig_cne_3.dlg
...
7000_12_lig_cne_3.dlg
I need to group these filles based on the second column of their names (01, 02, 03 .. 12), thus creating 1000 folders, which should contrain 12 filles for each system in the following manner:
Folder1, name: 7000_lig_cne_1, it contains 12 filles: 7000_{this is from 01 to 12}_lig_cne_1.dlg
Folder2, name: 7000_lig_cne_2, it contains 12 filles 7000_{this is from 01 to 12}_lig_cne_2.dlg
...
Folder1000, name: 7000_lig_cne_1000, it contains 12 filles 7000_{this is from 01 to 12}_lig_cne_1000.dlg
Assuming that all *.dlg filles are present withint the same dir, I propose bash loop workflow, which only lack some sorting function (sed, awk ??), organized in the following manner:
#set the name of folder with all DLG
home=$PWD
FILES=${home}/all_DLG/7000_CNE
# set the name of protein and ligand library to analyse
experiment="7000_CNE"
#name of the output
output=${home}/sub_folders_to_analyse
#now here all magic comes
rm -r ${output}
mkdir ${output}
# sed sollution
for i in ${FILES}/*.dlg # define this better to suit your needs
do
n=$( <<<"$i" sed 's/.*[^0-9]\([0-9]*\)\.dlg$/\1/' )
# move the file to proper dir
mkdir -p ${output}/"${experiment}_lig$n"
cp "$i" ${output}/"${experiment}_lig$n"
done
! Note: there I indicated beggining of the name of each folder as ${experiment} to which I add the number of the final column $n at the end. Would it be rather possible to set up each time the name of the new folder automatically based on the name of the coppied filles? Manually it could be achived via skipping the second column in the name of the folder
cp ./all_DLG/7000_*_lig_cne_987.dlg ./output/7000_lig_cne_987

Iterate over files. Extract the destination directory name from the filename. Move the file.
for i in *.dlg; do
# extract last number with your favorite tool
n=$( <<<"$i" sed 's/.*[^0-9]\([0-9]*\)\.dlg$/\1/' )
# move the file to proper dir
echo mkdir -p "folder$n"
echo mv "$i" "folder$n"
done
Notes:
Do not use upper case variables in your scripts. Use lower case variables.
Remember to quote variables expansions.
Check your scripts with http://shellcheck.net
Tested on repl
update: for OP's foldernaming convention:
for i in *.dlg; do
foldername="$HOME/output/${i%%_*}_${i#*_*_}"
echo mkdir -p "$foldername"
echo mv "$i" "$foldername"
done

This might work for you (GNU parallel):
ls *.dlg |
parallel --dry-run 'd={=s/^(7000_).*(lig.*)\.dlg/$1$2/=};mkdir -p $d;mv {} $d'
Pipe the output of ls command listing files ending in .dlg to parallel, which creates directories and moves the files to them.
Run the solution as is, and when satisfied the output of the dry run is ok, remove the option --dry-run.
The solution could be one instruction:
parallel 'd={=s/^(7000_).*(lig.*)\.dlg/$1$2/=};mkdir -p $d;mv {} $d' ::: *.dlg

Using POSIX shell's built-in grammar only and sort:
#!/usr/bin/env sh
curdir=
# Create list of files with newline
# Safe since we know there is no special
# characters in name
printf -- %s\\n *.dlg |
# Sort the list by 5th key with _ as field delimiter
sort -t_ -k5 |
# Iterate reading the _ delimited fields of the sorted list
while IFS=_ read -r _ _ c d e; do
# Compose the new directory name
newdir="${c}_${d}_${e%.dlg}"
# If we enter a new group / directory
if [ "$curdir" != "$newdir" ]; then
# Make the new directory current
curdir="$newdir"
# Create the new directory
echo mkdir -p "$curdir"
# Move all its files into it
echo mv -- *_"$curdir.dlg" "$curdir/"
fi
done
Optionally as a sort and xargs arguments stream:
printf -- %s\\n * |
sort -u -t_ -k5
xargs -n1 sh -c
'd="lig_cne_${0##*_}"
d="${d%.dlg}"
echo mkdir -p "$d"
echo mv -- *"_$d.dlg" "$d/"
'

Here is a very simple awk script that do the trick in single sweep.
script.awk
BEGIN{FS="[_.]"} # make field separator "_" or "."
{ # for each filename
dirName=$1"_"$3"_"$4"_"$5; # compute the target dir name from fields
sysCmd = "mkdir -p " dirName"; cp "$0 " "dirName; # prepare bash command
system(sysCmd); # run bash command
}
running script.awk
ls -1 *.dlg | awk -f script.awk
oneliner awk script
ls -1 *.dlg | awk 'BEGIN{FS="[_.]"}{d=$1"_"$3"_"$4"_"$5;system("mkdir -p "d"; cp "$0 " "d);}'

Related

Is there a way to add a suffix to files where the suffix comes from a list in a text file?

So currently the searches are coming up with a single word renaming solution, where you define the (static) suffix within the code. I need to rename based on a text based filelist and so -
I have a list of files in /home/linux/test/ :
1000.ext
1001.ext
1002.ext
1003.ext
1004.ext
Then I have a txt file (labels.txt) containing the labels I want to use:
Alpha
Beta
Charlie
Delta
Echo
I want to rename the files to look like (example1):
1000 - Alpha.ext
1001 - Beta.ext
1002 - Charlie.ext
1003 - Delta.ext
1004 - Echo.ext
How would you a script which renames all the files in /home/linux/test/ to the list in example1?
Use paste to loop through the two lists in parallel. Split the filenames into the prefix and extension, then combine everything to make the new filenames.
dir=/home/linux/test
for file in "$dir"/*.ext
do
read -r label
prefix=${file%.*} # remove everything from last .
ext=${file##*.} # remove everything before last .
mv "$file" "$prefix - $label.$ext"
done < labels.txt
I originally partly got the request wrong, although this step is still useful, because it gives you the filenames you need.
#!/bin/sh
count=1000
cp labels.txt stack
cat > ed1 <<EOF
1p
q
EOF
cat > ed2 <<EOF
1d
wq
EOF
next () {
[ -s stack ] && main
}
main () {
line="$(ed -s stack < ed1)"
echo "${count} - ${line}.ext" >> newfile
ed -s stack < ed2
count=$(($count+1))
next
}
next
Now we just need to move the files:-
cp newfile stack
for i in *.ext
do
newname="$(ed -s stack < ed1)"
mv -v "${i}" "${newname}"
ed -s stack < ed2
done
rm -v ./ed1
rm -v ./ed2
rm -v ./stack
rm -v ./newfile
On the possibility that you don't have exactly the same number of files as labels, I set it up to cycle a couple of arrays in pseudo-parallel.
$: cat script
#!/bin/env bash
lst=( *.ext ) # array of files to rename
mapfile -t labels < labels.txt # array of labels to attach
for ndx in ${!lst[#]} # for each filename's numeric index
do # assign the new name
new="${lst[ndx]/.ext/ - ${labels[ndx%${#labels[#]}]}.ext}"
# show the command to rename the file
echo "mv \"${lst[ndx]}\" \"$new\""
done
$: ls -1 *ext # I added an extra file
1000.ext
1001.ext
1002.ext
1003.ext
1004.ext
1005.ext
$: ./script # loops back if more files than labels
mv "1000.ext" "1000 - Alpha.ext"
mv "1001.ext" "1001 - Beta.ext"
mv "1002.ext" "1002 - Charlie.ext"
mv "1003.ext" "1003 - Delta.ext"
mv "1004.ext" "1004 - Echo.ext"
mv "1005.ext" "1005 - Alpha.ext"
$: ./script > do # use ./script to write ./do
$: ./do # use ./do to change the names
$: ls -1
'1000 - Alpha.ext'
'1001 - Beta.ext'
'1002 - Charlie.ext'
'1003 - Delta.ext'
'1004 - Echo.ext'
'1005 - Alpha.ext'
do
labels.txt
script
You can just remove the echo to have ./script rename the files there.
I renamed labels to labels.txt to match your example.
If you aren't using bash this will need a call to something like sed or awk. Here's a short awk-based script that will do the same.
$: cat script2
#!/bin/env sh
printf "%s\n" *.ext > files.txt
awk 'NR==FNR{label[i++]=$0}
NR>FNR{ if (! label[i] ) { i=0 } cmd="mv \""$0"\" \""gensub(/[.]ext/, " - "label[i++]".ext", 1)"\"";
print cmd;
# system(cmd);
}' labels.txt files.txt
Uncomment the system line to make it actually do the renames as well.
It does assume your filenames don't have embedded newlines. Let us know if that's a problem.

Cat content of files to .txt files with common pattern name in bash

I have a series of .dat files and a series of .txt files that have a common matching pattern. I want to cat the content of the .dat files into each respective .txt file with the matching pattern in the file name, in a loop. Example files are:
xfile_pr_WRF_mergetime_regionA.nc.dat
xfile_pr_GFDL_mergetime_regionA.nc.dat
xfile_pr_RCA_mergetime_regionA.nc.dat
#
yfile_pr_WRF_mergetime_regionA.nc.dat
yfile_pr_GFDL_mergetime_regionA.nc.dat
yfile_pr_RCA_mergetime_regionA.nc.dat
#
pr_WRF_mergetime_regionA_final.txt
pr_GFDL_mergetime_regionA_final.txt
pr_RCA_mergetime_regionA_final.txt
What I have tried so far is the following (I am trying to cat the content of all files starting with "xfile" to the respective model .txt file.
#
find -name 'xfile*' | sed 's/_mergetime_.*//' | sort -u | while read -r pattern
do
echo "${pattern}"*
cat "${pattern}"* >> "${pattern}".txt
done
Let me make some assumptions:
All filenames contain _mergetime_* substring.
The pattern is the portion such as pr_GFDL and is essential to
identify the file.
Then would you try the following:
declare -A map # create an associative array
for f in xfile_*.dat; do # loop over xfile_* files
pattern=${f%_mergetime_*} # remove _mergetime_* substring to extract pattern
pattern=${pattern#xfile_} # remove xfile_ prefix
map[$pattern]=$f # associate the pattern with the filename
done
for f in *.txt; do # loop over *.txt files
pattern=${f%_mergetime_*} # extract the pattern
[[ -f ${map[$pattern]} ]] && cat "${map[$pattern]}" >> "$f"
done
If I understood you correctly, you want the following:
- xfile_pr_WRF_mergetime_regionA.nc.dat
- yfile_pr_WRF_mergetime_regionA.nc.dat
----> pr_WRF_mergetime_regionA_final.txt
- xfile_pr_GFDL_mergetime_regionA.nc.dat
- yfile_pr_GFDL_mergetime_regionA.nc.dat
----> pr_GFDL_mergetime_regionA_final.txt
- xfile_pr_RCA_mergetime_regionA.nc.dat
- yfile_pr_RCA_mergetime_regionA.nc.dat
----> pr_RCA_mergetime_regionA_final.txt
So here's what you want to do in the script:
Get all .nc.dat files in the directory
Extra the pr_TYPE_mergetime_region from the file
Append the _final.txt part to the output file
Then actually pipe the cat output onto that file
So I ended up with the following code:
find *.dat | while read -r pattern
do
output=$(echo $pattern | sed -e 's![^(pr)]*!!' -e 's!.nc.dat!!')
cat $pattern >> "${output}_final.txt"
done
And here are the files I ended up with:
pr_GFDL_mergetime_regionA_final.txt
pr_RCA_mergetime_regionA_final.txt
pr_WRF_mergetime_regionA_final.txt
Kindly let me know in the comments if I misunderstood anything or missed anything.
Seems like what you asks for:
concatxy.sh:
#!/usr/bin/env bash
# do not return the pattern if no file matches
shopt -s nullglob
# Iterate all xfiles
for xfile in "xfile_pr_"*".nc.dat"; do
# Regex to extract the common filename part
[[ "$xfile" =~ ^xfile_(.*)\.nc\.dat$ ]]
# Compose the matching yfile name
yfile="yfile_${BASH_REMATCH[1]}.nc.dat"
# Compose the output text file name
txtfile="${BASH_REMATCH[1]}_final.txt"
# Perform the concatenation of xfile and yfile into the .txt file
cat "$xfile" "$yfile" >"$txtfile"
done
Creating populated test files:
preptest.sh:
#!/usr/bin/env bash
# Populating test files
echo "Content of xfile_pr_WRF_mergetime_regionA.nc.dat" >xfile_pr_WRF_mergetime_regionA.nc.dat
echo "Content of xfile_pr_GFDL_mergetime_regionA.nc.dat" >xfile_pr_GFDL_mergetime_regionA.nc.dat
echo "Content of xfile_pr_RCA_mergetime_regionA.nc.dat" >xfile_pr_RCA_mergetime_regionA.nc.dat
#
echo "Content of yfile_pr_WRF_mergetime_regionA.nc.dat" > yfile_pr_WRF_mergetime_regionA.nc.dat
echo "Content of yfile_pr_GFDL_mergetime_regionA.nc.dat" >yfile_pr_GFDL_mergetime_regionA.nc.dat
echo "Content of yfile_pr_RCA_mergetime_regionA.nc.dat" >yfile_pr_RCA_mergetime_regionA.nc.dat
#
#pr_WRF_mergetime_regionA_final.txt
#pr_GFDL_mergetime_regionA_final.txt
#pr_RCA_mergetime_regionA_final.txt
Running test
$ bash ./preptest.sh
$ bash ./concatxy.sh
$ ls -tr1
concatxy.sh
preptest.sh
yfile_pr_WRF_mergetime_regionA.nc.dat
yfile_pr_RCA_mergetime_regionA.nc.dat
yfile_pr_GFDL_mergetime_regionA.nc.dat
xfile_pr_WRF_mergetime_regionA.nc.dat
xfile_pr_RCA_mergetime_regionA.nc.dat
xfile_pr_GFDL_mergetime_regionA.nc.dat
pr_GFDL_mergetime_regionA_final.txt
pr_WRF_mergetime_regionA_final.txt
pr_RCA_mergetime_regionA_final.txt
$ cat pr_GFDL_mergetime_regionA_final.txt
Content of xfile_pr_GFDL_mergetime_regionA.nc.dat
Content of yfile_pr_GFDL_mergetime_regionA.nc.dat
$ cat pr_WRF_mergetime_regionA_final.txt
Content of xfile_pr_WRF_mergetime_regionA.nc.dat
Content of yfile_pr_WRF_mergetime_regionA.nc.dat
$ cat pr_RCA_mergetime_regionA_final.txt
Content of xfile_pr_RCA_mergetime_regionA.nc.dat
Content of yfile_pr_RCA_mergetime_regionA.nc.dat

How to sort files in paste command with 500 files csv

My question is similar to How to sort files in paste command?
- which has been solved.
I have 500 csv files (daily rainfall data) in a folder with naming convention chirps_yyyymmdd.csv. Each file has only 1 column (rainfall value) with 100,000 rows, and no header. I want to merge all the csv files into a single csv in chronological order.
When I tried this script ls -v file_*.csv | xargs paste -d, with only 100 csv files, it worked. But when tried using 500 csv files, I got this error: paste: chirps_19890911.csv: Too many open files
How to handle above error?
For fast solution, I can divide the csv's into two folder and do the process using above script. But, the problem I have 100 folders and it has 500 csv in each folder.
Thanks
Sample data and expected result: https://www.dropbox.com/s/ndofxuunc1sm292/data.zip?dl=0
You can do it with gawk like this...
Simply read all the files in, one after the other and save them into an array. The array is indexed by two numbers, firstly the line number in the current file (FNR) and secondly the column, which I increment each time we encounter a new file in the BEGINFILE block.
Then, at the end, print out the entire array:
gawk 'BEGINFILE{ ++col } # New file, increment column number
{ X[FNR SEP col]=$0; rows=FNR } # Save datum into array X, indexed by current record number and col
END { for(r=1;r<=rows;r++){
comma=","
for(c=1;c<=col;c++){
if(c==col)comma=""
printf("%s%s",X[r SEP c],comma)
}
printf("\n")
}
}' chirps*
SEP is just an unused character that makes a separator between indices. I am using gawk because BEGINFILE is useful for incrementing the column number.
Save the above in your HOME directory as merge. Then start a Terminal and, just once, make it executable with the command:
chmod +x merge
Now change directory to where your chirps are with a command like:
cd subdirectory/where/chirps/are
Now you can run the script with:
$HOME/merge
The output will rush past on the screen. If you want it in a file, use:
$HOME/merge > merged.csv
First make one file without pasting and change that file into a oneliner with tr:
cat */chirps_*.csv | tr "\n" "," > long.csv
If the goal is a file with 100,000 lines and 500 columns then something like this should work:
paste -d, chirps_*.csv > chirps_500_merge.csv
Additional code can be used to sort the chirps_... input files into any desired order before pasteing.
The error comes from ulimit, from man ulimit:
-n or --file-descriptor-count The maximum number of open file descriptors
On my system ulimit -n returns 1024.
Happily we can paste the paste output, so we can chain it.
find . -type f -name 'file_*.csv' |
sort |
xargs -n$(ulimit -n) sh -c '
tmp=$(mktemp);
paste -d, "$#" >$tmp;
echo $tmp
' -- |
xargs sh -c '
paste -d, "$#"
rm "$#"
' --
Don't parse ls output
Once we moved from parsing ls output to good find, we find all files and sort them.
the first xargs takes 1024 files at a time, creates temporary file, pastes the output into temporary and outputs the temporary file filename
The second xargs does the same with temporary files, but also removes all the temporaries
As the count of files would be 100*500=500000 which is smaller then 1024*1024 we can get away with one pass.
Tested against test data generated with:
seq 1 2000 |
xargs -P0 -n1 -t sh -c '
seq 1 1000 |
sed "s/^/ $RANDOM/" \
>"file_$(date --date="-${1}days" +%Y%m%d).csv"
' --
The problem seems to be much like foldl with maximum size of chunk to fold in one pass. Basically we want paste -d, <(paste -d, <(paste -d, <1024 files>) <1023 files>) <rest of files> that runs kind-of-recursively. With a little fun I came up with the following:
func() {
paste -d, "$#"
}
files=()
tmpfilecreated=0
# read filenames...c
while IFS= read -r line; do
files+=("$line")
# if the limit of 1024 files is reached
if ((${#files[#]} == 1024)); then
tmp=$(mktemp)
func "${files[#]}" >"$tmp"
# remove the last tmp file
if ((tmpfilecreated)); then
rm "${files[0]}"
fi
tmpfilecreated=1
# start with fresh files list
# with only the tmp file
files=("$tmp")
fi
done
func "${files[#]}"
# remember to clear tmp file!
if ((tmpfilecreated)); then
rm "${files[0]}"
fi
I guess readarray/mapfile could be faster, and result in a bit clearer code:
func() {
paste -d, "$#"
}
tmp=()
tmpfilecreated=0
while readarray -t -n1023 files && ((${#files[#]})); do
tmp=("$(mktemp)")
func "${tmp[#]}" "${files[#]}" >"$tmp"
if ((tmpfilecreated)); then
rm "${files[0]}"
fi
tmpfilecreated=1
done
func "${tmp[#]}" "${files[#]}"
if ((tmpfilecreated)); then
rm "${files[0]}"
fi
PS. I want to merge all the csv files into a single csv in chronological order. Wouldn't that be just cut? Right now each column represents one day.
You can try this Perl-one liner. It will work for any number of files matching *.csv under a directory
$ ls -1 *csv
file_1.csv
file_2.csv
file_3.csv
$ cat file_1.csv
1
2
3
$ cat file_2.csv
4
5
6
$ cat file_3.csv
7
8
9
$ perl -e ' BEGIN { while($f=glob("*.csv")) { $i=0;open($FH,"<$f"); while(<$FH>){ chomp;#t=#{$kv{$i}}; push(#t,$_);$kv{$i++}=[#t];}} print join(",",#{$kv{$_}})."\n" for(0..$i) } ' <
1,4,7
2,5,8
3,6,9
$

Finding the file name in a directory with a pattern

I need to find the latest file - filename_YYYYMMDD in the directory DIR.
The below is not working as the position is shifting each time because of the spaces between(occurring mostly at file size field as it differs every time.)
please suggest if there is other way.
report =‘ls -ltr $DIR/filename_* 2>/dev/null | tail -1 | cut -d “ “ -f9’
You can use AWK to cut the last field . like below
report=`ls -ltr $DIR/filename_* 2>/dev/null | tail -1 | awk '{print $NF}'`
Cut may not be an option here
If I understand you want to loop though each file in the directory and file the largest 'YYYYMMDD' value and the filename associated with that value, you can use simple POSIX parameter expansion with substring removal to isolate the 'YYYYMMDD' and compare against a value initialized to zero updating the latest variable to hold the largest 'YYYYMMDD' as you loop over all files in the directory. You can store the name of the file each time you find a larger 'YYYYMMDD'.
For example, you could do something like:
#!/bin/sh
name=
latest=0
for i in *; do
test "${i##*_}" -gt "$latest" && { latest="${i##*_}"; name="$i"; }
done
printf "%s\n" "$name"
Example Directory
$ ls -1rt
filename_20120615
filename_20120612
filename_20120115
filename_20120112
filename_20110615
filename_20110612
filename_20110115
filename_20110112
filename_20100615
filename_20100612
filename_20100115
filename_20100112
Example Use/Output
$ name=; latest=0; \
> for i in *; do \
> test "${i##*_}" -gt "$latest" && { latest="${i##*_}"; name="$i"; }; \
> done; \
> printf "%s\n" "$name"
filename_20120615
Where the script selects filename_20120615 as the file with the greatest 'YYYYMMDD' of all files in the directory.
Since you are using only tools provided by the shell itself, it doesn't need to spawn subshells for each pipe or utility it calls.
Give it a test and let me know if that is what you intended, let me know if your intent was different, or if you have any further questions.

UNIX - Replacing variables in sql with matching values from .profile file

I am trying to write a shell which will take an SQL file as input. Example SQL file:
SELECT *
FROM %%DB.TBL_%%TBLEXT
WHERE CITY = '%%CITY'
Now the script should extract all variables, which in this case everything starting with %%. So the output file will be something as below:
%%DB
%%TBLEXT
%%CITY
Now I should be able to extract the matching values from the user's .profile file for these variables and create the SQL file with the proper values.
SELECT *
FROM tempdb.TBL_abc
WHERE CITY = 'Chicago'
As of now I am trying to generate the file1 which will contain all the variables. Below code sample -
sed "s/[(),']//g" "T:/work/shell/sqlfile1.sql" | awk '/%%/{print $NF}' | awk '/%%/{print $NF}' > sqltemp2.sql
takes me till
%%DB.TBL_%%TBLEXT
%%CITY
Can someone help me in getting to file1 listing the variables?
You can use grep and sort to get a list of unique variables, as per the following transcript:
$ echo "SELECT *
FROM %%DB.TBL_%%TBLEXT
WHERE CITY = '%%CITY'" | grep -o '%%[A-Za-z0-9_]*' | sort -u
%%CITY
%%DB
%%TBLEXT
The -o flag to grep instructs it to only print the matching parts of lines rather than the entire line, and also outputs each matching part on a distinct line. Then sort -u just makes sure there are no duplicates.
In terms of the full process, here's a slight modification to a bash script I've used for similar purposes:
# Define all translations.
declare -A xlat
xlat['%%DB']='tempdb'
xlat['%%TBLEXT']='abc'
xlat['%%CITY']='Chicago'
# Check all variables in input file.
okay=1
for key in $(grep -o '%%[A-Za-z0-9_]*' input.sql | sort -u) ; do
if [[ "${xlat[$key]}" == "" ]] ; then
echo "Bad key ($key) in file:"
grep -n "${key}" input.sql | sed 's/^/ /'
okay=0
fi
done
if [[ ${okay} -eq 0 ]] ; then
exit 1
fi
# Process input file doing substitutions. Fairly
# primitive use of sed, must change to use sed -i
# at some point.
# Note we sort keys based on descending length so we
# correctly handle extensions like "NAME" and "NAMESPACE",
# doing the longer ones first makes it work properly.
cp input.sql output.sql
for key in $( (
for key in ${!xlat[#]} ; do
echo ${key}
done
) | awk '{print length($0)":"$0}' | sort -rnu | cut -d':' -f2) ; do
sed "s/${key}/${xlat[$key]}/g" output.sql >output2.sql
mv output2.sql output.sql
done
cat output.sql
It first checks that the input file doesn't contain any keys not found in the translation array. Then it applies sed substitutions to the input file, one per translation, to ensure all keys are substituted with their respective values.
This should be a good start, though there may be some edge cases such as if your keys or values contain characters sed would consider important (like / for example). If that is the case, you'll probably need to escape them such as changing:
xlat['%%UNDEFINED']='0/0'
into:
xlat['%%UNDEFINED']='0\/0'

Resources