How to escape special characters? - shell

I'm trying to remove songs via a bash shell for loop yet removing a file like this
while read item; do rm "$item"; done < duplicates
keeps getting caught up on song name. Is it possible to get around this? My song titles might look like this:
/home/user/Music/Master List's Music/iTunes/iTunes\ Music/John\ Mayer/Room\ for\ Squares\ \[Aware\]/07\ 83.m4a
/home/user/Music/Master List's Music/bsg\ season\ 1\ \(Case\ Conflict\ 1\)/06\ A\ Good\ Lighter.mp3
/home/user/Music/Master List's Music/Nino\ Rota/The\ Godfather\ Pt.\ 3/14\ A\ Casa\ Amiche.m4a
as you can see, in order to remove an item I can have no %.()[] or anything else without being escaped unless it's the . before the file extension obviously. Is there a way I can escape special characters like this?
For instance, I used sed to turn the %20 into spaces:
cat duplicates | sed 's/%20/\\ /g' > clean_duplicates
The output I'm looking for looks like this:
/home/user/Music/Master\ List\'s\ Music/iTunes/iTunes Music/John\ Mayer/Room\ for\ Squares\ \[Aware\]/07\ 83.m4a
/home/user/Music/Master\ List\'s\ Music/bsg\ season\ 1\ \(Case\ Conflict\ 1\)/06\ A\ Good\ Lighter.mp3
/home/user/Music/Master\ List\'s\ Music/Nino\ Rota/The Godfather\ Pt\.\ 3\/14\ A\ Casa\ Amiche.m4a

Update To address the actual url-decoding (I missed it before):
while read line; do printf "$(echo -n $line | sed 's/\\/\\\\/g;s/\(%\)\([0-9a-fA-F][0-9a-fA-F]\)/\\x\2/g')\n"; done < input
Output:
/home/user/Music/Master List's Music/iTunes/iTunes Music/John Mayer/Room for Squares [Aware]/07 83.m4a
/home/user/Music/Master List's Music/bsg season 1 (Case Conflict 1)/06 A Good Lighter.mp3
/home/user/Music/Master List's Music/Nino Rota/The Godfather Pt. 3/14 A Casa Amiche.m4a
So in order to delete those files, e.g. redirect the cleaned output to a file:
while read line
do
printf "$(echo -n $line | sed 's/\\/\\\\/g;s/\(%\)\([0-9a-fA-F][0-9a-fA-F]\)/\\x\2/g')\n"
done < duplicates > cleaned_duplicates
while read file; do rm -v "$file"; done < cleaned_duplicates
If you prefer to store the names into a script files using explicit shell character escaping you could do
while read file; do printf "rm -v %q\n" "$file"; done < cleaned_duplicates > script.sh
Which should result in script.sh containing:
rm -v /home/user/Music/Master\ List\'s\ Music/iTunes/iTunes\ Music/John\ Mayer/R
rm -v /home/user/Music/Master\ List\'s\ Music/bsg\ season\ 1\ \(Case\ Conflict\
rm -v /home/user/Music/Master\ List\'s\ Music/Nino\ Rota/The\ Godfather\ Pt.\ 3/

Related

Speed up bash for loop which contains multiple sed commands

my bash for loop looks like:
for i in read_* ; do
cut -f1 $i | sponge $i
sed -i '1 s/^/>/g' $i
sed -i '3 s/^/>ref\n/g' $i
sed -i '4d' $i
sed -i '1h;2H;1,2d;4G' $i
mv $i $i.fasta
done
Are there any methods of speeding up this process, perhaps using GNU parallel?
EDIT: Added input and expected output.
Input:
sampleid 97 stuff 2086 42 213M = 3322 1431
TATTTAGGGAAGATCTGGCCTTCCTACAAGGGAAGGCCAGGGAATTTTCTTCAGAGCAGA
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
TTTTTAGGGAAGATCTGGCCTTCCTACAAGGGAAGGCCAGGGAATTTTCTTCAGAGCAGA
Hopeful output:
>ref
TTTTTAGGGAAGATCTGGCCTTCCTACAAGGGAAGGCCAGGGAATTTTCTTCAGAGCAGA
>sampleid
TATTTAGGGAAGATCTGGCCTTCCTACAAGGGAAGGCCAGGGAATTTTCTTCAGAGCAGA
I used the sed -i '1h;2H;1,2d;4G' $i command to swap lines 2 and 4.
If I read it right, this should create the same result, though it would probably help a LOT if I could see what your input and expected output look like...
awk '{$0=$1}
FNR==1{hd=">"$0; next}
FNR==2{hd=hd"\n"$0;next}
FNR==3{print ">ref\n"$0 > FILENAME".fasta"}
FNR==4{next}
FNR==5{print hd"\n"$0 > FILENAME".fasta"}
' read_*
My input files:
$: cat read_x
foo x
bar x
baz x
last x
curiosity x
$: cat read_y
FOO y
BAR y
BAZ y
LAST y
CURIOSITY y
and the resulting output files:
$: cat read_x.fasta
>ref
baz
>foo
bar
curiosity
$: cat read_y.fasta
>ref
BAZ
>FOO
BAR
CURIOSITY
This runs in one pass with no loop aside from awk's usual internals, and leaves the originals in place so you can check it first. If all is good, all that's left is to remove the originals. For that, I would use extended globbing.
$: shopt -s extglob; rm read_!(*.fasta)
That will clean up the original inputs but not the new outputs.
Same results, three commands, no loops.
I am, or course, making some assumptions about what you are meaning to do that might not be accurate. To get this format in a single sed call -
$: sed -e 's/[[:space:]].*//' -e '1{s/^/>/;h;d}' -e '2{H;s/.*/>ref/}' -e '4x' read_x
>ref
baz
>foo
bar
curiosity
but that's not the same commands you used, so maybe I'm misreading it.
To use this to in-place edit multiple files at a time (instead of calling it in a loop on each file), use -si so that the line numbers apply to each file rather than the stream of records they collectively produce.
DON'T use -is, though you could use -i -s.
$: sed -s -i -e 's/[[:space:]].*//' -e '1{s/^/>/;h;d}' -e '2{H;s/.*/>ref/}' -e '4x' read_*
This still leaves you with the issue of renaming each, but xargs makes that pretty easy in the given example.
printf "%s\n" read_* | xargs -I# mv # #.fasta
addendum
Using the file you gave in the OP, assuming every file is the same general structure and exactly 4 lines -
$: cat file_0 # I made files 0 through 7, but with same data
sampleid 97 stuff 2086 42 213M = 3322 1431
TATTTAGGGAAGATCTGGCCTTCCTACAAGGGAAGGCCAGGGAATTTTCTTCAGAGCAGA
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
TTTTTAGGGAAGATCTGGCCTTCCTACAAGGGAAGGCCAGGGAATTTTCTTCAGAGCAGA
$: sed -Esi '1{s/^([^[:space:]]+).*/>\1/;h;s/.*/>ref/}; 3x;' file_?
$: cat file_0 # used a diff on each, worked on all at once
>ref
TATTTAGGGAAGATCTGGCCTTCCTACAAGGGAAGGCCAGGGAATTTTCTTCAGAGCAGA
>sampleid
TTTTTAGGGAAGATCTGGCCTTCCTACAAGGGAAGGCCAGGGAATTTTCTTCAGAGCAGA
Breakout:
-Esi Extended pattern matching, separate file linecounts, in-place edits
1{...}; Collectively do these commands, in order, only on every line 1
s/^([^[:space:]]+).*/>\1/ add leading > but strip everything after any whitespace
h store the resulting >\1 line in the hold buffer
s/.*/>ref/ then replace the whole line with a literal >ref
`3x' swap line 3 with the value in the hold buffer from line 1
file_? I used a glob to supply the appropriate list of files all at once.
Doing same with awk:
$: awk 'FNR==1{id=">"$1; print ">ref" >FILENAME".fasta"; next} FNR==3{print id > FILENAME".fasta"; next} {print $0 > FILENAME".fasta"}' file_?
Then you can do file management as above with the xargs/mv for the sed or the shopt/rm for the awk - or we could add a little organizational work in awk if you like. Consider this:
awk 'BEGIN { system(" mkdir -p done ") }
FNR==1 { id=">"$1; print ">ref" > FILENAME".fasta"; next } # skip printing original
FNR==3 { print id > FILENAME".fasta"; next } # skip printing original
{ print $0 > FILENAME".fasta" } # every line NOT skipped
FNR==4 { close(FILENAME); close(FILENAME".fasta");
system("mv " FILENAME " done/")
}' file_?
Then if there are any problems, it's easy to delete the fasta's, move the originals back, adjust the code, and try again. If everything is ok, it's fast and easy to rm -fr done, yes?
Note that I really only added the mkdir inside a system call in the awk to show that you can, and to keep from having to manually do it separately if you have to run a few iterations or move it all into a wrapper script, etc.
The code in the question runs multiple subprocesses (cut, sponge, sed four times, and mv) for each file that is processed. Running subprocesses is relatively slow, so you can speed up the code significantly by reducing the number of them.
This Shellcheck-clean code is one way to do it:
#! /bin/bash -p
old_files=()
for f in read_* ; do
readarray -t lines <"$f"
printf '>ref\n%s\n>%s\n%s\n' \
"${lines[3]}" "${lines[0]%%[[:space:]]*}" "${lines[1]}" >"$f.fasta"
old_files+=( "$f" )
done
rm -- "${old_files[#]}"
This runs no subprocesses when processing individual files. It just reads the lines of the old file into an array using the built-in readarray command and writes to the new file using the built-in printf.
See Removing part of a string (BashFAQ/100 (How do I do string manipulation in bash?)) for an explanation of the %% in ${lines[0]%%[[:space:]]*}.
To avoid running rm for each file, the code keeps a list of files to be deleted and removes all of them at the end. If you try the code, consider commenting the rm line until you are very confident that the rest of the code is doing what you want.

Is there a way to add a suffix to files where the suffix comes from a list in a text file?

So currently the searches are coming up with a single word renaming solution, where you define the (static) suffix within the code. I need to rename based on a text based filelist and so -
I have a list of files in /home/linux/test/ :
1000.ext
1001.ext
1002.ext
1003.ext
1004.ext
Then I have a txt file (labels.txt) containing the labels I want to use:
Alpha
Beta
Charlie
Delta
Echo
I want to rename the files to look like (example1):
1000 - Alpha.ext
1001 - Beta.ext
1002 - Charlie.ext
1003 - Delta.ext
1004 - Echo.ext
How would you a script which renames all the files in /home/linux/test/ to the list in example1?
Use paste to loop through the two lists in parallel. Split the filenames into the prefix and extension, then combine everything to make the new filenames.
dir=/home/linux/test
for file in "$dir"/*.ext
do
read -r label
prefix=${file%.*} # remove everything from last .
ext=${file##*.} # remove everything before last .
mv "$file" "$prefix - $label.$ext"
done < labels.txt
I originally partly got the request wrong, although this step is still useful, because it gives you the filenames you need.
#!/bin/sh
count=1000
cp labels.txt stack
cat > ed1 <<EOF
1p
q
EOF
cat > ed2 <<EOF
1d
wq
EOF
next () {
[ -s stack ] && main
}
main () {
line="$(ed -s stack < ed1)"
echo "${count} - ${line}.ext" >> newfile
ed -s stack < ed2
count=$(($count+1))
next
}
next
Now we just need to move the files:-
cp newfile stack
for i in *.ext
do
newname="$(ed -s stack < ed1)"
mv -v "${i}" "${newname}"
ed -s stack < ed2
done
rm -v ./ed1
rm -v ./ed2
rm -v ./stack
rm -v ./newfile
On the possibility that you don't have exactly the same number of files as labels, I set it up to cycle a couple of arrays in pseudo-parallel.
$: cat script
#!/bin/env bash
lst=( *.ext ) # array of files to rename
mapfile -t labels < labels.txt # array of labels to attach
for ndx in ${!lst[#]} # for each filename's numeric index
do # assign the new name
new="${lst[ndx]/.ext/ - ${labels[ndx%${#labels[#]}]}.ext}"
# show the command to rename the file
echo "mv \"${lst[ndx]}\" \"$new\""
done
$: ls -1 *ext # I added an extra file
1000.ext
1001.ext
1002.ext
1003.ext
1004.ext
1005.ext
$: ./script # loops back if more files than labels
mv "1000.ext" "1000 - Alpha.ext"
mv "1001.ext" "1001 - Beta.ext"
mv "1002.ext" "1002 - Charlie.ext"
mv "1003.ext" "1003 - Delta.ext"
mv "1004.ext" "1004 - Echo.ext"
mv "1005.ext" "1005 - Alpha.ext"
$: ./script > do # use ./script to write ./do
$: ./do # use ./do to change the names
$: ls -1
'1000 - Alpha.ext'
'1001 - Beta.ext'
'1002 - Charlie.ext'
'1003 - Delta.ext'
'1004 - Echo.ext'
'1005 - Alpha.ext'
do
labels.txt
script
You can just remove the echo to have ./script rename the files there.
I renamed labels to labels.txt to match your example.
If you aren't using bash this will need a call to something like sed or awk. Here's a short awk-based script that will do the same.
$: cat script2
#!/bin/env sh
printf "%s\n" *.ext > files.txt
awk 'NR==FNR{label[i++]=$0}
NR>FNR{ if (! label[i] ) { i=0 } cmd="mv \""$0"\" \""gensub(/[.]ext/, " - "label[i++]".ext", 1)"\"";
print cmd;
# system(cmd);
}' labels.txt files.txt
Uncomment the system line to make it actually do the renames as well.
It does assume your filenames don't have embedded newlines. Let us know if that's a problem.

looping with grep over several files

I have multiple files /text-1.txt, /text-2.txt ... /text-20.txt
and what I want to do is to grep for two patterns and stitch them into one file.
For example:
I have
grep "Int_dogs" /text-1.txt > /text-1-dogs.txt
grep "Int_cats" /text-1.txt> /text-1-cats.txt
cat /text-1-dogs.txt /text-1-cats.txt > /text-1-output.txt
I want to repeat this for all 20 files above. Is there an efficient way in bash/awk, etc. to do this ?
#!/bin/sh
count=1
next () {
[[ "${count}" -lt 21 ]] && main
[[ "${count}" -eq 21 ]] && exit 0
}
main () {
file="text-${count}"
grep "Int_dogs" "${file}.txt" > "${file}-dogs.txt"
grep "Int_cats" "${file}.txt" > "${file}-cats.txt"
cat "${file}-dogs.txt" "${file}-cats.txt" > "${file}-output.txt"
count=$((count+1))
next
}
next
grep has some features you seem not to be aware of:
grep can be launched on lists of files, but the output will be different:
For a single file, the output will only contain the filtered line, like in this example:
cat text-1.txt
I have a cat.
I have a dog.
I have a canary.
grep "cat" text-1.txt
I have a cat.
For multiple files, also the filename will be shown in the output: let's add another textfile:
cat text-2.txt
I don't have a dog.
I don't have a cat.
I don't have a canary.
grep "cat" text-*.txt
text-1.txt: I have a cat.
text-2.txt: I don't have a cat.
grep can be extended to search for multiple patterns in files, using the -E switch. The patterns need to be separated using a pipe symbol:
grep -E "cat|dog" text-1.txt
I have a dog.
I have a cat.
(summary of the previous two points + the remark that grep -E equals egrep):
egrep "cat|dog" text-*.txt
text-1.txt:I have a dog.
text-1.txt:I have a cat.
text-2.txt:I don't have a dog.
text-2.txt:I don't have a cat.
So, in order to redirect this to an output file, you can simply say:
egrep "cat|dog" text-*.txt >text-1-output.txt
Assuming you're using bash.
Try this:
for i in $(seq 1 20) ;do rm -f text-${i}-output.txt ; grep -E "Int_dogs|Int_cats" text-${i}.txt >> text-${i}-output.txt ;done
Details
This one-line script does the following:
Original files are intended to have the following name order/syntax:
text-<INTEGER_NUMBER>.txt - Example: text-1.txt, text-2.txt, ... text-100.txt.
Creates a loop starting from 1 to <N> and <N> is the number of files you want to process.
Warn: rm -f text-${i}-output.txt command first will be run and remove the possible outputfile (if there is any), to ensure that a fresh new output file will be only available at the end of the process.
grep -E "Int_dogs|Int_cats" text-${i}.txt will try to match both strings in the original file and by >> text-${i}-output.txt all the matched lines will be redirected to a newly created output file with the relevant number of the original file. Example: if integer number in original file is 5 text-5.txt, then text-5-output.txt file will be created & contain the matched string lines (if any).

Using sed to find a string with wildcards and then replacing with same wildcards

So I am trying to remove new lines using sed, because it the only way I can think of to do it. I'm completely self taught so there may be a more efficient way that I just don't know.
The string I am searching for is \HF=-[0-9](newline character). The problem is the data it is searching through can look like (Note: there are actual new line characters in this data, which I think is causing a bit of the problem)
1\1\GINC-N076\SP\RMP2-FC\CC-pVDZ\C12H12\R2536\09-Apr-2020\0\\# mp2/cc-
pVDZ\\Squish3_Slide0\\0,1\H,0,0.,2.4822,0.\C,0,0.,1.3948,0.\C,0,0.,-1.
3948,0.\C,0,1.2079,0.6974,0.\C,0,-1.2079,0.6974,0.\C,0,-1.2079,-0.6974
,0.\C,0,1.2079,-0.6974,0.\H,0,2.1497,1.2411,0.\H,0,-2.1497,1.2411,0.\H
,0,-2.1497,-1.2411,0.\H,0,2.1497,-1.2411,0.\H,0,0.,-2.4822,0.\C,0,0.,1
.3948,3.\C,0,0.,-1.3948,3.\C,0,1.2079,0.6974,3.\C,0,-1.2079,0.6974,3.\
C,0,-1.2079,-0.6974,3.\C,0,1.2079,-0.6974,3.\H,0,0.,2.4822,3.\H,0,2.14
97,1.2411,3.\H,0,-2.1497,1.2411,3.\H,0,-2.1497,-1.2411,3.\H,0,2.1497,-
1.2411,3.\H,0,0.,-2.4822,3.\\Version=ES64L-G09RevD.01\State=1-AG\HF=-4
61.3998608\MP2=-463.0005321\RMSD=3.490e-09\PG=D02H [SG"(C4H4),X(C8H8)]
\\#
OR
1\1\GINC-N076\SP\RMP2-FC\CC-pVDZ\C12H12\R2536\09-Apr-2020\0\\# mp2/cc-
pVDZ\\Squish3.1_Slide0\\0,1\H,0,0.,2.4822,0.\C,0,0.,1.3948,0.\C,0,0.,-
1.3948,0.\C,0,1.2079,0.6974,0.\C,0,-1.2079,0.6974,0.\C,0,-1.2079,-0.69
74,0.\C,0,1.2079,-0.6974,0.\H,0,2.1497,1.2411,0.\H,0,-2.1497,1.2411,0.
\H,0,-2.1497,-1.2411,0.\H,0,2.1497,-1.2411,0.\H,0,0.,-2.4822,0.\C,0,0.
,1.3948,3.1\C,0,0.,-1.3948,3.1\C,0,1.2079,0.6974,3.1\C,0,-1.2079,0.697
4,3.1\C,0,-1.2079,-0.6974,3.1\C,0,1.2079,-0.6974,3.1\H,0,0.,2.4822,3.1
\H,0,2.1497,1.2411,3.1\H,0,-2.1497,1.2411,3.1\H,0,-2.1497,-1.2411,3.1\
H,0,2.1497,-1.2411,3.1\H,0,0.,-2.4822,3.1\\Version=ES64L-G09RevD.01\St
ate=1-AG\HF=-461.4104442\MP2=-463.0062587\RMSD=3.651e-09\PG=D02H [SG"(
C4H4),X(C8H8)]\\#
OR
1\1\GINC-N076\SP\RMP2-FC\CC-pVDZ\C12H12\R2536\09-Apr-2020\0\\# mp2/cc-
pVDZ\\Squish3.3_Slide1.7\\0,1\H,0,0.,2.4822,0.\C,0,0.,1.3948,0.\C,0,0.
,-1.3948,0.\C,0,1.2079,0.6974,0.\C,0,-1.2079,0.6974,0.\C,0,-1.2079,-0.
6974,0.\C,0,1.2079,-0.6974,0.\H,0,2.1497,1.2411,0.\H,0,-2.1497,1.2411,
0.\H,0,-2.1497,-1.2411,0.\H,0,2.1497,-1.2411,0.\H,0,0.,-2.4822,0.\C,0,
0.,-0.3052,3.3\C,0,0.,-3.0948,3.3\C,0,1.2079,-1.0026,3.3\C,0,-1.2079,-
1.0026,3.3\C,0,-1.2079,-2.3974,3.3\C,0,1.2079,-2.3974,3.3\H,0,0.,0.782
2,3.3\H,0,2.1497,-0.4589,3.3\H,0,-2.1497,-0.4589,3.3\H,0,-2.1497,-2.94
11,3.3\H,0,2.1497,-2.9411,3.3\H,0,0.,-4.1822,3.3\\Version=ES64L-G09Rev
D.01\State=1-AG\HF=-461.436061\MP2=-463.0177441\RMSD=7.859e-09\PG=C02H
[SGH(C4H4),X(C8H8)]\\#
OR
1\1\GINC-N076\SP\RMP2-FC\CC-pVDZ\C12H12\R2536\09-Apr-2020\0\\# mp2/cc-
pVDZ\\Squish3.6_Slide0.9\\0,1\H,0,0.,2.4822,0.\C,0,0.,1.3948,0.\C,0,0.
,-1.3948,0.\C,0,1.2079,0.6974,0.\C,0,-1.2079,0.6974,0.\C,0,-1.2079,-0.
6974,0.\C,0,1.2079,-0.6974,0.\H,0,2.1497,1.2411,0.\H,0,-2.1497,1.2411,
0.\H,0,-2.1497,-1.2411,0.\H,0,2.1497,-1.2411,0.\H,0,0.,-2.4822,0.\C,0,
0.,0.4948,3.6\C,0,0.,-2.2948,3.6\C,0,1.2079,-0.2026,3.6\C,0,-1.2079,-0
.2026,3.6\C,0,-1.2079,-1.5974,3.6\C,0,1.2079,-1.5974,3.6\H,0,0.,1.5822
,3.6\H,0,2.1497,0.3411,3.6\H,0,-2.1497,0.3411,3.6\H,0,-2.1497,-2.1411,
3.6\H,0,2.1497,-2.1411,3.6\H,0,0.,-3.3822,3.6\\Version=ES64L-G09RevD.0
1\State=1-AG\HF=-461.4376969\MP2=-463.0163868\RMSD=7.263e-09\PG=C02H [
SGH(C4H4),X(C8H8)]\\#
Basically the number I am looking for can be broken up into two lines at any point based on character count. I need to get rid of the newline breaking up the number so that I can extract the entire value into a separate file. (I have no problems with the extraction to a new file, hence why it isn't included in the code)
Currently I am using this code
sed -i ':a;N;$!ba;s/HF=-*[0-9]*\n/HF=-*[0-9]*/g' $i &&
Which ALMOST works, expect it doesn't replace the wildcard values with the same values. It replaces it with the actual text [0-9] instead and doesn't always remove the new line character.
Important to the is that THERE ARE ACTUAL NEW LINE CHARACTERS in the output file and there is no way to change that without messing up the other 30 lines I am extracting from this output file.
What I want is to just get rid of the newline characters that occur when that string is found, regardless of how many digits there are in between the - sign and the newline character.
So the expected output would be something like
1\1\GINC-N076\SP\RMP2-FC\CC-pVDZ\C12H12\R2536\09-Apr-2020\0\\# mp2/cc-
pVDZ\\Squish3_Slide0\\0,1\H,0,0.,2.4822,0.\C,0,0.,1.3948,0.\C,0,0.,-1.
3948,0.\C,0,1.2079,0.6974,0.\C,0,-1.2079,0.6974,0.\C,0,-1.2079,-0.6974
,0.\C,0,1.2079,-0.6974,0.\H,0,2.1497,1.2411,0.\H,0,-2.1497,1.2411,0.\H
,0,-2.1497,-1.2411,0.\H,0,2.1497,-1.2411,0.\H,0,0.,-2.4822,0.\C,0,0.,1
.3948,3.\C,0,0.,-1.3948,3.\C,0,1.2079,0.6974,3.\C,0,-1.2079,0.6974,3.\
C,0,-1.2079,-0.6974,3.\C,0,1.2079,-0.6974,3.\H,0,0.,2.4822,3.\H,0,2.14
97,1.2411,3.\H,0,-2.1497,1.2411,3.\H,0,-2.1497,-1.2411,3.\H,0,2.1497,-
1.2411,3.\H,0,0.,-2.4822,3.\\Version=ES64L-G09RevD.01\State=1-AG\HF=-461.3998608\MP2=-463.0005321\RMSD=3.490e-09\PG=D02H [SG"(C4H4),X(C8H8)]
\\#
These files are rather large and have over 1500 executions of this line of code, so the more efficient the better.
Everything else in the script this is in is using a combination of grep, awk, sed, and basic UNIX commands.
EDIT
After trying
sed -i -E ':a;N;$!ba;s/(\\HF=-?[.0-9]*)\n/\1/' $i &&
I still had no luck getting rid of those pesky new line characters.
If it has any effect on the answers at all here is the rest of the code to go with the one line that is causing problems
echo name HF MP2 mpdiff | cat > allE
for i in *.out
do echo name HF MP2 mpdiff | cat > $i.allE
grep "Slide" $i | cut -d "\\" -f2 | cat | tr -d '\n' > $i.name &&
grep "EUMP2" $i | cut -d "=" -f3 | cut -c 1-25 | tr '\n' ' ' | tr -s ' ' >> $i.mp &&
grep "EUMP2" $i | cut -d "=" -f2 | cut -c 1-25 | tr '\n' ' ' | tr -s ' ' >> $i.mpdiff &&
sed -i -E ':a;N;$!ba;s/(\\HF=-?[.0-9]*)\n/\1/' $i &&
grep '\\HF' $i | awk -F 'HF' '{print substr($2,2,14)}' | tr '\n' ' ' >> $i.hf &&
paste $i.name >> $i.energies &&
sed -i 's/ /0 /g' $i.hf &&
sed -i 's/\\/0/g' $i.hf &&
sed -i 's/[A-Z]/0/g' $i.hf &&
paste $i.hf >> $i.energies &&
sed -i 's/[ABCEFGHIJKLMNOPQRSTUVWXYZ]//g' $i.mp &&
paste $i.mp >> $i.energies &&
sed -i 's/[ABCEFGHIJKLMNOPQRSTUVWXYZ]//g' $i.mpdiff &&
paste $i.mpdiff >> $i.energies &&
transpose $i.energies >> $i.allE #temp.txt &&
#cat temp.txt > $i.energies
#echo $i is finished
done
echo see allE for energies
#rm *.energies #temp.txt
rm *.name
rm *.mp
rm *.hf
rm *.mpdiff
Here is how you can fix your current attempt.
sed -E ':a;N;$!ba;s/(\\HF=-?[.0-9]*)\n/\1/'
Add the i flag if you want to make the changes on the file itself, add && to send the job to the background, etc. The -E flag is needed, because backreferences (see below) are part of extended regular expressions.
I made the following changes: I changed -* to -? as there should be at most one dash (if I understand correctly and that is in fact a minus sign, not a dash). I added the period to the bracket expression, so that the decimal point would be matched too. (Note that in a bracket expression, the dot is a regular character). I wrapped the whole thing except the newline in parentheses - making it into a subexpression, which you can refer to with a backreference - which is what I did in the replacement part.
A few notes though - this will join the lines even if the entire number is at the end of one line, but not followed by the closing \. If in fact the entire number being on one line, but the closing \ is on the next line, you can change the sed command slightly, to leave those alone. On the other hand, this does not handle situations where, for example, one line ends in \H and the next line begins with F=304.222\ You only mentioned "split number" in your problem statement; shouldn't you, though, also handle such cases, where the newline splits the \HF=...\ token, just not in the "number" portion of the token?
It looks like your input lines start with a space. I have ignored them in this solution.
sed -rz 's/(AG\\HF=-[0-9]*)\n/\1/g' "$i"

Remove spaces between patterns

I have a log file where data is separated by spaces. Unfortunately one of the datafields contains spaces as well. I would like to replace those spaces with "%20". It looks like this:
2012-11-02 23:48:36 INFO 10.2.3.23 something strange name.doc 3.0.0 view1 orientation_right
the expected result is
2012-11-02 23:48:36 INFO 10.2.3.23 something%20strange%20name.doc 3.0.0 view1 orientation_right
unpredictable that how many spaces we have between the IP address and ".doc". So I would like to change them between these two patterns using pure bash if possible.
thanks for the help
$ cat file
2012-11-02 23:48:36 INFO 10.2.3.23 something strange name.doc 3.0.0 view1 orientation_right
Using Perl:
$ perl -lne 'if (/(.*([0-9]{1,3}\.){3}[0-9]{1,3} )(.*)(.doc.*)/){($a,$b,$c)=($1,$3,$4);$b=~s/ /%20/g;print $a.$b.$c;}' file
2012-11-02 23:48:36 INFO 10.2.3.23 something%20strange%20name.doc 3.0.0 view1 orientation_right
This might work for you (GNU sed):
sed 's/\S*\s/&\n/4;s/\(\s\S*\)\{3\}$/\n&/;h;s/ /%20/g;H;g;s/\(\n.*\n\)\(.*\)\n.*\n\(.*\)\n.*/\3\2/' file
This splits the line into three, copies the line, replaces space's with %20's in one of the copies and reassembles the line discarding the unwanted pieces.
EDIT:
With reference to the comment below, the above solution can be ameliorated to:
sed -r 's/\S*\s/&\n/4;s/.*\.doc/&\n/;h;s/ /%20/g;H;g;s/(\n.*\n)(.*)\n.*\n(.*)\n.*/\3\2/' file
Untested as of yet, but in Bash 4 it's possible to do this
if [[ $line =~ (.*([0-9]+\.){3}[0-9]+ +)([^ ].*\.doc)(.*) ]]; then
nospace=${BASH_REMATCH[3]// /%20}
printf "%s%s%s\n" ${BASH_REMATCH[1]} ${nospace} ${BASH_REMATCH[4]}
fi
Here's one way with GNU sed:
echo "2012-11-02 23:48:36 INFO 10.2.3.23 something strange name.doc 3.0.0 view1 orientation_right" |
sed -r 's/(([0-9]+\.){3}[0-9]+\s+)(.*\.doc)/\1\n\3\n/; h; s/[^\n]+\n([^\n]+)\n.*$/\1/; s/\s/%20/g; G; s/([^\n]+)\n([^\n]+)\n([^\n]+)\n(.*)$/\2\1\4/'
Output:
2012-11-02 23:48:36 INFO 10.2.3.23 something%20strange%20name.doc 3.0.0 view1 orientation_right
Explanation
s/(([0-9]+\.){3}[0-9]+\s+)(.*\.doc)/\1\n\3\n/ # Separate the interesting bit on its own line
h # Store the rest in HS for later
s/[^\n]+\n([^\n]+)\n.*$/\1/ # Isolate the interesting bit
s/\s/%20/g # Do the replacement
G # Fetched stored bits back
s/([^\n]+)\n([^\n]+)\n([^\n]+)\n(.*)$/\2\1\4/ # Reorganize into the correct order
Just bash. Assuming 4 fields appear before the space separated string and 3 fields appear after:
reformat_line() {
local sep i new=""
for ((i=1; i<=$#; i++)); do
if (( i==1 )); then
sep=""
elif (( (1<i && i<=5) || ($#-3<i && i<=$#) )); then
sep=" "
else
sep="%20"
fi
new+="$sep${!i}"
done
echo "$new"
}
while IFS= read -r line; do
reformat_line $line # unquoted variable here
done < filename
outputs
2012-11-02 23:48:36 INFO 10.2.3.23 something%20strange%20name.doc 3.0.0 view1 orientation_right
A variation on Thor's answers, but using 3 processes (4 with the cat bellow but you can get rid of it by putting your_file as the last argument of the 1st sed):
cat your_file |
sed -r -e 's/ (([0-9]+\.){3}[0-9]+) +(.*\.doc) / \1\n\3\n/' |
sed -e '2~3s/ /%20/g' |
paste -s -d " \n"
As Thor explained:
The 1st sed (s/ (([0-9]+\.){3}[0-9]+) +(.*\.doc) / \1\n\3\n/) separates the interesting bit on its own line.
And then:
The 2nd sed replaces all spaces by %20 on the 2nd line and every 3 lines.
Finally, paste it back together.
It must be noted that the 2~3 part is a GNU sed extension. If you do not have GNU sed, you can do:
cat your_file |
sed -r -e 's/ (([0-9]+\.){3}[0-9]+) +(.*\.doc) / \1\n\3\n/' |
sed -e 'N;P;s/.*\n//;s/ /%20/g;N' |
paste -s -d " \n"

Resources