Using sed to find a string with wildcards and then replacing with same wildcards - bash

So I am trying to remove new lines using sed, because it the only way I can think of to do it. I'm completely self taught so there may be a more efficient way that I just don't know.
The string I am searching for is \HF=-[0-9](newline character). The problem is the data it is searching through can look like (Note: there are actual new line characters in this data, which I think is causing a bit of the problem)
1\1\GINC-N076\SP\RMP2-FC\CC-pVDZ\C12H12\R2536\09-Apr-2020\0\\# mp2/cc-
pVDZ\\Squish3_Slide0\\0,1\H,0,0.,2.4822,0.\C,0,0.,1.3948,0.\C,0,0.,-1.
3948,0.\C,0,1.2079,0.6974,0.\C,0,-1.2079,0.6974,0.\C,0,-1.2079,-0.6974
,0.\C,0,1.2079,-0.6974,0.\H,0,2.1497,1.2411,0.\H,0,-2.1497,1.2411,0.\H
,0,-2.1497,-1.2411,0.\H,0,2.1497,-1.2411,0.\H,0,0.,-2.4822,0.\C,0,0.,1
.3948,3.\C,0,0.,-1.3948,3.\C,0,1.2079,0.6974,3.\C,0,-1.2079,0.6974,3.\
C,0,-1.2079,-0.6974,3.\C,0,1.2079,-0.6974,3.\H,0,0.,2.4822,3.\H,0,2.14
97,1.2411,3.\H,0,-2.1497,1.2411,3.\H,0,-2.1497,-1.2411,3.\H,0,2.1497,-
1.2411,3.\H,0,0.,-2.4822,3.\\Version=ES64L-G09RevD.01\State=1-AG\HF=-4
61.3998608\MP2=-463.0005321\RMSD=3.490e-09\PG=D02H [SG"(C4H4),X(C8H8)]
\\#
OR
1\1\GINC-N076\SP\RMP2-FC\CC-pVDZ\C12H12\R2536\09-Apr-2020\0\\# mp2/cc-
pVDZ\\Squish3.1_Slide0\\0,1\H,0,0.,2.4822,0.\C,0,0.,1.3948,0.\C,0,0.,-
1.3948,0.\C,0,1.2079,0.6974,0.\C,0,-1.2079,0.6974,0.\C,0,-1.2079,-0.69
74,0.\C,0,1.2079,-0.6974,0.\H,0,2.1497,1.2411,0.\H,0,-2.1497,1.2411,0.
\H,0,-2.1497,-1.2411,0.\H,0,2.1497,-1.2411,0.\H,0,0.,-2.4822,0.\C,0,0.
,1.3948,3.1\C,0,0.,-1.3948,3.1\C,0,1.2079,0.6974,3.1\C,0,-1.2079,0.697
4,3.1\C,0,-1.2079,-0.6974,3.1\C,0,1.2079,-0.6974,3.1\H,0,0.,2.4822,3.1
\H,0,2.1497,1.2411,3.1\H,0,-2.1497,1.2411,3.1\H,0,-2.1497,-1.2411,3.1\
H,0,2.1497,-1.2411,3.1\H,0,0.,-2.4822,3.1\\Version=ES64L-G09RevD.01\St
ate=1-AG\HF=-461.4104442\MP2=-463.0062587\RMSD=3.651e-09\PG=D02H [SG"(
C4H4),X(C8H8)]\\#
OR
1\1\GINC-N076\SP\RMP2-FC\CC-pVDZ\C12H12\R2536\09-Apr-2020\0\\# mp2/cc-
pVDZ\\Squish3.3_Slide1.7\\0,1\H,0,0.,2.4822,0.\C,0,0.,1.3948,0.\C,0,0.
,-1.3948,0.\C,0,1.2079,0.6974,0.\C,0,-1.2079,0.6974,0.\C,0,-1.2079,-0.
6974,0.\C,0,1.2079,-0.6974,0.\H,0,2.1497,1.2411,0.\H,0,-2.1497,1.2411,
0.\H,0,-2.1497,-1.2411,0.\H,0,2.1497,-1.2411,0.\H,0,0.,-2.4822,0.\C,0,
0.,-0.3052,3.3\C,0,0.,-3.0948,3.3\C,0,1.2079,-1.0026,3.3\C,0,-1.2079,-
1.0026,3.3\C,0,-1.2079,-2.3974,3.3\C,0,1.2079,-2.3974,3.3\H,0,0.,0.782
2,3.3\H,0,2.1497,-0.4589,3.3\H,0,-2.1497,-0.4589,3.3\H,0,-2.1497,-2.94
11,3.3\H,0,2.1497,-2.9411,3.3\H,0,0.,-4.1822,3.3\\Version=ES64L-G09Rev
D.01\State=1-AG\HF=-461.436061\MP2=-463.0177441\RMSD=7.859e-09\PG=C02H
[SGH(C4H4),X(C8H8)]\\#
OR
1\1\GINC-N076\SP\RMP2-FC\CC-pVDZ\C12H12\R2536\09-Apr-2020\0\\# mp2/cc-
pVDZ\\Squish3.6_Slide0.9\\0,1\H,0,0.,2.4822,0.\C,0,0.,1.3948,0.\C,0,0.
,-1.3948,0.\C,0,1.2079,0.6974,0.\C,0,-1.2079,0.6974,0.\C,0,-1.2079,-0.
6974,0.\C,0,1.2079,-0.6974,0.\H,0,2.1497,1.2411,0.\H,0,-2.1497,1.2411,
0.\H,0,-2.1497,-1.2411,0.\H,0,2.1497,-1.2411,0.\H,0,0.,-2.4822,0.\C,0,
0.,0.4948,3.6\C,0,0.,-2.2948,3.6\C,0,1.2079,-0.2026,3.6\C,0,-1.2079,-0
.2026,3.6\C,0,-1.2079,-1.5974,3.6\C,0,1.2079,-1.5974,3.6\H,0,0.,1.5822
,3.6\H,0,2.1497,0.3411,3.6\H,0,-2.1497,0.3411,3.6\H,0,-2.1497,-2.1411,
3.6\H,0,2.1497,-2.1411,3.6\H,0,0.,-3.3822,3.6\\Version=ES64L-G09RevD.0
1\State=1-AG\HF=-461.4376969\MP2=-463.0163868\RMSD=7.263e-09\PG=C02H [
SGH(C4H4),X(C8H8)]\\#
Basically the number I am looking for can be broken up into two lines at any point based on character count. I need to get rid of the newline breaking up the number so that I can extract the entire value into a separate file. (I have no problems with the extraction to a new file, hence why it isn't included in the code)
Currently I am using this code
sed -i ':a;N;$!ba;s/HF=-*[0-9]*\n/HF=-*[0-9]*/g' $i &&
Which ALMOST works, expect it doesn't replace the wildcard values with the same values. It replaces it with the actual text [0-9] instead and doesn't always remove the new line character.
Important to the is that THERE ARE ACTUAL NEW LINE CHARACTERS in the output file and there is no way to change that without messing up the other 30 lines I am extracting from this output file.
What I want is to just get rid of the newline characters that occur when that string is found, regardless of how many digits there are in between the - sign and the newline character.
So the expected output would be something like
1\1\GINC-N076\SP\RMP2-FC\CC-pVDZ\C12H12\R2536\09-Apr-2020\0\\# mp2/cc-
pVDZ\\Squish3_Slide0\\0,1\H,0,0.,2.4822,0.\C,0,0.,1.3948,0.\C,0,0.,-1.
3948,0.\C,0,1.2079,0.6974,0.\C,0,-1.2079,0.6974,0.\C,0,-1.2079,-0.6974
,0.\C,0,1.2079,-0.6974,0.\H,0,2.1497,1.2411,0.\H,0,-2.1497,1.2411,0.\H
,0,-2.1497,-1.2411,0.\H,0,2.1497,-1.2411,0.\H,0,0.,-2.4822,0.\C,0,0.,1
.3948,3.\C,0,0.,-1.3948,3.\C,0,1.2079,0.6974,3.\C,0,-1.2079,0.6974,3.\
C,0,-1.2079,-0.6974,3.\C,0,1.2079,-0.6974,3.\H,0,0.,2.4822,3.\H,0,2.14
97,1.2411,3.\H,0,-2.1497,1.2411,3.\H,0,-2.1497,-1.2411,3.\H,0,2.1497,-
1.2411,3.\H,0,0.,-2.4822,3.\\Version=ES64L-G09RevD.01\State=1-AG\HF=-461.3998608\MP2=-463.0005321\RMSD=3.490e-09\PG=D02H [SG"(C4H4),X(C8H8)]
\\#
These files are rather large and have over 1500 executions of this line of code, so the more efficient the better.
Everything else in the script this is in is using a combination of grep, awk, sed, and basic UNIX commands.
EDIT
After trying
sed -i -E ':a;N;$!ba;s/(\\HF=-?[.0-9]*)\n/\1/' $i &&
I still had no luck getting rid of those pesky new line characters.
If it has any effect on the answers at all here is the rest of the code to go with the one line that is causing problems
echo name HF MP2 mpdiff | cat > allE
for i in *.out
do echo name HF MP2 mpdiff | cat > $i.allE
grep "Slide" $i | cut -d "\\" -f2 | cat | tr -d '\n' > $i.name &&
grep "EUMP2" $i | cut -d "=" -f3 | cut -c 1-25 | tr '\n' ' ' | tr -s ' ' >> $i.mp &&
grep "EUMP2" $i | cut -d "=" -f2 | cut -c 1-25 | tr '\n' ' ' | tr -s ' ' >> $i.mpdiff &&
sed -i -E ':a;N;$!ba;s/(\\HF=-?[.0-9]*)\n/\1/' $i &&
grep '\\HF' $i | awk -F 'HF' '{print substr($2,2,14)}' | tr '\n' ' ' >> $i.hf &&
paste $i.name >> $i.energies &&
sed -i 's/ /0 /g' $i.hf &&
sed -i 's/\\/0/g' $i.hf &&
sed -i 's/[A-Z]/0/g' $i.hf &&
paste $i.hf >> $i.energies &&
sed -i 's/[ABCEFGHIJKLMNOPQRSTUVWXYZ]//g' $i.mp &&
paste $i.mp >> $i.energies &&
sed -i 's/[ABCEFGHIJKLMNOPQRSTUVWXYZ]//g' $i.mpdiff &&
paste $i.mpdiff >> $i.energies &&
transpose $i.energies >> $i.allE #temp.txt &&
#cat temp.txt > $i.energies
#echo $i is finished
done
echo see allE for energies
#rm *.energies #temp.txt
rm *.name
rm *.mp
rm *.hf
rm *.mpdiff

Here is how you can fix your current attempt.
sed -E ':a;N;$!ba;s/(\\HF=-?[.0-9]*)\n/\1/'
Add the i flag if you want to make the changes on the file itself, add && to send the job to the background, etc. The -E flag is needed, because backreferences (see below) are part of extended regular expressions.
I made the following changes: I changed -* to -? as there should be at most one dash (if I understand correctly and that is in fact a minus sign, not a dash). I added the period to the bracket expression, so that the decimal point would be matched too. (Note that in a bracket expression, the dot is a regular character). I wrapped the whole thing except the newline in parentheses - making it into a subexpression, which you can refer to with a backreference - which is what I did in the replacement part.
A few notes though - this will join the lines even if the entire number is at the end of one line, but not followed by the closing \. If in fact the entire number being on one line, but the closing \ is on the next line, you can change the sed command slightly, to leave those alone. On the other hand, this does not handle situations where, for example, one line ends in \H and the next line begins with F=304.222\ You only mentioned "split number" in your problem statement; shouldn't you, though, also handle such cases, where the newline splits the \HF=...\ token, just not in the "number" portion of the token?

It looks like your input lines start with a space. I have ignored them in this solution.
sed -rz 's/(AG\\HF=-[0-9]*)\n/\1/g' "$i"

Related

Cut everything from specific char (or after) + Bash

I have files which all look like this:
filename.bla_1
of cours I cannot know if the filename has "_" in it. could be file_name.bla_1.
I want to write a function that take filename and delete the _# at the end.
filename.bla_1 will be --> filename.bla
echo $filename | rev | cut -d "_" -f2 | rev
will do the trick if the file doesn't have "" in the name but I want to make sure this works also for filenames with ""
You can use parameter expansion. The % removes the shortest possible pattern on the right side of the value, ## removes the longest possible match on the left:
#! /bin/bash
for f in filename.bla_1 \
file_name_with_underscores.foo_2 \
file_name_with_underscores.foo \
filename.with_dots.foo_2 ; do
ext=${f##*.}
basename=${f%.*}
echo "$basename.${ext%_*}"
done
If you care to tweak the globbing parser a little,
shopt -s extglob
for f in abc.bla a_b_c_.bla abc.bla_1 a_b_c_.bla_2 123.456.789 123.456.789_x abc_
do echo ${f%_+([^._])}
done
abc.bla
a_b_c_.bla
abc.bla
a_b_c_.bla
123.456.789
123.456.789
abc_
${f%_+([^._])} means the value of $f with a _ followed immediately by one or more non-dot-or-underscore characters trimmed OFF the end.
Use #choroba's answer.
But to fix your code, after you reverse the filename, you need to take the 2nd and all following fields, not just the 2nd:
$ filename=foo_bar_baz.bla_1
$ rev <<<"$filename" | cut -d_ -f2- | rev
foo_bar_baz.bla
The -f2- with the trailing hyphen is the magic here. Read the cut man page.

Bash: replace specific text with its translation

There is a huge file, in it I want to replace all the text between '=' and '\n' with its translation, here is an example:
input:
screen.LIGHT_COLOR=Lighting Color
screen.LIGHT_M=Light (Morning)
screen.AMBIENT_M=Ambient (Morning)
output:
screen.LIGHT_COLOR=Цвет Освещения
screen.LIGHT_M=Свет (Утро)
screen.AMBIENT_M=Эмбиент (Утро)
All I have managed to do until now is to extract and translate the targeted text.
while IFS= read -r line
do
echo $line | cut -d= -f2- | trans -b en:ru
done < file.txt
output:
Цвет Освещения
Свет (Утро)
Эмбиент (Утро)
*trans is short for translate-shell. It is slow, but does the job. -b for brief translation; en:ru means English to Russian.
If you have any suggestions or solutions i'll be glad to know, thanks!
edit, in case someone needs it:
After discovering trans-shell limitations I ended up going with the #TaylorG. suggestion. It is seams that translation-shell allows around 110 request per some time. Processing each line seperatly results in 1300 requests, which breaks the script.
long story short, it is faster to pack all the data into a single request. Its possible to reduce processing time from couple of minutes to mere seconds. sorry for the messy code, it's my third day with:
cut -s -d = -f 1 en_US.lang > option_en.txt
cut -s -d = -f 2 en_US.lang > value_en.txt
# merge lines
sed ':a; N; $!ba; s/\n/ :: /g' value_en.txt > value_en_block.txt
trans -b en:ru -i value_en_block.txt -o value_ru_block.txt
sed 's/ :: /\n/g' value_ru_block.txt > value_ru.txt
paste -d = option_en.txt value_ru.txt > ru_RU.lang
# remove trmporary files
rm option_en.txt value_en.txt value_en_block.txt value_ru.txt value_ru_block.txt
Thanks Taylor G., Armali and every commentator
Using pipe in a large loop is expensive. You can try the following instead.
cut -s -d = -f 1 file.txt > name.txt
cut -s -d = -f 2- file.txt | trans -b en:ru > translate.txt
paste -d = name.txt translate.txt
It shall be much faster than your current script. I'm not sure how your trans method is written. It needs to be updated to process batch input if it's not, e.g. using a while loop.
trans() {
while read -r line; do
# do translate and print result
done
}
You already did most of the work, though it can be optimized a bit. What's missing is just to output the first part of the line up to the equal sign together with the translation:
while IFS== read left right
do echo $left=`trans -b en:ru <<<$right`
done <file.txt

bash-replacing string in file, that contains special chars

as i said in the title im trying to replace a string in a file, that contains special characters , now the idea is to loop on every line of a "infofile" contains many lines of: whatiwantotreplace,replacer.
once I have this i want to do sed to a certain file to replace all the occurrences of string-> "whatiwantotreplace" with ->"replacer".
my code:
infofile="inforfilepath"
replacefile="replacefilepath"
while IFS= read -r line
do
what2replace="a" #$(echo "$line" | cut -d"," -f1);
replacer="b\\" #$(echo "$line" | cut -d"," -f2 );
sed -i -e "s/$what2replace/$replacer/g" "$replacefile"
#sed -i -e "s/'$what2replace'/'$replacer'/g" "$replacefile"
#sed -i -e "s#$what2replace#$replacer#g" "$replacefile"
#sed -i -e s/$what2replace/$replacer/g' "$replacefile"
#sed -i -e "s/${what2replace}/${replacer}/g" "$replacefile"
#${replacefile//what2replace/replacer}
done < "$infofile"
As you can see, the string that want to replace and the string that i want to replace with,may contain special characters , all the commented lines are the things I tried (things I saw online) but still clueless.
for some i got this error:
"sed: -e expression #1, char 8: unterminated `s' command"
and for some just nothing happend.
really need your help
Edit: inputs and outputs:
It's hard to give inputs and output, because all of the variations I tried had the same thing , didn't changed anything, the only one gave the above error is the variation with #.
thanks for your effort.
You're barking up the wrong tree - you're trying to do literal string replacements using a tool, sed, that doesn't have functionality to handle literal strings. See Is it possible to escape regex metacharacters reliably with sed for the convoluted mess required to try to force sed to do what you want and also https://unix.stackexchange.com/q/169716/133219 for why to avoid shell loops for manipulating text.
Just use awk instead since it has literal string functions and loops implicitly itself:
awk '
NR==FNR{map[$1]=$2; next}
{
for (old in map) {
new = map[old]
head = ""
tail = $0
while ( s = index(tail,old) ) {
head = head substr(tail,1,s-1) new
tail = substr(tail,s+length(old))
}
$0 = head tail
}
}
' "$infofile" "$replacefile"
The above is untested of course since you didn't provide any sample input/output.
You can try this way
what='a';to='b\\\\';echo 'sdev adfc xdae' | sed "s/${what}/${to}/g"
output
sdev b\\dfc xdb\\e

BASH: i can echo string + grep + sed, but how to add more strings on the same line?

Asking a question here is always my last resort. I tried everything even the most embarrassing code so i'm confused on explaining what i tried with no success. I have:
echo $output | grep -i -m 1 "Time:" | sed 's/.*\s\([0-9]*:[0-9]*:[0-9]*\).time.*/\1/'
it outputs:
23:25:31
Easy.
But i'd like to add one more string to the end, like " , $year" - so that i have:
23:25:31 , 2013
The problem is that whatever i tried (printf, -n, -e, -ne, brackets, quotes, |, ;, &, /r, etc.) gives an error or goes to a new line anyway.
Any suggestion will be really appreciated.
Thanks
time=$(echo $output | grep -i -m 1 "Time:" | sed 's/.*\s\([0-9]*:[0-9]*:[0-9]*\).time.*/\1/')
echo "The time is ${time}, 2013"
Alternates
add tr -d '\n' at the end of echo+grep+sed pipeline.
{ entire-echo-grep-sed-pipeline ; echo , 2013 ; } | xargs echo (This however, will add a space before ,)

modify the contents of a file without a temp file

I have the following log file which contains lines like this
1345447800561|FINE|blah#13|txReq
1345447800561|FINE|blah#13|Req
1345447800561|FINE|blah#13|rxReq
1345447800561|FINE|blah#14|txReq
1345447800561|FINE|blah#15|Req
I am trying extract the first field from each line and depending on whether it belongs to blah#13 or blah#14, blah#15 i am creating the corresponding files using the following script, which seems quite in-efficient in terms of the number of temp files creates. Any suggestions on how I can optimize it ?
cat newLog | grep -i "org.arl.unet.maca.blah#13" >> maca13
cat newLog | grep -i "org.arl.unet.maca.blah#14" >> maca14
cat newLog | grep -i "org.arl.unet.maca.blah#15" >> maca15
cat maca10 | grep -i "txReq" >> maca10TxFrameNtf_temp
exec<blah10TxFrameNtf_temp
while read line
do
echo $line | cut -d '|' -f 1 >>maca10TxFrameNtf
done
cat maca10 | grep -i "Req" >> maca10RxFrameNtf_temp
while read line
do
echo $line | cut -d '|' -f 1 >>maca10TxFrameNtf
done
rm -rf *_temp
Something like this ?
for m in org.arl.unet.maca.blah#13 org.arl.unet.maca.blah#14 org.arl.unet.maca.blah#15
do
grep -i "$m" newLog | grep "txReq" | cut -d' ' -f1 > log.$m
done
I've found it useful at times to use ex instead of grep/sed to modify text files in place without using temps ... saves the trouble of worrying about uniqueness and writability to the temp file and its directory etc. Plus it just seemed cleaner.
In ksh I would use a code block with the edit commands and just pipe that into ex ...
{
# Any edit command that would work at the colon prompt of a vi editor will work
# This one was just a text substitution that would replace all contents of the line
# at line number ${NUMBER} with the word DATABASE ... which strangely enough was
# necessary at one time lol
# The wq is the "write/quit" command as you would enter it at the vi colon prompt
# which are essentially ex commands.
print "${NUMBER}s/.*/DATABASE/"
print "wq"
} | ex filename > /dev/null 2>&1

Resources