Bash: replace specific text with its translation - bash

There is a huge file, in it I want to replace all the text between '=' and '\n' with its translation, here is an example:
input:
screen.LIGHT_COLOR=Lighting Color
screen.LIGHT_M=Light (Morning)
screen.AMBIENT_M=Ambient (Morning)
output:
screen.LIGHT_COLOR=Цвет Освещения
screen.LIGHT_M=Свет (Утро)
screen.AMBIENT_M=Эмбиент (Утро)
All I have managed to do until now is to extract and translate the targeted text.
while IFS= read -r line
do
echo $line | cut -d= -f2- | trans -b en:ru
done < file.txt
output:
Цвет Освещения
Свет (Утро)
Эмбиент (Утро)
*trans is short for translate-shell. It is slow, but does the job. -b for brief translation; en:ru means English to Russian.
If you have any suggestions or solutions i'll be glad to know, thanks!
edit, in case someone needs it:
After discovering trans-shell limitations I ended up going with the #TaylorG. suggestion. It is seams that translation-shell allows around 110 request per some time. Processing each line seperatly results in 1300 requests, which breaks the script.
long story short, it is faster to pack all the data into a single request. Its possible to reduce processing time from couple of minutes to mere seconds. sorry for the messy code, it's my third day with:
cut -s -d = -f 1 en_US.lang > option_en.txt
cut -s -d = -f 2 en_US.lang > value_en.txt
# merge lines
sed ':a; N; $!ba; s/\n/ :: /g' value_en.txt > value_en_block.txt
trans -b en:ru -i value_en_block.txt -o value_ru_block.txt
sed 's/ :: /\n/g' value_ru_block.txt > value_ru.txt
paste -d = option_en.txt value_ru.txt > ru_RU.lang
# remove trmporary files
rm option_en.txt value_en.txt value_en_block.txt value_ru.txt value_ru_block.txt
Thanks Taylor G., Armali and every commentator

Using pipe in a large loop is expensive. You can try the following instead.
cut -s -d = -f 1 file.txt > name.txt
cut -s -d = -f 2- file.txt | trans -b en:ru > translate.txt
paste -d = name.txt translate.txt
It shall be much faster than your current script. I'm not sure how your trans method is written. It needs to be updated to process batch input if it's not, e.g. using a while loop.
trans() {
while read -r line; do
# do translate and print result
done
}

You already did most of the work, though it can be optimized a bit. What's missing is just to output the first part of the line up to the equal sign together with the translation:
while IFS== read left right
do echo $left=`trans -b en:ru <<<$right`
done <file.txt

Related

Using sed to find a string with wildcards and then replacing with same wildcards

So I am trying to remove new lines using sed, because it the only way I can think of to do it. I'm completely self taught so there may be a more efficient way that I just don't know.
The string I am searching for is \HF=-[0-9](newline character). The problem is the data it is searching through can look like (Note: there are actual new line characters in this data, which I think is causing a bit of the problem)
1\1\GINC-N076\SP\RMP2-FC\CC-pVDZ\C12H12\R2536\09-Apr-2020\0\\# mp2/cc-
pVDZ\\Squish3_Slide0\\0,1\H,0,0.,2.4822,0.\C,0,0.,1.3948,0.\C,0,0.,-1.
3948,0.\C,0,1.2079,0.6974,0.\C,0,-1.2079,0.6974,0.\C,0,-1.2079,-0.6974
,0.\C,0,1.2079,-0.6974,0.\H,0,2.1497,1.2411,0.\H,0,-2.1497,1.2411,0.\H
,0,-2.1497,-1.2411,0.\H,0,2.1497,-1.2411,0.\H,0,0.,-2.4822,0.\C,0,0.,1
.3948,3.\C,0,0.,-1.3948,3.\C,0,1.2079,0.6974,3.\C,0,-1.2079,0.6974,3.\
C,0,-1.2079,-0.6974,3.\C,0,1.2079,-0.6974,3.\H,0,0.,2.4822,3.\H,0,2.14
97,1.2411,3.\H,0,-2.1497,1.2411,3.\H,0,-2.1497,-1.2411,3.\H,0,2.1497,-
1.2411,3.\H,0,0.,-2.4822,3.\\Version=ES64L-G09RevD.01\State=1-AG\HF=-4
61.3998608\MP2=-463.0005321\RMSD=3.490e-09\PG=D02H [SG"(C4H4),X(C8H8)]
\\#
OR
1\1\GINC-N076\SP\RMP2-FC\CC-pVDZ\C12H12\R2536\09-Apr-2020\0\\# mp2/cc-
pVDZ\\Squish3.1_Slide0\\0,1\H,0,0.,2.4822,0.\C,0,0.,1.3948,0.\C,0,0.,-
1.3948,0.\C,0,1.2079,0.6974,0.\C,0,-1.2079,0.6974,0.\C,0,-1.2079,-0.69
74,0.\C,0,1.2079,-0.6974,0.\H,0,2.1497,1.2411,0.\H,0,-2.1497,1.2411,0.
\H,0,-2.1497,-1.2411,0.\H,0,2.1497,-1.2411,0.\H,0,0.,-2.4822,0.\C,0,0.
,1.3948,3.1\C,0,0.,-1.3948,3.1\C,0,1.2079,0.6974,3.1\C,0,-1.2079,0.697
4,3.1\C,0,-1.2079,-0.6974,3.1\C,0,1.2079,-0.6974,3.1\H,0,0.,2.4822,3.1
\H,0,2.1497,1.2411,3.1\H,0,-2.1497,1.2411,3.1\H,0,-2.1497,-1.2411,3.1\
H,0,2.1497,-1.2411,3.1\H,0,0.,-2.4822,3.1\\Version=ES64L-G09RevD.01\St
ate=1-AG\HF=-461.4104442\MP2=-463.0062587\RMSD=3.651e-09\PG=D02H [SG"(
C4H4),X(C8H8)]\\#
OR
1\1\GINC-N076\SP\RMP2-FC\CC-pVDZ\C12H12\R2536\09-Apr-2020\0\\# mp2/cc-
pVDZ\\Squish3.3_Slide1.7\\0,1\H,0,0.,2.4822,0.\C,0,0.,1.3948,0.\C,0,0.
,-1.3948,0.\C,0,1.2079,0.6974,0.\C,0,-1.2079,0.6974,0.\C,0,-1.2079,-0.
6974,0.\C,0,1.2079,-0.6974,0.\H,0,2.1497,1.2411,0.\H,0,-2.1497,1.2411,
0.\H,0,-2.1497,-1.2411,0.\H,0,2.1497,-1.2411,0.\H,0,0.,-2.4822,0.\C,0,
0.,-0.3052,3.3\C,0,0.,-3.0948,3.3\C,0,1.2079,-1.0026,3.3\C,0,-1.2079,-
1.0026,3.3\C,0,-1.2079,-2.3974,3.3\C,0,1.2079,-2.3974,3.3\H,0,0.,0.782
2,3.3\H,0,2.1497,-0.4589,3.3\H,0,-2.1497,-0.4589,3.3\H,0,-2.1497,-2.94
11,3.3\H,0,2.1497,-2.9411,3.3\H,0,0.,-4.1822,3.3\\Version=ES64L-G09Rev
D.01\State=1-AG\HF=-461.436061\MP2=-463.0177441\RMSD=7.859e-09\PG=C02H
[SGH(C4H4),X(C8H8)]\\#
OR
1\1\GINC-N076\SP\RMP2-FC\CC-pVDZ\C12H12\R2536\09-Apr-2020\0\\# mp2/cc-
pVDZ\\Squish3.6_Slide0.9\\0,1\H,0,0.,2.4822,0.\C,0,0.,1.3948,0.\C,0,0.
,-1.3948,0.\C,0,1.2079,0.6974,0.\C,0,-1.2079,0.6974,0.\C,0,-1.2079,-0.
6974,0.\C,0,1.2079,-0.6974,0.\H,0,2.1497,1.2411,0.\H,0,-2.1497,1.2411,
0.\H,0,-2.1497,-1.2411,0.\H,0,2.1497,-1.2411,0.\H,0,0.,-2.4822,0.\C,0,
0.,0.4948,3.6\C,0,0.,-2.2948,3.6\C,0,1.2079,-0.2026,3.6\C,0,-1.2079,-0
.2026,3.6\C,0,-1.2079,-1.5974,3.6\C,0,1.2079,-1.5974,3.6\H,0,0.,1.5822
,3.6\H,0,2.1497,0.3411,3.6\H,0,-2.1497,0.3411,3.6\H,0,-2.1497,-2.1411,
3.6\H,0,2.1497,-2.1411,3.6\H,0,0.,-3.3822,3.6\\Version=ES64L-G09RevD.0
1\State=1-AG\HF=-461.4376969\MP2=-463.0163868\RMSD=7.263e-09\PG=C02H [
SGH(C4H4),X(C8H8)]\\#
Basically the number I am looking for can be broken up into two lines at any point based on character count. I need to get rid of the newline breaking up the number so that I can extract the entire value into a separate file. (I have no problems with the extraction to a new file, hence why it isn't included in the code)
Currently I am using this code
sed -i ':a;N;$!ba;s/HF=-*[0-9]*\n/HF=-*[0-9]*/g' $i &&
Which ALMOST works, expect it doesn't replace the wildcard values with the same values. It replaces it with the actual text [0-9] instead and doesn't always remove the new line character.
Important to the is that THERE ARE ACTUAL NEW LINE CHARACTERS in the output file and there is no way to change that without messing up the other 30 lines I am extracting from this output file.
What I want is to just get rid of the newline characters that occur when that string is found, regardless of how many digits there are in between the - sign and the newline character.
So the expected output would be something like
1\1\GINC-N076\SP\RMP2-FC\CC-pVDZ\C12H12\R2536\09-Apr-2020\0\\# mp2/cc-
pVDZ\\Squish3_Slide0\\0,1\H,0,0.,2.4822,0.\C,0,0.,1.3948,0.\C,0,0.,-1.
3948,0.\C,0,1.2079,0.6974,0.\C,0,-1.2079,0.6974,0.\C,0,-1.2079,-0.6974
,0.\C,0,1.2079,-0.6974,0.\H,0,2.1497,1.2411,0.\H,0,-2.1497,1.2411,0.\H
,0,-2.1497,-1.2411,0.\H,0,2.1497,-1.2411,0.\H,0,0.,-2.4822,0.\C,0,0.,1
.3948,3.\C,0,0.,-1.3948,3.\C,0,1.2079,0.6974,3.\C,0,-1.2079,0.6974,3.\
C,0,-1.2079,-0.6974,3.\C,0,1.2079,-0.6974,3.\H,0,0.,2.4822,3.\H,0,2.14
97,1.2411,3.\H,0,-2.1497,1.2411,3.\H,0,-2.1497,-1.2411,3.\H,0,2.1497,-
1.2411,3.\H,0,0.,-2.4822,3.\\Version=ES64L-G09RevD.01\State=1-AG\HF=-461.3998608\MP2=-463.0005321\RMSD=3.490e-09\PG=D02H [SG"(C4H4),X(C8H8)]
\\#
These files are rather large and have over 1500 executions of this line of code, so the more efficient the better.
Everything else in the script this is in is using a combination of grep, awk, sed, and basic UNIX commands.
EDIT
After trying
sed -i -E ':a;N;$!ba;s/(\\HF=-?[.0-9]*)\n/\1/' $i &&
I still had no luck getting rid of those pesky new line characters.
If it has any effect on the answers at all here is the rest of the code to go with the one line that is causing problems
echo name HF MP2 mpdiff | cat > allE
for i in *.out
do echo name HF MP2 mpdiff | cat > $i.allE
grep "Slide" $i | cut -d "\\" -f2 | cat | tr -d '\n' > $i.name &&
grep "EUMP2" $i | cut -d "=" -f3 | cut -c 1-25 | tr '\n' ' ' | tr -s ' ' >> $i.mp &&
grep "EUMP2" $i | cut -d "=" -f2 | cut -c 1-25 | tr '\n' ' ' | tr -s ' ' >> $i.mpdiff &&
sed -i -E ':a;N;$!ba;s/(\\HF=-?[.0-9]*)\n/\1/' $i &&
grep '\\HF' $i | awk -F 'HF' '{print substr($2,2,14)}' | tr '\n' ' ' >> $i.hf &&
paste $i.name >> $i.energies &&
sed -i 's/ /0 /g' $i.hf &&
sed -i 's/\\/0/g' $i.hf &&
sed -i 's/[A-Z]/0/g' $i.hf &&
paste $i.hf >> $i.energies &&
sed -i 's/[ABCEFGHIJKLMNOPQRSTUVWXYZ]//g' $i.mp &&
paste $i.mp >> $i.energies &&
sed -i 's/[ABCEFGHIJKLMNOPQRSTUVWXYZ]//g' $i.mpdiff &&
paste $i.mpdiff >> $i.energies &&
transpose $i.energies >> $i.allE #temp.txt &&
#cat temp.txt > $i.energies
#echo $i is finished
done
echo see allE for energies
#rm *.energies #temp.txt
rm *.name
rm *.mp
rm *.hf
rm *.mpdiff
Here is how you can fix your current attempt.
sed -E ':a;N;$!ba;s/(\\HF=-?[.0-9]*)\n/\1/'
Add the i flag if you want to make the changes on the file itself, add && to send the job to the background, etc. The -E flag is needed, because backreferences (see below) are part of extended regular expressions.
I made the following changes: I changed -* to -? as there should be at most one dash (if I understand correctly and that is in fact a minus sign, not a dash). I added the period to the bracket expression, so that the decimal point would be matched too. (Note that in a bracket expression, the dot is a regular character). I wrapped the whole thing except the newline in parentheses - making it into a subexpression, which you can refer to with a backreference - which is what I did in the replacement part.
A few notes though - this will join the lines even if the entire number is at the end of one line, but not followed by the closing \. If in fact the entire number being on one line, but the closing \ is on the next line, you can change the sed command slightly, to leave those alone. On the other hand, this does not handle situations where, for example, one line ends in \H and the next line begins with F=304.222\ You only mentioned "split number" in your problem statement; shouldn't you, though, also handle such cases, where the newline splits the \HF=...\ token, just not in the "number" portion of the token?
It looks like your input lines start with a space. I have ignored them in this solution.
sed -rz 's/(AG\\HF=-[0-9]*)\n/\1/g' "$i"

Alternating output in bash for loop from two grep

I'm trying to search through files and extract two pieces of relevant information every time they appear in the file. The code I currently have:
#!/bin/bash
echo "Utilized reads from ustacks output" > reads.txt
str1="utilized reads:"
str2="Parsing"
for file in /home/desaixmg/novogene/stacks/sample01/conda_ustacks.o*; do
reads=$(grep $str1 $file | cut -d ':' -f 3
samples=$(grep $str2 $file | cut -d '/' -f 8
echo $samples $reads >> reads.txt
done
It is doing each line for the file (the files have varying numbers of instances of these phrases) and gives me the output per row for each file:
PopA_15.fq 1081264
PopA_16.fq PopA_17.fq 1008416 554791
PopA_18.fq PopA_20.fq PopA_21.fq 604610 531227 595129
...
I want it to match each instance (i.e. 1st instance of both greps next two each other):
PopA_15.fq 1081264
PopA_16.fq 1008416
PopA_17.fq 554791
PopA_18.fq 604610
PopA_20.fq 531227
PopA_21.fq 595129
...
How do I do this? Thank you
Considering that your Input_file is same as sample shown and number of columns are even on each line with 1 PopA value and other will be with digit values. Following awk may help you in same.
awk '{for(i=1;i<=(NF/2);i++){print $i,$((NF/2)+i)}}' Input_file
Output will be as follows.
PopA_15.fq 1081264
PopA_16.fq 1008416
PopA_17.fq 554791
PopA_18.fq 604610
PopA_20.fq 531227
PopA_21.fq 595129
In case you want to pass output of a command to awk command then you could do like your command | awk command... no need to add Input_file to above awk command.
This is what ended up working for me...any tips for more efficient code are definitely welcome
#!/bin/bash
echo "Utilized reads from ustacks output" > reads.txt
str1="utilized reads:"
str2="Parsing"
for file in /home/desaixmg/novogene/stacks/sample01/conda_ustacks.o*; do
reads=$(grep $str1 $file | cut -d ':' -f 3)
samples=$(grep $str2 $file | cut -d '/' -f 8)
paste <(echo "$samples" | column -t) <(echo "$reads" | column -t) >> reads.txt
done
This provides the desired output described above.

KSH: Loop performance

I need to process a file with approximately 120k lines that has the following format using ksh:
"[UserId=USER1]";"Client=001";"Locked_Status=0";"TYPE=A";"Last_Logon=00000000";"Valid_To=99991231";"Password_Change=20120131";"Last_Password_Change=29990"
"[UserId=USER2]";"Client=000";"Locked_Status=0";"TYPE=A";"Last_Logon=20141020";"Valid_To=00000000";"Password_Change=20140620";"Last_Password_Change=9501"
"[UserId=USER3]";"Client=002";"Locked_Status=0";"TYPE=A";"Last_Logon=00000000";"Valid_To=99991231";"Password_Change=20140304";"Last_Password_Change=9817"
The output should be something like:
[UserId=USER1] Client=001
Locked_Status=0
TYPE=A
Last_Logon=00000000
Valid_To=99991231
Password_Change=20120131
Last_Password_Change=29985
[UserId=USER2]
Client=000
Locked_Status=0
TYPE=A
Last_Logon=20141020
Valid_To=00000000
Password_Change=20140620
Last_Password_Change=9496
[UserId=User3]
Client=002
Locked_Status=0
TYPE=A
Last_Logon=00000000
Valid_To=99991231
Password_Change=20140304
Last_Password_Change=9812
I initially used the following code do process the file:
for a in $(<$1)
do
a=$(echo $a|sed -e 's/;/ /g' -e 's/"//g')
for b in $a
do
print $b
done
done
It was taking around 3hrs to process 120k lines.
Then I tried to improved the code changing it to the following:
for a in $(<$1)
do
printf "\n$(echo $a|sed -e 's/"//g' -e 's/;/\\n/g')"
done
That gave me 2hrs processing time however it still takes too long to process 120k lines
At last I tried this code which processed the 120k lines in 3secs!
perl -ne '
chomp;
s/\"//g;
s/;/\n/g;
print;
' <$1
Is there anyway I can improve the code in KSH to achieve similar performance? I believe that I must be missing something in my KSH code... Help me to find out please.
Thanks in advance
How about:
tr ';' '\n' < file | tr -d '"'
Your code is assigning every whitespace-delimited word to variable "a" in turn, and thus invoking sed once for each word in the file. Clearly a lot of accumulated overhead spawning all those processes. The idiom to iterate over the lines of a file is:
while IFS= read -r line; do ...; done < file
You suggestion worked perfectly
Host> wc -l /tmp/MyTest
114449 /tmp/MyTest
Host> time tr ';' '\n' < /tmp/MyTest | tr -d '"' > /tmp/zuza.out
real 0m1.04s
user 0m1.06s
sys 0m0.08s
Host> time perl -ne '
chomp;
s/\"//g;
s/;/\n/g;
print "\n$_";
' </tmp/MyTest > /tmp/zuza
real 0m1.30s
user 0m0.60s
sys 0m0.08s

sh random error when saving function output

Im trying to make a script that takes a .txt file containing lines
like:
davda103:David:Davidsson:800104-1234:TNCCC_1:TDDB46 TDDB80:
and then sort them etc. Thats just the background my problem lies here:
#!/bin/sh -x
cat $1 |
while read a
do
testsak = `echo $a | cut -f 1 -d :`; <---**
echo $testsak;
done
Where the arrow is, when I try to run this code I get some kind of weird error.
+ read a
+ cut -f+ echo 1 -d :davda103:David:Davidsson:800104-1234:TNCCC_1:TDDB46
TDDB80:
+ testsak = davda103
scriptTest.sh: testsak: Det går inte att hitta
+ echo
(I have my linux in swedish because school -.-) Anyways that error just says that it cant find... something. Any ideas what could be causing my problem?
You have extra spaces around the assignment operator, remove them:
testsak=`echo $a | cut -f 1 -d :`; <---**
The spaces around the equal sign
testsak = `echo $a | cut -f 1 -d :`; <---**
causes bash to interpret this as a command testak with arguments = and the result of the command substitution. Removing the spaces will fix the immediate error.
A much more efficient way to extract the value from a is to let read do it (and use input redirection instead of cat):
while IFS=: read testak the_rest; do
echo $testak
done < "$1"

modify the contents of a file without a temp file

I have the following log file which contains lines like this
1345447800561|FINE|blah#13|txReq
1345447800561|FINE|blah#13|Req
1345447800561|FINE|blah#13|rxReq
1345447800561|FINE|blah#14|txReq
1345447800561|FINE|blah#15|Req
I am trying extract the first field from each line and depending on whether it belongs to blah#13 or blah#14, blah#15 i am creating the corresponding files using the following script, which seems quite in-efficient in terms of the number of temp files creates. Any suggestions on how I can optimize it ?
cat newLog | grep -i "org.arl.unet.maca.blah#13" >> maca13
cat newLog | grep -i "org.arl.unet.maca.blah#14" >> maca14
cat newLog | grep -i "org.arl.unet.maca.blah#15" >> maca15
cat maca10 | grep -i "txReq" >> maca10TxFrameNtf_temp
exec<blah10TxFrameNtf_temp
while read line
do
echo $line | cut -d '|' -f 1 >>maca10TxFrameNtf
done
cat maca10 | grep -i "Req" >> maca10RxFrameNtf_temp
while read line
do
echo $line | cut -d '|' -f 1 >>maca10TxFrameNtf
done
rm -rf *_temp
Something like this ?
for m in org.arl.unet.maca.blah#13 org.arl.unet.maca.blah#14 org.arl.unet.maca.blah#15
do
grep -i "$m" newLog | grep "txReq" | cut -d' ' -f1 > log.$m
done
I've found it useful at times to use ex instead of grep/sed to modify text files in place without using temps ... saves the trouble of worrying about uniqueness and writability to the temp file and its directory etc. Plus it just seemed cleaner.
In ksh I would use a code block with the edit commands and just pipe that into ex ...
{
# Any edit command that would work at the colon prompt of a vi editor will work
# This one was just a text substitution that would replace all contents of the line
# at line number ${NUMBER} with the word DATABASE ... which strangely enough was
# necessary at one time lol
# The wq is the "write/quit" command as you would enter it at the vi colon prompt
# which are essentially ex commands.
print "${NUMBER}s/.*/DATABASE/"
print "wq"
} | ex filename > /dev/null 2>&1

Resources