Bash: replace specific text with its translation - bash
There is a huge file, in it I want to replace all the text between '=' and '\n' with its translation, here is an example:
input:
screen.LIGHT_COLOR=Lighting Color
screen.LIGHT_M=Light (Morning)
screen.AMBIENT_M=Ambient (Morning)
output:
screen.LIGHT_COLOR=Цвет Освещения
screen.LIGHT_M=Свет (Утро)
screen.AMBIENT_M=Эмбиент (Утро)
All I have managed to do until now is to extract and translate the targeted text.
while IFS= read -r line
do
echo $line | cut -d= -f2- | trans -b en:ru
done < file.txt
output:
Цвет Освещения
Свет (Утро)
Эмбиент (Утро)
*trans is short for translate-shell. It is slow, but does the job. -b for brief translation; en:ru means English to Russian.
If you have any suggestions or solutions i'll be glad to know, thanks!
edit, in case someone needs it:
After discovering trans-shell limitations I ended up going with the #TaylorG. suggestion. It is seams that translation-shell allows around 110 request per some time. Processing each line seperatly results in 1300 requests, which breaks the script.
long story short, it is faster to pack all the data into a single request. Its possible to reduce processing time from couple of minutes to mere seconds. sorry for the messy code, it's my third day with:
cut -s -d = -f 1 en_US.lang > option_en.txt
cut -s -d = -f 2 en_US.lang > value_en.txt
# merge lines
sed ':a; N; $!ba; s/\n/ :: /g' value_en.txt > value_en_block.txt
trans -b en:ru -i value_en_block.txt -o value_ru_block.txt
sed 's/ :: /\n/g' value_ru_block.txt > value_ru.txt
paste -d = option_en.txt value_ru.txt > ru_RU.lang
# remove trmporary files
rm option_en.txt value_en.txt value_en_block.txt value_ru.txt value_ru_block.txt
Thanks Taylor G., Armali and every commentator
Using pipe in a large loop is expensive. You can try the following instead.
cut -s -d = -f 1 file.txt > name.txt
cut -s -d = -f 2- file.txt | trans -b en:ru > translate.txt
paste -d = name.txt translate.txt
It shall be much faster than your current script. I'm not sure how your trans method is written. It needs to be updated to process batch input if it's not, e.g. using a while loop.
trans() {
while read -r line; do
# do translate and print result
done
}
You already did most of the work, though it can be optimized a bit. What's missing is just to output the first part of the line up to the equal sign together with the translation:
while IFS== read left right
do echo $left=`trans -b en:ru <<<$right`
done <file.txt
Related
Using sed to find a string with wildcards and then replacing with same wildcards
So I am trying to remove new lines using sed, because it the only way I can think of to do it. I'm completely self taught so there may be a more efficient way that I just don't know. The string I am searching for is \HF=-[0-9](newline character). The problem is the data it is searching through can look like (Note: there are actual new line characters in this data, which I think is causing a bit of the problem) 1\1\GINC-N076\SP\RMP2-FC\CC-pVDZ\C12H12\R2536\09-Apr-2020\0\\# mp2/cc- pVDZ\\Squish3_Slide0\\0,1\H,0,0.,2.4822,0.\C,0,0.,1.3948,0.\C,0,0.,-1. 3948,0.\C,0,1.2079,0.6974,0.\C,0,-1.2079,0.6974,0.\C,0,-1.2079,-0.6974 ,0.\C,0,1.2079,-0.6974,0.\H,0,2.1497,1.2411,0.\H,0,-2.1497,1.2411,0.\H ,0,-2.1497,-1.2411,0.\H,0,2.1497,-1.2411,0.\H,0,0.,-2.4822,0.\C,0,0.,1 .3948,3.\C,0,0.,-1.3948,3.\C,0,1.2079,0.6974,3.\C,0,-1.2079,0.6974,3.\ C,0,-1.2079,-0.6974,3.\C,0,1.2079,-0.6974,3.\H,0,0.,2.4822,3.\H,0,2.14 97,1.2411,3.\H,0,-2.1497,1.2411,3.\H,0,-2.1497,-1.2411,3.\H,0,2.1497,- 1.2411,3.\H,0,0.,-2.4822,3.\\Version=ES64L-G09RevD.01\State=1-AG\HF=-4 61.3998608\MP2=-463.0005321\RMSD=3.490e-09\PG=D02H [SG"(C4H4),X(C8H8)] \\# OR 1\1\GINC-N076\SP\RMP2-FC\CC-pVDZ\C12H12\R2536\09-Apr-2020\0\\# mp2/cc- pVDZ\\Squish3.1_Slide0\\0,1\H,0,0.,2.4822,0.\C,0,0.,1.3948,0.\C,0,0.,- 1.3948,0.\C,0,1.2079,0.6974,0.\C,0,-1.2079,0.6974,0.\C,0,-1.2079,-0.69 74,0.\C,0,1.2079,-0.6974,0.\H,0,2.1497,1.2411,0.\H,0,-2.1497,1.2411,0. \H,0,-2.1497,-1.2411,0.\H,0,2.1497,-1.2411,0.\H,0,0.,-2.4822,0.\C,0,0. ,1.3948,3.1\C,0,0.,-1.3948,3.1\C,0,1.2079,0.6974,3.1\C,0,-1.2079,0.697 4,3.1\C,0,-1.2079,-0.6974,3.1\C,0,1.2079,-0.6974,3.1\H,0,0.,2.4822,3.1 \H,0,2.1497,1.2411,3.1\H,0,-2.1497,1.2411,3.1\H,0,-2.1497,-1.2411,3.1\ H,0,2.1497,-1.2411,3.1\H,0,0.,-2.4822,3.1\\Version=ES64L-G09RevD.01\St ate=1-AG\HF=-461.4104442\MP2=-463.0062587\RMSD=3.651e-09\PG=D02H [SG"( C4H4),X(C8H8)]\\# OR 1\1\GINC-N076\SP\RMP2-FC\CC-pVDZ\C12H12\R2536\09-Apr-2020\0\\# mp2/cc- pVDZ\\Squish3.3_Slide1.7\\0,1\H,0,0.,2.4822,0.\C,0,0.,1.3948,0.\C,0,0. ,-1.3948,0.\C,0,1.2079,0.6974,0.\C,0,-1.2079,0.6974,0.\C,0,-1.2079,-0. 6974,0.\C,0,1.2079,-0.6974,0.\H,0,2.1497,1.2411,0.\H,0,-2.1497,1.2411, 0.\H,0,-2.1497,-1.2411,0.\H,0,2.1497,-1.2411,0.\H,0,0.,-2.4822,0.\C,0, 0.,-0.3052,3.3\C,0,0.,-3.0948,3.3\C,0,1.2079,-1.0026,3.3\C,0,-1.2079,- 1.0026,3.3\C,0,-1.2079,-2.3974,3.3\C,0,1.2079,-2.3974,3.3\H,0,0.,0.782 2,3.3\H,0,2.1497,-0.4589,3.3\H,0,-2.1497,-0.4589,3.3\H,0,-2.1497,-2.94 11,3.3\H,0,2.1497,-2.9411,3.3\H,0,0.,-4.1822,3.3\\Version=ES64L-G09Rev D.01\State=1-AG\HF=-461.436061\MP2=-463.0177441\RMSD=7.859e-09\PG=C02H [SGH(C4H4),X(C8H8)]\\# OR 1\1\GINC-N076\SP\RMP2-FC\CC-pVDZ\C12H12\R2536\09-Apr-2020\0\\# mp2/cc- pVDZ\\Squish3.6_Slide0.9\\0,1\H,0,0.,2.4822,0.\C,0,0.,1.3948,0.\C,0,0. ,-1.3948,0.\C,0,1.2079,0.6974,0.\C,0,-1.2079,0.6974,0.\C,0,-1.2079,-0. 6974,0.\C,0,1.2079,-0.6974,0.\H,0,2.1497,1.2411,0.\H,0,-2.1497,1.2411, 0.\H,0,-2.1497,-1.2411,0.\H,0,2.1497,-1.2411,0.\H,0,0.,-2.4822,0.\C,0, 0.,0.4948,3.6\C,0,0.,-2.2948,3.6\C,0,1.2079,-0.2026,3.6\C,0,-1.2079,-0 .2026,3.6\C,0,-1.2079,-1.5974,3.6\C,0,1.2079,-1.5974,3.6\H,0,0.,1.5822 ,3.6\H,0,2.1497,0.3411,3.6\H,0,-2.1497,0.3411,3.6\H,0,-2.1497,-2.1411, 3.6\H,0,2.1497,-2.1411,3.6\H,0,0.,-3.3822,3.6\\Version=ES64L-G09RevD.0 1\State=1-AG\HF=-461.4376969\MP2=-463.0163868\RMSD=7.263e-09\PG=C02H [ SGH(C4H4),X(C8H8)]\\# Basically the number I am looking for can be broken up into two lines at any point based on character count. I need to get rid of the newline breaking up the number so that I can extract the entire value into a separate file. (I have no problems with the extraction to a new file, hence why it isn't included in the code) Currently I am using this code sed -i ':a;N;$!ba;s/HF=-*[0-9]*\n/HF=-*[0-9]*/g' $i && Which ALMOST works, expect it doesn't replace the wildcard values with the same values. It replaces it with the actual text [0-9] instead and doesn't always remove the new line character. Important to the is that THERE ARE ACTUAL NEW LINE CHARACTERS in the output file and there is no way to change that without messing up the other 30 lines I am extracting from this output file. What I want is to just get rid of the newline characters that occur when that string is found, regardless of how many digits there are in between the - sign and the newline character. So the expected output would be something like 1\1\GINC-N076\SP\RMP2-FC\CC-pVDZ\C12H12\R2536\09-Apr-2020\0\\# mp2/cc- pVDZ\\Squish3_Slide0\\0,1\H,0,0.,2.4822,0.\C,0,0.,1.3948,0.\C,0,0.,-1. 3948,0.\C,0,1.2079,0.6974,0.\C,0,-1.2079,0.6974,0.\C,0,-1.2079,-0.6974 ,0.\C,0,1.2079,-0.6974,0.\H,0,2.1497,1.2411,0.\H,0,-2.1497,1.2411,0.\H ,0,-2.1497,-1.2411,0.\H,0,2.1497,-1.2411,0.\H,0,0.,-2.4822,0.\C,0,0.,1 .3948,3.\C,0,0.,-1.3948,3.\C,0,1.2079,0.6974,3.\C,0,-1.2079,0.6974,3.\ C,0,-1.2079,-0.6974,3.\C,0,1.2079,-0.6974,3.\H,0,0.,2.4822,3.\H,0,2.14 97,1.2411,3.\H,0,-2.1497,1.2411,3.\H,0,-2.1497,-1.2411,3.\H,0,2.1497,- 1.2411,3.\H,0,0.,-2.4822,3.\\Version=ES64L-G09RevD.01\State=1-AG\HF=-461.3998608\MP2=-463.0005321\RMSD=3.490e-09\PG=D02H [SG"(C4H4),X(C8H8)] \\# These files are rather large and have over 1500 executions of this line of code, so the more efficient the better. Everything else in the script this is in is using a combination of grep, awk, sed, and basic UNIX commands. EDIT After trying sed -i -E ':a;N;$!ba;s/(\\HF=-?[.0-9]*)\n/\1/' $i && I still had no luck getting rid of those pesky new line characters. If it has any effect on the answers at all here is the rest of the code to go with the one line that is causing problems echo name HF MP2 mpdiff | cat > allE for i in *.out do echo name HF MP2 mpdiff | cat > $i.allE grep "Slide" $i | cut -d "\\" -f2 | cat | tr -d '\n' > $i.name && grep "EUMP2" $i | cut -d "=" -f3 | cut -c 1-25 | tr '\n' ' ' | tr -s ' ' >> $i.mp && grep "EUMP2" $i | cut -d "=" -f2 | cut -c 1-25 | tr '\n' ' ' | tr -s ' ' >> $i.mpdiff && sed -i -E ':a;N;$!ba;s/(\\HF=-?[.0-9]*)\n/\1/' $i && grep '\\HF' $i | awk -F 'HF' '{print substr($2,2,14)}' | tr '\n' ' ' >> $i.hf && paste $i.name >> $i.energies && sed -i 's/ /0 /g' $i.hf && sed -i 's/\\/0/g' $i.hf && sed -i 's/[A-Z]/0/g' $i.hf && paste $i.hf >> $i.energies && sed -i 's/[ABCEFGHIJKLMNOPQRSTUVWXYZ]//g' $i.mp && paste $i.mp >> $i.energies && sed -i 's/[ABCEFGHIJKLMNOPQRSTUVWXYZ]//g' $i.mpdiff && paste $i.mpdiff >> $i.energies && transpose $i.energies >> $i.allE #temp.txt && #cat temp.txt > $i.energies #echo $i is finished done echo see allE for energies #rm *.energies #temp.txt rm *.name rm *.mp rm *.hf rm *.mpdiff
Here is how you can fix your current attempt. sed -E ':a;N;$!ba;s/(\\HF=-?[.0-9]*)\n/\1/' Add the i flag if you want to make the changes on the file itself, add && to send the job to the background, etc. The -E flag is needed, because backreferences (see below) are part of extended regular expressions. I made the following changes: I changed -* to -? as there should be at most one dash (if I understand correctly and that is in fact a minus sign, not a dash). I added the period to the bracket expression, so that the decimal point would be matched too. (Note that in a bracket expression, the dot is a regular character). I wrapped the whole thing except the newline in parentheses - making it into a subexpression, which you can refer to with a backreference - which is what I did in the replacement part. A few notes though - this will join the lines even if the entire number is at the end of one line, but not followed by the closing \. If in fact the entire number being on one line, but the closing \ is on the next line, you can change the sed command slightly, to leave those alone. On the other hand, this does not handle situations where, for example, one line ends in \H and the next line begins with F=304.222\ You only mentioned "split number" in your problem statement; shouldn't you, though, also handle such cases, where the newline splits the \HF=...\ token, just not in the "number" portion of the token?
It looks like your input lines start with a space. I have ignored them in this solution. sed -rz 's/(AG\\HF=-[0-9]*)\n/\1/g' "$i"
Alternating output in bash for loop from two grep
I'm trying to search through files and extract two pieces of relevant information every time they appear in the file. The code I currently have: #!/bin/bash echo "Utilized reads from ustacks output" > reads.txt str1="utilized reads:" str2="Parsing" for file in /home/desaixmg/novogene/stacks/sample01/conda_ustacks.o*; do reads=$(grep $str1 $file | cut -d ':' -f 3 samples=$(grep $str2 $file | cut -d '/' -f 8 echo $samples $reads >> reads.txt done It is doing each line for the file (the files have varying numbers of instances of these phrases) and gives me the output per row for each file: PopA_15.fq 1081264 PopA_16.fq PopA_17.fq 1008416 554791 PopA_18.fq PopA_20.fq PopA_21.fq 604610 531227 595129 ... I want it to match each instance (i.e. 1st instance of both greps next two each other): PopA_15.fq 1081264 PopA_16.fq 1008416 PopA_17.fq 554791 PopA_18.fq 604610 PopA_20.fq 531227 PopA_21.fq 595129 ... How do I do this? Thank you
Considering that your Input_file is same as sample shown and number of columns are even on each line with 1 PopA value and other will be with digit values. Following awk may help you in same. awk '{for(i=1;i<=(NF/2);i++){print $i,$((NF/2)+i)}}' Input_file Output will be as follows. PopA_15.fq 1081264 PopA_16.fq 1008416 PopA_17.fq 554791 PopA_18.fq 604610 PopA_20.fq 531227 PopA_21.fq 595129 In case you want to pass output of a command to awk command then you could do like your command | awk command... no need to add Input_file to above awk command.
This is what ended up working for me...any tips for more efficient code are definitely welcome #!/bin/bash echo "Utilized reads from ustacks output" > reads.txt str1="utilized reads:" str2="Parsing" for file in /home/desaixmg/novogene/stacks/sample01/conda_ustacks.o*; do reads=$(grep $str1 $file | cut -d ':' -f 3) samples=$(grep $str2 $file | cut -d '/' -f 8) paste <(echo "$samples" | column -t) <(echo "$reads" | column -t) >> reads.txt done This provides the desired output described above.
KSH: Loop performance
I need to process a file with approximately 120k lines that has the following format using ksh: "[UserId=USER1]";"Client=001";"Locked_Status=0";"TYPE=A";"Last_Logon=00000000";"Valid_To=99991231";"Password_Change=20120131";"Last_Password_Change=29990" "[UserId=USER2]";"Client=000";"Locked_Status=0";"TYPE=A";"Last_Logon=20141020";"Valid_To=00000000";"Password_Change=20140620";"Last_Password_Change=9501" "[UserId=USER3]";"Client=002";"Locked_Status=0";"TYPE=A";"Last_Logon=00000000";"Valid_To=99991231";"Password_Change=20140304";"Last_Password_Change=9817" The output should be something like: [UserId=USER1] Client=001 Locked_Status=0 TYPE=A Last_Logon=00000000 Valid_To=99991231 Password_Change=20120131 Last_Password_Change=29985 [UserId=USER2] Client=000 Locked_Status=0 TYPE=A Last_Logon=20141020 Valid_To=00000000 Password_Change=20140620 Last_Password_Change=9496 [UserId=User3] Client=002 Locked_Status=0 TYPE=A Last_Logon=00000000 Valid_To=99991231 Password_Change=20140304 Last_Password_Change=9812 I initially used the following code do process the file: for a in $(<$1) do a=$(echo $a|sed -e 's/;/ /g' -e 's/"//g') for b in $a do print $b done done It was taking around 3hrs to process 120k lines. Then I tried to improved the code changing it to the following: for a in $(<$1) do printf "\n$(echo $a|sed -e 's/"//g' -e 's/;/\\n/g')" done That gave me 2hrs processing time however it still takes too long to process 120k lines At last I tried this code which processed the 120k lines in 3secs! perl -ne ' chomp; s/\"//g; s/;/\n/g; print; ' <$1 Is there anyway I can improve the code in KSH to achieve similar performance? I believe that I must be missing something in my KSH code... Help me to find out please. Thanks in advance
How about: tr ';' '\n' < file | tr -d '"' Your code is assigning every whitespace-delimited word to variable "a" in turn, and thus invoking sed once for each word in the file. Clearly a lot of accumulated overhead spawning all those processes. The idiom to iterate over the lines of a file is: while IFS= read -r line; do ...; done < file
You suggestion worked perfectly Host> wc -l /tmp/MyTest 114449 /tmp/MyTest Host> time tr ';' '\n' < /tmp/MyTest | tr -d '"' > /tmp/zuza.out real 0m1.04s user 0m1.06s sys 0m0.08s Host> time perl -ne ' chomp; s/\"//g; s/;/\n/g; print "\n$_"; ' </tmp/MyTest > /tmp/zuza real 0m1.30s user 0m0.60s sys 0m0.08s
sh random error when saving function output
Im trying to make a script that takes a .txt file containing lines like: davda103:David:Davidsson:800104-1234:TNCCC_1:TDDB46 TDDB80: and then sort them etc. Thats just the background my problem lies here: #!/bin/sh -x cat $1 | while read a do testsak = `echo $a | cut -f 1 -d :`; <---** echo $testsak; done Where the arrow is, when I try to run this code I get some kind of weird error. + read a + cut -f+ echo 1 -d :davda103:David:Davidsson:800104-1234:TNCCC_1:TDDB46 TDDB80: + testsak = davda103 scriptTest.sh: testsak: Det går inte att hitta + echo (I have my linux in swedish because school -.-) Anyways that error just says that it cant find... something. Any ideas what could be causing my problem?
You have extra spaces around the assignment operator, remove them: testsak=`echo $a | cut -f 1 -d :`; <---**
The spaces around the equal sign testsak = `echo $a | cut -f 1 -d :`; <---** causes bash to interpret this as a command testak with arguments = and the result of the command substitution. Removing the spaces will fix the immediate error. A much more efficient way to extract the value from a is to let read do it (and use input redirection instead of cat): while IFS=: read testak the_rest; do echo $testak done < "$1"
modify the contents of a file without a temp file
I have the following log file which contains lines like this 1345447800561|FINE|blah#13|txReq 1345447800561|FINE|blah#13|Req 1345447800561|FINE|blah#13|rxReq 1345447800561|FINE|blah#14|txReq 1345447800561|FINE|blah#15|Req I am trying extract the first field from each line and depending on whether it belongs to blah#13 or blah#14, blah#15 i am creating the corresponding files using the following script, which seems quite in-efficient in terms of the number of temp files creates. Any suggestions on how I can optimize it ? cat newLog | grep -i "org.arl.unet.maca.blah#13" >> maca13 cat newLog | grep -i "org.arl.unet.maca.blah#14" >> maca14 cat newLog | grep -i "org.arl.unet.maca.blah#15" >> maca15 cat maca10 | grep -i "txReq" >> maca10TxFrameNtf_temp exec<blah10TxFrameNtf_temp while read line do echo $line | cut -d '|' -f 1 >>maca10TxFrameNtf done cat maca10 | grep -i "Req" >> maca10RxFrameNtf_temp while read line do echo $line | cut -d '|' -f 1 >>maca10TxFrameNtf done rm -rf *_temp
Something like this ? for m in org.arl.unet.maca.blah#13 org.arl.unet.maca.blah#14 org.arl.unet.maca.blah#15 do grep -i "$m" newLog | grep "txReq" | cut -d' ' -f1 > log.$m done
I've found it useful at times to use ex instead of grep/sed to modify text files in place without using temps ... saves the trouble of worrying about uniqueness and writability to the temp file and its directory etc. Plus it just seemed cleaner. In ksh I would use a code block with the edit commands and just pipe that into ex ... { # Any edit command that would work at the colon prompt of a vi editor will work # This one was just a text substitution that would replace all contents of the line # at line number ${NUMBER} with the word DATABASE ... which strangely enough was # necessary at one time lol # The wq is the "write/quit" command as you would enter it at the vi colon prompt # which are essentially ex commands. print "${NUMBER}s/.*/DATABASE/" print "wq" } | ex filename > /dev/null 2>&1