Replace string with result of command - bash

I have data in zdt format (like this), where I want to perform this python script only on the third column (the pinyin one). I have tried to do this with sed and awk but I have not had any success due to my limited knowledge of these tools. Ideally, I want to feed the column’s contents to the python script and then have the source replaced with the yield of the script.
This is roughly what I envision but the call is not executed, not even when in quotes.
s/([a-z]+[1,2,3,4]?)(?=.*\t)/decode_pinyin(\1)/g
I am not too strict of the tools (sed, awk, python, …) used, I just want a shell script for batch processing of a number of files. It would be best if the original spaces are preserved.

Try something like this:
awk -F'\t' '{printf "decode_pinyin(\"%s\")\n", $3}' file
This outputs:
decode_pinyin("ru4xiang1 sui2su2")
decode_pinyin("ru4")
decode_pinyin("xiang1")
decode_pinyin("sui2")
decode_pinyin("su2")

Related

bash script to rewrite numbers sequentially

I'd like to 're-sequence' some variable assignment values that are within a large BASH script I'm writing. At present, I have to do this manually, and it's quite time-consuming. ;)
e.g.:
(some code here)
ab=0
(and some here too)
ab=3
(more code here)
cd=2; ab=1
(more code here)
ab=2
What I'd like to do is run a command that can re-order the assignment values of 'ab' so we get:
(some code here)
ab=0
(and some here too)
ab=1
(more code here)
cd=2; ab=2
(more code here)
ab=3
The indentations exist as these usually form part of a code block, like an 'if' or 'for' block.
The variable name will always be the same. The first occurrence in the script should be made a zero. I thought if something (like sed) could search for 'ab=' followed by an integer, then change that integer according to an incrementing value, this would be perfect.
Hoping someone out there may know of something that can do this already. I use 'Kate' for my BASH editing.
Any thoughts? Thank you.
$ # can also use: perl -pe 's/\bab=\K\d+/$i++/ge' file
$ perl -pe 's/(\bab=)\d+/$1.$i++/ge' file
(some code here)
ab=0
(and some here too)
ab=1
(more code here)
cd=2; ab=2
(more code here)
ab=3
(\bab=)\d+ match ab= and one or more digits. \b is word boundary marker so that words like dab=4 doesn't match
The e modifier allows to use Perl code in replacement section
$1.$i++ is string concatenation of ab= and value of $i (which is 0 by default) Then $i gets incremented
Use perl -i -pe for inplace editing
#teracoy:#try:
awk '/ab=/{sub(/ab=[0-9]+/,"ab="i++);print;next} 1' Input_file
WIth GNU awk for multi-char RS, RT, and gensub():
$ awk -v RS='\\<ab=[0-9]+' '{ORS=gensub(/[0-9]+/,i++,1,RT)}1' file
(some code here)
ab=0
(and some here too)
ab=1
(more code here)
cd=2; ab=2
(more code here)
ab=3
Use awk -i inplace ... for inplace editing if desired.

Issue with bash script using SED/AWK for substituion

I have been working on this little script at work to free up my own time and am currently stuck on part of it. The script is supposed to pull some content from a JSON, modify the content, and then re-upload it. The modification part is the portion that doesn't work.
An example of what the content looks like after being extracted from the JSON is:
<p>App1_v1.0_20160911_release.apk</p<p>App2_v2.0_20160915_beta.apk</p><p>App3_v3.0_20150909_VendorRelease.apk</p>
The modification function is supposed to update the list with the newer app filenames in the same location. I've tried using both SED and AWK to get this to work but I haven't gotten anywhere fast.
Here are examples of both commands and the parameters for the substitution I am trying to run on the example file:
old_name=App1_.*_release.apk
new_name=App1_v1.0_20160920_1152_release.apk
sed "s/$old_name/$new_name/" body > upload
awk -v oldname="$old_name" -v newname="$new_name" '{sub(oldname, newname)}1' body > upload
What ends up happening is the substitution will change the correct part of the list, but then nuke everything between that point and the end of the list.
Thank you for any and all help.
PS: If I didn't explain something correctly or you feel some information is missing, please comment and let me know so I can better explain the problem.
There are SO many possible values of oldname, newname, and your input data that could cause either of the commands you wrote to fail - don't use that "replace a regexp with a backreference-enabled-string" approach in any command, use string operations instead (which means you can't use sed since sed doesn't support strings)
This modifies your sample input as you say you want:
$ awk -v new='App1_v1.0_20160920_1152_release.apk' 'BEGIN{RS="</p>\n?"; FS=OFS="<p>"} NR==1{$2=new} {printf "%s%s", $0, RT}' file
<p>App1_v1.0_20160920_1152_release.apk<p>App2_v2.0_20160915_beta.apk</p><p>App3_v3.0_20150909_VendorRelease.apk</p>
If that's not adequate then edit your question to better explain your requirements and provide more truly representative sample input/output.
The above uses GNU awk for multi-char RS and RT.

Create CSV from specific columns in another CSV using shell scripting

I have a CSV file with several thousand lines, and I need to take some of the columns in that file to create another CSV file to use for import to a database.
I'm not in shape with shell scripting anymore, is there anyone who can help with pointing me in the correct direction?
I have a bash script to read the source file but when I try to print the columns I want to a new file it just doesn't work.
while IFS=, read symbol tr_ven tr_date sec_type sec_name name
do
echo "$name,$name,$symbol" >> output.csv
done < test.csv
Above is the code I have. Out of the 6 columns in the original file, I want to build a CSV with "column6, column6, collumn1"
The test CSV file is like this:
Symbol,Trading Venue,Trading Date,Security Type,Security Name,Company Name
AAAIF,Grey Market,22/01/2015,Fund,,Alternative Investment Trust
AAALF,Grey Market,22/01/2015,Ordinary Shares,,Aareal Bank AG
AAARF,Grey Market,22/01/2015,Ordinary Shares,,Aluar Aluminio Argentino S.A.I.C.
What am I doing wrong with my script? Or, is there an easier - and faster - way of doing this?
Edit
These are the real headers:
Symbol,US Trading Venue,Trading Date,OTC Tier,Caveat Emptor,Security Type,Security Class,Security Name,REG_SHO,Rule_3210,Country of Domicile,Company Name
I'm trying to get the last column, which is number 12, but it always comes up empty.
The snippet looks and works fine to me, maybe you have some weird characters in the file or it is coming from a DOS environment (use dos2unix to "clean" it!). Also, you can make use of read -r to prevent strange behaviours with backslashes.
But let's see how can awk solve this even faster:
awk 'BEGIN{FS=OFS=","} {print $6,$6,$1}' test.csv >> output.csv
Explanation
BEGIN{FS=OFS=","} this sets the input and output field separators to the comma. Alternatively, you can say -F=",", -F, or pass it as a variable with -v FS=",". The same applies for OFS.
{print $6,$6,$1} prints the 6th field twice and then the 1st one. Note that using print, every comma-separated parameter that you give will be printed with the OFS that was previously set. Here, with a comma.

Need a quick way of removing partial duplicates from a log

I'm using a bash script to grep out some lines from a log file. The basic format of this log file is:
field1: value1, field2=value2, field3=value3,
field4=value4,value5,value6, field5=value7
Sometimes there will be lines in which field1: value1 is identical, but some of the other information is either the same or different. I'd like to filter those lines out, so that I only grep out the first instance of anything that has the same "field1: value1" tuple.
I'd prefer a nice command-line one-liner if you can find something especially simple. I definitely want to keep it in the bash script. This is on linux, so we've got all the command-line tools available.
Thanks!
Using awk:
awk -F, '!arr[$1]++ { print }' LOGFILE
The awk program uses an array to keep a count of the number of times a particular 'field1: value1` string is seen, but only prints the incoming line the first time.

using sed to change <CR><LF> to a symbol

Am working on Windows Vista with GnuWin32 (sed 4.2.1 and core utilities 5.3.0). Also have ActivePerl 5.14.2 package.
I have a large multi record file. The end of each record in the file is denoted with four dollar signs ($$$$). Within each logical record are many "CRLF."
I would like to replace all instances of CRLF with a symbol such as |+|. Then I will replace $$$$ with CRLF. The result: one record per row for import into Excel for further manipulation.
I've tried several methods for transforming CRLF to |+| but without success.
For example, one method was: sed -e "s/[\r\n]/|+|/g" source_file_in target_file_out
Another method used tr -d to delete \r and then a second statement: sed -e "s/\n/|+|/g" source_file_in target_file_out
The tr statement worked; the sed statement did not.
I've read the following articles but don't see how to adapt them to replace \r\n with a symbol like |+|.
sed: how to replace CR and/or LF with "\r" "\n", so any file will be in one line
Replace string that contains CRLF?
How can I replace a newline (\n) using sed?
If this problem cannot be solved easily using sed (and tr), then I'll use Perl if someone shows me how.
Thank you Ed for your recommendation.
The awk script is not yet working completely, so I'll add some missing detail with the hope that you can fine tune your recommendation.
First, I'm running gawk v3.1.6.2962. I believe there may be differences in awk implementations, so this may be a useful bit of information.
Next, some more information about the type of data and origin of the data.
The data is about chemicals (text data that is input to a stereo-chemical drawing program).
The chemical files are in an .sdf format.
When I open "133711.sdf" in NotePad++ (using View/Show symbol/Show all characters), I see data that is shown in the screen shot:
https://dl.dropbox.com/u/3094317/_master_1_screen_shot_.png
As you see, LF only - no CR.
I believe this means that the origin of the .sdf files is a UNIX system.
Next, I run the Windows command COPY *.sdf _master_2_.txt. That creates the very large file-of-files that I want to parse into records.
_master_2_.txt has the same structure as 133711.sdf - LF only; no CR.
Then, I run your awk recommendation in a .BAT file. I need to replace your single quotes with double quotes because Microsoft made me.
awk -v FS="\r\n" -v OFS="|+|" -v RS="\$\$\$\$" -v ORS="\r\n" "{$1=$1}1" C:_master_2_.txt >C:\output.txt
I've attached a screen shout of output.txt:
https://dl.dropbox.com/u/3094317/output.txt.png
As you can see, the awk command did not successfully replace "\r\n" with "|+|".
Further, Windows created the output.txt with CRLF.
It did successfully replace the four $ with CRLF.
Is this information adequate to update your awk recommendation to handle the Windows-related issues?
Try this with GNU awk:
awk -v FS='\r\n' -v OFS='|+|' -v RS='\\$\\$\\$\\$' -v ORS='\r\n' '{$1=$1}1' file
I see from your updated question that you're on Windows. To avoid ridiculous quoting rules and issues, put this in a file named "whatever.awk":
BEGIN{FS="\r\n"; OFS="|+|"; RS="\\$\\$\\$\\$"; ORS="\r\n"} {$1=$1}1
and run it as
awk -f whatever.awk file
and see if that does what you want.

Resources