Parsing content using grep, awk - shell

I have a parsed content similar to this as a output from JSON.sh.
["/home/ukrishnan/projects/test.yml","LOG_DRIVER"] "syslog"
["/home/ukrishnan/projects/test.yml","IMAGE"] "mysql:5.6"
["/home/ukrishnan/projects/test.yml"] {"LOG_DRIVER":"syslog","IMAGE":"mysql:5.6"}
["/home/ukrishnan/projects/mysql/app.xml","ENV_ACCOUNT_BRIDGE_ENDPOINT"] "/u01/src/test/sample.txt"
["/home/ukrishnan/projects/mysql/app.xml"] {"ENV_ACCOUNT_BRIDGE_ENDPOINT":"/u01/src/test/sample.txt"}
[] {"/home/ukrishnan/projects/test.yml":{"LOG_DRIVER":"syslog","IMAGE":"mysql:5.6"},"/home/ukrishnan/projects/mysql/app.xml":{"ENV_ACCOUNT_BRIDGE_ENDPOINT":"/u01/src/test/sample.txt"}}
So, I just wanted to take the values, similar to the Line 1,2 and 4. And need to parse, for example in the first line, "/home/ukrishnan/projects/test.yml","LOG_DRIVER","syslog" for all the lines with similar format. Please help as I'm completely a newbie to grep or awk.
Edit:
Sorry, if this is two broad. Here is what I tried.
By using, grep -v "{\|}" returns,
["/home/ukrishnan/projects/test.yml","LOG_DRIVER"] "syslog"
["/home/ukrishnan/projects/test.yml","IMAGE"] "mysql:5.6"
["/home/ukrishnan/projects/mysql/app.xml","ENV_ACCOUNT_BRIDGE_ENDPOINT"] "/u01/src/test/sample.txt"
If someone helps me with also grabbing values within double quotes in a single grep, that would be great.

this one-liner works for your example:
awk '$NF~/^[^{]/&&sub(/^\[/,"")+sub(/\]\s*/,",")' file
It gives:
"/home/ukrishnan/projects/test.yml","LOG_DRIVER","syslog"
"/home/ukrishnan/projects/test.yml","IMAGE","mysql:5.6"
"/home/ukrishnan/projects/mysql/app.xml","ENV_ACCOUNT_BRIDGE_ENDPOINT","/u01/src/test/sample.txt"

Related

awk command works with small files but does nothing with big ones

I have the following awk command to join lines which are smaller than a limit (it is basically used to break lines in multiline fixed-width file):
awk 'last{$0=last $0;} length($0)<21{last=$0" ";next} {print;last=""}' input_file.txt > output_file.txt
input_file.txt:
1,11,"dummy
111",1111
2,22,"dummy 222",2222
3,33,"dummy 333",3333
output_file.txt (expected):
1,11,"dummy 111",1111
2,22,"dummy 222",2222
3,33,"dummy 333",3333
The script works pretty well with small files (~MB) but it does nothing with big files (~GB).
What may be the problem?
Thanks in advance.
Best guess - all the lines in your big file are longer than 21 chars. There are more robust ways to do what you're trying to do with that script, though, so it may not be worth debugging this and ask for help with an improved script instead.
Here's one more robust way to combine quoted fields that contain newlines using any awk:
$ awk -F'"' '{$0=prev $0; if (NF%2){print; prev=""} else prev=$0 OFS}' input_file.txt
1,11,"dummy 111",1111
2,22,"dummy 222",2222
3,33,"dummy 333",3333
That may be a better starting point for you than your existing script. To do more than that, see What's the most robust way to efficiently parse CSV using awk?.

Issue with bash script using SED/AWK for substituion

I have been working on this little script at work to free up my own time and am currently stuck on part of it. The script is supposed to pull some content from a JSON, modify the content, and then re-upload it. The modification part is the portion that doesn't work.
An example of what the content looks like after being extracted from the JSON is:
<p>App1_v1.0_20160911_release.apk</p<p>App2_v2.0_20160915_beta.apk</p><p>App3_v3.0_20150909_VendorRelease.apk</p>
The modification function is supposed to update the list with the newer app filenames in the same location. I've tried using both SED and AWK to get this to work but I haven't gotten anywhere fast.
Here are examples of both commands and the parameters for the substitution I am trying to run on the example file:
old_name=App1_.*_release.apk
new_name=App1_v1.0_20160920_1152_release.apk
sed "s/$old_name/$new_name/" body > upload
awk -v oldname="$old_name" -v newname="$new_name" '{sub(oldname, newname)}1' body > upload
What ends up happening is the substitution will change the correct part of the list, but then nuke everything between that point and the end of the list.
Thank you for any and all help.
PS: If I didn't explain something correctly or you feel some information is missing, please comment and let me know so I can better explain the problem.
There are SO many possible values of oldname, newname, and your input data that could cause either of the commands you wrote to fail - don't use that "replace a regexp with a backreference-enabled-string" approach in any command, use string operations instead (which means you can't use sed since sed doesn't support strings)
This modifies your sample input as you say you want:
$ awk -v new='App1_v1.0_20160920_1152_release.apk' 'BEGIN{RS="</p>\n?"; FS=OFS="<p>"} NR==1{$2=new} {printf "%s%s", $0, RT}' file
<p>App1_v1.0_20160920_1152_release.apk<p>App2_v2.0_20160915_beta.apk</p><p>App3_v3.0_20150909_VendorRelease.apk</p>
If that's not adequate then edit your question to better explain your requirements and provide more truly representative sample input/output.
The above uses GNU awk for multi-char RS and RT.

Command grouping in sed

I do not understand the command grouping in sed scripts. We use curly braces to group commands. I found some information in the first answer to the following question: Using multiple sed commands. But I still do not understand this properly. Could someone please explain this to me?
If you use
/Number/ s/N/n/;s/r//
Then rs will be removed on all lines, not only those containing Number. But, if you use
/Number/{s/N/n/;s/r//}
then rs will be removed only from lines containing Number.

Need a quick way of removing partial duplicates from a log

I'm using a bash script to grep out some lines from a log file. The basic format of this log file is:
field1: value1, field2=value2, field3=value3,
field4=value4,value5,value6, field5=value7
Sometimes there will be lines in which field1: value1 is identical, but some of the other information is either the same or different. I'd like to filter those lines out, so that I only grep out the first instance of anything that has the same "field1: value1" tuple.
I'd prefer a nice command-line one-liner if you can find something especially simple. I definitely want to keep it in the bash script. This is on linux, so we've got all the command-line tools available.
Thanks!
Using awk:
awk -F, '!arr[$1]++ { print }' LOGFILE
The awk program uses an array to keep a count of the number of times a particular 'field1: value1` string is seen, but only prints the incoming line the first time.

in bash, bash remove punctuation between pattern matches?

I am struggling with a conversion of a data file to csv when there is punctuation in the title field.
I have a bash script that obtains the file and processes it, and it almost works. What gets me is when there are commas in a free text title field, which then create extra fields.
I have tried some sed examples to replace between patterns but I have not gotten any of them to work. What I want to do is work between two patterns and replace commas with either nothing or perhaps a semicolon.
Taking this string:
name:A100040,title:Oatmeal is better with raisins, dates, and sugar,current_balance:50000,
Replacing with this:
name:A100040,title:Oatmeal is better with raisins dates and sugar,current_balance:50000,
I should probably use "title:" and ",current_" to denote the start and end of the block where I want to make the change to avoid situations like this:
name:A100040,title:Re-title current periodicals, recent books,current_balance:50000,
So far I have not gotten the substitution to match. In this case I am using !! to make the change obvious:
teststring="name:A100040,title:Oatmeal is better with raisins, dates, and sugar,current_balance:50000,"
echo $teststring |sed '/title:/,/current_/s/,/!!/g'
name:A100040!!title:Oatmeal is better with raisins!! dates!! and sugar!!current_balance:50000!!
Any help appreciated.
This is one way which could undoubtedly be refined:
perl -ple 'm/(.*?)(title:.*?)(current_balance:.*)/; $save = $part = $2; $part =~ s/,/!!/g; s/$save/$part/'
First, using sed or awk to parse CSV is almost always the wrong thing to do, because they do not allow field delimiters to be quoted. That said, it seems like a better approach would be to quote the fields so that your output would be:
name:"A100040",title:"Oatmeal ... , dates, and sugar",current_balance:50000
Using sed you can try: (this is fragile)
sed 's/:\([^:]*\),\([^,:]*\)/:"\1",\2/g'
If you insist on trying to parse the csv with "standard" tools and you consider perl to be standard, you could try:
perl -pe '1 while s/,([^,:]*),/ $1,/g'

Resources