How to extract text between two patterns with sed/awk

How to extract text between two patterns with sed/awk - shell

I know this has been asked 1000 times here, but I read a lot of similar questions and still did not manage to find the right way to do this. I need to extract a number from a line that looks like this:
{"version":"4.9.123M","info":{"version":[2034.2],"description":""},"status":"OK"}
Expected output:
2034.2
This version number will not always be the same, but the rest of the line should.
I have tried working with sed but I am new to this and failed:
sed -e 's/version":[\(.*\),"description/\1/'
output:
sed: -e expression #1, char 35: unterminated `s' command
I think the issue is that there are too many special characters involved in the line and I did not write the command very well.

Since it's JSON, use should use JSON aware tools for processing it. If you prefer, for example, awk, the way is to use GNU awk's JSON extension. This is a small how-to.
First download and compile appropriate versions of GNU awk, Gawkextlib and gawk-json. That's pretty straightforward, actually, just ./configure and make. Then, write some code:
awk '
#load "json" # enable json extension
{
lines=lines $0 # read json file records and buffer to var lines
if(json_fromJSON(lines,data)==1) { # once the json is complete
for(i in data["info"]["version"]) # that seems to be an array so all elements
print data["info"]["version"][i] # are outputed
lines="" # once done with the first json object
} # reset the var for more lines
}' file
Output this time:
2034.2
Explained a bit more:
The JSON file structure can vary from one line to multiple lines, for example:
{"version":"4.9.123M","info":{"version":[2034.2],"description":""},"status":"OK"}
or:
{
"version": "4.9.123M",
"info": {
"version": [
2034.2
],
"description": ""
},
"status": "OK"
}
so we need to buffer the JSON lines with lines=lines $0 until there is a whole valid object in variable lines. We use the extension function json_fromJSON() to determine that validity in if(json_fromJSON(lines,data)==1). While validated the object gets disentangled and stored to array data. For this particular object the structure of the array is:
data["version"]="4.9.123M"
data["info"]["version"][1]="2034.2"
data["info"]["description"]=""
data["status"]="OK"
We could examine the object and produce some output of it with this recursive array scanning function:
awk '
#load "json"
function scan(a,p, q) { # a is array, p path to it, q is qnd *
if(isarray(a))
for(i in a) {
q=p (p==""?"":"->") i
scan(a[i],q)
}
else
print p ":" a
}
{
lines=lines $0
if(json_fromJSON(lines,data)==1)
scan(data) #
}' file.json
Output:
status:OK
version:4.9.123M
info->version->1:2034.2
info->description:
*) quick'n dirty
Here is a brief example of how to output JSON from an array: https://stackoverflow.com/a/58109715/4162356

If the version is always enclosed in [] and no other [ or ] is present in a line ,you can try this logic
STR='{"version":"4.9.123M","info":{"version":[2034.2],"description":""},"status":"OK"}'
echo $STR | awk -F'[' '{print $2}' | awk -F']' '{print $1}'

Simplest Way
Try grep when want to extract simple texts
echo "{"version":"4.9.123M","info":{"version":[2034.2],"description":""},"status":"OK"}"| grep -o "\[.*\]" | sed -e 's/\[\|\]//g'

This should do:
STR='{"version":"4.9.123M","info":{"version":[2034.2],"description":""},"status":"OK"}'
echo "$STR" | awk -F'[][]' '{print $2}'
2034.2

Related

Extract json value on regex on bash script

How can i get the values inner depends in bash script?
manifest.py
# Commented lines
{
'category': 'Sales/Subscription',
'depends': [
'sale_subscription',
'sale_timesheet',
],
'auto_install': True,
}
Expected response:
sale_subscription sale_timesheet
The major problem is linebreak, i have already tried | grep depends but i can not get the sale_timesheet value.
Im trying to add this values comming from files into a var, like:
DOWNLOADED_DEPS=($(ls -A $DOWNLOADED_APPS | while read -r file; do cat $DOWNLOADED_APPS/$file/__manifest__.py | [get depends value])
Example updated.

If this is your JSON file:
{
"category": "Sales/Subscription",
"depends": [
"sale_subscription",
"sale_timesheet"
],
"auto_install": true
}
You can get the desired result using jq like this:
jq -r '.depends | join(" ")' YOURFILE.json
This uses .depends to extract the value from the depends field, pipes it to join(" ") to join the array with a single space in between, and uses -r for raw (unquoted) output.

If it is not a json file and only string then you can use below Regex to find the values. If it's json file then you can use other methods like Thomas suggested.
^'depends':\s*(?:\[\s*)(.*?)(?:\])$
demo
you can use egrep for this as follows:
% egrep -M '^\'depends\':\s*(?:\[\s*)(.*?)(?:\])$' pathTo\jsonFile.txt
you can read about grep

As #Thomas has pointed out in a comment, the OPs input data is not in JSON format:
$ cat manifest.py
# Commented lines // comments not allowed in JSON
{
'category': 'Sales/Subscription', // single quotes should be replaced by double quotes
'depends': [
'sale_subscription',
'sale_timesheet', // trailing comma at end of section not allowed
],
'auto_install': True, // trailing comma issue; should be lower case "true"
}
And while the title of the question mentions regex, there is no sign of a regex in the question. I'll leave a regex based solution for someone else to come up with and instead ...
One (quite verbose) awk solution based on the input looking exactly like what's in the question:
$ awk -F"'" ' # use single quote as field separator
/depends/ { printme=1 ; next } # if we see the string "depends" then set printme=1
printme && /]/ { printme=0 ; next} # if printme=1 and line contains a right bracket then set printme=0
printme { printf pfx $2; pfx=" " } # if printme=1 then print a prefix + field #2;
# first time around pfx is undefined;
# subsequent passes will find pfx set to a space;
# since using "printf" with no "\n" in sight, all output will stay on a single line
END { print "" } # add a linefeed on the end of our output
' json.dat
This generates:
sale_subscription sale_timesheet

How to get paragraphs of text by index number

I am wondering if there is a way to get paragraphs of text (source file would be a pyx file) by number as sed does with lines
sed -n ${i}p
At this moment I'd be interested to use awk with:
awk '/custom-pyx-tag\(/,/\)custom-pyx-tag/'
but I can't find documentation or examples about that.
I'm also trying to trim "\r\n" with gsub(/\r\n/,"; ") int the same awk command but it doesn't work, and I can't really figure out why.
Any hint would be very appreciated, thanks
EDIT:
This is just one example and not my exact need but I would need to know how to do it for a multipurpose project
Let's take the case that I have exported the ID3Tags of a huge collection of audio files and these have been stored in a pyx-like format, so in the end I will have a nice big file with this pattern repeating for each file in the collection:
audio-genre(
blablabla
)audio-genre
audio-artist(
bla.blabla
)audio-artist
audio album(
bla-bla-bla
)audio-album
audio-track-num(
0x
)audio-track-num
audio-track-title(
bla.bla-bla
)audio-track-title
audio-lyrics(
blablablablabla
bla.bla.bla.bla
blah-blah-blah
blabla-blabla
)audio-lyrics
...
Now if I want to extract the artist of the 1234th audio file I can use:
awk '/audio-artist\(/, /)audio-artist/' | sed '/audio-artist/d' | sed -n 1234p
so being one line it can be obtained with sed, but I don't know how to get an entire paragraph given its index, for example if I want to get the lyrics of the 6543th file how could I do it?
In the end it is just a question of whether there is a command equivalent to
sed -n $ {num} p
but to be used for paragraphs

awk -v indx=1024
'BEGIN {
RS=""
}
{ split($0,arr,"audio-artist");
for (i=2;i<=length(arr);i=i+2)
{ gsub("[()]","",arr[i]);
arts[cnt+=1]=arr[i]
}
}
END {
print arts[indx]
}' audioartist
One liner:
awk -v indx=1234 'BEGIN {RS=""} NR==1 { split($0,arr,"audio-artist");for (i=2;i<=length(arr);i=i+2) { gsub("[()]","",arr[i]);arts[cnt+=1]=arr[i] } } END { print arts[indx] }' audioartist
Using awk, and the file called audioartist, we consume the file as one line by setting the records separator (RS) to "". We then split the whole file into an array arr, based on the separator audio-artist. We look through the array arr starting from 2 in steps of 2 till the end of the array and strip out the opening and closing brackets, creating another array called arts with an incrementing count as the index and the stripped artist as the value. At the end we print the arts index specified by the passed indx variable (in this case 1234).

How can I retrieve the matching records from mentioned file format in bash

XYZNA0000778800Z
16123000012300321000000008000000000000000
16124000012300322000000007000000000000000
17234000012300323000000005000000000000000
17345000012300324000000004000000000000000
17456000012300325000000003000000000000000
9
XYZNA0000778900Z
16123000012300321000000008000000000000000
16124000012300322000000007000000000000000
17234000012300323000000005000000000000000
17345000012300324000000004000000000000000
17456000012300325000000003000000000000000
9
I have above file format from which I want to find a matching record. For example, match a number(7789) on line starting with XYZ and once matched look for a matching number (7345) in lines below starting with 1 until it reaches to line starting with 9. retrieve the entire line record. How can I accomplish this using shell script, awk, sed or any combination.
Expected Output:
XYZNA0000778900Z
17345000012300324000000004000000000000000

With sed one can do:
$ sed -n '/^XYZ.*7789/,/^9$/{/^1.*7345/p}' file
17345000012300324000000004000000000000000
Breakdown:
sed -n ' ' # -n disabled automatic printing
/^XYZ.*7789/, # Match line starting with XYZ, and
# containing 7789
/^1.*7345/p # Print line starting with 1 and
# containing 7345, which is coming
# after the previous match
/^9$/ { } # Match line that is 9
range { stuff } will execute stuff when it's inside range, in this case the range is starting at /^XYZ.*7789/ and ending with /^9$/.
.* will match anything but newlines zero or more times.

If you want to print the whole block matching the conditions, one can use:
$ sed -n '/^XYZ.*7789/{:s;N;/\n9$/!bs;/\n1.*7345/p}' file
XYZNA0000778900Z
16123000012300321000000008000000000000000
16124000012300322000000007000000000000000
17234000012300323000000005000000000000000
17345000012300324000000004000000000000000
17456000012300325000000003000000000000000
9
This works by reading lines between ^XYZ.*7779 and ^9$ into the pattern
space. And then printing the whole thing if ^1.*7345 can be matches:
sed -n ' ' # -n disables printing
/^XYZ.*7789/{ } # Match line starting
# with XYZ that also contains 7789
:s; # Define label s
N; # Append next line to pattern space
/\n9$/!bs; # Goto s unless \n9$ matches
/\n1.*7345/p # Print whole pattern space
# if \n1.*7345 matches

I'd use awk:
awk -v rid=7789 -v fid=7345 -v RS='\n9\n' -F '\n' 'index($1, rid) { for(i = 2; i < $NF; ++i) { if(index($i, fid)) { print $i; next } } }' filename
This works as follows:
-v RS='\n9\n' is the meat of the whole thing. Awk separates its input into records (by default lines). This sets the record separator to \n9\n, which means that records are separated by lines with a single 9 on them. These records are further separated into fields, and
-F '\n' tells awk that fields in a record are separated by newlines, so that each line in a record becomes a field.
-v rid=7789 -v fid=7345 sets two awk variables rid and fid (meant by me as record identifier and field identifier, respectively. The names are arbitrary.) to your search strings. You could encode these in the awk script directly, but this way makes it easier and safer to replace the values with those of a shell variables (which I expect you'll want to do).
Then the code:
index($1, rid) { # In records whose first field contains rid
for(i = 2; i < $NF; ++i) { # Walk through the fields from the second
if(index($i, fid)) { # When you find one that contains fid
print $i # Print it,
next # and continue with the next record.
} # Remove the "next" line if you want all matching
} # fields.
}
Note that multi-character record separators are not strictly required by POSIX awk, and I'm not certain if BSD awk accepts it. Both GNU awk and mawk do, though.
EDIT: Misread question the first time around.

an extendable awk script can be
$ awk '/^9$/{s=0} s&&/7345/; /^XYZ/&&/7789/{s=1} ' file
set flag s when line starts with XYZ and contains 7789; reset when line is just 9, and print when flag is set and contains pattern 7345.

This might work for you (GNU sed):
sed -n '/^XYZ/h;//!H;/^9/!b;x;/^XYZ[^\n]*7789/!b;/7345/p' file
Use the option -n for the grep-like nature of sed. Gather up records beginning with XYZ and ending in 9. Reject any records which do not have 7789 in the header. Print any remaining records that contain 7345.
If the 7345 will always follow the header,this could be shortened to:
sed -n '/^XYZ/h;//!H;/^9/!b;x;/^XYZ[^\n]*7789.*7345/p' file
If all records are well-formed (begin XYZ and end in 9) then use:
sed -n '/^XYZ/h;//!H;/^9/!b;x;/^[^\n]*7789.*7345/p' file

appending text to specific line in file bash

So I have a file that contains some lines of text separated by ','. I want to create a script that counts how much parts a line has and if the line contains 16 parts i want to add a new one. So far its working great. The only thing that is not working is appending the ',' at the end. See my example below:
Original file:
a,a,a,a,a,a,a,a,a,a,a,a,a,a,a,a
b,b,b,b,b,b
a,a,a,a,a,a,a,a,a,a,a,a,a,a,a,a,a
b,b,b,b,b,b
a,a,a,a,a,a,a,a,a,a,a,a,a,a,a,a
Expected result:
a,a,a,a,a,a,a,a,a,a,a,a,a,a,a,a,xx
b,b,b,b,b,b
a,a,a,a,a,a,a,a,a,a,a,a,a,a,a,a,a
b,b,b,b,b,b
a,a,a,a,a,a,a,a,a,a,a,a,a,a,a,a,xx
This is my code:
while read p; do
if [[ $p == "HEA"* ]]
then
IFS=',' read -ra ADDR <<< "$p"
echo ${#ADDR[#]}
arrayCount=${#ADDR[#]}
if [ "${arrayCount}" -eq 16 ];
then
sed -i "/$p/ s/\$/,xx/g" $f
fi
fi
done <$f
Result:
a,a,a,a,a,a,a,a,a,a,a,a,a,a,a,a
,xx
b,b,b,b,b,b
a,a,a,a,a,a,a,a,a,a,a,a,a,a,a,a,a
b,b,b,b,b,b
a,a,a,a,a,a,a,a,a,a,a,a,a,a,a,a
,xx
What im doing wrong? I'm sure its something small but i cant find it..

It can be done using awk:
awk -F, 'NF==16{$0 = $0 FS "xx"} 1' file
a,a,a,a,a,a,a,a,a,a,a,a,a,a,a,a,xx
b,b,b,b,b,b
a,a,a,a,a,a,a,a,a,a,a,a,a,a,a,a,a
b,b,b,b,b,b
a,a,a,a,a,a,a,a,a,a,a,a,a,a,a,a,xx
-F, sets input field separator as comma
NF==16 is the condition that says execute block inside { and } if # of fields is 16
$0 = $0 FS "xx" appends xx at end of line
1 is the default awk action that means print the output

For using sed answer should be in the following:
Use ${line_number} s/..../..../ format - to target a specific line, you need to find out the line number first.
Use the special char & to denote the matched string
The sed statement should look like the following:
sed -i "${line_number}s/.*/&xx/"
I would prefer to leave it to you to play around with it but if you would prefer i can give you a full working sample.

How to interate based on words in text? (Shell Scripting)

I have a file currently in the form
location1 attr attr ... attr
location2 attr attr ... attr
...
locationn attr atrr ... attr
What I want to do is go through each line, grab the location (first field) then iterate through the attributes. So far I know how to grab the first field, but not iterate through the attributes. There are also a different number of attributes for each line.
TEMP_LIST=$DIR/temp.list
while read LINE
do
x=`echo $LINE | awk '{print $1}'`
echo $x
done<$TEMP_LIST
Can someone tell me how to iterate through the attributes?
I want to get the effect like
while read LINE
do
location=`echo $LINES |awk '{print $1}'`
for attribute in attributes
do something involving the $location for the line and each individual $attribute
done<$TEMP_LIST
I am currently working in ksh shell, but any other unix shell is fine, I will find out how to translate. I am really grateful if someone could help as it would save me alot of time.
Thank you.

Similar to DreadPirateShawn's solution, but a bit simpler:
while read -r location all_attrs; do
read -ra attrs <<< "$all_attrs"
for attr in "${attrs[#]}"; do
: # do something with $location and $attr
done
done < inputfile
The second read line makes use of bash's herestring feature.

This might work in other shells too, but here's an approach that works in Bash:
#!/bin/bash
TEMP_LIST=temp.list
while read LINE
do
# Split line into array using space as delimiter.
IFS=' ' read -a array <<< $LINE
# Use first element of array as location.
location=${array[0]}
echo "First param: $location"
# Remove first element from array.
unset array[0]
# Loop through remaining array elements.
for i in "${array[#]}"
do
echo " Value: $i"
done
done < $TEMP_LIST

As you're already using awk in your posted code, why not learn how to use awk, as it is designed for this sort of problem.
while read LINE
do
location=`echo $LINES |awk '{print $1}'`
for attribute in attributes
do something involving the $location for the line and each individual $attribute
done<$TEMP_LIST
is written in awk as
#!/bin/bash
tempList="MyTempList.txt"
awk '{ # implied while loop for input records by default
location=$1
print "location=" location # location as a "header"
for (i=2;i<NF;i++) {
printf("attr%d=%s\t", i, $i) # print each attr with its number
}
printf("\n") # add new-line char to end of each line of attributes
}' ${tempList}
If you want to save your output, use awk '{.....}' ${tempList}> ${tempList}.new
Awk has numerous vars that it sets as it reads your files. NF mean NumberOfFields for the current line. So the for loop, starts at field 2, and prints all remaining fields on that line in the format provided (change to suit your needs). The i<=NF drives the ability to print all elems on a line.
Sometimes you'll want the 3rd to last elem on line, so you can perform math on the value stored in NF, like thirdFromLast=$(NF-3). For all variables that are numbers, you can "dereference" it as a value, and ask awk to print the value stored of the $N(th) field. i.e. try
print "thirdFromLast="(NF-3)
print "thirdFromLast="$(NF-3)
... to see the difference that the $ makes on a variable that holds a number.
(For large amounts of data, 1 awk process will be considerably more efficient that using subprocesses to gather parts of files.)
Also work your way through this tutorial grymoire's awk tutorial
IHTH

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

How to extract text between two patterns with sed/awk - shell

If the version is always enclosed in [] and no other [ or ] is present in a line ,you can try this logic STR='{"version":"4.9.123M","info":{"version":[2034.2],"description":""},"status":"OK"}' echo $STR | awk -F'[' '{print $2}' | awk -F']' '{print $1}'

Simplest Way Try grep when want to extract simple texts echo "{"version":"4.9.123M","info":{"version":[2034.2],"description":""},"status":"OK"}"| grep -o "\[.*\]" | sed -e 's/\[\|\]//g'

This should do: STR='{"version":"4.9.123M","info":{"version":[2034.2],"description":""},"status":"OK"}' echo "$STR" | awk -F'[][]' '{print $2}' 2034.2

Related

Extract json value on regex on bash script

How to get paragraphs of text by index number

How can I retrieve the matching records from mentioned file format in bash

appending text to specific line in file bash

How to interate based on words in text? (Shell Scripting)

Categories

Resources