Extract a string in linux shell script - bash

Guys i have a string like this:
variable='<partyRoleId>12345</partyRoleId>'
what i want is to extract the value so the output is 12345.
Note the tag can be in any form:
<partyRoleId> or <ns1:partyRoleId>
any idea how to get the tag value using grep or sed only?

Use an XML parser to extract the value:
echo "$variable" | xmllint -xpath '*/text()' -
You probably should use it for the whole XML document instead of extracting a single line from it into a variable, anyway.

to use only grep, you need regexp to find first closing brackets and cut all digits:
echo '<partyRoleId>12345</partyRoleId>'|grep -Po ">\K\d*"
-P means PCRE
-o tells to grep to show only matched pattern
and special \K tells to grep cut off everything before this.

Related

Regex: match only string C that is in between string A and string B

How can I write a regex in a shell script that would target only the targeted substring between two given values? Give the example
https://www.stackoverflow.com
How can I match only the ":" in between "https" and "//".
If possible please also explain the approach.
The context is that I need to prepare a file that would fetch a config from the server and append it to the .env file. The response comes as JSON
{
"GRAPHQL_URL": "https://someurl/xyz",
"PUBLIC_TOKEN": "skml2JdJyOcrVdfEJ3Bj1bs472wY8aSyprO2DsZbHIiBRqEIPBNg9S7yXBbYkndX2Lk8UuHoZ9JPdJEWaiqlIyGdwU6O5",
"SUPER_SECRET": "MY_SUPER_SECRET"
}
so I need to adjust it to the .env syntax. What I managed to do this far is
#!/bin/bash
CURL_RESPONSE="$(curl -s url)"
cat <<< ${CURL_RESPONSE} | jq -r '.property.source' | sed -r 's/:/=/g;s/[^a-zA-Z0-9=:_/-]//g' > .env.test
so basically I fetch the data, then extract the key I am after with jq, and then I use sed to first replace all ":" to "=" and after that I remove all the quotations and semicolons and white spaces that comes from JSON and leave some characters that are necessary.
I am almost there but the problem is that now my graphql url (and only other) would look like so
https=//someurl/xyz
so I need to replace this = that is in between https and // back with the colon.
Thank you very much #Nic3500 for the response, not sure why but I get error saying that
sed: 1: "s/:/=/g;s#https\(.*\)// ...": \1 not defined in the RE
I searched SO and it seems that it should work since the brackets are escaped and I use -r flag (tried -E but no difference) and I don't know how to apply it. To be honest I assume that the replacement block is this part
#\1#
so how can I let this know to what character should it be replaced?
This is how I tried to use it
#!/bin/bash
CURL_RESPONSE="$(curl -s url)"
cat <<< ${CURL_RESPONSE} | jq -r '.property.source' | sed -r 's/:/=/g;s#https\(.*\)//.*#\1#;s/[^a-zA-Z0-9=:_/-]//g' > .env.test
Hope with this context you would be able to help me.
echo "https://www.stackoverflow.com" | sed 's#https\(.*\)//.*#\1#'
:
sed operator s/regexp/replacement/
regexp: https\(.*)//.*. So "https" followed by something (.*), followed by "//", followed by anything else .*
the parenthesis are back slashed since they are not part of the pattern. They are used to group a part of the regex for the replacement part of the s### operator.
replacement: \1, means the first group found in the regex \(.*\)
I used s###, but the usual form is s///. Any character can take the place of the / with the s operator. I used # as using / would have been confusing since you use / in the url.
The problem is that your sed substitutions are terribly imprecise. Anyway, you want to do it in jq instead, where you have more control over which parts you are substituting, and avoid spawning a separate process for something jq quite easily does natively in the first place.
curl -s url |
jq -r '.property.source | to_entries[] |
"\(.key)=\"\(.value\)\""' > .env.test
Tangentially, capturing the output of curl into a variable just so you can immediately cat it once to standard output is just a waste of memory.

Find string then from there pull numbers

Im starting to code bash and not the best but i have a situation. I have an output like:
Configuration file 'hello2.conf' is in use by process 735.
Ending
I want to extract the process ID 735.
I seen answers were to extract ONLY numbers from outputs but then i am left with 2735?
How can i go about extracting 735 from the output? I was thinking search for process then grab number after perhaps?
Thanks!
Use GNU grep with its Perl Compatible Regular Expression capabilities enabled with the -P flag and print only the matching entry using -o flag.
grep -Po 'process \K[0-9]+' <<<"Configuration file 'hello2.conf' is in use by process 735."
735
Use it in a command line as
.. | grep -Po 'process \K[0-9]+'
where the \K escape sequence stands for
\K: This sequence resets the starting point of the reported match. Any previously matched characters are not included in the final matched sequence.
RegEx Demo
You might want to use a regular expressions:
[[ "$line" =~ ([0-9]+)\.$ ]] && echo "${BASH_REMATCH[1]}"
This should match any number at the end of the line, select the number part, and print it!
Good Luck!
If you line remains the same, use cut -d" " -f 9
sed can extract only the numbers at the specific location of the message (using \(...\) match grouping and \1 replacement).
... | sed "s#^Configuration file '.*' is in use by process \([0-9]*\)\.#\1#"

Shell find a string btween two patterns

I have a response from a curl command to a text file that looks like this:
<att id="owner"><val>objs/13///user</val></att><att id="por"><val>objs/8/</val></att><att id="subType"><val>null</val></att><att id="uid"><val>14</val></att><att id="webDavPartialUrl"><val>/Users/user%
I need to find the string between the string >objs/8/</val> and <att id="uid">
i have tries awk,sed and grep, but all have issues with special charterers like those above, is there an option to treat the text as simple charterers?
Using grep with -- (explained here)
$ grep -o -- '>objs/8/</val>.*<att id="uid">' pattern
>objs/8/</val></att><att id="subType"><val>null</val></att><att id="uid">
For more specific matching with grep, you can refer to this question.
Otherwise, because your input seems to be XML, you should consider using an XPATH expression on it. More specifically, it seems that you want to
retrieve <att id="subType">, which should be easy to express.
Adding <test> and </test> around your sample, I was able to use xmllint to retrieve the value.
$ xmllint --xpath '/test/att[#id="subType"]' pattern
<att id="subType"><val>null</val></att>
Using Perl:
perl -ne 'print "$1\n" if m#>objs/8/</val>(.*)<att id="uid">#' file
output:
</att><att id="subType"><val>null</val></att>
Explanation:
$1 is the captured string (.*)
m## is used here as the matching operator instead of the standard Perl //, in order to ignore the special / characters

grep exact pattern from a file in bash

I have the following IP addresses in a file
3.3.3.1
3.3.3.11
3.3.3.111
I am using this file as input file to another program. In that program it will grep each IP address. But when I grep the contents I am getting some wrong outputs.
like
cat testfile | grep -o 3.3.3.1
but I am getting output like
3.3.3.1
3.3.3.1
3.3.3.1
I just want to get the exact output. How can I do that with grep?
Use the following command:
grep -owF "3.3.3.1" tesfile
-o returns the match only and not the whole line.-w greps for whole words, meaning the match must be enclosed in non word chars like <space>, <tab>, ,, ; the start or the end of the line etc. It prevents grep from matching 3.3.3.1 out of 3.3.3.111.
-F greps for fixed strings instead of patterns. This prevents the . in the IP address to be interpreted as any char, meaning grep will not match 3a3b3c1 (or something like this).
To match whole words only, use grep -ow 3.3.3.1 testfile
UPDATE: Use the solution provided by hek2mgl as it is more robust.
You may use anhcors.
grep '^3\.3\.3\.1$' file
Since by default grep uses regex, you need to escape the dots in-order to make grep to match literal dot character.

extract specific tag from html output of a python script

I have a program that should be piped with grep command, the outpu of my program is sth like this:
<cite>www.site.com/sdds/ass</cite>A-"><div Class="sa_mc"><div class="sb_tlst"><h3><a href=
and so on...
I run a python script:
./python.py | grep -Po '(?<=<cite>)([^</cite>])'
in order to grep every thing between cite tag...
Can you help me?
You need to make a proper use of lookaround feature, your lookbehind is fine but lookahead is not. Try this:
grep -Po "(?<=<cite>).*?(?=</cite>)"
Ex:
echo '<cite>www.site.com/sdds/ass</cite>A-"><div Class="sa_mc"><div class="sb_tlst"><h3><a href=' | grep -Po "(?<=<cite>).*?(?=</cite>)"
www.site.com/sdds/ass
Disclaimer: It's a bad practice to parse XML/HTML with regex. You should probably use a parser like xmllint instead.
You could also use sed. But it's a bad practice to parse XML/HTML with regex.
sed -r 's/^<cite>([^<]*)<\/cite>.*/\1/g' file
Output:
www.site.com/sdds/ass

Resources