Shell find a string btween two patterns - bash

I have a response from a curl command to a text file that looks like this:
<att id="owner"><val>objs/13///user</val></att><att id="por"><val>objs/8/</val></att><att id="subType"><val>null</val></att><att id="uid"><val>14</val></att><att id="webDavPartialUrl"><val>/Users/user%
I need to find the string between the string >objs/8/</val> and <att id="uid">
i have tries awk,sed and grep, but all have issues with special charterers like those above, is there an option to treat the text as simple charterers?

Using grep with -- (explained here)
$ grep -o -- '>objs/8/</val>.*<att id="uid">' pattern
>objs/8/</val></att><att id="subType"><val>null</val></att><att id="uid">
For more specific matching with grep, you can refer to this question.
Otherwise, because your input seems to be XML, you should consider using an XPATH expression on it. More specifically, it seems that you want to
retrieve <att id="subType">, which should be easy to express.
Adding <test> and </test> around your sample, I was able to use xmllint to retrieve the value.
$ xmllint --xpath '/test/att[#id="subType"]' pattern
<att id="subType"><val>null</val></att>

Using Perl:
perl -ne 'print "$1\n" if m#>objs/8/</val>(.*)<att id="uid">#' file
output:
</att><att id="subType"><val>null</val></att>
Explanation:
$1 is the captured string (.*)
m## is used here as the matching operator instead of the standard Perl //, in order to ignore the special / characters

Related

Extract a string in linux shell script

Guys i have a string like this:
variable='<partyRoleId>12345</partyRoleId>'
what i want is to extract the value so the output is 12345.
Note the tag can be in any form:
<partyRoleId> or <ns1:partyRoleId>
any idea how to get the tag value using grep or sed only?
Use an XML parser to extract the value:
echo "$variable" | xmllint -xpath '*/text()' -
You probably should use it for the whole XML document instead of extracting a single line from it into a variable, anyway.
to use only grep, you need regexp to find first closing brackets and cut all digits:
echo '<partyRoleId>12345</partyRoleId>'|grep -Po ">\K\d*"
-P means PCRE
-o tells to grep to show only matched pattern
and special \K tells to grep cut off everything before this.

Cutting a string using multiple delimiters using the awk or sed commands

I am using a SIPP server simulator to verify incoming calls.
What I need to verify is the caller ID and the dialed digits. I've logged this information to a file, which now contains, for example, the following:
From: <sip:972526134661#server>;tag=60=.To: <sip:972526134662#server>}
in each line.
What I want is to modify it to a csv file containing simply the two phone numbers, such as follows:
972526134661,972526134662
and etc.
I've tried using the awk -F command, but then I can only use the sip: as a delimiter or the # or / as delimiters.
While, basically what I want to do is to take all the strings which begin with a < and end with >, and then take all the strings that follow the sip: delimiter.
using the cut command is also not an option, as I understand that it cannot use strings as delimiters.
I guess it should be really simple but I haven't find quite the right thing to use.. Would appreciate the help, thanks!
OK, for fun, picking some random data (from your original post) and using awk -F as you originally wanted.
To note, because your file is "generated", we can assume a regular format for the data and not expect the "short" patterns to cause mis-hits.
[g]awk -F'sip:|#' -v OFS="," '{print $2,$4}' yourlogfile
It uses both sip: and # as the Field Separator, by means of the alternation operator |. It can easily be extended to allow further characters or strings to also be used to separate fields in the input if required. The built-in variable FS can contain a regular expression/regexp like this.
For that first sample in your question, it yields this:
972526134661,972526134662
For the latest (revision 8) version, and guessing what you want:
[g]awk -F'sip:|#|to_number:' -v OFS="," '{print $2,$5}' yourlogfile
Yields this:
from_number,972526134662
The [g]awk is because I used gawk on my machine, and got same behaviour with awk.
Slight amendment in style, suggested by #fedorqui, to use the command-line option -v to set the value for the Output Field Separator (an AWK built-in variable which can be amended using -v like any other variable) and separating the print fields with a comma, so that they are treated in the output as fields, rather than building a string with a hard-coded "," and treating it as one field.
I would suggest using sed to extract the two numbers:
$ sed -n 's/^From: <sip:\([0-9]*\).*To: <sip:\([0-9]*\).*/\1,\2/p' file
972526134661,972526134662
The regular expression matches a line beginning with From and captures the two numbers after <sip:. If the spaces are variable, you may want to add * to those places.
You can use a regex replace, as long as the format stays the same (order is always From/To):
sed -E "s/^.*sip:([0-9]+)#.*sip:([0-9]+)#.*$/\1,\2/"
It's not a very specific or perfect solution, but in most cases an approach like this is enough.

copy text between two strings in a file using bash

I have a XML file. From that, I want to copy the text between two strings.
Sample line from XML file:
some stuff.........<br/><br/><br/>http://example.com/copythislink.php<br/><br/>After you.........some more stuff
I want to copy all the text between
<br/><br/><br/>
and
<br/><br/>After you
These two strings occur only once in the xml file.I tried using sed. But, it returns an error because of <.
You can use this sed,
sed 's#.*<br/><br/><br/>\(.*\)<br/><br/>After you.*#\1#' yourfile.xml
(OR)
If you want to extract only the URL.
sed -n 's#.*<br/><br/><br/>\(.*\)<br/><br/>After you.*#\1#p' yourfile.xml
Using gnu grep
grep -Po '(?<=<br/><br/><br/>)((?!<br/><br/>After you).)*' file
Explanation
(?<=<br/><br/><br/>) is a positive look-behind assertion
(?!<br/><br/>After you) is a negative look-behind assertion
If your need was only to extract the URI, a simple grep would have been enough. For example, something like :
grep -o "http:\/\/[A-Za-z0-9\.\/]*" test.xml
However, if you really want to catch the text (whatever the kind of content, even if it doesn't contain an URI) between these both strings, the solution of sat works well.

Using sed/awk to process a pattern in bash

I have a command whose output is of the form:
[{"foo1":<some value>,"foo2":<some value>,"foo3":<some value>}]
I want to take the output of this command and just get the value corresponding to foo2
How do I use sed/awk or any other shell utility readily available in a bash script to do this?
Assuming that the values do not contain commas, this sed rune will do it:
sed -n 's/.*"foo2":\([^,]*\),.*/\1/'p
sed -n tells sed not to print lines by default.
The s ("substitute") command uses a regexp group delimited by \( and \) to pick out just the bit you want.
"foo2": provides the context needed to find the right value.
[^,]* means "a character that is not a comma, any number of times". This is your . If values are not delimited by commas, change this (and the comma after the grouping parens) to match correctly.
.* means "any character, any number of times", and it is used to match all the characters before and after the bit you want. Now the regexp will match the entire line.
\1 means the contents of the grouping parentheses. sed will substitute the string that matches the pattern (which is the whole line, because we used .* at the beginning and end) with the contents of the parens, .
Finally, the p on the end means "print the resulting line".
With this awk for example:
$ awk -F[:,] '{print $4}' file
<some value2>
-F[:,] sets possible field separators as : or ,. Then, it is a matter of counting the position in which <some value> of foo2 are. It happens to be the 4th.
With sed:
$ sed 's/.*"foo2":\([^,]*\).*/\1/g' file
<some value2>
.*"foo2":\([^,]*\).* gets the string coming after foo2: and until the comma appears. Then it prints it back with \1.
Your block of data looks like JSON. There is no native JSON parsing in bash, sed or awk, so ALL the answers here will either suggest that you use a different, more appropriate tool, or they will be hackish and might easily fail if your real data looks different from the example you've provided here.
That said, if you are confident that your variable:value blocks and line structure are always in the same format as this example, you may be able to get away with writing your own (very) basic parser that will work for just your use case.
Note that you can't really parse things in sed, it's just not designed for that. If your data always looks the same, a sed solution may be sufficient ... but remember that you are simply pattern matching, not parsing the input data. There are other answers already which cover this.
For very simple matching of the string that appears after the colon after "foo2", as Peter suggested, you could use the following:
$ data='[{"foo1":11,"foo2":222,"foo3":3333}]'
$ echo "$data" | sed -ne 's/.*"foo2":\([^,]*\),.*/\1/p'
As I say, this should in no way be confused with parsing of your JSON. It would work equally well (or badly) with an input string of abcde"foo2":bar,abcde.
In awk, you can make things that are a bit more advanced, but you still have serious limitations when it comes to JSON. For example, if you choose to separate fields with commas, but then you put a comma inside the <some value> in your data, awk doesn't know how to distinguish it from a field separator.
That said, if your JSON is only one level deep (i.e. matches your sample data), the following might work for you:
$ data='[{"foo1":11,"foo2":222,"foo3":3333}]'
$ echo "$data" | awk -F: -vRS=, '{gsub(/[^[:alnum:]]/,"",$1)} $1=="foo2" {print $2}'
This awk script considers commas as record separators and colons as field separators. It does not support any level of depth in your JSON, and depends on alphanumeric variable names. But it should handle JSON split on to multiple lines.
Alternately, if you want to avoid ugly hacks, and perl or python solutions don't work for you, you might want to try out jsawk. With it, you might use something like this:
$ data='[{"foo1":11,"foo2":222,"foo3":3333}]'
$ echo "$data" | jsawk -a 'return this.foo2'
[222]
SEE ALSO: Parsing json with awk/sed in bash to get key value pair
This worked for me. You can Try this one
echo "[{"foo1":<some value>,"foo2":<some value>,"foo3":<some value>}]" | awk -F"[:,]+" '{ if($3=="foo2") { print $4 }}'
Above line awk uses multiple field separators.I have used colon and comma here
Since this looks like JSON, let's parse it like JSON:
perl -MJSON -ne '$json = decode_json($_); print $json->[0]{foo2}, "\n"' <<END
[{"foo1":"some value","foo2":"some, value","foo3":"some value"}]
END
some, value

parse word from html file

I am having a lot of trouble trying to extract a word from an html file. The line in the html file appears like this:
<span id="result">WORD</span>
I am trying to get the WORD out but I can't figure it out. So far I've got:
grep 'span id="result"' FILE
Which just gets me the line. I've also tried:
sed -n '/<span id="result">/,/<\/span>/p' FILE
which didn't work either.
I know this is probably a very simple question, but I'm just beginning so I could really use some help.
Do not use regex to parse html.
Use a html parser.
My Xidel has the shortest syntax for this:
xidel FILE -e "#result"
This is a task for awk
I do guess you have other line in same files so a search for span id is a must.
echo "<span id="result">WORD</span>" | awk -F"[<>]" '/span id/ {print $3}'
WORD
You can try
awk -f ext.awk input.html
where input.html is your input html file, and ext.awk is
{
line=line $0 RS
}
END {
match (line,/<span id="result">([^<]*)<\/span>/,a)
print a[1]
}
This will extract the contents across line breaks..
Use grep with backward reference:
grep -Po '(?<=<span id="result">)\w+'
The expression between parenthèses is a backward reference; it is not captured but serves as test for the following regex part: if the expression appears, the captured pattern is only \w+ here. Add option -o for outputting only the word; option -P enables forward and backward references.
If you want to modifiy this regex, please note that with grep, a backward reference must have a fixed size.

Resources