How can extract word between special character and other words - bash

I am trying to find a way, how to extract a word between special character and other words.
Example of the text:
description "CST 500M TEST/VPNGW/11040 X {} // test"
description "test2-VPNGW-110642 -VPNGW"
I am trying to achieve result like,only the word including VPNGW:
TEST/VPNGW/11040
test2-VPNGW-110642
I tried with grep and AWK, but looks like my knowledge is not so far enough.
The way to print with awk '{$1=""; $2=""; ... is not working due to the whole word is not always on the same position.
Thanks for the help!

With grep you can output only the part of the string that matches the regex:
grep -o '[^ "]\+VPNGW[^ "]\+' file.name

You could try something like:
grep -Eoi 'test.*[0-9]'
Of course this would be greedy and if there is another number after the ones in the required string it will grab up to there. Normally I would suggest an inverted test to stop at the thing you don't want:
grep -Eoi 'test[^ ]+'
The problem with this is like in your first example, there is more than one occurrence of the string 'test' and so the output for the first example is:
TEST/VPNGW/11040
test"
Of course knowing what your real data looks like you can make your own decision on what might best suit
Uou could go with the perl regex machine in grep and use a look-ahead:
grep -Eoi 'test[^ ]+(?= )'
Again though, if you have the string 'test' somewhere else on the line followed by a single space, this will still not work as desired.
Lastly, awk can do the job but you would need to cycle through each item or set RS to white space:
Option 1:
awk '{for(i=1;i<=NF;i++)if(tolower($i) ~ /test.*[0-9]/)print $i}'
Option 2:
awk 'tolower($i) ~ /test.*[0-9]/' RS="[[:space:]]+"

awk '/test2/{sub(/"/,"")}$0{print $4}/test2/{print $2}' file
TEST/VPNGW/11040
test2-VPNGW-110642

Related

Replace text between pattern range on same line

This may be a better task for awk than sed, but the goal is to parse a single, long string (it happens to be an XML doc) and replace text within a pattern range with another character.
I want to preserve the number of characters being replaced and simply mask them as asterisks. I've put something together in a python script to parse the XML tree but have a feeling a native program is going to be much faster.
Assuming the string: "<mask>123</mask><keep>123</keep>"
...I'd like the output: "<mask>***</mask><keep>123</keep>"
My first attempt with sed without using ranges got me this:
$ echo "<mask>123</mask><keep>123</keep>" | sed "s/[0-9]/*/g"
<mask>***</mask><keep>***</keep>
I learned that sed can operate within ranges, but my understanding is that the behavior can only be toggled from line-to-line, not over the course of processing a single line.
Experimenting with pattern ranges got me the following (consistent with my understanding) and thus didn't work either:
$ echo "<mask>123</mask><keep>123</keep>" | sed "/<mask>/,/<\/mask>/ s/[0-9]/*/g"
<mask>***</mask><keep>***</keep>
EDIT: In fact, even if there were line breaks in the input, I must not be understanding the pattern range behavior correctly (or my example is poorly constructed)
$ echo "<mask>123</mask>\n<keep>123</keep>" | sed "/<mask>/,/<\/mask>/ s/[0-9]/*/g"
<mask>***</mask>
<keep>***</keep>
Any tips would be greatly appreciated.
Never use range expressions as they make simple tasks very slightly briefer but then need a complete rewrite or duplicate conditions when your requirements become marginally more interesting, always use a flag variable instead if a range is necessary. What that means, of course, is that you can't use sed for problems like this since it doesn't support variables.
Anyway, here's a trivial GNU awk (for multi-char RS and RT) solution that doesn't directly use ranges at all:
$ cat file
Assuming the string: "<mask>123</mask><keep>123</keep>" ...I'd like the
$ awk -v RS='</mask>' -v ORS= '{print gensub(/(.*<mask>).*/,"\\1***",1) RT}' file
Assuming the string: "<mask>***</mask><keep>123</keep>" ...I'd like the
or if you need the number of *s to match the number of characters they're replacing:
$ cat file
Assuming first string: "<mask>123</mask><keep>123</keep>" ...I'd like the
Assuming second string: "<mask>1234567</mask><keep>123</keep>" ...I'd like the
$ awk -v RS='</mask>' 'match($0,/(.*<mask>)(.*)/,a){ $0=a[1] gensub(/./,"*","g",a[2]) } {ORS=RT} 1' file
Assuming first string: "<mask>***</mask><keep>123</keep>" ...I'd like the
Assuming second string: "<mask>*******</mask><keep>123</keep>" ...I'd like the
why you got this output is completely correct. It is a trick of sed's range address of two regex.
What you gave sed is /regex1/, /regex2/, sed will first try to find the line matches address1, which is /regex1/, the first line matched, fine. Then your address2 is a regex too, so:
and if addr2 is a regexp, it will not be tested against the line
that addr1 matched.
This sentence is from sed's man page.
That is, sed starts checking your /regex2/ from line 2. of course, no line matches the /<\/mask>/, so sed just did the substitution on whole file.
Check this example:
kent$ cat f
<mask>234</mask>
123
123
123
<mask>234</mask>
123
123
<keep>234</keep>
kent$ sed "/<mask>/,/<\/mask>/ s/[0-9]/*/g" f
<mask>***</mask>
***
***
***
<mask>***</mask>
123
123
<keep>234</keep>
Finally just a suggestion, don't process xml with regex (sed/awk/grep...). Of course, you may just use the "xml" as an example.

bash how to extract a field based on its content from a delimited string

Problem - I have a set of strings that essentially look like this:
|AAAAAA|BBBBBB|CCCCCCC|...|XXXXXXXXX|...|ZZZZZZZZZ|
The '...' denotes omitted fields.
Please note that the fields between the pipes ('|') can appear in ANY ORDER and not all fields are necessarily present. My task is to find the "XXXXXXX" field and extract it from the string; I can specify that field with a regex and find it with grep/awk/etc., but once I have that one line extracted from the file, I am at a loss as to how to extract just that text between the pipes.
My searches have turned up splitting the line into individual fields and then extracting the Nth field, however, I do not know what N is, that is the trick.
I've thought of splitting the string by the delimiter, substituting the delimiter with a newline, piping those lines into a grep for the field, but that involves running another program and this will be run on a production server through near-TB of data, so I wanted to minimize program invocations. And I cannot copy the files to another machine nor do I have the benefit of languages like Python, Perl, etc., I'm stuck with the "standard" UNIX commands on SunOS. I think I'm being punished.
Thanks
As an example, let's extract the field that matches MyField:
Using sed
$ s='|AAAAAA|BBBBBB|CCCCCCC|...|XXXXXXXXX|12MyField34|ZZZZZZZZZ|'
$ sed -E 's/.*[|]([^|]*MyField[^|]*)[|].*/\1/' <<<"$s"
12MyField34
Using awk
$ awk -F\| -v re="MyField" '{for (i=1;i<=NF;i++) if ($i~re) print $i}' <<<"$s"
12MyField34
Using grep -P
$ grep -Po '(?<=\|)[^|]*MyField[^|]*' <<<"$s"
12MyField34
The -P option requires GNU grep.
$ sed -e 's/^.*|\(XXXXXXXXX\)|.*$/\1/'
Naturally, this only makes sense if XXXXXXXXX is a regular expression.
This should be really fast if used something like:
$ grep '|XXXXXXXXX|' somefile | sed -e ...
One hackish way -
sed 's/^.*|\(<whatever your regex is>\)|.*$/\1/'
but that might be too slow for your production server since it may involve a fair amount of regex backtracking.

Delete all lines before last case of a string

How would I go about deleting all the lines before the last occurrence of a string. Like if I had a file that looked like
Icecream is good
And
Chocolate is good
And
They have lots of sugar
If I want all lines after and including the last occurrence of "And" what's the cleanest way to do this? Specifically, I want
And
They have lots of sugar
I was doing sed -n -E -e '/And/,$p' file but I see this gives me the first occurrence.
This might work for you (GNU sed):
sed -n '/And/h;//!H;$!d;x;//p' file
Replace anything in the hold space by the line containing And. Append all other lines to the hold space. At the end of the file, swap the pattern space for the hold space and print out the result as long it matches the required string And.
I know that you asked for sed and that Potong provided a good sed solution. But, for comparison, here is an awk solution:
$ awk 's{s=s"\n"$0;} /And/{s=$0;} END{print s;}' file
And
They have lots of sugar
How it works:
s{s=s"\n"$0;}
If the variable s is not empty, then add to it the current line, $0.
/And/{s=$0;}
If the current line contains And, then set s to the current line, $0.
END{print s;}
After we have reached the end of the file, print s.
$ tac file | awk '!f; /And/{f=1}' | tac
And
They have lots of sugar
$ awk 'NR==FNR{if(/And/)nr=NR;next} FNR>=nr' file file
And
They have lots of sugar

How to pull a value from between 2 strings which occur several times in a file

I am trying to pull the value from inbetween 2 strings and line break each result. I am then hoping to combine this with another value from the same document being pulled the same way. The problem is there are NO linebreaks in this file and it is quite large. Here is an example of the file.
<ID>47</ID><DATACENTER_ID>36</DATACENTER_ID><DNS_NAME>myhost.domain.local</DNS_NAME> <IP_ADDRESS>10.0.0.1</IP_ADDRESS><ID>60</ID><DATACENTER_ID>36</DATACENTER_ID><DNS_NAME>yourhost.domain.local</DNS_NAME><IP_ADDRESS>10.0.0.2</IP_ADDRESS>
My end result would ideally look something like this.
ID-----DNS_NAME
47-----myhost.domain.local
60-----yourhost.domain.local
My closest attemps so far have been creating variables with grep, but I cant seem to format them into a table. Im also very new to scripting so forgive my ignorance.
If your grep supports -P (--Perl-regexp), then you're free to use the below regex.
$ grep -oP '<ID>\K[^<>]*(?=</ID>)|<DNS_NAME>\K[^<>]*(?=</DNS_NAME>)' file | sed 'N;s/\n/-----/g'
47-----myhost.domain.local
60-----yourhost.domain.local
\K Discards the previously matched characters from printing.
(?=...) posiitve lookahead assertion which asserts where the match would occur. It won't consume any characters.
Here is an gnu awk (do to multiple characters in RS) to get your data:
awk -v RS="<ID>" -F"<|>" 'NR>1 {print $1"-----"$9}' file
47-----myhost.domain.local
60-----yourhost.domain.local

Using sed/awk to process a pattern in bash

I have a command whose output is of the form:
[{"foo1":<some value>,"foo2":<some value>,"foo3":<some value>}]
I want to take the output of this command and just get the value corresponding to foo2
How do I use sed/awk or any other shell utility readily available in a bash script to do this?
Assuming that the values do not contain commas, this sed rune will do it:
sed -n 's/.*"foo2":\([^,]*\),.*/\1/'p
sed -n tells sed not to print lines by default.
The s ("substitute") command uses a regexp group delimited by \( and \) to pick out just the bit you want.
"foo2": provides the context needed to find the right value.
[^,]* means "a character that is not a comma, any number of times". This is your . If values are not delimited by commas, change this (and the comma after the grouping parens) to match correctly.
.* means "any character, any number of times", and it is used to match all the characters before and after the bit you want. Now the regexp will match the entire line.
\1 means the contents of the grouping parentheses. sed will substitute the string that matches the pattern (which is the whole line, because we used .* at the beginning and end) with the contents of the parens, .
Finally, the p on the end means "print the resulting line".
With this awk for example:
$ awk -F[:,] '{print $4}' file
<some value2>
-F[:,] sets possible field separators as : or ,. Then, it is a matter of counting the position in which <some value> of foo2 are. It happens to be the 4th.
With sed:
$ sed 's/.*"foo2":\([^,]*\).*/\1/g' file
<some value2>
.*"foo2":\([^,]*\).* gets the string coming after foo2: and until the comma appears. Then it prints it back with \1.
Your block of data looks like JSON. There is no native JSON parsing in bash, sed or awk, so ALL the answers here will either suggest that you use a different, more appropriate tool, or they will be hackish and might easily fail if your real data looks different from the example you've provided here.
That said, if you are confident that your variable:value blocks and line structure are always in the same format as this example, you may be able to get away with writing your own (very) basic parser that will work for just your use case.
Note that you can't really parse things in sed, it's just not designed for that. If your data always looks the same, a sed solution may be sufficient ... but remember that you are simply pattern matching, not parsing the input data. There are other answers already which cover this.
For very simple matching of the string that appears after the colon after "foo2", as Peter suggested, you could use the following:
$ data='[{"foo1":11,"foo2":222,"foo3":3333}]'
$ echo "$data" | sed -ne 's/.*"foo2":\([^,]*\),.*/\1/p'
As I say, this should in no way be confused with parsing of your JSON. It would work equally well (or badly) with an input string of abcde"foo2":bar,abcde.
In awk, you can make things that are a bit more advanced, but you still have serious limitations when it comes to JSON. For example, if you choose to separate fields with commas, but then you put a comma inside the <some value> in your data, awk doesn't know how to distinguish it from a field separator.
That said, if your JSON is only one level deep (i.e. matches your sample data), the following might work for you:
$ data='[{"foo1":11,"foo2":222,"foo3":3333}]'
$ echo "$data" | awk -F: -vRS=, '{gsub(/[^[:alnum:]]/,"",$1)} $1=="foo2" {print $2}'
This awk script considers commas as record separators and colons as field separators. It does not support any level of depth in your JSON, and depends on alphanumeric variable names. But it should handle JSON split on to multiple lines.
Alternately, if you want to avoid ugly hacks, and perl or python solutions don't work for you, you might want to try out jsawk. With it, you might use something like this:
$ data='[{"foo1":11,"foo2":222,"foo3":3333}]'
$ echo "$data" | jsawk -a 'return this.foo2'
[222]
SEE ALSO: Parsing json with awk/sed in bash to get key value pair
This worked for me. You can Try this one
echo "[{"foo1":<some value>,"foo2":<some value>,"foo3":<some value>}]" | awk -F"[:,]+" '{ if($3=="foo2") { print $4 }}'
Above line awk uses multiple field separators.I have used colon and comma here
Since this looks like JSON, let's parse it like JSON:
perl -MJSON -ne '$json = decode_json($_); print $json->[0]{foo2}, "\n"' <<END
[{"foo1":"some value","foo2":"some, value","foo3":"some value"}]
END
some, value

Resources