Replace text between pattern range on same line - bash

This may be a better task for awk than sed, but the goal is to parse a single, long string (it happens to be an XML doc) and replace text within a pattern range with another character.
I want to preserve the number of characters being replaced and simply mask them as asterisks. I've put something together in a python script to parse the XML tree but have a feeling a native program is going to be much faster.
Assuming the string: "<mask>123</mask><keep>123</keep>"
...I'd like the output: "<mask>***</mask><keep>123</keep>"
My first attempt with sed without using ranges got me this:
$ echo "<mask>123</mask><keep>123</keep>" | sed "s/[0-9]/*/g"
<mask>***</mask><keep>***</keep>
I learned that sed can operate within ranges, but my understanding is that the behavior can only be toggled from line-to-line, not over the course of processing a single line.
Experimenting with pattern ranges got me the following (consistent with my understanding) and thus didn't work either:
$ echo "<mask>123</mask><keep>123</keep>" | sed "/<mask>/,/<\/mask>/ s/[0-9]/*/g"
<mask>***</mask><keep>***</keep>
EDIT: In fact, even if there were line breaks in the input, I must not be understanding the pattern range behavior correctly (or my example is poorly constructed)
$ echo "<mask>123</mask>\n<keep>123</keep>" | sed "/<mask>/,/<\/mask>/ s/[0-9]/*/g"
<mask>***</mask>
<keep>***</keep>
Any tips would be greatly appreciated.

Never use range expressions as they make simple tasks very slightly briefer but then need a complete rewrite or duplicate conditions when your requirements become marginally more interesting, always use a flag variable instead if a range is necessary. What that means, of course, is that you can't use sed for problems like this since it doesn't support variables.
Anyway, here's a trivial GNU awk (for multi-char RS and RT) solution that doesn't directly use ranges at all:
$ cat file
Assuming the string: "<mask>123</mask><keep>123</keep>" ...I'd like the
$ awk -v RS='</mask>' -v ORS= '{print gensub(/(.*<mask>).*/,"\\1***",1) RT}' file
Assuming the string: "<mask>***</mask><keep>123</keep>" ...I'd like the
or if you need the number of *s to match the number of characters they're replacing:
$ cat file
Assuming first string: "<mask>123</mask><keep>123</keep>" ...I'd like the
Assuming second string: "<mask>1234567</mask><keep>123</keep>" ...I'd like the
$ awk -v RS='</mask>' 'match($0,/(.*<mask>)(.*)/,a){ $0=a[1] gensub(/./,"*","g",a[2]) } {ORS=RT} 1' file
Assuming first string: "<mask>***</mask><keep>123</keep>" ...I'd like the
Assuming second string: "<mask>*******</mask><keep>123</keep>" ...I'd like the

why you got this output is completely correct. It is a trick of sed's range address of two regex.
What you gave sed is /regex1/, /regex2/, sed will first try to find the line matches address1, which is /regex1/, the first line matched, fine. Then your address2 is a regex too, so:
and if addr2 is a regexp, it will not be tested against the line
that addr1 matched.
This sentence is from sed's man page.
That is, sed starts checking your /regex2/ from line 2. of course, no line matches the /<\/mask>/, so sed just did the substitution on whole file.
Check this example:
kent$ cat f
<mask>234</mask>
123
123
123
<mask>234</mask>
123
123
<keep>234</keep>
kent$ sed "/<mask>/,/<\/mask>/ s/[0-9]/*/g" f
<mask>***</mask>
***
***
***
<mask>***</mask>
123
123
<keep>234</keep>
Finally just a suggestion, don't process xml with regex (sed/awk/grep...). Of course, you may just use the "xml" as an example.

Related

Adding a new line to a text file after 5 occurrences of a comma in Bash

I have a text file that is basically one giant excel file on one line in a text file. An example would be like this:
Name,Age,Year,Michael,27,2018,Carl,19,2018
I need to change the third occurance of a comma into a new line so that I get
Name,Age,Year
Michael,27,2018
Carl,19,2018
Please let me know if that is too ambiguous and as always thank you in advance for all the help!
With Gnu sed:
sed -E 's/(([^,]*,){2}[^,]*),/\1\n/g'
To change the number of fields per line, change {2} to one less than the number of fields. For example, to change every fifth comma (as in the title of your question), you would use:
sed -E 's/(([^,]*,){4}[^,]*),/\1\n/g'
In the regular expression, [^,]*, is "zero or more characters other than , followed by a ,; in other words, it is a single comma-delimited field. This won't work if the fields are quoted strings with internal commas or newlines.
Regardless of what Linux's man sed says, the -E flag is an extension to Posix sed, which causes sed to use extended regular expressions (EREs) rather than basic regular expressions (see man 7 regex). -E also works on BSD sed, used by default on Mac OS X. (Thanks to #EdMorton for the note.)
With GNU awk for multi-char RS:
$ awk -v RS='[,\n]' '{ORS=(NR%3 ? "," : "\n")} 1' file
Name,Age,Year
Michael,27,2018
Carl,19,2018
With any awk:
$ awk -v RS=',' '{sub(/\n$/,""); ORS=(NR%3 ? "," : "\n")} 1' file
Name,Age,Year
Michael,27,2018
Carl,19,2018
Try this:
$ cat /tmp/22.txt
Name,Age,Year,Michael,27,2018,Carl,19,2018,Nooka,35,1945,Name1,11,19811
$ echo "Name,Age,Year"; grep -o "[a-zA-Z][a-zA-Z0-9]*,[1-9][0-9]*,[1-9][0-9]\{3\}" /tmp/22.txt
Michael,27,2018
Carl,19,2018
Nooka,35,1945
Name1,11,1981
Or, ,[1-9][0-9]\{3\} if you don't want to put [0-9] 3 more times for the YYYY part.
PS: This solution will give you only YYYY for the year (even if the data for YYYY is 19811 (typo mistakes if any), you'll still get 1981
You are looking for 3 fragments, each without a comma and separated by a comma.
The last fields can give problems (not ending with a comma and mayby only two fields.
The next command looks fine.
grep -Eo "([^,]*[,]{0,1}){0,3}" inputfile
This might work for you (GNU sed):
sed 's/,/\n/3;P;D' file
Replace every third , with a newline, print ,delete the first line and repeat.

bash how to extract a field based on its content from a delimited string

Problem - I have a set of strings that essentially look like this:
|AAAAAA|BBBBBB|CCCCCCC|...|XXXXXXXXX|...|ZZZZZZZZZ|
The '...' denotes omitted fields.
Please note that the fields between the pipes ('|') can appear in ANY ORDER and not all fields are necessarily present. My task is to find the "XXXXXXX" field and extract it from the string; I can specify that field with a regex and find it with grep/awk/etc., but once I have that one line extracted from the file, I am at a loss as to how to extract just that text between the pipes.
My searches have turned up splitting the line into individual fields and then extracting the Nth field, however, I do not know what N is, that is the trick.
I've thought of splitting the string by the delimiter, substituting the delimiter with a newline, piping those lines into a grep for the field, but that involves running another program and this will be run on a production server through near-TB of data, so I wanted to minimize program invocations. And I cannot copy the files to another machine nor do I have the benefit of languages like Python, Perl, etc., I'm stuck with the "standard" UNIX commands on SunOS. I think I'm being punished.
Thanks
As an example, let's extract the field that matches MyField:
Using sed
$ s='|AAAAAA|BBBBBB|CCCCCCC|...|XXXXXXXXX|12MyField34|ZZZZZZZZZ|'
$ sed -E 's/.*[|]([^|]*MyField[^|]*)[|].*/\1/' <<<"$s"
12MyField34
Using awk
$ awk -F\| -v re="MyField" '{for (i=1;i<=NF;i++) if ($i~re) print $i}' <<<"$s"
12MyField34
Using grep -P
$ grep -Po '(?<=\|)[^|]*MyField[^|]*' <<<"$s"
12MyField34
The -P option requires GNU grep.
$ sed -e 's/^.*|\(XXXXXXXXX\)|.*$/\1/'
Naturally, this only makes sense if XXXXXXXXX is a regular expression.
This should be really fast if used something like:
$ grep '|XXXXXXXXX|' somefile | sed -e ...
One hackish way -
sed 's/^.*|\(<whatever your regex is>\)|.*$/\1/'
but that might be too slow for your production server since it may involve a fair amount of regex backtracking.

Delete all lines before last case of a string

How would I go about deleting all the lines before the last occurrence of a string. Like if I had a file that looked like
Icecream is good
And
Chocolate is good
And
They have lots of sugar
If I want all lines after and including the last occurrence of "And" what's the cleanest way to do this? Specifically, I want
And
They have lots of sugar
I was doing sed -n -E -e '/And/,$p' file but I see this gives me the first occurrence.
This might work for you (GNU sed):
sed -n '/And/h;//!H;$!d;x;//p' file
Replace anything in the hold space by the line containing And. Append all other lines to the hold space. At the end of the file, swap the pattern space for the hold space and print out the result as long it matches the required string And.
I know that you asked for sed and that Potong provided a good sed solution. But, for comparison, here is an awk solution:
$ awk 's{s=s"\n"$0;} /And/{s=$0;} END{print s;}' file
And
They have lots of sugar
How it works:
s{s=s"\n"$0;}
If the variable s is not empty, then add to it the current line, $0.
/And/{s=$0;}
If the current line contains And, then set s to the current line, $0.
END{print s;}
After we have reached the end of the file, print s.
$ tac file | awk '!f; /And/{f=1}' | tac
And
They have lots of sugar
$ awk 'NR==FNR{if(/And/)nr=NR;next} FNR>=nr' file file
And
They have lots of sugar

Using sed/awk to process a pattern in bash

I have a command whose output is of the form:
[{"foo1":<some value>,"foo2":<some value>,"foo3":<some value>}]
I want to take the output of this command and just get the value corresponding to foo2
How do I use sed/awk or any other shell utility readily available in a bash script to do this?
Assuming that the values do not contain commas, this sed rune will do it:
sed -n 's/.*"foo2":\([^,]*\),.*/\1/'p
sed -n tells sed not to print lines by default.
The s ("substitute") command uses a regexp group delimited by \( and \) to pick out just the bit you want.
"foo2": provides the context needed to find the right value.
[^,]* means "a character that is not a comma, any number of times". This is your . If values are not delimited by commas, change this (and the comma after the grouping parens) to match correctly.
.* means "any character, any number of times", and it is used to match all the characters before and after the bit you want. Now the regexp will match the entire line.
\1 means the contents of the grouping parentheses. sed will substitute the string that matches the pattern (which is the whole line, because we used .* at the beginning and end) with the contents of the parens, .
Finally, the p on the end means "print the resulting line".
With this awk for example:
$ awk -F[:,] '{print $4}' file
<some value2>
-F[:,] sets possible field separators as : or ,. Then, it is a matter of counting the position in which <some value> of foo2 are. It happens to be the 4th.
With sed:
$ sed 's/.*"foo2":\([^,]*\).*/\1/g' file
<some value2>
.*"foo2":\([^,]*\).* gets the string coming after foo2: and until the comma appears. Then it prints it back with \1.
Your block of data looks like JSON. There is no native JSON parsing in bash, sed or awk, so ALL the answers here will either suggest that you use a different, more appropriate tool, or they will be hackish and might easily fail if your real data looks different from the example you've provided here.
That said, if you are confident that your variable:value blocks and line structure are always in the same format as this example, you may be able to get away with writing your own (very) basic parser that will work for just your use case.
Note that you can't really parse things in sed, it's just not designed for that. If your data always looks the same, a sed solution may be sufficient ... but remember that you are simply pattern matching, not parsing the input data. There are other answers already which cover this.
For very simple matching of the string that appears after the colon after "foo2", as Peter suggested, you could use the following:
$ data='[{"foo1":11,"foo2":222,"foo3":3333}]'
$ echo "$data" | sed -ne 's/.*"foo2":\([^,]*\),.*/\1/p'
As I say, this should in no way be confused with parsing of your JSON. It would work equally well (or badly) with an input string of abcde"foo2":bar,abcde.
In awk, you can make things that are a bit more advanced, but you still have serious limitations when it comes to JSON. For example, if you choose to separate fields with commas, but then you put a comma inside the <some value> in your data, awk doesn't know how to distinguish it from a field separator.
That said, if your JSON is only one level deep (i.e. matches your sample data), the following might work for you:
$ data='[{"foo1":11,"foo2":222,"foo3":3333}]'
$ echo "$data" | awk -F: -vRS=, '{gsub(/[^[:alnum:]]/,"",$1)} $1=="foo2" {print $2}'
This awk script considers commas as record separators and colons as field separators. It does not support any level of depth in your JSON, and depends on alphanumeric variable names. But it should handle JSON split on to multiple lines.
Alternately, if you want to avoid ugly hacks, and perl or python solutions don't work for you, you might want to try out jsawk. With it, you might use something like this:
$ data='[{"foo1":11,"foo2":222,"foo3":3333}]'
$ echo "$data" | jsawk -a 'return this.foo2'
[222]
SEE ALSO: Parsing json with awk/sed in bash to get key value pair
This worked for me. You can Try this one
echo "[{"foo1":<some value>,"foo2":<some value>,"foo3":<some value>}]" | awk -F"[:,]+" '{ if($3=="foo2") { print $4 }}'
Above line awk uses multiple field separators.I have used colon and comma here
Since this looks like JSON, let's parse it like JSON:
perl -MJSON -ne '$json = decode_json($_); print $json->[0]{foo2}, "\n"' <<END
[{"foo1":"some value","foo2":"some, value","foo3":"some value"}]
END
some, value

Sed is not replacing all occurrences of pattern

I've got a the following variable LINES with the format date;album;song;duration;singer;author;genre.
August 2013;MDNA;Falling Free;00:31:40;Madonna;Madonna;Pop
August 2013;MDNA;I don't give a;00:45:40;Madonna;Madonna;Pop
August 2013;MDNA;I'm a sinner;01:00:29;Madonna;Madonna;Pop
August 2013;MDNA;Give Me All Your Luvin';01:15:02;Madonna;Madonna;Pop
I want to output author-song, so I made this script:
echo $LINES | sed s_"^[^;]*;[^;]*;\([^;]*\);[^;]*;[^;]*;\([^;]*\)"_"\2-\1"_g
The desired output is:
Madonna-Falling Free
Madonna-I don't give a
Madonna-I'm a sinner
Madonna-Give Me All Your Luvin'
However, I am getting this:
Madonna-Falling Free;Madonna;Pop August 2013;MDNA;I don't give a;00:45:40;Madonna;Madonna;Pop August 2013;MDNA;I'm a sinner;01:00:29;Madonna;Madonna;Pop August 2013;MDNA;Give Me All Your Luvin';01:15:02;Madonna;Madonna;Pop
Why?
EDIT: I need to use sed.
When I run your sed script on your input, I get this output:
Madonna-Falling Free;Pop
Madonna-I don't give a;Pop
Madonna-I'm a sinner;Pop
Madonna-Give Me All Your Luvin';Pop
which is fine except for the extra ;Pop - you just need to add .*$ to the end of your regex so that the entire line is replaced.
Based on your reported output, I'm guessing your input file is using a different newline convention from what sed expects.
In any case, this is a pretty silly thing to use sed for. Much better with awk, for instance:
awk 'BEGIN {FS=";";OFS="-"} {print $5,$3}'
or, slightly more tersely,
awk -F\; -vOFS=- '{print $5,$3}'
If you want sed to see more than one line of input, you must quote the variable to echo:
echo "$LINES" | sed ...
Note that I'm not even going to try to evaluate the correctness of your sed script; using sed here is a travesty, given that awk is so much better suited to the task.
It looks like sed is viewing your entire sample text as a single line. So it is performing the operation requested and then leaving the rest unchanged.
I would look into the newline issue first. How are you populating $LINES?
You should also add to the pattern that seventh field in your input (genre), so that the expression actually does consume all of the text that you want it to. And perhaps anchor the end of the pattern on $ or \b (word boundary) or \s (a spacey character) or \n (newline).
If your format is absolutely permanent, just try below:
echo $line | sed 's#.*;.*;\(.*\);.*;.*;\(.*\);.*#\2-\1#'

Resources