Preprocessor to include file

Preprocessor to include file - ruby

What's a simple way to implement file inclusion in Ruby, e.g. if a text file includes {{{stuff.txt}}}, the contents of stuff.txt is included in-line. I thought maybe something like this:
cat prog | ruby -pe 'gsub /{{{.+}}}/, File.open("$0").read'
... with eval() involved, but can't get it to work.

s.gsub(/{{{(.+?)}}}/) { |m| File.read($1) }
A couple points:
There is an important difference between .+ and .+?. Using .+, it is easy to make regexps which eat too many characters.
You need to use a block to calculate the replacement (since it is dynamic)

If you'd like to do it straight from the command line, try
cat prog | ruby -pe '$_.gsub!(/{{{(.+?)}}}/) { File.read $1 }'
As pointed out by Alex D, .+ is greedy and matches as many characters as it can. On the other hand, .+? tries to match as few characters as possible.
Ruby's command line -p expects you to update the value of the $_ variable. Hence usage of mutating gsub! instead of gsub, which makes a copy. The same result could be achieved by using -n.
cat prog | ruby -ne 'puts $_.gsub(/{{{(.+?)}}}/) { File.read $1 }'

Related

Adding file paths to Latex figures?

In the below text I would like to add figs/01/ to each of the 3 files. As you can see the files can either be pdf,png or not have an extension and sometimes the \includegraphics breaks over several lines.
My current thinking is
cat figs.tex | ruby -ne 'puts $_.gsub(/\\includegraphics\[.*?\]\{.*?\}/) { |x| x.do_something_here }'
but it is a chick and egg problem, because I would need to search again for the part to search and replace.
Question
Can anyone see how to solve such a situation?
\begin{figure}[ht]
\centerline{ \includegraphics[height=55mm]{plotLn} \includegraphics[height=55mm]{plotLnZoom.pdf}}
\caption{Funktionen $f(x) = \ln(x)$ \ref{examg0} (bl)}
\end{figure}
\begin{example}[Parameterfremstilling for ret linje]\label{tn6.linje}
\begin{think}
Givet linjen $\,m\,$,
\includegraphics[trim=1cm 11.5cm 1cm
11.5cm,width=0.60\textwidth,clip]{vektor8.png}
\end{think}

You can read the whole file in one shot (instead of the default behaviour that reads the file line by line). To do that you need the switch -0777 (special value for the record separator). This solves the problem of a pattern that spreads over multiple lines.
You can also replace the -n option and puts with -p to automatically print the result.
ruby -0777 -pe 'gsub(/\\includegraphics\[[^\]]*\]{\K/,"figs/01/")' figs.tex
You can omit $_, by default gsub is applied to it. (You can even impress your friends removing the space between -pe and the quote ')
About the pattern, \K removes all on the left from the match result, the match result here is only an empty string at the expected position where the replacement string is inserted.
Note that the ruby command line options come from Perl:
perl -0777 -pe 's!\\includegraphics\[[^\]]*\]{\K!figs/01/!g' figs.tex

Replace text between pattern range on same line

This may be a better task for awk than sed, but the goal is to parse a single, long string (it happens to be an XML doc) and replace text within a pattern range with another character.
I want to preserve the number of characters being replaced and simply mask them as asterisks. I've put something together in a python script to parse the XML tree but have a feeling a native program is going to be much faster.
Assuming the string: "<mask>123</mask><keep>123</keep>"
...I'd like the output: "<mask>***</mask><keep>123</keep>"
My first attempt with sed without using ranges got me this:
$ echo "<mask>123</mask><keep>123</keep>" | sed "s/[0-9]/*/g"
<mask>***</mask><keep>***</keep>
I learned that sed can operate within ranges, but my understanding is that the behavior can only be toggled from line-to-line, not over the course of processing a single line.
Experimenting with pattern ranges got me the following (consistent with my understanding) and thus didn't work either:
$ echo "<mask>123</mask><keep>123</keep>" | sed "/<mask>/,/<\/mask>/ s/[0-9]/*/g"
<mask>***</mask><keep>***</keep>
EDIT: In fact, even if there were line breaks in the input, I must not be understanding the pattern range behavior correctly (or my example is poorly constructed)
$ echo "<mask>123</mask>\n<keep>123</keep>" | sed "/<mask>/,/<\/mask>/ s/[0-9]/*/g"
<mask>***</mask>
<keep>***</keep>
Any tips would be greatly appreciated.

Never use range expressions as they make simple tasks very slightly briefer but then need a complete rewrite or duplicate conditions when your requirements become marginally more interesting, always use a flag variable instead if a range is necessary. What that means, of course, is that you can't use sed for problems like this since it doesn't support variables.
Anyway, here's a trivial GNU awk (for multi-char RS and RT) solution that doesn't directly use ranges at all:
$ cat file
Assuming the string: "<mask>123</mask><keep>123</keep>" ...I'd like the
$ awk -v RS='</mask>' -v ORS= '{print gensub(/(.*<mask>).*/,"\\1***",1) RT}' file
Assuming the string: "<mask>***</mask><keep>123</keep>" ...I'd like the
or if you need the number of *s to match the number of characters they're replacing:
$ cat file
Assuming first string: "<mask>123</mask><keep>123</keep>" ...I'd like the
Assuming second string: "<mask>1234567</mask><keep>123</keep>" ...I'd like the
$ awk -v RS='</mask>' 'match($0,/(.*<mask>)(.*)/,a){ $0=a[1] gensub(/./,"*","g",a[2]) } {ORS=RT} 1' file
Assuming first string: "<mask>***</mask><keep>123</keep>" ...I'd like the
Assuming second string: "<mask>*******</mask><keep>123</keep>" ...I'd like the

why you got this output is completely correct. It is a trick of sed's range address of two regex.
What you gave sed is /regex1/, /regex2/, sed will first try to find the line matches address1, which is /regex1/, the first line matched, fine. Then your address2 is a regex too, so:
and if addr2 is a regexp, it will not be tested against the line
that addr1 matched.
This sentence is from sed's man page.
That is, sed starts checking your /regex2/ from line 2. of course, no line matches the /<\/mask>/, so sed just did the substitution on whole file.
Check this example:
kent$ cat f
<mask>234</mask>
123
123
123
<mask>234</mask>
123
123
<keep>234</keep>
kent$ sed "/<mask>/,/<\/mask>/ s/[0-9]/*/g" f
<mask>***</mask>
***
***
***
<mask>***</mask>
123
123
<keep>234</keep>
Finally just a suggestion, don't process xml with regex (sed/awk/grep...). Of course, you may just use the "xml" as an example.

Bash script output text between first match and 2nd match only [duplicate]

I'm trying to use sed to clean up lines of URLs to extract just the domain.
So from:
http://www.suepearson.co.uk/product/174/71/3816/
I want:
http://www.suepearson.co.uk/
(either with or without the trailing slash, it doesn't matter)
I have tried:
sed 's|\(http:\/\/.*?\/\).*|\1|'
and (escaping the non-greedy quantifier)
sed 's|\(http:\/\/.*\?\/\).*|\1|'
but I can not seem to get the non-greedy quantifier (?) to work, so it always ends up matching the whole string.

Neither basic nor extended Posix/GNU regex recognizes the non-greedy quantifier; you need a later regex. Fortunately, Perl regex for this context is pretty easy to get:
perl -pe 's|(http://.*?/).*|\1|'

In this specific case, you can get the job done without using a non-greedy regex.
Try this non-greedy regex [^/]* instead of .*?:
sed 's|\(http://[^/]*/\).*|\1|g'

With sed, I usually implement non-greedy search by searching for anything except the separator until the separator :
echo "http://www.suon.co.uk/product/1/7/3/" | sed -n 's;\(http://[^/]*\)/.*;\1;p'
Output:
http://www.suon.co.uk
this is:
don't output -n
search, match pattern, replace and print s/<pattern>/<replace>/p
use ; search command separator instead of / to make it easier to type so s;<pattern>;<replace>;p
remember match between brackets \( ... \), later accessible with \1,\2...
match http://
followed by anything in brackets [], [ab/] would mean either a or b or /
first ^ in [] means not, so followed by anything but the thing in the []
so [^/] means anything except / character
* is to repeat previous group so [^/]* means characters except /.
so far sed -n 's;\(http://[^/]*\) means search and remember http://followed by any characters except / and remember what you've found
we want to search untill the end of domain so stop on the next / so add another / at the end: sed -n 's;\(http://[^/]*\)/' but we want to match the rest of the line after the domain so add .*
now the match remembered in group 1 (\1) is the domain so replace matched line with stuff saved in group \1 and print: sed -n 's;\(http://[^/]*\)/.*;\1;p'
If you want to include backslash after the domain as well, then add one more backslash in the group to remember:
echo "http://www.suon.co.uk/product/1/7/3/" | sed -n 's;\(http://[^/]*/\).*;\1;p'
output:
http://www.suon.co.uk/

Simulating lazy (un-greedy) quantifier in sed
And all other regex flavors!
Finding first occurrence of an expression:
POSIX ERE (using -r option)
Regex:
(EXPRESSION).*|.
Sed:
sed -r ‍'s/(EXPRESSION).*|./\1/g' # Global `g` modifier should be on
Example (finding first sequence of digits) Live demo:
$ sed -r 's/([0-9]+).*|./\1/g' <<< 'foo 12 bar 34'
12
How does it work?
This regex benefits from an alternation |. At each position engine tries to pick the longest match (this is a POSIX standard which is followed by couple of other engines as well) which means it goes with . until a match is found for ([0-9]+).*. But order is important too.
Since global flag is set, engine tries to continue matching character by character up to the end of input string or our target. As soon as the first and only capturing group of left side of alternation is matched (EXPRESSION) rest of line is consumed immediately as well .*. We now hold our value in the first capturing group.
POSIX BRE
Regex:
\(\(\(EXPRESSION\).*\)*.\)*
Sed:
sed 's/\(\(\(EXPRESSION\).*\)*.\)*/\3/'
Example (finding first sequence of digits):
$ sed 's/\(\(\([0-9]\{1,\}\).*\)*.\)*/\3/' <<< 'foo 12 bar 34'
12
This one is like ERE version but with no alternation involved. That's all. At each single position engine tries to match a digit.
If it is found, other following digits are consumed and captured and the rest of line is matched immediately otherwise since * means
more or zero it skips over second capturing group \(\([0-9]\{1,\}\).*\)* and arrives at a dot . to match a single character and this process continues.
Finding first occurrence of a delimited expression:
This approach will match the very first occurrence of a string that is delimited. We can call it a block of string.
sed 's/\(END-DELIMITER-EXPRESSION\).*/\1/; \
s/\(\(START-DELIMITER-EXPRESSION.*\)*.\)*/\1/g'
Input string:
foobar start block #1 end barfoo start block #2 end
-EDE: end
-SDE: start
$ sed 's/\(end\).*/\1/; s/\(\(start.*\)*.\)*/\1/g'
Output:
start block #1 end
First regex \(end\).* matches and captures first end delimiter end and substitues all match with recent captured characters which
is the end delimiter. At this stage our output is: foobar start block #1 end.
Then the result is passed to second regex \(\(start.*\)*.\)* that is same as POSIX BRE version above. It matches a single character
if start delimiter start is not matched otherwise it matches and captures the start delimiter and matches the rest of characters.
Directly answering your question
Using approach #2 (delimited expression) you should select two appropriate expressions:
EDE: [^:/]\/
SDE: http:
Usage:
$ sed 's/\([^:/]\/\).*/\1/g; s/\(\(http:.*\)*.\)*/\1/' <<< 'http://www.suepearson.co.uk/product/174/71/3816/'
Output:
http://www.suepearson.co.uk/
Note: this will not work with identical delimiters.

sed does not support "non greedy" operator.
You have to use "[]" operator to exclude "/" from match.
sed 's,\(http://[^/]*\)/.*,\1,'
P.S. there is no need to backslash "/".

sed - non greedy matching by Christoph Sieghart
The trick to get non greedy matching in sed is to match all characters excluding the one that terminates the match. I know, a no-brainer, but I wasted precious minutes on it and shell scripts should be, after all, quick and easy. So in case somebody else might need it:
Greedy matching
% echo "<b>foo</b>bar" | sed 's/<.*>//g'
bar
Non greedy matching
% echo "<b>foo</b>bar" | sed 's/<[^>]*>//g'
foobar

Non-greedy solution for more than a single character
This thread is really old but I assume people still needs it.
Lets say you want to kill everything till the very first occurrence of HELLO. You cannot say [^HELLO]...
So a nice solution involves two steps, assuming that you can spare a unique word that you are not expecting in the input, say top_sekrit.
In this case we can:
s/HELLO/top_sekrit/ #will only replace the very first occurrence
s/.*top_sekrit// #kill everything till end of the first HELLO
Of course, with a simpler input you could use a smaller word, or maybe even a single character.
HTH!

This can be done using cut:
echo "http://www.suepearson.co.uk/product/174/71/3816/" | cut -d'/' -f1-3

another way, not using regex, is to use fields/delimiter method eg
string="http://www.suepearson.co.uk/product/174/71/3816/"
echo $string | awk -F"/" '{print $1,$2,$3}' OFS="/"

sed certainly has its place but this not not one of them !
As Dee has pointed out: Just use cut. It is far simpler and much more safe in this case. Here's an example where we extract various components from the URL using Bash syntax:
url="http://www.suepearson.co.uk/product/174/71/3816/"
protocol=$(echo "$url" | cut -d':' -f1)
host=$(echo "$url" | cut -d'/' -f3)
urlhost=$(echo "$url" | cut -d'/' -f1-3)
urlpath=$(echo "$url" | cut -d'/' -f4-)
gives you:
protocol = "http"
host = "www.suepearson.co.uk"
urlhost = "http://www.suepearson.co.uk"
urlpath = "product/174/71/3816/"
As you can see this is a lot more flexible approach.
(all credit to Dee)

sed 's|(http:\/\/[^\/]+\/).*|\1|'

There is still hope to solve this using pure (GNU) sed. Despite this is not a generic solution in some cases you can use "loops" to eliminate all the unnecessary parts of the string like this:
sed -r -e ":loop" -e 's|(http://.+)/.*|\1|' -e "t loop"
-r: Use extended regex (for + and unescaped parenthesis)
":loop": Define a new label named "loop"
-e: add commands to sed
"t loop": Jump back to label "loop" if there was a successful substitution
The only problem here is it will also cut the last separator character ('/'), but if you really need it you can still simply put it back after the "loop" finished, just append this additional command at the end of the previous command line:
-e "s,$,/,"

sed -E interprets regular expressions as extended (modern) regular expressions
Update: -E on MacOS X, -r in GNU sed.

Because you specifically stated you're trying to use sed (instead of perl, cut, etc.), try grouping. This circumvents the non-greedy identifier potentially not being recognized. The first group is the protocol (i.e. 'http://', 'https://', 'tcp://', etc). The second group is the domain:
echo "http://www.suon.co.uk/product/1/7/3/" | sed "s|^\(.*//\)\([^/]*\).*$|\1\2|"
If you're not familiar with grouping, start here.

I realize this is an old entry, but someone may find it useful.
As the full domain name may not exceed a total length of 253 characters replace .* with .\{1, 255\}

This is how to robustly do non-greedy matching of multi-character strings using sed. Lets say you want to change every foo...bar to <foo...bar> so for example this input:
$ cat file
ABC foo DEF bar GHI foo KLM bar NOP foo QRS bar TUV
should become this output:
ABC <foo DEF bar> GHI <foo KLM bar> NOP <foo QRS bar> TUV
To do that you convert foo and bar to individual characters and then use the negation of those characters between them:
$ sed 's/#/#A/g; s/{/#B/g; s/}/#C/g; s/foo/{/g; s/bar/}/g; s/{[^{}]*}/<&>/g; s/}/bar/g; s/{/foo/g; s/#C/}/g; s/#B/{/g; s/#A/#/g' file
ABC <foo DEF bar> GHI <foo KLM bar> NOP <foo QRS bar> TUV
In the above:
s/#/#A/g; s/{/#B/g; s/}/#C/g is converting { and } to placeholder strings that cannot exist in the input so those chars then are available to convert foo and bar to.
s/foo/{/g; s/bar/}/g is converting foo and bar to { and } respectively
s/{[^{}]*}/<&>/g is performing the op we want - converting foo...bar to <foo...bar>
s/}/bar/g; s/{/foo/g is converting { and } back to foo and bar.
s/#C/}/g; s/#B/{/g; s/#A/#/g is converting the placeholder strings back to their original characters.
Note that the above does not rely on any particular string not being present in the input as it manufactures such strings in the first step, nor does it care which occurrence of any particular regexp you want to match since you can use {[^{}]*} as many times as necessary in the expression to isolate the actual match you want and/or with seds numeric match operator, e.g. to only replace the 2nd occurrence:
$ sed 's/#/#A/g; s/{/#B/g; s/}/#C/g; s/foo/{/g; s/bar/}/g; s/{[^{}]*}/<&>/2; s/}/bar/g; s/{/foo/g; s/#C/}/g; s/#B/{/g; s/#A/#/g' file
ABC foo DEF bar GHI <foo KLM bar> NOP foo QRS bar TUV

Have not yet seen this answer, so here's how you can do this with vi or vim:
vi -c '%s/\(http:\/\/.\{-}\/\).*/\1/ge | wq' file &>/dev/null
This runs the vi :%s substitution globally (the trailing g), refrains from raising an error if the pattern is not found (e), then saves the resulting changes to disk and quits. The &>/dev/null prevents the GUI from briefly flashing on screen, which can be annoying.
I like using vi sometimes for super complicated regexes, because (1) perl is dead dying, (2) vim has a very advanced regex engine, and (3) I'm already intimately familiar with vi regexes in my day-to-day usage editing documents.

Since PCRE is also tagged here, we could use GNU grep by using non-lazy match in regex .*? which will match first nearest match opposite of .*(which is really greedy and goes till last occurrence of match).
grep -oP '^http[s]?:\/\/.*?/' Input_file
Explanation: using grep's oP options here where -P is responsible for enabling PCRE regex here. In main program of grep mentioning regex which is matching starting http/https followed by :// till next occurrence of / since we have used .*? it will look for first / after (http/https://). It will print matched part only in line.

echo "/home/one/two/three/myfile.txt" | sed 's|\(.*\)/.*|\1|'
don bother, i got it on another forum :)

sed 's|\(http:\/\/www\.[a-z.0-9]*\/\).*|\1| works too

Here is something you can do with a two step approach and awk:
A=http://www.suepearson.co.uk/product/174/71/3816/
echo $A|awk '
{
var=gensub(///,"||",3,$0) ;
sub(/\|\|.*/,"",var);
print var
}'
Output:
http://www.suepearson.co.uk
Hope that helps!

Another sed version:
sed 's|/[:alnum:].*||' file.txt
It matches / followed by an alphanumeric character (so not another forward slash) as well as the rest of characters till the end of the line. Afterwards it replaces it with nothing (ie. deletes it.)

#Daniel H (concerning your comment on andcoz' answer, although long time ago): deleting trailing zeros works with
s,([[:digit:]]\.[[:digit:]]*[1-9])[0]*$,\1,g
it's about clearly defining the matching conditions ...

You should also think about the case where there is no matching delims. Do you want to output the line or not. My examples here do not output anything if there is no match.
You need prefix up to 3rd /, so select two times string of any length not containing / and following / and then string of any length not containing / and then match / following any string and then print selection. This idea works with any single char delims.
echo http://www.suepearson.co.uk/product/174/71/3816/ | \
sed -nr 's,(([^/]*/){2}[^/]*)/.*,\1,p'
Using sed commands you can do fast prefix dropping or delim selection, like:
echo 'aaa #cee: { "foo":" #cee: " }' | \
sed -r 't x;s/ #cee: /\n/;D;:x'
This is lot faster than eating char at a time.
Jump to label if successful match previously. Add \n at / before 1st delim. Remove up to first \n. If \n was added, jump to end and print.
If there is start and end delims, it is just easy to remove end delims until you reach the nth-2 element you want and then do D trick, remove after end delim, jump to delete if no match, remove before start delim and and print. This only works if start/end delims occur in pairs.
echo 'foobar start block #1 end barfoo start block #2 end bazfoo start block #3 end goo start block #4 end faa' | \
sed -r 't x;s/end//;s/end/\n/;D;:x;s/(end).*/\1/;T y;s/.*(start)/\1/;p;:y;d'

If you have access to gnu grep, then can utilize perl regex:
grep -Po '^https?://([^/]+)(?=)' <<< 'http://www.suepearson.co.uk/product/174/71/3816/'
http://www.suepearson.co.uk
Alternatively, to get everything after the domain use
grep -Po '^https?://([^/]+)\K.*' <<< 'http://www.suepearson.co.uk/product/174/71/3816/'
/product/174/71/3816/

The following solution works for matching / working with multiply present (chained; tandem; compound) HTML or other tags. For example, I wanted to edit HTML code to remove <span> tags, that appeared in tandem.
Issue: regular sed regex expressions greedily matched over all the tags from the first to the last.
Solution: non-greedy pattern matching (per discussions elsewhere in this thread; e.g. https://stackoverflow.com/a/46719361/1904943).
Example:
echo '<span>Will</span>This <span>remove</span>will <span>this.</span>remain.' | \
sed 's/<span>[^>]*>//g' ; echo
This will remain.
Explanation:
s/<span> : find <span>
[^>] : followed by anything that is not >
*> : until you find >
//g : replace any such strings present with nothing.
Addendum
I was trying to clean up URLs, but I was running into difficulty matching / excluding a word - href - using the approach above. I briefly looked at negative lookarounds (Regular expression to match a line that doesn't contain a word) but that approach seemed overly complex and did not provide a satisfactory solution.
I decided to replace href with ` (backtick), do the regex substitutions, then replace ` with href.
Example (formatted here for readability):
printf '\n
<a aaa h href="apple">apple</a>
<a bbb "c=ccc" href="banana">banana</a>
<a class="gtm-content-click"
data-vars-link-text="nope"
data-vars-click-url="https://blablabla"
data-vars-event-category="story"
data-vars-sub-category="story"
data-vars-item="in_content_link"
data-vars-link-text
href="https:example.com">Example.com</a>\n\n' |
sed 's/href/`/g ;
s/<a[^`]*`/\n<a href/g'
apple
banana
Example.com
Explanation: basically as above. Here,
s/href/` : replace href with ` (backtick)
s/<a : find start of URL
[^`] : followed by anything that is not ` (backtick)
*` : until you find a `
/<a href/g : replace each of those found with <a href

Unfortunately, as mentioned, this it is not supported in sed.
To overcome this, I suggest to use the next best thing(actually better even), to use vim sed-like capabilities.
define in .bash-profile
vimdo() { vim $2 --not-a-term -c "$1" -es +"w >> /dev/stdout" -cq! ; }
That will create headless vim to execute a command.
Now you can do for example:
echo $PATH | vimdo "%s_\c:[a-zA-Z0-9\\/]\{-}python[a-zA-Z0-9\\/]\{-}:__g" -
to filter out python in $PATH.
Use - to have input from pipe in vimdo.
While most of the syntax is the same. Vim features more advanced features, and using \{-} is standard for non-greedy match. see help regexp.

Using sed/awk to process a pattern in bash

I have a command whose output is of the form:
[{"foo1":<some value>,"foo2":<some value>,"foo3":<some value>}]
I want to take the output of this command and just get the value corresponding to foo2
How do I use sed/awk or any other shell utility readily available in a bash script to do this?

Assuming that the values do not contain commas, this sed rune will do it:
sed -n 's/.*"foo2":\([^,]*\),.*/\1/'p
sed -n tells sed not to print lines by default.
The s ("substitute") command uses a regexp group delimited by \( and \) to pick out just the bit you want.
"foo2": provides the context needed to find the right value.
[^,]* means "a character that is not a comma, any number of times". This is your . If values are not delimited by commas, change this (and the comma after the grouping parens) to match correctly.
.* means "any character, any number of times", and it is used to match all the characters before and after the bit you want. Now the regexp will match the entire line.
\1 means the contents of the grouping parentheses. sed will substitute the string that matches the pattern (which is the whole line, because we used .* at the beginning and end) with the contents of the parens, .
Finally, the p on the end means "print the resulting line".

With this awk for example:
$ awk -F[:,] '{print $4}' file
<some value2>
-F[:,] sets possible field separators as : or ,. Then, it is a matter of counting the position in which <some value> of foo2 are. It happens to be the 4th.
With sed:
$ sed 's/.*"foo2":\([^,]*\).*/\1/g' file
<some value2>
.*"foo2":\([^,]*\).* gets the string coming after foo2: and until the comma appears. Then it prints it back with \1.

Your block of data looks like JSON. There is no native JSON parsing in bash, sed or awk, so ALL the answers here will either suggest that you use a different, more appropriate tool, or they will be hackish and might easily fail if your real data looks different from the example you've provided here.
That said, if you are confident that your variable:value blocks and line structure are always in the same format as this example, you may be able to get away with writing your own (very) basic parser that will work for just your use case.
Note that you can't really parse things in sed, it's just not designed for that. If your data always looks the same, a sed solution may be sufficient ... but remember that you are simply pattern matching, not parsing the input data. There are other answers already which cover this.
For very simple matching of the string that appears after the colon after "foo2", as Peter suggested, you could use the following:
$ data='[{"foo1":11,"foo2":222,"foo3":3333}]'
$ echo "$data" | sed -ne 's/.*"foo2":\([^,]*\),.*/\1/p'
As I say, this should in no way be confused with parsing of your JSON. It would work equally well (or badly) with an input string of abcde"foo2":bar,abcde.
In awk, you can make things that are a bit more advanced, but you still have serious limitations when it comes to JSON. For example, if you choose to separate fields with commas, but then you put a comma inside the <some value> in your data, awk doesn't know how to distinguish it from a field separator.
That said, if your JSON is only one level deep (i.e. matches your sample data), the following might work for you:
$ data='[{"foo1":11,"foo2":222,"foo3":3333}]'
$ echo "$data" | awk -F: -vRS=, '{gsub(/[^[:alnum:]]/,"",$1)} $1=="foo2" {print $2}'
This awk script considers commas as record separators and colons as field separators. It does not support any level of depth in your JSON, and depends on alphanumeric variable names. But it should handle JSON split on to multiple lines.
Alternately, if you want to avoid ugly hacks, and perl or python solutions don't work for you, you might want to try out jsawk. With it, you might use something like this:
$ data='[{"foo1":11,"foo2":222,"foo3":3333}]'
$ echo "$data" | jsawk -a 'return this.foo2'
[222]
SEE ALSO: Parsing json with awk/sed in bash to get key value pair

This worked for me. You can Try this one
echo "[{"foo1":<some value>,"foo2":<some value>,"foo3":<some value>}]" | awk -F"[:,]+" '{ if($3=="foo2") { print $4 }}'
Above line awk uses multiple field separators.I have used colon and comma here

Since this looks like JSON, let's parse it like JSON:
perl -MJSON -ne '$json = decode_json($_); print $json->[0]{foo2}, "\n"' <<END
[{"foo1":"some value","foo2":"some, value","foo3":"some value"}]
END
some, value

How to use sed to test and then edit one line of input?

I want to test whether a phone number is valid, and then translate it to a different format using a script. This far I can test the number like this:
sed -n -e '/(0..)-...\s..../p' -e '/(0..)-...-..../p'
However, I don't just want to test the number and output it, I would like to remove the brackets, dashes and spaces and output that.
Is there any way to do that using sed? Or should I be using something else, like AWK?

I'm not sure why you're using a 0 in that position. You're saying "a zero followed by any two characters" in the area code position. Is that really what you mean?
Anyway, you want to use the sed substitution operator with the p command in conjunction with the -n switch. Here's one way to do it:
sed -n 's/(\([0-9][0-9][0-9]\))\s\?\([0-9][0-9][0-9]\)[- ]\([0-9][0-9][0-9][0-9]\)/\1\2\3/p'

You can also use something as simple as egrep to validate lines and tr to remove the characters you don't want to see:
egrep "\([0-9]+\)[0-9.-]+" <file> |tr -d '()\-'
Note that it will only work if you don't want to keep any of those characters.

This is a more succinct version of Jonathan Feinberg's answer. It uses extended regular expressions to avoid having to do all the escaping that the curly braces would require (in addition to moving the escaping of parentheses from the special ones to the literal ones).
sed -r 's/\(([[:digit:]]{3})\)\s?([[:digit:]]{3})[ -]([[:digit:]]{4})/\1\2\3/'

this suggestion depends on how your number format looks like , for example, i assume phone number like this
echo "(703) 234 5678" | awk '
{
for(i=1;i<=NF;i++){
gsub(/\(|\)/,"",$i) # remove ( and )
if ($i+0>=0 ){ # check if it more than 0 and a number
print $i
}
if (){
# some other checks
}
}
}
'
do it systematically, and you don't have to waste time crafting out complex regex

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Preprocessor to include file - ruby

What's a simple way to implement file inclusion in Ruby, e.g. if a text file includes {{{stuff.txt}}}, the contents of stuff.txt is included in-line. I thought maybe something like this: cat prog | ruby -pe 'gsub /{{{.+}}}/, File.open("$0").read' ... with eval() involved, but can't get it to work.

s.gsub(/{{{(.+?)}}}/) { |m| File.read($1) } A couple points: There is an important difference between .+ and .+?. Using .+, it is easy to make regexps which eat too many characters. You need to use a block to calculate the replacement (since it is dynamic)

Related

Adding file paths to Latex figures?

Replace text between pattern range on same line

Bash script output text between first match and 2nd match only [duplicate]

Using sed/awk to process a pattern in bash

How to use sed to test and then edit one line of input?

Categories

Resources