Using sed or awk to select - bash

I'm trying to select the lines between between two markers in an html file. I've tried using sed and awk but I think there's an issue with the way i'm escaping some of the characters. I have seen some similar questions and answers, but the examples given are simple, with no special characters. I think my escaping is the issue. I need the lines between
<div class="bread crumb">
and
</div>
There is no other div within the block and there are multiple lines within the block.
Do I need to escape the characters <, > and ? as below?
sed -n -e '/^\<div class=\"bread crumb\"\>$/,/^\<\/div\>$/{ /^\<div class=\"bread crumb\">$/d; /^\<\/div>$/d; p; }'
My awk attempt :
awk '/\<div class=\"bread crumb\"\>/{flag=1;next}/\<\/div\>/{flag=0}flag'

Actually, you just need to escape the / in the </div>, rest goes fine..
sed -n '/<div class="bread crumb">/,/<\/div>/{//!p}'

You should use a html parser for that job.
If you still want to do it with sed, don't escape < and > that are used for word boundary.
Try this:
sed -ne '/<div class="bread crumb">/,/<\/div>/{//!p;}' file
The //!p part outputs all the block except the lines matching the address patterns.

Just use string matches in awk:
awk '$0=="</div>"{f=0} f{print} $0=="<div class=\"bread crumb\">"{f=1} ' file

Related

How to add double quote in csv file where field contains space?

One feature of legacy code doesn't work and I have to make a work around by redevelopping a quick and dirty feature.
We are generating csv file and I had something like that with legacy code :
foo; bar;"foo bar";foobar
"bla ble"; bli;blo;"blu bly"
Each field in my csv containing a space must be surrounded by a double quote "
Currently, with my quick and dirty script, my csv file got only
foo; bar;foo bar;foobar
bla ble; bli;blo;blu bly
This is not good because clients will have a breaking change with my quick and dirty script :D
I am developping a script using shell /bin/bash, I've search arround sed or awk but wasn't able to find something to help me.
Will you ? :)
Thanks !
Here is a simple awk:
$ awk 'BEGIN{FS=OFS=";"}{for(i=1;i<=NF;++i) if ($i ~ / /) $i = "\042" $i "\042"}1' file.csv
To quote fields that contain spaces (for example foo;foo bar -> foo;"foo bar") you can use sed:
sed 's/ *\(\w\+ \)\+\w\+/"&"/g' input.csv > output.csv
The pattern *\(\w\+ \+\)\+\w\+ matches zero or more spaces, followed by a group with a word and one or more spaces \(\w\+ \+\), then one or more occurrences of the group \+, followed by a word \w\+. The replacement "&" quotes the matched pattern.
Using Miller (https://github.com/johnkerl/miller) and running
mlr --icsvlite --ocsv --quote-all --fs ";" cat input
you will have
"foo";"bar";"foo bar";"foobar"
"bla ble";"bli";"blo";"blu bly"
I think it's no problem for you to have double quotes for all
echo "foo; bar;foo bar;foobar" | sed s'#;#+#'g | tr '+' '\n' | \
sed s'#^#\"#'g | sed s'#$#\";#'g | tr -d '\n'
The first thing this code does, is replace the colon delimiters with a placeholder, that can then be replaced with newlines.
From there, it's simple. I first replace the start of every new line with double quotes, and then the end with closing double quotes and a colon.
After that, I use tr to remove the newlines again, which puts all of the colon delimited fields back on the same line.

Removing all the characters from a string after pattern+2

I am trying to remove all the characters from a string after a specific pattern +2 in bash.
In this case I have for example:
3434.586909
3434.58690932454
3434.5869093232r3353
I'd like to keep just 3434.58
I tried with awk and a wildcard but my test haven't worked yet.
You can use sed:
sed 's/\(\...\).*/\1/'
It means "remembering a dot and two following characters, replace them and everything that follows with the remembered part".
How about using floating point logic?
awk '{printf("%.02f\n",$0)}' Input_file
awk '{print substr($0,1,7)}' file
3434.58
3434.58
3434.58

Match the same pattern n times on the same line using sed

I have an input file div.txt that looks like this:
<div>a</div>b<div>c</div>
<div>d</div>
Now I want to pick all the div tags and the text between them using sed:
sed -n 's:.*\(<div>.*</div>\).*:\1:p' < div.txt
The result I get:
<div>c</div>
<div>d</div>
What I really want:
<div>a</div>
<div>c</div>
<div>d</div>
So the question is, how to match the same pattern n times on the same line?
(do not suggest me to use perl or python, please)
This might work for you (GNU sed):
sed 's/\(<\/div>\)[^<]*/\1\n/;/^</P;D' file
Replace a </div> followed by zero or more characters that are not a < by itself and a newline. Print only lines that begin with a <.
Sed is not the right tool to handle HTML.
But if you really insist, and you know your input will always have properly closed pairs of div tags, you can just replace everything that's not inside a div by a newline:
sed 's=</div>.*<div>=</div>\n<div>='

Adding a new line to a text file after 5 occurrences of a comma in Bash

I have a text file that is basically one giant excel file on one line in a text file. An example would be like this:
Name,Age,Year,Michael,27,2018,Carl,19,2018
I need to change the third occurance of a comma into a new line so that I get
Name,Age,Year
Michael,27,2018
Carl,19,2018
Please let me know if that is too ambiguous and as always thank you in advance for all the help!
With Gnu sed:
sed -E 's/(([^,]*,){2}[^,]*),/\1\n/g'
To change the number of fields per line, change {2} to one less than the number of fields. For example, to change every fifth comma (as in the title of your question), you would use:
sed -E 's/(([^,]*,){4}[^,]*),/\1\n/g'
In the regular expression, [^,]*, is "zero or more characters other than , followed by a ,; in other words, it is a single comma-delimited field. This won't work if the fields are quoted strings with internal commas or newlines.
Regardless of what Linux's man sed says, the -E flag is an extension to Posix sed, which causes sed to use extended regular expressions (EREs) rather than basic regular expressions (see man 7 regex). -E also works on BSD sed, used by default on Mac OS X. (Thanks to #EdMorton for the note.)
With GNU awk for multi-char RS:
$ awk -v RS='[,\n]' '{ORS=(NR%3 ? "," : "\n")} 1' file
Name,Age,Year
Michael,27,2018
Carl,19,2018
With any awk:
$ awk -v RS=',' '{sub(/\n$/,""); ORS=(NR%3 ? "," : "\n")} 1' file
Name,Age,Year
Michael,27,2018
Carl,19,2018
Try this:
$ cat /tmp/22.txt
Name,Age,Year,Michael,27,2018,Carl,19,2018,Nooka,35,1945,Name1,11,19811
$ echo "Name,Age,Year"; grep -o "[a-zA-Z][a-zA-Z0-9]*,[1-9][0-9]*,[1-9][0-9]\{3\}" /tmp/22.txt
Michael,27,2018
Carl,19,2018
Nooka,35,1945
Name1,11,1981
Or, ,[1-9][0-9]\{3\} if you don't want to put [0-9] 3 more times for the YYYY part.
PS: This solution will give you only YYYY for the year (even if the data for YYYY is 19811 (typo mistakes if any), you'll still get 1981
You are looking for 3 fragments, each without a comma and separated by a comma.
The last fields can give problems (not ending with a comma and mayby only two fields.
The next command looks fine.
grep -Eo "([^,]*[,]{0,1}){0,3}" inputfile
This might work for you (GNU sed):
sed 's/,/\n/3;P;D' file
Replace every third , with a newline, print ,delete the first line and repeat.

parse word from html file

I am having a lot of trouble trying to extract a word from an html file. The line in the html file appears like this:
<span id="result">WORD</span>
I am trying to get the WORD out but I can't figure it out. So far I've got:
grep 'span id="result"' FILE
Which just gets me the line. I've also tried:
sed -n '/<span id="result">/,/<\/span>/p' FILE
which didn't work either.
I know this is probably a very simple question, but I'm just beginning so I could really use some help.
Do not use regex to parse html.
Use a html parser.
My Xidel has the shortest syntax for this:
xidel FILE -e "#result"
This is a task for awk
I do guess you have other line in same files so a search for span id is a must.
echo "<span id="result">WORD</span>" | awk -F"[<>]" '/span id/ {print $3}'
WORD
You can try
awk -f ext.awk input.html
where input.html is your input html file, and ext.awk is
{
line=line $0 RS
}
END {
match (line,/<span id="result">([^<]*)<\/span>/,a)
print a[1]
}
This will extract the contents across line breaks..
Use grep with backward reference:
grep -Po '(?<=<span id="result">)\w+'
The expression between parenthèses is a backward reference; it is not captured but serves as test for the following regex part: if the expression appears, the captured pattern is only \w+ here. Add option -o for outputting only the word; option -P enables forward and backward references.
If you want to modifiy this regex, please note that with grep, a backward reference must have a fixed size.

Resources