chunking a file with awk or a shell script

chunking a file with awk or a shell script - bash

this feels like it should be a simple task, but somehow can't wrap my brain around it. I have HTML files with headers from H1-H4. I would like to get the content between H3 tags. Not the text between <H3> and </H3> rather the text between two H3s.
<H3>some text</H3>
<p> more text that I would like to grab</p>
<H3> some other text </H3>
<p> some more text that I'd like to get </p>
...
Thank you in advance
I've been asked to describe a sample output, which I thought i did in a comment below. I will restate the same, and if something is not clear, please let me know.
input: long file with many H3 headings
output: many small files each containing a fragment that starts with the line containing an H3 heading, and ends on the line before the next H3 heading.

Without you posting your expected output we're just guessing but if you literally want the text between </H3> and <H3>, here's one way with GNU awk:
$ cat file
<H3>some text</H3>
<p> more text that I would like to grab</p>
<H3> some other text </H3>
<p> some more text that I'd like to get </p>
$ gawk -F'</H3>' -v RS="<H3>" -v ORS= 'NR>1{print $NF}' file
<p> more text that I would like to grab</p>
<p> some more text that I'd like to get </p>
$
$ cat file
<H3>some text</H3><p>more text that I would like to grab</p><H3>some other text</H3><p> some more text that I'd like to get </p>
$ gawk -F'</H3>' -v RS="<H3>" -v ORS= 'NR>1{print $NF}' file
<p>more text that I would like to grab</p><p> some more text that I'd like to get </p>
$ gawk -F'</H3>' -v RS="<H3>" 'NR>1{print $NF}' file
<p>more text that I would like to grab</p>
<p> some more text that I'd like to get </p>
You need GNU awk for that so you can have a multi-character RS.
Note that when there are newlines included in the text between your blocks those are reproduced in the output just like any other characters.
If the above is not what you want, again, tell us more....

The problem is that HTML syntax is quite flexible. For example:
<H3>some text</H3>
<p> more text that I would like to grab</p>
<H3> some other text </H3>
<p> some more text that I'd like to get </p>
And
<H3>
some text
</H3>
<p>
more
text
that
I
would
like
to
grab</p>
<H3>
some other text
</H3>
<p>some more text that I'd like to get
</p>
Will produce the same output. Extra whitespace is stripped, and tags can be scattered all about. You can't simply look for a particular tag to know what you're after.
The only real way to do this is to use a full bred scripting language like Perl or Python that has modules that can parse and organize HTML formatted files for you. You can't parse HTML or XML with Unix's regular expressions.
Unfortunately, you've tagged this as bash, shell, or awk, and none of those can really handle HTML input in a clean manner.

As a start, this shell line will extract the first H3 to H3 section...
$ sed -e '1,/<H3/d' -e '/<H3/,$d'

Related

How to extract an HTML tag by ID?

How can I extract HTML content on a page by ID?
I tried exploring sed/grep solutions for an hour. None worked.
I then gave in and explored HTML/XML parsers. html-xml-utils can only get an element by class, not ID, making it totally useless. I consulted the manual and it seems there's no way to get by id.
xmlstarlet seemed more promising, yet it whines when I try passing it HTML files rather than XML files. The following spits out at least 100 errors:
cat /home/com/interlinked/blog.html | tail -n +2 | xmlstarlet sel -T -t -m '/div/article[#id="post33"]' -v '.' -n
I used cat here because I don't want to modify the actual file. I used tail to cut out the DOCTYPE declaration which seemed to be causing issues earlier: Extra content at the end of the document
The content on the page is well formatted and consisted. Content looks like this:
<article id="post44">
... more HTML tags and content here...
</article>
I'd like to be able to extract everything between the specific article tags here by ID (e.g. if I pass it "44" it will return the contents of post44, if I pass it 34, it will return the contents of post34).
What sets this apart from other questions is I do not want just the content, I want the actual HTML between the article tags. I don't need the article tags themselves, though removing them is probably trivial.
Is there a way to do this using the built in Unix tools or xmlstarlet or html-xml-utils? I also tried the following sed which also failed to work:
article=`patt=$(printf 'article id="post%d"' $1); sed -n '/<$patt>/,/<\/article>/{ /article>/d; p }' $file`
Here I am passing in the file path as $file and and $1 is the blog post ID (44 or 34 or whatever). The reason for the two statements in one is because the $1 doesn't get evaluated within the sed statement otherwise because of the single quotes. That helps the variable resolve in a related grep command but not in this sed command.
Complete HTML structure:
<!doctype html>
<html lang="en">
<head>
<title>Page</title>
</head>
<body>
<header>
<nav>
<div id="sitelogo">
<img src="/img/logo/logo.png" alt="InterLinked"></img>
</div>
<ul>
<p>Menu</p>
</ul>
</nav>
<hr>
</header>
<div id="main">
<h1>Blog</h1>
<div id="bloglisting">
<article id="post44">
<p>Content</p>
</article>
<article id="post43">
</p>Content</p>
</article>
</div>
</div>
</body>
</html>
Also, to clarify, I need this to work on 2 different pages. Some posts are inline on this main page, but longer ones have their own page. The structure is similar, but not exactly the same. I'd like a solution that just finds the ID and doesn't need to worry about parent tags, if possible. The article tags themselves are formatted the same way on both kinds of pages. For instance, on a longer blog post with its own page, the different is here:
<div id="main">
<h1>Why Ridesharing Is Evil</h1>
<div id="blogpost">
<article id="post43">
<div>
In this case, the div bloglisting becomes blogpost. That's really the only big difference.

You can use the libxml2 tools to properly parse HTML/XML in proper syntax awareness. For your case, you can use xmllint and ask it to parse HTML file with flag --html and provide an xpath query from the top-level to get the node of your choice.
For e.g. to get the content for post id post43 use a filter like
xmllint --html --xpath \
"//html/body/div[#id='main']/div[#id='bloglisting']/article[#id='post43']" html
If the xmllint compiled on your machine does not understand a few recent (HTML5) tags like <article> or <nav>, suppress the warnings by adding 2>/dev/null at the end of the command.
If you want to get only the contents within <article> and not have the tags themselves, remove the first and last line by piping the result to sed as below.
xmllint --html --xpath \
"//html/body/div[#id='main']/div[#id='bloglisting']/article[#id='post43']" html 2>/dev/null |
sed '1d; $d'
To use a variable for the post-id, define a shell variable and use it within the xpath query
postID="post43"
xmllint --html --xpath \
"//html/body/div[#id='main']/div[#id='bloglisting']/article[#id='"$postID"']" html 2>/dev/null |
sed '1d; $d'

Substitute unmatched left angle brackets in HTML

My problem: How to find lines with unmatched left angle brackets and replace these brackets with their HTML equivalents.
Example input:
<dd>
Pro 10g Flüssigkeit: 2g Wasserstoffperoxid <10% Tenside. ENTHÄLT: Sulfamidsäure,</dd>
Expected output by substituting the unmatched '<10%' string:
<dd>
Pro 10g Flüssigkeit: 2g Wasserstoffperoxid <10% Tenside. ENTHÄLT: Sulfamidsäure,</dd>
There are German 'Umlaute' included in my example text just in case they could 'mess something up'...
I would like to use sed or awk if possible.
I have read:
Use sed with regex and (, How to decrement (substract) number in file with sed and
sed - regex square brackets detection in Linux and other Q&A but I can't seem to get my head around regexes. Sorry!
Thanks a lot for your help!

This is a dangerous proposal, because sed works on a line-by-line basis, and for each line, there are several cases to consider:
There could be only the less-than character without any html tags:
<p>
x < 10
</p>
There could be, as in your example, a html tag after the less-than character
<p> x < 10 </p>
The less-than character could be inside a html tag.
<img src="..." alt="Graph for x < 10">
It could be a really long html tag which is closed in a later line.
<img
src="..."
alt="..."
>
What I'd do is to at first assume only the first two options are present, then use something like this:
sed -i.orig -r 's/<([^>]*($|<))/\<\1/g' file.
This will keep a backup of the original file with the new extension .orig, so that you can then run a diff program over both to see what has changed.
As for how this works:
s/AAA/BBB/g replaces any occurrence of AAA with BBB
s/A(CC)/B\1/g replaces ACC with BCC, that is the part in the parenthesis is inserted for the \1
[^>]* means zero or more of any characters other than >
($|<) is either the end of line or <, whichever comes first.
So it searches for a < without a > until either the next < or the end of the line, and replaces that part with < and everything that it found after the initial <

This might be good enough:
$ sed -E 's/<([^>]+<)/\<\1/g' file
<dd>
Pro 10g Flüssigkeit: 2g Wasserstoffperoxid <10% Tenside. ENTHÄLT: Sulfamidsäure,</dd>
If not then edit your question provide a more complete (but still concise and testable) example that truly represents your real input.
There's nothing special about an umlaute or any other input character btw.

How can I parse out a line below a specific string? [duplicate]

I need to get the HTML contents between a pair of given tags using a bash script.
As an example, having the HTML code below:
<html>
<head>
</head>
<body>
text
<div>
text2
<div>
text3
</div>
</div>
</body>
</html>
Using the bash command/script, given the body tag, we would get:
text
<div>
text2
<div>
text3
</div>
</div>
Thanks in advance.

plain text processing is not good for html/xml parsing. I hope this could give you some idea:
kent$ xmllint --xpath "//body" f.html
<body>
text
<div>
text2
<div>
text3
</div>
</div>
</body>

Using sed in shell/bash, so you needn't install something else.
tag=body
sed -n "/<$tag>/,/<\/$tag>/p" file

Personally I find it very useful to use hxselect command (often with help of hxclean) from package html-xml-utils. The latter fixes (sometimes broken) HTML file to correct XML file and the first one allows to use CSS selectors to get the node(s) you need. With use of the -c option, it strips surrounding tags. All these commands work on stdin and stdout. So in your case you should execute:
$ hxselect -c body <<HTML
<html>
<head>
</head>
<body>
text
<div>
text2
<div>
text3
</div>
</div>
</body>
</html>
HTML
to get what you need. Plain and simple.

Forgetting Bash due it's limitation, you can use nokogiri as command line util, as explained here.
Example:
curl -s http://example.com/ | nokogiri -e 'puts $_.search('\''a'\'')'

Another option is to use the multi-platform xidel utility (home page on SourceForge, GitHub repository), which can handle both XML and HTML:
xidel -s in.html -e '/html/body/node()' --printed-node-format=html
The above prints the resulting HTML with syntax highlighting (colored), and seemingly with an empty line after the text node.
If you want the text only, Reino points out that you can simplify to:
xidel -s in.html -e '/html/body/inner-html()'

Consider using beautifulspoon.
Select the body tag from the above .html:
$ beautifulspoon example.html --select body
<body>
text
<div>
text2
<div>
text3
</div>
</div>
</body>
And to unwrap the tag:
$ beautifulspoon example.html --select body |beautifulspoon --select body --unwrap
text
<div>
text2
<div>
text3
</div>
</div>

BASH is probably the wrong tool for this. Try a Python script using the powerful Beautiful Soup library instead.
It will be more work upfront but in the long run (here: after one hour), the time savings will make up for the additional effort.

Using bash in order to extract data from a HTML forum list

I'm looking to create a quick script, but I've ran into some issues.
<li type="square"> Y </li>
I'm basically using wget to download a HTML file, and then trying to search the file for the above snippet. Y is dynamic and changes each time, so in one it might be "Dave", and in the other "Chris". So I'm trying to get the bash script to find
<li type="square"> </li>
and tell me what is inbetween the two. The general formatting of the file is very messy:
<html stuff tags><li type="square">Dave</li><more html stuff>
<br/><html stuff>
<br/><br/><li type="square">Chris</li><more html stuff><br/>
I've been unable to come up with anything that works for parsing the file, and would really appreciate someone to give me a push in the right direction.
EDIT -
<div class="post">
<hr class="hrcolor" width="100%" size="1" />
<div class="inner" id="msg_4287022"><ul class="bbc_list"><li type="square">-dave</li><li type="square">-chris</li><li type="square">-sarah</li><li type="square">-amber</li></ul><br /></div>
</div>
is the block of code that I'm looking to extract the names from. The "-" symbol is somethng added onto the list to minimize its scope, so I just get that list. The problem I'm having is that:
awk '{print $2}' FS='(<[^>]*>)+-' 4287022.html > output.txt
Only gives outputs the first list item, and not the rest.

You generally should not use regex to parse html files.
Instead you can use my Xidel to perform pattern matching on it:
xidel 4287022.html -e '<li type="square">{.}</li>*'
Or with traditional XPath:
xidel 4287022.html -e '//li[#type="square"]'

You could use grep -Eo "<li type=\"square\">-?(\w+)</li>" ./* for this.

Using sed:
sed -n 's/.*<li type="square"> *\([^<]*\).*/\1/p' input.html

awk '{print $2,$3,$4,$5}' FS='(<[^>]*>)+' 4287022.html
This presents the HTML page as a table. However instead of runs of whitespace as the Field Separator, runs of HTML tags are the Field Separator. The first field in this case is the empty space at the beginning of the line. The second field in the case is the Name, so we print this.
Result
-dave -chris -sarah -amber

How to extract text between particular HTML tag in script

Given that I have some HTML in the form:
<html>
<body>
<div id="1" class="c">some other html stuff</div>
</body>
</html>
How can I extract this with Unix script?
some other html stuff

You may checkout the html-xml-utils and the hxselect command which allows you to extract elements that match a CSS selector:
hxselect '.c' < test.htm
This assumes that your input is a well-formed XML document. If it is not you might need to resort to regular expressions and the possible consequences of that.

For simple uses, you can use Ex editor, for example:
$ ex +'/<div/norm vity' +'%d|pu 0|%p' -scq! file.html
some other html stuff
where it finds div tag, then selecting inner HTML tag (vit) of found tag, yank it (y) in order to replace the buffer with it (%delete, put 0), then print it (%print), and quit (-cq!).
Other example with demo URL:
$ ex +'/<div/norm vity' +'%d|pu 0|%p' -Nscq! http://example.com/
The advantage is that ex is a standard Unix editor available in most Linux/Unix distributions.
See also:
How to jump between matching HTML/XML tags? at Vim SE
How to remove inner content of html tag conditionally? at Vim SE

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

chunking a file with awk or a shell script - bash

As a start, this shell line will extract the first H3 to H3 section... $ sed -e '1,/<H3/d' -e '/<H3/,$d'

Related

How to extract an HTML tag by ID?

Substitute unmatched left angle brackets in HTML

How can I parse out a line below a specific string? [duplicate]

Using bash in order to extract data from a HTML forum list

How to extract text between particular HTML tag in script

Categories

Resources