xmllint to parse a html file

xmllint to parse a html file - bash

I was trying to parse out text between specific tags on a mac in various html files. I was looking for the first <H1> heading in the body. Example:
<BODY>
<H1>Dublin</H1>
Using regular expressions for this I believe is an anti pattern so I used xmllint and xpath instead.
xmllint --nowarning --xpath '/HTML/BODY/H1[0]'
Problem is some of the HTML files contain badly formed tags. So I get errors on the lines of
parser error : Opening and ending tag mismatch: UL line 261 and LI
</LI>
Problem is I can't just do, 2>/dev/null as then I loose those files altogether. Is there any way, I can just use an XPath expression here and just say, relax if the XML isn't perfect, just give me the value between the first H1 headings?

Try the --html option. Otherwise, xmllint parses your document as XML which is a lot stricter than HTML. Also note that XPath indices are 1-based and that HTML tags are converted to lowercase when parsing. The command
xmllint --html --xpath '/html/body/h1[1]' - <<EOF
<BODY>
<H1>Dublin</H1>
EOF
prints
<h1>Dublin</h1>

Related

How to extract an HTML tag by ID?

How can I extract HTML content on a page by ID?
I tried exploring sed/grep solutions for an hour. None worked.
I then gave in and explored HTML/XML parsers. html-xml-utils can only get an element by class, not ID, making it totally useless. I consulted the manual and it seems there's no way to get by id.
xmlstarlet seemed more promising, yet it whines when I try passing it HTML files rather than XML files. The following spits out at least 100 errors:
cat /home/com/interlinked/blog.html | tail -n +2 | xmlstarlet sel -T -t -m '/div/article[#id="post33"]' -v '.' -n
I used cat here because I don't want to modify the actual file. I used tail to cut out the DOCTYPE declaration which seemed to be causing issues earlier: Extra content at the end of the document
The content on the page is well formatted and consisted. Content looks like this:
<article id="post44">
... more HTML tags and content here...
</article>
I'd like to be able to extract everything between the specific article tags here by ID (e.g. if I pass it "44" it will return the contents of post44, if I pass it 34, it will return the contents of post34).
What sets this apart from other questions is I do not want just the content, I want the actual HTML between the article tags. I don't need the article tags themselves, though removing them is probably trivial.
Is there a way to do this using the built in Unix tools or xmlstarlet or html-xml-utils? I also tried the following sed which also failed to work:
article=`patt=$(printf 'article id="post%d"' $1); sed -n '/<$patt>/,/<\/article>/{ /article>/d; p }' $file`
Here I am passing in the file path as $file and and $1 is the blog post ID (44 or 34 or whatever). The reason for the two statements in one is because the $1 doesn't get evaluated within the sed statement otherwise because of the single quotes. That helps the variable resolve in a related grep command but not in this sed command.
Complete HTML structure:
<!doctype html>
<html lang="en">
<head>
<title>Page</title>
</head>
<body>
<header>
<nav>
<div id="sitelogo">
<img src="/img/logo/logo.png" alt="InterLinked"></img>
</div>
<ul>
<p>Menu</p>
</ul>
</nav>
<hr>
</header>
<div id="main">
<h1>Blog</h1>
<div id="bloglisting">
<article id="post44">
<p>Content</p>
</article>
<article id="post43">
</p>Content</p>
</article>
</div>
</div>
</body>
</html>
Also, to clarify, I need this to work on 2 different pages. Some posts are inline on this main page, but longer ones have their own page. The structure is similar, but not exactly the same. I'd like a solution that just finds the ID and doesn't need to worry about parent tags, if possible. The article tags themselves are formatted the same way on both kinds of pages. For instance, on a longer blog post with its own page, the different is here:
<div id="main">
<h1>Why Ridesharing Is Evil</h1>
<div id="blogpost">
<article id="post43">
<div>
In this case, the div bloglisting becomes blogpost. That's really the only big difference.

You can use the libxml2 tools to properly parse HTML/XML in proper syntax awareness. For your case, you can use xmllint and ask it to parse HTML file with flag --html and provide an xpath query from the top-level to get the node of your choice.
For e.g. to get the content for post id post43 use a filter like
xmllint --html --xpath \
"//html/body/div[#id='main']/div[#id='bloglisting']/article[#id='post43']" html
If the xmllint compiled on your machine does not understand a few recent (HTML5) tags like <article> or <nav>, suppress the warnings by adding 2>/dev/null at the end of the command.
If you want to get only the contents within <article> and not have the tags themselves, remove the first and last line by piping the result to sed as below.
xmllint --html --xpath \
"//html/body/div[#id='main']/div[#id='bloglisting']/article[#id='post43']" html 2>/dev/null |
sed '1d; $d'
To use a variable for the post-id, define a shell variable and use it within the xpath query
postID="post43"
xmllint --html --xpath \
"//html/body/div[#id='main']/div[#id='bloglisting']/article[#id='"$postID"']" html 2>/dev/null |
sed '1d; $d'

Add a figure element in pandoc with filters

I'm writing a filter for pandoc in python. I'm using pandocfilters.
I want to replace a Para[Image] with a Figure[InlineEl1, InlineEl2].
Figure is not supported by pandoc, so I'm using a RawBlock to write raw html. The problem is that I don't know the html for InlineEl1 and InlineEl2. I need to let pandoc process them.
Possible workaround: use a Div and then modify the resulting html file by hand.
Is there a better method?
edit: Or maybe I can put inline elements in a RawBlock? I'm just using a simple string for now. I don't know if it's possible as I don't have any documentation available. I'm just proceeding by trial and error.

As of pandoc 2.0, the figure representation in the AST is still somewhat adhoc. It's simply a paragraph that contains nothing but an image, with the image's title attribute starting with fig:.
$ echo '![caption](/url/of/image.png)' | pandoc -t native
[Para [Image ("",[],[]) [Str "caption"] ("/url/of/image.png","fig:")]]
$ echo '![caption](/url/of/image.png)' | pandoc -t html
<figure>
<img src="/url/of/image.png" alt="caption" />
<figcaption>caption</figcaption>
</figure>
See http://pandoc.org/MANUAL.html#extension-implicit_figures

Pandoc: no line wrapping when converting to HTML

I am converting from Markdown to HTML like so:
pandoc --columns=70 --mathjax -f markdown input.pdc -t html -Ss > out.html
Everything works fine, except for the fact that the text doesn't get wrapped. I tried different columns lengths, no effect. Removed options, no go. Whatever I tried, the HTML just doesn't get wrapped. I search the bug tracker, but there don't seem to be any open bugs relating to this issue. I also checked the documentation, but as far as I could glean, the text ought be line-wrapped... So, have I stumbled into a bug?
I'm using pandoc version 1.12.4.2.
Thanks in advance for your help!

Pandoc puts newlines in the HTML so the source code is easier to read. By default, it doesn't insert <br>-tags.
If you want to preserve line breaks from markdown input:
pandoc -f markdown+hard_line_breaks input.md output.html
However, usually a better approach to limit the text width when opening the HTML file in the browser is to adapt the HTML template (pandoc -D html5) and add some CSS, like:
<!DOCTYPE html>
<html$if(lang)$ lang="$lang$"$endif$>
<head>
<style>
body {
width: 46em;
}
</style>
...

It is not clear what text should get wrapped but does not as you did not provide a sample.
Pandoc supports several line breaking scenarios in markdown documents.
What you may be looking for is the hard_line_breaks extension
If it is so then your command should look like
pandoc --columns=70 --mathjax -f markdown+hard_line_breaks input.pdc -t html -Ss > out.html
I'd recommend you to read about all the markdown-relevant options and configure pandoc to match your input markdown flavor

Using bash in order to extract data from a HTML forum list

I'm looking to create a quick script, but I've ran into some issues.
<li type="square"> Y </li>
I'm basically using wget to download a HTML file, and then trying to search the file for the above snippet. Y is dynamic and changes each time, so in one it might be "Dave", and in the other "Chris". So I'm trying to get the bash script to find
<li type="square"> </li>
and tell me what is inbetween the two. The general formatting of the file is very messy:
<html stuff tags><li type="square">Dave</li><more html stuff>
<br/><html stuff>
<br/><br/><li type="square">Chris</li><more html stuff><br/>
I've been unable to come up with anything that works for parsing the file, and would really appreciate someone to give me a push in the right direction.
EDIT -
<div class="post">
<hr class="hrcolor" width="100%" size="1" />
<div class="inner" id="msg_4287022"><ul class="bbc_list"><li type="square">-dave</li><li type="square">-chris</li><li type="square">-sarah</li><li type="square">-amber</li></ul><br /></div>
</div>
is the block of code that I'm looking to extract the names from. The "-" symbol is somethng added onto the list to minimize its scope, so I just get that list. The problem I'm having is that:
awk '{print $2}' FS='(<[^>]*>)+-' 4287022.html > output.txt
Only gives outputs the first list item, and not the rest.

You generally should not use regex to parse html files.
Instead you can use my Xidel to perform pattern matching on it:
xidel 4287022.html -e '<li type="square">{.}</li>*'
Or with traditional XPath:
xidel 4287022.html -e '//li[#type="square"]'

You could use grep -Eo "<li type=\"square\">-?(\w+)</li>" ./* for this.

Using sed:
sed -n 's/.*<li type="square"> *\([^<]*\).*/\1/p' input.html

awk '{print $2,$3,$4,$5}' FS='(<[^>]*>)+' 4287022.html
This presents the HTML page as a table. However instead of runs of whitespace as the Field Separator, runs of HTML tags are the Field Separator. The first field in this case is the empty space at the beginning of the line. The second field in the case is the Name, so we print this.
Result
-dave -chris -sarah -amber

How to extract text between particular HTML tag in script

Given that I have some HTML in the form:
<html>
<body>
<div id="1" class="c">some other html stuff</div>
</body>
</html>
How can I extract this with Unix script?
some other html stuff

You may checkout the html-xml-utils and the hxselect command which allows you to extract elements that match a CSS selector:
hxselect '.c' < test.htm
This assumes that your input is a well-formed XML document. If it is not you might need to resort to regular expressions and the possible consequences of that.

For simple uses, you can use Ex editor, for example:
$ ex +'/<div/norm vity' +'%d|pu 0|%p' -scq! file.html
some other html stuff
where it finds div tag, then selecting inner HTML tag (vit) of found tag, yank it (y) in order to replace the buffer with it (%delete, put 0), then print it (%print), and quit (-cq!).
Other example with demo URL:
$ ex +'/<div/norm vity' +'%d|pu 0|%p' -Nscq! http://example.com/
The advantage is that ex is a standard Unix editor available in most Linux/Unix distributions.
See also:
How to jump between matching HTML/XML tags? at Vim SE
How to remove inner content of html tag conditionally? at Vim SE

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

xmllint to parse a html file - bash

Related

How to extract an HTML tag by ID?

Add a figure element in pandoc with filters

Pandoc: no line wrapping when converting to HTML

Using bash in order to extract data from a HTML forum list

How to extract text between particular HTML tag in script

Categories

Resources