wildcard usage with xpath

wildcard usage with xpath - xpath

I'm having problems with selecting both following tags with xpath:
<div class="table-row">
<div class="table-row ">
I tried with:
"//div[#class='table-row*']"
but this won't match any of above.
What am I doing wrong? Thanks for help.

Use starts-with to match in the place of wildcard '*'
//div[starts-with(#class, 'table-row')]

Related

Escape double quotes inside single quoted text

With spring boot 3.0.1 and thymeleaf 3.1 in the below expression:
<a th:replace="~{fragments/paging :: paging(${totalPages}, ' Last >>', 'Last Page')}"></a>
I am trying to replace ' Last >>' with ' <i class="fa-solid fa-forward-step"></i>' and hence need to escape double-quotes.
I tried the suggestions and solutions suggested here, here and other SO threads but none of them seem working.
When I try using " the text appears as it while for other options I get parsing error. What I am missing? After trying so many options I am feeling clueless. Any suggestion will be helpful.

How to extract an HTML tag by ID?

How can I extract HTML content on a page by ID?
I tried exploring sed/grep solutions for an hour. None worked.
I then gave in and explored HTML/XML parsers. html-xml-utils can only get an element by class, not ID, making it totally useless. I consulted the manual and it seems there's no way to get by id.
xmlstarlet seemed more promising, yet it whines when I try passing it HTML files rather than XML files. The following spits out at least 100 errors:
cat /home/com/interlinked/blog.html | tail -n +2 | xmlstarlet sel -T -t -m '/div/article[#id="post33"]' -v '.' -n
I used cat here because I don't want to modify the actual file. I used tail to cut out the DOCTYPE declaration which seemed to be causing issues earlier: Extra content at the end of the document
The content on the page is well formatted and consisted. Content looks like this:
<article id="post44">
... more HTML tags and content here...
</article>
I'd like to be able to extract everything between the specific article tags here by ID (e.g. if I pass it "44" it will return the contents of post44, if I pass it 34, it will return the contents of post34).
What sets this apart from other questions is I do not want just the content, I want the actual HTML between the article tags. I don't need the article tags themselves, though removing them is probably trivial.
Is there a way to do this using the built in Unix tools or xmlstarlet or html-xml-utils? I also tried the following sed which also failed to work:
article=`patt=$(printf 'article id="post%d"' $1); sed -n '/<$patt>/,/<\/article>/{ /article>/d; p }' $file`
Here I am passing in the file path as $file and and $1 is the blog post ID (44 or 34 or whatever). The reason for the two statements in one is because the $1 doesn't get evaluated within the sed statement otherwise because of the single quotes. That helps the variable resolve in a related grep command but not in this sed command.
Complete HTML structure:
<!doctype html>
<html lang="en">
<head>
<title>Page</title>
</head>
<body>
<header>
<nav>
<div id="sitelogo">
<img src="/img/logo/logo.png" alt="InterLinked"></img>
</div>
<ul>
<p>Menu</p>
</ul>
</nav>
<hr>
</header>
<div id="main">
<h1>Blog</h1>
<div id="bloglisting">
<article id="post44">
<p>Content</p>
</article>
<article id="post43">
</p>Content</p>
</article>
</div>
</div>
</body>
</html>
Also, to clarify, I need this to work on 2 different pages. Some posts are inline on this main page, but longer ones have their own page. The structure is similar, but not exactly the same. I'd like a solution that just finds the ID and doesn't need to worry about parent tags, if possible. The article tags themselves are formatted the same way on both kinds of pages. For instance, on a longer blog post with its own page, the different is here:
<div id="main">
<h1>Why Ridesharing Is Evil</h1>
<div id="blogpost">
<article id="post43">
<div>
In this case, the div bloglisting becomes blogpost. That's really the only big difference.

You can use the libxml2 tools to properly parse HTML/XML in proper syntax awareness. For your case, you can use xmllint and ask it to parse HTML file with flag --html and provide an xpath query from the top-level to get the node of your choice.
For e.g. to get the content for post id post43 use a filter like
xmllint --html --xpath \
"//html/body/div[#id='main']/div[#id='bloglisting']/article[#id='post43']" html
If the xmllint compiled on your machine does not understand a few recent (HTML5) tags like <article> or <nav>, suppress the warnings by adding 2>/dev/null at the end of the command.
If you want to get only the contents within <article> and not have the tags themselves, remove the first and last line by piping the result to sed as below.
xmllint --html --xpath \
"//html/body/div[#id='main']/div[#id='bloglisting']/article[#id='post43']" html 2>/dev/null |
sed '1d; $d'
To use a variable for the post-id, define a shell variable and use it within the xpath query
postID="post43"
xmllint --html --xpath \
"//html/body/div[#id='main']/div[#id='bloglisting']/article[#id='"$postID"']" html 2>/dev/null |
sed '1d; $d'

"Or" and "And" operators in Octopus variable substitution syntax

Is there some way to use logical "or" and "and" operators in variable substitution syntax like the following?
#{if Octopus.Action[Smoke Test] && Octopus.Action[Smoke Test].Output.FailedSmokeTestMessage}
<h3 style="color: red">Failed Smoke Tests</h3>
#{Octopus.Action[Smoke Test].Output.FailedSmokeTestMessage}
#{/if}

No. See the documentation on variable substitution syntax. Instead you could try nesting your if blocks.
#{if Octopus.Action[Smoke Test]}
#{Octopus.Action[Smoke Test].Output.FailedSmokeTestMessage}
<h3 style="color: red">Failed Smoke Tests</h3>
#{Octopus.Action[Smoke Test].Output.FailedSmokeTestMessage}
#{/if}
#{/if}
There is no test case for this in Octostache, the open-source templating engine, so there's no guarantee it works. It is pretty easy to set up a test project for Octostache using its Nuget package to test it out and play around with syntax.

How can I parse out a line below a specific string? [duplicate]

I need to get the HTML contents between a pair of given tags using a bash script.
As an example, having the HTML code below:
<html>
<head>
</head>
<body>
text
<div>
text2
<div>
text3
</div>
</div>
</body>
</html>
Using the bash command/script, given the body tag, we would get:
text
<div>
text2
<div>
text3
</div>
</div>
Thanks in advance.

plain text processing is not good for html/xml parsing. I hope this could give you some idea:
kent$ xmllint --xpath "//body" f.html
<body>
text
<div>
text2
<div>
text3
</div>
</div>
</body>

Using sed in shell/bash, so you needn't install something else.
tag=body
sed -n "/<$tag>/,/<\/$tag>/p" file

Personally I find it very useful to use hxselect command (often with help of hxclean) from package html-xml-utils. The latter fixes (sometimes broken) HTML file to correct XML file and the first one allows to use CSS selectors to get the node(s) you need. With use of the -c option, it strips surrounding tags. All these commands work on stdin and stdout. So in your case you should execute:
$ hxselect -c body <<HTML
<html>
<head>
</head>
<body>
text
<div>
text2
<div>
text3
</div>
</div>
</body>
</html>
HTML
to get what you need. Plain and simple.

Forgetting Bash due it's limitation, you can use nokogiri as command line util, as explained here.
Example:
curl -s http://example.com/ | nokogiri -e 'puts $_.search('\''a'\'')'

Another option is to use the multi-platform xidel utility (home page on SourceForge, GitHub repository), which can handle both XML and HTML:
xidel -s in.html -e '/html/body/node()' --printed-node-format=html
The above prints the resulting HTML with syntax highlighting (colored), and seemingly with an empty line after the text node.
If you want the text only, Reino points out that you can simplify to:
xidel -s in.html -e '/html/body/inner-html()'

Consider using beautifulspoon.
Select the body tag from the above .html:
$ beautifulspoon example.html --select body
<body>
text
<div>
text2
<div>
text3
</div>
</div>
</body>
And to unwrap the tag:
$ beautifulspoon example.html --select body |beautifulspoon --select body --unwrap
text
<div>
text2
<div>
text3
</div>
</div>

BASH is probably the wrong tool for this. Try a Python script using the powerful Beautiful Soup library instead.
It will be more work upfront but in the long run (here: after one hour), the time savings will make up for the additional effort.

Using bash in order to extract data from a HTML forum list

I'm looking to create a quick script, but I've ran into some issues.
<li type="square"> Y </li>
I'm basically using wget to download a HTML file, and then trying to search the file for the above snippet. Y is dynamic and changes each time, so in one it might be "Dave", and in the other "Chris". So I'm trying to get the bash script to find
<li type="square"> </li>
and tell me what is inbetween the two. The general formatting of the file is very messy:
<html stuff tags><li type="square">Dave</li><more html stuff>
<br/><html stuff>
<br/><br/><li type="square">Chris</li><more html stuff><br/>
I've been unable to come up with anything that works for parsing the file, and would really appreciate someone to give me a push in the right direction.
EDIT -
<div class="post">
<hr class="hrcolor" width="100%" size="1" />
<div class="inner" id="msg_4287022"><ul class="bbc_list"><li type="square">-dave</li><li type="square">-chris</li><li type="square">-sarah</li><li type="square">-amber</li></ul><br /></div>
</div>
is the block of code that I'm looking to extract the names from. The "-" symbol is somethng added onto the list to minimize its scope, so I just get that list. The problem I'm having is that:
awk '{print $2}' FS='(<[^>]*>)+-' 4287022.html > output.txt
Only gives outputs the first list item, and not the rest.

You generally should not use regex to parse html files.
Instead you can use my Xidel to perform pattern matching on it:
xidel 4287022.html -e '<li type="square">{.}</li>*'
Or with traditional XPath:
xidel 4287022.html -e '//li[#type="square"]'

You could use grep -Eo "<li type=\"square\">-?(\w+)</li>" ./* for this.

Using sed:
sed -n 's/.*<li type="square"> *\([^<]*\).*/\1/p' input.html

awk '{print $2,$3,$4,$5}' FS='(<[^>]*>)+' 4287022.html
This presents the HTML page as a table. However instead of runs of whitespace as the Field Separator, runs of HTML tags are the Field Separator. The first field in this case is the empty space at the beginning of the line. The second field in the case is the Name, so we print this.
Result
-dave -chris -sarah -amber

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

wildcard usage with xpath - xpath

I'm having problems with selecting both following tags with xpath: <div class="table-row"> <div class="table-row "> I tried with: "//div[#class='table-row*']" but this won't match any of above. What am I doing wrong? Thanks for help.

Use starts-with to match in the place of wildcard '*' //div[starts-with(#class, 'table-row')]

Related

Escape double quotes inside single quoted text

How to extract an HTML tag by ID?

"Or" and "And" operators in Octopus variable substitution syntax

How can I parse out a line below a specific string? [duplicate]

Using bash in order to extract data from a HTML forum list

Categories

Resources