Get fields from a form in htmlagilitypack - html-agility-pack

I want to get the data for a form so i wrote the below. It didnt work
doc.DocumentNode.SelectNodes("//form[#name='F1']//input[#name]");
Breaking it up into two steps did
var node = doc.DocumentNode.SelectSingleNode("//form[#name='F1']");
var nodes = node.SelectNodes("//input[#name]");
However i get the data from the entire html file rather then the node/form which is unexpected. How do i get the results from that form only? I tried /input[#name] and .//input[#name] which gave me null

It seems this is default behavior for <form> tag parsing in Html Agility Pack. As they said here:
FORM is treated
like this because many HTML pages used to have overlapping forms, as
this was actually a (powerful) feature of original HTML. Now that XML
and XHTML exist, everybody assumes that overlapping is an error, but
it's not (in HTML 3.2).
You could change it by using:
HtmlNode.ElementsFlags.Remove("form");
and your "//form[#name='F1']//input[#name]" expression should work. Or change the second expression to ".//input[#name]" and it also should work:
var node = doc.DocumentNode.SelectSingleNode("//form[#name='F1']");
var nodes = node.SelectNodes(".//input[#name]");

Related

XPATH - how to pick up text in each html element regardless of classes

I am trying to grab some content from webpages that are not structured in a uniform fashion. What I want to do is tell the XPATH to grab any content within html tags in the order it sees them and return the results, without having to specify div names etc, as they are different and not very uniform.
So I need to know how to just say 'return any html content in the order that it's found from within tags, regardless of whether they are classes, ems, strong tags etc. The only experience I have had with XPATH is to specify actual div names, example:
//div[#id='tab_info']
This XPath,
string(/)
will return the string value of the entire XML or HTML document. That is, it'll return a single string of all of the text in document order, as requested.

WebDriver select element that has ::before

I have 2 elements that have the same attributes but shown one at a time on the page (When one is shown, the other disappears).The only difference between the two is that the element which is displayed will have the '::before' selector. Is it possible to use an xpath or css selector to retrieve the element based on its id and whether or not it has ::before
I bet also to try with the javascript solution above.
Since ::after & ::before are a pseudo element which allows you to insert content onto a page from CSS (without it needing to be in the HTML). While the end result is not actually in the DOM, it appears on the page as if it is - you see it but can't really locate it with xpath for example (https://css-tricks.com/almanac/selectors/a/after-and-before/).
I can also suggest if possible to have different IDs or if they in different place in the DOM make more complex xpath using above/below elements and see if it is displayed.
String script = "return window.getComputedStyle(document.querySelector('.analyzer_search_inner.tooltipstered'),':after').getPropertyValue('content')";
Thread.sleep(3000);
JavascriptExecutor js = (JavascriptExecutor) driver;
String content = (String) js.executeScript(script);
System.out.println(content);

get current element using Simple HTML DOM

I'm trying to use Simple HTML DOM to find objects via XPath.
It's working pretty well but I can't seem to get the current element:
$object->find('.');
$object->find('..');
$object->find('//');
all return an empty array
$object->innertext
returns a normal table with HTML, so the object IS valid.
Simple HTML DOM doesn't recognize '.' for getting the current element,
in fact, it uses Regex to find elements using XPath.
In order to solve this problem I used DOMXPath instead of Simple HTML DOM,
which has a lot more options and functionality.

Dynamically adding Html elements inside an Html doc added into XUL

I'm trying to create a Firefox extension using the same html files I used to chrome extension.
With some Google search, I found a way to use
xmlns:xul="http://www.mozilla.org/keymaster/gatekeeper/there.is.only.xul"
xmlns="http://www.w3.org/1999/xhtml"
As namespace and to use the html file which I used in chrome as it is and its working fine. Now, I want to add elements dynamically into that html file using JavaScript. For example
var testdiv=document.getElementById('test');
var a = document.createElement('a');
a.setAttribute("innerText", "test");
testdiv.appendChild(a);
But this is not giving the expected out put. Any suggestions on this or any other way to do this??
If you want to create elements in a namespace, you have to use document.createElementNS method. So in your case, creating the A element would look like this:
var a = document.createElementNS('http://www.w3.org/1999/xhtml', 'a');

XPath "Not". Ignore branches with a specific tag

I have loaded a web page into the HTML Agility Pack and have a DOM. I want to use XPATH to pull out all of the text on the page (but not the javascript found within <script> tags).
I figure I need a //text() and then a 'not' to ignore any tag within the branch that has a <script> in it.
I have tried
doc.DocumentNode.SelectNodes("//text()[not(self::script)]"))
and
doc.DocumentNode.SelectNodes("//text()[not(script)]"))
but neither work. An example of the XPath property of a node that they return is (notice the Script)
/html[1]/body[1]/div[2]/div[4]/div[1]/div[1]/div[1]/div[3]/script[1]/#text[1]
I have consulted with both of these posts.
Is it possible to do 'not' matching in XPath?
Grab all text from html with Html Agility Pack (This is a good post but it brings out the JS)
Any suggestions?
Your first attempt rejects all text nodes that are script elements, and your second rejects all text nodes that have script node children. Of course, in both cases, the condition is never true.
You haven't explained your requirements clearly, but I guess you want to reject all text nodes that have script elements as their parent, which would be
//text()[not(parent::script)]
or
//*[not(self::script)]/text()

Resources