get current element using Simple HTML DOM - xpath

I'm trying to use Simple HTML DOM to find objects via XPath.
It's working pretty well but I can't seem to get the current element:
$object->find('.');
$object->find('..');
$object->find('//');
all return an empty array
$object->innertext
returns a normal table with HTML, so the object IS valid.

Simple HTML DOM doesn't recognize '.' for getting the current element,
in fact, it uses Regex to find elements using XPath.
In order to solve this problem I used DOMXPath instead of Simple HTML DOM,
which has a lot more options and functionality.

Related

Unable to identify element in Blue Prism using XPath

I have spied an input text box using the Application Modeller of Blue Prism and was able to successfully highlight the text box using the below XPath:
/HTML/BODY(1)/DIV(4)/main(1)/DIV(1)/DIV(1)/DIV(1)/DIV(2)/DIV(1)/DIV(1)/DIV(2)/IFRAME(1)/HTML/BODY(1)/DIV(2)/FORM(1)/DIV(3)/TABLE(2)/TBODY(1)/TR(1)/TD(1)/DIV(1)/DIV(1)/DIV(1)/DIV(2)/DIV(1)/DIV(1)/DIV(1)/DIV(1)/DIV(1)/DIV(1)/DIV(1)/DIV(1)/DIV(1)/SPAN(1)/DIV(1)/DIV(2)/DIV(1)/DIV(1)/DIV(1)/DIV(1)/DIV(1)/TABLE(1)/TBODY(1)/TR(1)/TD(1)/INPUT(1)
I wanted to use a more robust XPath and to achieve that I was trying to use the below XPath:
//*[#id="CT"]/div/div/div/div[1]/div[1]/table/tbody[1]/tr/td/input[1]
The above XPath was identifying the element correctly in Chrome but was getting the below error message when trying the same in Blue Prism:
Error - Highlighting results - Object reference not set to an instance of an object.
Let me know if I am doing anything incorrectly.
Sorry for replying to a pretty old one! The workaround we've devised for this scenario (where making the path dynamic requires too long of a loop / search) is to use Jquery snippets. If the page is using jquery it is trivial to execute these queries very quickly using the blue prism capability of executing javascript functions.
And we put in an enhancement request, because it'd be supremely useful functionality.
Update: As a user points out below, the vanilla js querySelector method is probably safer and more future proof than using jquery if it is possible to be used.
Blue Prism does not fully support the XPath spec; alas the construct you're attempting to use here won't work.
Alternatively, you can set the Path attribute of an application modeler entry to be Dynamic, which allows you to insert dynamic parameters from the process/object level to pinpoint elements you'd like to interact with.
Unfortunately Blue Prism doesn't actually use "real" XPaths, but only an extremely limited subset: Absolute paths without wildcards. (Note: It is technically possible to match the XPath to a string with wildcards, but this seemingly causes BP to check every single element in the document, and is so slow it is almost never the right solution.)
For cases where an element can't be robustly identified via the BP application modeler (maybe because it requires complex or dynamic selectors), my workaround is to inject a JS snippet. JS can select elements much more reliably, and it can then generate the BluePrism path for that element.
Returning data from JS to BluePrism is not trivial, but one of the nicer solutions is to have JS create a <script id="_output"> element, put JSON inside it, then have BluePrism read the contents of this element.

WebDriver select element that has ::before

I have 2 elements that have the same attributes but shown one at a time on the page (When one is shown, the other disappears).The only difference between the two is that the element which is displayed will have the '::before' selector. Is it possible to use an xpath or css selector to retrieve the element based on its id and whether or not it has ::before
I bet also to try with the javascript solution above.
Since ::after & ::before are a pseudo element which allows you to insert content onto a page from CSS (without it needing to be in the HTML). While the end result is not actually in the DOM, it appears on the page as if it is - you see it but can't really locate it with xpath for example (https://css-tricks.com/almanac/selectors/a/after-and-before/).
I can also suggest if possible to have different IDs or if they in different place in the DOM make more complex xpath using above/below elements and see if it is displayed.
String script = "return window.getComputedStyle(document.querySelector('.analyzer_search_inner.tooltipstered'),':after').getPropertyValue('content')";
Thread.sleep(3000);
JavascriptExecutor js = (JavascriptExecutor) driver;
String content = (String) js.executeScript(script);
System.out.println(content);

How to select all links on a page using XPath

I want to write a function that identifies all the links on a particular HTML page. My idea was to use XPath, by using a path such as //body//a[x] and incrementing x to go through the first, second, third link on the page.
Whilst trying this out in Chrome, I load up the page http://exoplanet.eu/ and in the Chrome Developer Tools JS console, I call $x("//body//a[1]"). I expect the very first link on the page, but this returns a list of multiple anchor elements. Calling $x("//body//a[2]") returns two anchor elements. Calling $x("//body//a[3]") returns nothing.
I was hoping that incrementing the [x] each time would give me each unique link one by one on the page, but they seem to be grouped. How can I rewrite this path so that I picks each anchor tag, one by one?
Your //body//a[1] should be (//body//a)[1] if you want to select the first link on the page. The former expression selects any element that is the first child of its parent element.
But it seems a very odd thing to do anyway. Why do you need the links one by one? Just select all of them, as a node-list or node-set, using //body//a, and then iterate over the set.
If you use the path //body/descendant::a[1], //body/descendant::a[2] and so on you can select all descendant a elements of the body element. Or with your attempt you need braces e.g. (//body//a)[1], (//body//a)[2] and so on.
Note however that inside the browser with Javascript there is a document.links collection in the object model so no XPath needed to access the links.

XPath "Not". Ignore branches with a specific tag

I have loaded a web page into the HTML Agility Pack and have a DOM. I want to use XPATH to pull out all of the text on the page (but not the javascript found within <script> tags).
I figure I need a //text() and then a 'not' to ignore any tag within the branch that has a <script> in it.
I have tried
doc.DocumentNode.SelectNodes("//text()[not(self::script)]"))
and
doc.DocumentNode.SelectNodes("//text()[not(script)]"))
but neither work. An example of the XPath property of a node that they return is (notice the Script)
/html[1]/body[1]/div[2]/div[4]/div[1]/div[1]/div[1]/div[3]/script[1]/#text[1]
I have consulted with both of these posts.
Is it possible to do 'not' matching in XPath?
Grab all text from html with Html Agility Pack (This is a good post but it brings out the JS)
Any suggestions?
Your first attempt rejects all text nodes that are script elements, and your second rejects all text nodes that have script node children. Of course, in both cases, the condition is never true.
You haven't explained your requirements clearly, but I guess you want to reject all text nodes that have script elements as their parent, which would be
//text()[not(parent::script)]
or
//*[not(self::script)]/text()

how to use Xpath with LibXml 2

in this address i am trying to scrape a tage (that is Larg price which is bold red one)
i use LIBXML 2.2
when i try to extract the tag through this XPATH
//*[#class='priceLarge']
it works!
but to make queries easier i would like to use FireBug on Firefox.
Using FireBug it gives me this XPath
/html/body/div[2]/form/table[3]/tbody/tr/td/div/table/tbody/tr[2]/td[2]/span/b
using this Xpath it does not work, seems this one does not give a complete query. how can i modify this XPath to scrape the item ?
Firefox and other browsers generate tbody tags in HTML.
In fact, the tbody is probably not there, so you can remove it in your XPath. (/html/body/div[2]/form/table[3]/tr/td/div/table/tr[2]/td[2]/span/b) You can test this by just saving the HTML from your application and viewing it in a text editor.
Since it seems the intent is to pull information from a web page however, your application will probably be more resistant to changes in the web page if you use XPath less dependent on the tree structure (i.e. //b[#class='priceLarge']).
EDIT: It seems that in addition to the tbody problem, Firefox is rendering the div (ID: divsinglecolumnminwidth) element as containing the form element (ID: handleBuy).
Looking at the html with an XML editor shows that the form element is a sibling of that div element, so the expression should start with /html/body/form/table[3].
One tool, among many others, to test your XPath expressions is HAP Testbed.

Resources