How to grab a piece of data which has a different xpath on different webpages? - xpath

So I am trying to grab a piece of data that is displayed in a different xpath on different pages.
if you will see the xpath of the IPA pronunction on wiktionary... https://en.wiktionary.org/wiki/foo you will see that the xpath is
//*[#id="mw-content-text"]/ul[1]/li[1]/span[4]
but if I got to another word, like https://en.wiktionary.org/wiki/bar then the xpath would be
//*[#id="mw-content-text"]/ul[1]/li[2]/span[5]
I cannot think of any way to reconcile these, is there something that I am missing?

The answer is simple. Never let a tool write any XPath for you. All tools get it wrong.
Look at the document's HTML source and write the appropriate XPath it yourself.
var result = document.evaluate("//*[#class = 'IPA']", document),
elem;
while (elem = result.iterateNext()) {
console.log(elem);
}
The above shows the simplest variant. It selects two occurrences of <span class="IPA"> on https://en.wiktionary.org/wiki/foo and quite a few more on https://en.wiktionary.org/wiki/bar.
Use a more specific expression to narrow down the results.

Related

Find HTML Tags in Properties

My current issue is to find HTML-Tags inside of property values. I thought it would be easy to search with a query like /jcr:root/content/xgermany//*[jcr:contains(., '<strong>')] order by #jcr:score
It looks like there is a problem with the chars < and > because this query finds everything which has strong in it's property. It finds <strong>Some Text</strong> but also This is a strong man.
Also the Query Builder API didn't helped me.
Is there a possibility to solve it with a XPath or SQL Query or do I have to iterate through the whole content?
I don't fully understand why it finds This is a strong man as a result for '<strong>', but it sounds like the unexpected behavior comes from the "simple search-engine syntax" for the second argument to jcr:contains(). Apparently the < > are just being ignored as "meaningless" punctuation.
You could try quoting the search term:
/jcr:root/content/xgermany//*[jcr:contains(., '"<strong>"')]
though you may have to tweak that if your whole XPath expression is enclosed in double quotes.
Of course this will not be very robust even if it works, since you're trying to find HTML elements by searching for fixed strings, instead of actually parsing the HTML.
If you have an specific jcr:primaryType and the targeted properties you can do something like this
select * from nt:unstructured where text like '%<strong>%'
I tested it , but you need to know the properties you are intererested in.
This is jcr-sql syntax
Start using predicates like a champ this way all of this will make sense to you!
HTML Encode <strong>
HTML Decimal <strong>
Query builder is your friend:
Predicates: (like a CHAMP!)
path=/content/geometrixx
type=nt:unstructured
property=text
property.operation=like
property.value=%<strong>%
Have go here:
http://localhost:4502/libs/cq/search/content/querydebug.html?charset=UTF-8&query=path%3D%2Fcontent%2Fgeometrixx%0D%0Atype%3Dnt%3Aunstructured%0D%0Aproperty%3Dtext%0D%0Aproperty.operation%3Dlike%0D%0Aproperty.value%3D%25%3Cstrong%3E%25
Predicates: (like a CHAMP!)
path=/content/geometrixx
type=nt:unstructured
property=text
property.operation=like
property.value=%<strong>%
Have a go here:
http://localhost:4502/libs/cq/search/content/querydebug.html?charset=UTF-8&query=path%3D%2Fcontent%2Fgeometrixx%0D%0Atype%3Dnt%3Aunstructured%0D%0Aproperty%3Dtext%0D%0Aproperty.operation%3Dlike%0D%0Aproperty.value%3D%25%26lt%3Bstrong%26gt%3B%25
XPath:
/jcr:root/content/geometrixx//element(*, nt:unstructured)
[
jcr:like(#text, '%<strong>%')
]
SQL2 (already covered... NASTY YUK..)
SELECT * FROM [nt:unstructured] AS s WHERE ISDESCENDANTNODE([/content/geometrixx]) and text like '%<strong>%'
Although I'm sure it's entirely possible with a string of predicates, it's possibly heading down the wrong route. Ideally it would be better to parse the HTML when it is stored or published.
The required information would be stored on simple properties on the node in question. The query will then be a lot simpler with just a property = value query, than lots of overly complex query syntax.
It will probably be faster too.
So if you read in your HTML with something like HTMLClient and then parse it with a OSGI service, that can accurately save these properties for you. Every time the HTML is changed the process would update these properties as necessary. Just some thoughts if your SQL is getting too much.

Handling Dynamic Xpath

Am automating things using Selenium. Need your help to handle Dynamic Xpath as below:
Driver.findElement(By.xpath("//[#id='INQ_2985']/div[2]/tr/td/div/div[3]/div")).click();
As above INQ_2985 changes to 2986,2987,2988 etc during each run
HTML CODE:
< div> class="context-menu-item-inner" style="background-image:url(../images/productSmall.png);">Tender Assignment < /div>
Tried different combinations as below but with no success:
// Driver.findElement(By.name("//input[#name='Tender Assignment']")).click();
// Driver.findElement(By.className("context-menu-item-inner")).click();`
Can you help me on this.
you can try using contains() or starts-with() in xpath,
above xpath can be rewritten as follows,
Driver.findElement(By.xpath("//*[starts-with(#id,'INQ')]/div[2]/tr/td/div/div[3]/div")).click();
if you can post more of your html, we can help improve your xpath..
moreover using such long xpath's is not recommended, this may cause your test to fail more often
for example,if a "new table data or div" is added to the UI, above xpath will no longer be valid
you should try and use id, class or other attributes to get closer to the element your trying to find
i personally recommend using cssSelectors over xpath
you can use many methods,
use implicity wait;
driver.findElement(By.xpath("//*[contains(#id,'select2-result-label-535')]").click();
driver.findElement(By.xpath("//*[contains(text(), 'select2-result-label-535')]").click();
Good to use Regular expression
driver.findElement(By.xpath("//*[contains(#id,'INQ_')]")
Note: If you have single ID with name starts from INQ_ then you can take action on the element . If a bunch of ID then you can extract as a List<WebElements> and then match with the specific text of the element ( element.getText().trim() =="Linked Text" and if it matched then take action. You can follow other logic to traverse and match.
you can use css -
div.context-menu-item-inner
Use this xpath:
driver.findElement(By.cssSelector("div.context-menu-item-inner").click();
The best choice is using full xpath instead of id which you can get easily via firebug.
e.g.
/html/body/div[3]/div[3]/div[2]/div/div[2]/div[1]/div/div[1]
if your xpath is varying
Ex: "//*[#id='msg500']" , "//*[#id='msg501']", "//*[#id='msg502']" and so on...
Then use this code in script:
for (int i=0;i<=9;i++) {
String mpath= "//*[#id='msg50"+i+"']";
driver.findElement(By.xpath(mpath)).click();
}

HtmlUnit getByXpath returns null

I am coding with Groovy, however, I don't believe its a language specific set of questions.
I actually have two questions
First Question
I've run into an issue while using HtmlUnit. It is telling me that what I am trying to grab is null.
The page I'm testing it on is:
http://browse.deviantart.com/resources/applications/psbrushes/?order=9&offset=0#/dbwam4
My code:
client = new WebClient(BrowserVersion.FIREFOX_3)
client.javaScriptEnabled = false
page = client.getPage(url)
//coming up as null
title = page.getByXPath("//html/body/div[4]/div/div[3]/div/div/div/div/div/div/div/div/div/div/h1/a")
println title
This simply prints out: []
Is this because the page uses onclick()? If so, how would I get around that? Enabling javascript creates a mess in my cmd prompt.
Second Question
I am wanting to also get the image but am having trouble because when I attempt to get the XPath (via firebug) it shows up as: //*[#id="gmi-ResViewSizer_img"]
How do I handle that?
First Answer:
/html/body/div[3]/div/div[3]/div/div/div/div/div/div/div/div/div/div/h1/a
Your XPATH was off by one in the predicate filter for the 4th div of the body, it should be the 3rd div. It appears the HTML for the site can/does change from when you had origionally snagged the XPATH using Firebug. You may need to adjust your XPATH to accommodate for potential change and be less sensitive to some differences in document structure.
Maybe something like this:
/html/body//div/h1/a
Second Answer: The XPATH that you listed will work. It may look odd/short(and may not be the most efficient), but // starts at the root node and looks throughout every node in the tree, * matches on any element(to include the img) and the [] predicate filter restricts it to those that have an id attribute who's value equals "gmi-ResViewSizer_img".
There are many other options for XPATHs that could work as well. It will also depend on how often the HTML structure changes. This is one that also works for the page referenced to select that img:
/html/body/div/div/div/div/img[1]
I had the same problem, I solved when I realize iframe tags on page, try call
((HtmlPage)current_page.getFrames()[n].getEnclosedPage()).getElementByXPath(...
where n is the position in frame in iframe collection. It's work for me !!!
Thanks a lot.

Query html tag with XPath

I am writing the selenium test.
I have a label there "Assign Designer" and the select box followed right after the label.
Unfortunetely, select box has the dynamic id and I can not query it by id or any other it's attribute.
Can I build the XPath query that returns "First select tag after text 'Assign Designer'"?
PS. Selenium supports only XPath 1.0
This would be something like:
//label[text() = 'Assign Designer']/following-sibling::select[1]
Note that:
The // shorthand is quite inefficient, because it causes a document-wide scan. If you can be more specific about the label's position, I recommend doing so. If the document is small, however, this won't be a problem.
Since I don't know much about Selenium, I used "label". If it is not a <label>, you should use the actual element name, of course. ;-)
be sure to include a position predicate ([1], in this case) whenever you use an axis like "following-sibling". It's easily forgotten and if it is, your expressions may produce unexpected results.

Accessing Comments in XML using XPath

How to access the comments inside the XML document using XPath?
For example:
<table>
<length> 12 </length>
<!--Some comment here-->
</table>
I want to access the "Some comment here".
Thanks...
EDIT: I am using MSXML DOM ActiveX and the command comment() seems to be failing... Any idea why?
With the path
/foo/bar/comment()
you can select all comments in the /foo/bar element. May depend on your language of choice, of course. But generally this is how you do it.
Use comment() function for example:-
/table/length/following::comment()[1]
selects the first comment that follows the length element.
Edit
Manoj asks in a comment to this answer why this isn't working in MSXML. The reason will be you are using MSXML3. By default MSXML3 does not use XPath as its selection language, it defaults to an earlier much weaker language (XSL pattern). You need to set XPath as the selection language via the DOMDocument's setProperty method. E.g (in JScript):-
var dom = new ActiveXObject("MSXML2.DOMDocument.3.0");
dom.setProperty("SelectionLanguage", "XPath");
Now the full XPath language will work in your queries (note one breaking change is indexer predicates are 1 based in XPath whereas they were 0 based in XSL Pattern).
Based on the OP's comments to posted answers (and my curiosity as to why this simple thing would not work), here is my suggestion:
Using the XPath expression suggested by #Anthony, I was able to successfully load the comment node with the following JS function:
function SelectComment(s)
{
var xDoc = new ActiveXObject("MSXML2.DOMDocument.6.0");
if (xDoc)
{
xDoc.loadXML(s);
var selNode = xDoc.selectSingleNode("/table/length/following::comment()[1]");
if (selNode != null)
return selNode.text;
else
return "";
}
}
Sample invocation:
SelectComment("<table><length> 12</length><!--Some comment here--></table>");
Output:
"Some comment here"
Notes:
a. Your MSXML version may vary. Please use appropriately.
b. This kind of code is definitely not recommended because it works only on IE. However, since this is your explicitly stated requirement, I have used the ActiveXObject.
c. You have not mentioned in your comments what fails in the suggested XPath expressions. My guess is that you are not querying the text property of the retrieved node. Keep in mind that the SelectSingleNode always returns an IXmlNode and you need to query its data or text properties.
Maybe this coud help,
This sample removes Comments
XmlNodeList list = xmlDoc.SelectNodes("//comment()");
foreach(XmlNode node in list)
node.ParentNode.RemoveChild(node);
Leaned from here link text
<adjustment>
<!-- krishna k -->
<name>FX Update USD</name>
<!-- Since this plan updates existing adj's no ajd's will be created using this id -->
<id>7206</id>
Am facing the similar Issue my application is reading comments which causes stack crash. How can I avoid reading comments by DOM.

Resources