Trying to find two different text nodes from a descendant - xpath

Someone decided to make a site as unfriendly as possible by intention so I'm trying what I can to have our scraper still get to where it should.
<div class="issueDetails">
<div class="issueTitle ng-binding" style="">FANCY UNIQUE TEXT dd.MM.yyyy</div>
<a>COMPLETELY DIFFERENT TEXT</a>
I've left out the unnecessary details here, but I'm trying to find a match within the site through XPATH (can't use anything else for this) that will find something which fulfils both conditions, FANCY UNIQUE TEXT dd.MM.yyyy *as well as COMPLETELY DIFFERENT TEXT.
I've tried my luck with //div[#class='issueDetails']/descendant::*[contains(text(), 'FANCY UNIQUE TEXT dd.MM.yyyy') and contains (text(), 'COMPLETELY DIFFERENT TEXT')]
but it contains the erroneous logic that both unique things I need are in the same thing.
The first, FANCY UNIQUE TEXT, is the unique identifier for where I want to go. The second, COMPLETELY DIFFERENT TEXT, is what I need the scraper to click on to actually head to that specific one. So an XPath that finds both despite them being different descendants is necessary.

Is this what you're looking for :
//div[#class="issueDetails"]/*[contains(.,"COMPLETELY DIFFERENT TEXT") or translate(substring(.,string-length(.)-9,10),"123456789","000000000")="00.00.0000" and contains(.,'FANCY UNIQUE TEXT')]
It will return the 2 elements respecting your conditions : div and a.
Translate, substring-length and substring functions are used to check if a date pattern is present in the div element.
EDIT : Check if the parent+child contains the text you're looking for, then get the childs with :
//div[#class='issueDetails'][contains(.,"FANCY UNIQUE TEXT dd.MM.yyyy") and contains(.,"COMPLETELY DIFFERENT TEXT")]/*[contains(.,"FANCY UNIQUE TEXT dd.MM.yyyy") or contains(.,"COMPLETELY DIFFERENT TEXT")]

Related

XPath fails because Namespace colon in Title

I'm generating an XML report, using the JDF standard for PDFs going into a printing workflow.
There are 3 "DPart" sections, and I can use an xPath query to recognize them, but I want to grab the "Separation" attribute of each "cip4:Part". I can also get a query to find that, but it does not distinguish between the multiple "DPart"s.
<DPart End="0" ID="0003" ParentRef="0002" Start="0">
<DPM>
<cip4:Root>
<cip4:Intent cip4:ProductType="ProductPart"/>
<cip4:Production>
<cip4:Resource>
<cip4:Part Separation="K1"/>
<cip4:Color cip4:ActualColorName="Black" cip4:ColorType="Normal">
</cip4:Resource>
<cip4:Resource>
<cip4:Part Separation="S1"/>**
<cip4:Color cip4:ActualColorName="Dieline" cip4:ColorType="Normal">
</cip4:Resource>
<cip4:Resource>
<cip4:ColorantControl ColorantOrder="K1 S1" ColorantParams="K1 S1"/>
</cip4:Resource>
<cip4:Resource>
<eg:InkCoverage>
<eg:InkCov eg:Mm2="0.000000" eg:Pct="0.000000" eg:Separation="K1"/>
<eg:InkCov eg:Mm2="182.337538" eg:Pct="0.721209" eg:Separation="S1"/>
</eg:InkCoverage>
</cip4:Resource>
</cip4:Production>
</cip4:Root>
</DPM>
</DPart>
I want to do something like:
/DPM[2]/*[name ()='cip4:Part'], but it's not working.
I'm in a low-code pre-press environment (Esko Automation Engine), but the system gives me tools to parse an xPath, and throw some JavaScript at it.
There are at least three reasons your XPath selects nothing:
DPM is not an immediate child of the root node
There is only one DPM, so DPM[2] won't select anything
There is no child of a DPM whose name is cip4:Part.
You also say in the narrative that there are three DPart's, which implies that DPart is not actually the outermost element as it appears to be in your sample. This makes it difficult to provide the correct XPath. However, you might be able to make a start with
(//DPM)[2]//*[name()='cip4:Part']

Trying to exclude a portion of an xPath

I have looked through several posts about this, but have failed to apply the principles used to get the result I desire, so I'm going to just post my specific problem.
I am building a Google Sheet that enables the user to pull up Bible verses.
I have it all working, however I am running into an issue with a hidden element being pulled into my text().
FUNCTION:
=IMPORTXML("http://www.biblestudytools.com/ESV/Numbers/5-3.html",
"//*[#class='scripture']//span[2]//text()")
RESULT: You shall put out both male and female, putting them outside the camp, that they may not defile their camp, 1in the midst of which I dwell."
You can see the "1" that is showing up before the word "in"
I have found the xPath that pulls only that "1"
//*[#class='scripture']//span[2]//sup//text()
I am trying to remove that "1" from the text.
HELP PLEASE!!! :)
You can add a predicate to the end to exclude text nodes that are inside sup elements:
=IMPORTXML("http://www.biblestudytools.com/ESV/Numbers/5-3.html",
"//*[#class='scripture']//span[2]//text()[not(ancestor::sup)]")
This will retrieve only the text nodes that are not inside a sup element, but it will still result in having the verse spread out across two cells, because there are two text nodes. You can rectify this by wrapping this expression in a JOIN():
=JOIN("", IMPORTXML("http://www.biblestudytools.com/ESV/Numbers/5-3.html",
"//*[#class='scripture']//span[2]//text()[not(ancestor::sup)]"))

Extracting Datas with JSoup

Even though I've look through so many answers to this question, I still do not know how to make this happen on my program.
Basically, I want to get the data from this website -> http://leagueoflegends.wikia.com/wiki/ahri
and then, from this specific data table
<table id="champion_info-lower" style="background-color:#041424;box-shadow:0 2px 5px black, inset 0 7px 5px -5px black;text-align:center;padding:0 1em;border-spacing:0;width:90%;margin:0 auto;">
I want to extract the numbers for the statistics such as health, health regen, attack damage, attack speed into the instance variable in my class.
So how do I do this?
Can you guys show the specific code, not just by words, because I still do not understand how this is working and this program being made right now is my first program.
Welcome to StackOverflow!
Usually, people including myself, are reluctant to provide code if the the person who asks hasn't shown any of their own attempts first, but since this one can be quite tricky and mostly comes down to how to use Jsoup and not programming in general, I'll provide an answer with example code that should give you the results that you desire. Though bear in mind, you should practice your general programming and provide examples of what code you have done so far instead of just ask others to provide the code for you!
Select certain elements with ID
You can use the CSS-selector to select the table element with the id="champion_info-lower"
using the syntax #id, as done below on the <span id="Abilities"> element
Element e = doc.select("span#Abilities").first();
System.out.println(e.text());
which prints out Abilities. This can be used to get the values in the table.
Splitting up the values into variables
I don't want to hand you a complete solution, but this might be hard to explain without showing some working code. If you look at the HTML that contains the table that you are interested in, you see that you can select the right part only by using the following selector syntax for the td elements that contain the data that we want to parse.
Elements table = doc.select("table#champion_info-lower td:eq(1) table td");
Further observing at the HTML reveals that the value for the health is presented in the sibling element to the element that contains the text "Health". If we check each element in the table for the text "Health", we know that the next following will be the one we are looking for. Since we have selected only the td elements, this should now be easy.
String health = "Health: ";
for (Element e : table) {
if (e.text().equals("Health")) {
health += table.get(table.indexOf(e) + 1).text();
}
}
System.out.println(health);
Check all the elements in the table.
If it has the text "Health", assign the value of the next element to string health.
This will output
Health: 380 (+80)
Figuring out how to get the rest of the values should be a piece of cake!
Try some code of your own before you continue, and I strongly recommend that you use the Jsoup API to find out how to use it, especially the Element class and the FAQ on how to use the selector.

Retrieve an xpath text contains using text()

I've been hacking away at this one for hours and I just can't figure it out. Using XPath to find text values is tricky and this problem has too many moving parts.
I have a webpage with a large table and a section in this table contains a list of users (assignees) that are assigned to a particular unit. There is nearly always multiple users assigned to a unit and I need to make sure a particular user is assigned to any of the units on the table. I've used XPath for nearly all of my selectors and I'm half way there on this one. I just can't seem to figure out how to use contains with text() in this context.
Here's what I have so far:
//td[#id='unit']/span [text()='asdfasdfasdfasdfasdf (Primary); asdfasdfasdfasdfasdf, asdfasdfasdfasdf; 456, 3456'; testuser]
The XPath Query above captures all text in the particular section I am looking at, which is great. However, I only need to know if testuser is in that section.
text() gets you a set of text nodes. I tend to use it more in a context of //span//text() or something.
If you are trying to check if the text inside an element contains something you should use contains on the element rather than the result of text() like this:
span[contains(., 'testuser')]
XPath is pretty good with context. If you know exactly what text a node should have you can do:
span[.='full text in this span']
But if you want to do something like regular expressions (using exslt for example) you'll need to use the string() function:
span[regexp:test(string(.), 'testuser')]

Dealing with duplicate ids in selenium webdriver

I am trying to automate some tests using selenium webdriver. I am dealing with a third-party login provider (OAuth) who is using duplicate id's in their html. As a result I cannot "find" the input fields correctly. When I just select on an id, I get the wrong one.
This question has already been answered for JQuery. But I would like an answer (I am presuming using Xpath) that will work in Selenium webdriver.
On other questions about this issue, answers typically say "you should not have duplicate id's in html". Preaching to the choir there. I am not in control of the webpage in question. If it was, I would use class and id properly and just fix the problem that way.
Since I cannot do that. What options do I get with xpath etc?
you can do it by driver.find_element_by_id, for example ur duplicate "duplicate_ID" is inside "div_ID" wich is unique :
driver.find_element_by_id("div_ID").find_element_by_id("duplicate_id")
for other duplicate id under another div :
driver.find_element_by_id("div_ID2").find_element_by_id("duplicate_id")
This XPath expression:
//div[#id='something']
selects all div elements in the XML document, the string value of whose id attribute is the string "something".
This Xpath expression:
count(//div[#id='something'])
produces the number of the div elements selected by the first XPath expression.
And this XPath expression:
(//div[#id='something'])[3]
selects the third (in document order) div element that is selected by the first XPath expression above.
Generally:
(//div[#id='something'])[$k]
selects the $k-th such div element ($k must be substituted with a positive integer).
Equipped with this knowledge, one can get any specific div whose id attribute has string value "something".
Which language are you working on? Dublicate id's shouldn't be a problem as you can virtually grab any attribute not just the id tag using xpath. The syntax will differ slightly in other languages (let me know if you want something else than Ruby) but this is how you do it:
driver.find_element(:xpath, "//input[#id='loginid']"
The way you go about constructing the xpath locator is the following:
From the html code you can pick any attribute:
<input id="gbqfq" class="gbqfif" type="text" value="" autocomplete="off" name="q">
Let's say for example that you want to consturct your xpath with the html code above (Google's search box) using name attribute. Your xpath will be:
driver.find_element(:xpath, "//input[#name='q']"
In other words when the id's are the same just grab another attribute available!
Improvement:
To avoid fragile xpath locators such as order in the XML document (which can change easily) you can use something even more robust. Two xpath locators instead of one. This can also be useful when dealing with hmtl tags that are really similar. You can locate an element by 2 of its attributes like this:
driver.find_element(:id, 'amount') and driver.find_element(xpath: "//input[#maxlength='50']")
or in pure xpath one liner if you prefer:
//input[#id="amount" and #maxlength='50']
Alternatively (and provided your xpath will only return one unique element) you can move one more step higher in the abstraction level; completely omitting the attribute values:
//input[#id and #maxlength]
It's not listed at http://selenium-python.readthedocs.io/locating-elements.html but I'm able access a method find_elements_by_id
This returns a list of all elements with the duplicate ID.
links = browser.find_elements_by_id("link")
for link in links:
print(link.get_attribute("href"))
you should use driver.findElement(By.xpath() but while locating element with firebug you should select absolute path for particular element instead of getting relative path this is how you will get the element even with duplicate ID's

Resources