Trying to exclude a portion of an xPath - xpath

I have looked through several posts about this, but have failed to apply the principles used to get the result I desire, so I'm going to just post my specific problem.
I am building a Google Sheet that enables the user to pull up Bible verses.
I have it all working, however I am running into an issue with a hidden element being pulled into my text().
FUNCTION:
=IMPORTXML("http://www.biblestudytools.com/ESV/Numbers/5-3.html",
"//*[#class='scripture']//span[2]//text()")
RESULT: You shall put out both male and female, putting them outside the camp, that they may not defile their camp, 1in the midst of which I dwell."
You can see the "1" that is showing up before the word "in"
I have found the xPath that pulls only that "1"
//*[#class='scripture']//span[2]//sup//text()
I am trying to remove that "1" from the text.
HELP PLEASE!!! :)

You can add a predicate to the end to exclude text nodes that are inside sup elements:
=IMPORTXML("http://www.biblestudytools.com/ESV/Numbers/5-3.html",
"//*[#class='scripture']//span[2]//text()[not(ancestor::sup)]")
This will retrieve only the text nodes that are not inside a sup element, but it will still result in having the verse spread out across two cells, because there are two text nodes. You can rectify this by wrapping this expression in a JOIN():
=JOIN("", IMPORTXML("http://www.biblestudytools.com/ESV/Numbers/5-3.html",
"//*[#class='scripture']//span[2]//text()[not(ancestor::sup)]"))

Related

XPath fails because Namespace colon in Title

I'm generating an XML report, using the JDF standard for PDFs going into a printing workflow.
There are 3 "DPart" sections, and I can use an xPath query to recognize them, but I want to grab the "Separation" attribute of each "cip4:Part". I can also get a query to find that, but it does not distinguish between the multiple "DPart"s.
<DPart End="0" ID="0003" ParentRef="0002" Start="0">
<DPM>
<cip4:Root>
<cip4:Intent cip4:ProductType="ProductPart"/>
<cip4:Production>
<cip4:Resource>
<cip4:Part Separation="K1"/>
<cip4:Color cip4:ActualColorName="Black" cip4:ColorType="Normal">
</cip4:Resource>
<cip4:Resource>
<cip4:Part Separation="S1"/>**
<cip4:Color cip4:ActualColorName="Dieline" cip4:ColorType="Normal">
</cip4:Resource>
<cip4:Resource>
<cip4:ColorantControl ColorantOrder="K1 S1" ColorantParams="K1 S1"/>
</cip4:Resource>
<cip4:Resource>
<eg:InkCoverage>
<eg:InkCov eg:Mm2="0.000000" eg:Pct="0.000000" eg:Separation="K1"/>
<eg:InkCov eg:Mm2="182.337538" eg:Pct="0.721209" eg:Separation="S1"/>
</eg:InkCoverage>
</cip4:Resource>
</cip4:Production>
</cip4:Root>
</DPM>
</DPart>
I want to do something like:
/DPM[2]/*[name ()='cip4:Part'], but it's not working.
I'm in a low-code pre-press environment (Esko Automation Engine), but the system gives me tools to parse an xPath, and throw some JavaScript at it.
There are at least three reasons your XPath selects nothing:
DPM is not an immediate child of the root node
There is only one DPM, so DPM[2] won't select anything
There is no child of a DPM whose name is cip4:Part.
You also say in the narrative that there are three DPart's, which implies that DPart is not actually the outermost element as it appears to be in your sample. This makes it difficult to provide the correct XPath. However, you might be able to make a start with
(//DPM)[2]//*[name()='cip4:Part']

How to properly scraping filtered content using XPath Query to Google Sheet?

So, this is about a content from a website which I want to get and put it in my Google Sheets, but I'm having difficulty understanding the class of the content.
target link: https://www.cnbc.com/quotes/?symbol=XAU=
This number is what I want to get from. Picture 1: The part which i want to scrape
And this is what the code looks like in inspector. Picture 2: The code shown in inspector
The target is inside a span attribute but the span attribute looks very difficult to me, so I tried to simplify it using this line of code here =IMPORTXML("https://www.cnbc.com/quotes/?symbol=XAU=","//table[#class='quote-horizontal regular']//tr/td/span")
Picture 3: List is shown when putting the code
After some tries, I am able to get the right target, but it confuse me, Im using this code =IMPORTXML("https://www.cnbc.com/quotes/?symbol=XAU=","//table[#class='quote-horizontal regular']//tr/td/span[#class='last original'][1]")
Picture 4: The right target is shown when the xpath query is more specified
As what you can see in 2nd Picture, 'last original' is not really the full name of the class, when I put the 'last original ng-binding' instead it gave me an error saying imported content is empty
So, correct me if my code is wrong, or accidental worked out somehow because there's another correct way?
How about this answer?
Modified formula 1:
When the name of class is last original and last original ng-binding, how about the following xpath and formula?
=IMPORTXML(A1,"//span[contains(#class,'last original')][1]")
In this case, the URL of https://www.cnbc.com/quotes/?symbol=XAU= is put in the cell "A1".
In this case, //span[contains(#class,'last original')][1] is used as the xpath. The value of span that the name of class includes last original is retrieved. So last original and last original ng-binding can be used.
Modified formula2:
As other xpath, how about the following xpath and formula?
=IMPORTXML(A1,"//meta[#itemprop='price']/#content")
It seems that the value is included in the metadata. So this sample retrieves the value from the metadata.
Reference:
IMPORTXML
To complete #Tanaike's answer, two alternatives :
=IMPORTXML(B2;"//span[#class='year high']")
"Year high" seems always equal to the current stock index value.
Or, with value retrieved from the script element :
=IMPORTXML(B2;"substring-before(substring-after(//script[contains(.,'modApi')],'""last\"":\""'),'\')")
Note : since I'm based in Europe, you need to replace ; with , in the formulas.

Trying to find two different text nodes from a descendant

Someone decided to make a site as unfriendly as possible by intention so I'm trying what I can to have our scraper still get to where it should.
<div class="issueDetails">
<div class="issueTitle ng-binding" style="">FANCY UNIQUE TEXT dd.MM.yyyy</div>
<a>COMPLETELY DIFFERENT TEXT</a>
I've left out the unnecessary details here, but I'm trying to find a match within the site through XPATH (can't use anything else for this) that will find something which fulfils both conditions, FANCY UNIQUE TEXT dd.MM.yyyy *as well as COMPLETELY DIFFERENT TEXT.
I've tried my luck with //div[#class='issueDetails']/descendant::*[contains(text(), 'FANCY UNIQUE TEXT dd.MM.yyyy') and contains (text(), 'COMPLETELY DIFFERENT TEXT')]
but it contains the erroneous logic that both unique things I need are in the same thing.
The first, FANCY UNIQUE TEXT, is the unique identifier for where I want to go. The second, COMPLETELY DIFFERENT TEXT, is what I need the scraper to click on to actually head to that specific one. So an XPath that finds both despite them being different descendants is necessary.
Is this what you're looking for :
//div[#class="issueDetails"]/*[contains(.,"COMPLETELY DIFFERENT TEXT") or translate(substring(.,string-length(.)-9,10),"123456789","000000000")="00.00.0000" and contains(.,'FANCY UNIQUE TEXT')]
It will return the 2 elements respecting your conditions : div and a.
Translate, substring-length and substring functions are used to check if a date pattern is present in the div element.
EDIT : Check if the parent+child contains the text you're looking for, then get the childs with :
//div[#class='issueDetails'][contains(.,"FANCY UNIQUE TEXT dd.MM.yyyy") and contains(.,"COMPLETELY DIFFERENT TEXT")]/*[contains(.,"FANCY UNIQUE TEXT dd.MM.yyyy") or contains(.,"COMPLETELY DIFFERENT TEXT")]

Extracting Datas with JSoup

Even though I've look through so many answers to this question, I still do not know how to make this happen on my program.
Basically, I want to get the data from this website -> http://leagueoflegends.wikia.com/wiki/ahri
and then, from this specific data table
<table id="champion_info-lower" style="background-color:#041424;box-shadow:0 2px 5px black, inset 0 7px 5px -5px black;text-align:center;padding:0 1em;border-spacing:0;width:90%;margin:0 auto;">
I want to extract the numbers for the statistics such as health, health regen, attack damage, attack speed into the instance variable in my class.
So how do I do this?
Can you guys show the specific code, not just by words, because I still do not understand how this is working and this program being made right now is my first program.
Welcome to StackOverflow!
Usually, people including myself, are reluctant to provide code if the the person who asks hasn't shown any of their own attempts first, but since this one can be quite tricky and mostly comes down to how to use Jsoup and not programming in general, I'll provide an answer with example code that should give you the results that you desire. Though bear in mind, you should practice your general programming and provide examples of what code you have done so far instead of just ask others to provide the code for you!
Select certain elements with ID
You can use the CSS-selector to select the table element with the id="champion_info-lower"
using the syntax #id, as done below on the <span id="Abilities"> element
Element e = doc.select("span#Abilities").first();
System.out.println(e.text());
which prints out Abilities. This can be used to get the values in the table.
Splitting up the values into variables
I don't want to hand you a complete solution, but this might be hard to explain without showing some working code. If you look at the HTML that contains the table that you are interested in, you see that you can select the right part only by using the following selector syntax for the td elements that contain the data that we want to parse.
Elements table = doc.select("table#champion_info-lower td:eq(1) table td");
Further observing at the HTML reveals that the value for the health is presented in the sibling element to the element that contains the text "Health". If we check each element in the table for the text "Health", we know that the next following will be the one we are looking for. Since we have selected only the td elements, this should now be easy.
String health = "Health: ";
for (Element e : table) {
if (e.text().equals("Health")) {
health += table.get(table.indexOf(e) + 1).text();
}
}
System.out.println(health);
Check all the elements in the table.
If it has the text "Health", assign the value of the next element to string health.
This will output
Health: 380 (+80)
Figuring out how to get the rest of the values should be a piece of cake!
Try some code of your own before you continue, and I strongly recommend that you use the Jsoup API to find out how to use it, especially the Element class and the FAQ on how to use the selector.

Retrieve an xpath text contains using text()

I've been hacking away at this one for hours and I just can't figure it out. Using XPath to find text values is tricky and this problem has too many moving parts.
I have a webpage with a large table and a section in this table contains a list of users (assignees) that are assigned to a particular unit. There is nearly always multiple users assigned to a unit and I need to make sure a particular user is assigned to any of the units on the table. I've used XPath for nearly all of my selectors and I'm half way there on this one. I just can't seem to figure out how to use contains with text() in this context.
Here's what I have so far:
//td[#id='unit']/span [text()='asdfasdfasdfasdfasdf (Primary); asdfasdfasdfasdfasdf, asdfasdfasdfasdf; 456, 3456'; testuser]
The XPath Query above captures all text in the particular section I am looking at, which is great. However, I only need to know if testuser is in that section.
text() gets you a set of text nodes. I tend to use it more in a context of //span//text() or something.
If you are trying to check if the text inside an element contains something you should use contains on the element rather than the result of text() like this:
span[contains(., 'testuser')]
XPath is pretty good with context. If you know exactly what text a node should have you can do:
span[.='full text in this span']
But if you want to do something like regular expressions (using exslt for example) you'll need to use the string() function:
span[regexp:test(string(.), 'testuser')]

Resources