Extracting Datas with JSoup - user-interface

Even though I've look through so many answers to this question, I still do not know how to make this happen on my program.
Basically, I want to get the data from this website -> http://leagueoflegends.wikia.com/wiki/ahri
and then, from this specific data table
<table id="champion_info-lower" style="background-color:#041424;box-shadow:0 2px 5px black, inset 0 7px 5px -5px black;text-align:center;padding:0 1em;border-spacing:0;width:90%;margin:0 auto;">
I want to extract the numbers for the statistics such as health, health regen, attack damage, attack speed into the instance variable in my class.
So how do I do this?
Can you guys show the specific code, not just by words, because I still do not understand how this is working and this program being made right now is my first program.

Welcome to StackOverflow!
Usually, people including myself, are reluctant to provide code if the the person who asks hasn't shown any of their own attempts first, but since this one can be quite tricky and mostly comes down to how to use Jsoup and not programming in general, I'll provide an answer with example code that should give you the results that you desire. Though bear in mind, you should practice your general programming and provide examples of what code you have done so far instead of just ask others to provide the code for you!
Select certain elements with ID
You can use the CSS-selector to select the table element with the id="champion_info-lower"
using the syntax #id, as done below on the <span id="Abilities"> element
Element e = doc.select("span#Abilities").first();
System.out.println(e.text());
which prints out Abilities. This can be used to get the values in the table.
Splitting up the values into variables
I don't want to hand you a complete solution, but this might be hard to explain without showing some working code. If you look at the HTML that contains the table that you are interested in, you see that you can select the right part only by using the following selector syntax for the td elements that contain the data that we want to parse.
Elements table = doc.select("table#champion_info-lower td:eq(1) table td");
Further observing at the HTML reveals that the value for the health is presented in the sibling element to the element that contains the text "Health". If we check each element in the table for the text "Health", we know that the next following will be the one we are looking for. Since we have selected only the td elements, this should now be easy.
String health = "Health: ";
for (Element e : table) {
if (e.text().equals("Health")) {
health += table.get(table.indexOf(e) + 1).text();
}
}
System.out.println(health);
Check all the elements in the table.
If it has the text "Health", assign the value of the next element to string health.
This will output
Health: 380 (+80)
Figuring out how to get the rest of the values should be a piece of cake!
Try some code of your own before you continue, and I strongly recommend that you use the Jsoup API to find out how to use it, especially the Element class and the FAQ on how to use the selector.

Related

Trying to find two different text nodes from a descendant

Someone decided to make a site as unfriendly as possible by intention so I'm trying what I can to have our scraper still get to where it should.
<div class="issueDetails">
<div class="issueTitle ng-binding" style="">FANCY UNIQUE TEXT dd.MM.yyyy</div>
<a>COMPLETELY DIFFERENT TEXT</a>
I've left out the unnecessary details here, but I'm trying to find a match within the site through XPATH (can't use anything else for this) that will find something which fulfils both conditions, FANCY UNIQUE TEXT dd.MM.yyyy *as well as COMPLETELY DIFFERENT TEXT.
I've tried my luck with //div[#class='issueDetails']/descendant::*[contains(text(), 'FANCY UNIQUE TEXT dd.MM.yyyy') and contains (text(), 'COMPLETELY DIFFERENT TEXT')]
but it contains the erroneous logic that both unique things I need are in the same thing.
The first, FANCY UNIQUE TEXT, is the unique identifier for where I want to go. The second, COMPLETELY DIFFERENT TEXT, is what I need the scraper to click on to actually head to that specific one. So an XPath that finds both despite them being different descendants is necessary.
Is this what you're looking for :
//div[#class="issueDetails"]/*[contains(.,"COMPLETELY DIFFERENT TEXT") or translate(substring(.,string-length(.)-9,10),"123456789","000000000")="00.00.0000" and contains(.,'FANCY UNIQUE TEXT')]
It will return the 2 elements respecting your conditions : div and a.
Translate, substring-length and substring functions are used to check if a date pattern is present in the div element.
EDIT : Check if the parent+child contains the text you're looking for, then get the childs with :
//div[#class='issueDetails'][contains(.,"FANCY UNIQUE TEXT dd.MM.yyyy") and contains(.,"COMPLETELY DIFFERENT TEXT")]/*[contains(.,"FANCY UNIQUE TEXT dd.MM.yyyy") or contains(.,"COMPLETELY DIFFERENT TEXT")]

Trying to exclude a portion of an xPath

I have looked through several posts about this, but have failed to apply the principles used to get the result I desire, so I'm going to just post my specific problem.
I am building a Google Sheet that enables the user to pull up Bible verses.
I have it all working, however I am running into an issue with a hidden element being pulled into my text().
FUNCTION:
=IMPORTXML("http://www.biblestudytools.com/ESV/Numbers/5-3.html",
"//*[#class='scripture']//span[2]//text()")
RESULT: You shall put out both male and female, putting them outside the camp, that they may not defile their camp, 1in the midst of which I dwell."
You can see the "1" that is showing up before the word "in"
I have found the xPath that pulls only that "1"
//*[#class='scripture']//span[2]//sup//text()
I am trying to remove that "1" from the text.
HELP PLEASE!!! :)
You can add a predicate to the end to exclude text nodes that are inside sup elements:
=IMPORTXML("http://www.biblestudytools.com/ESV/Numbers/5-3.html",
"//*[#class='scripture']//span[2]//text()[not(ancestor::sup)]")
This will retrieve only the text nodes that are not inside a sup element, but it will still result in having the verse spread out across two cells, because there are two text nodes. You can rectify this by wrapping this expression in a JOIN():
=JOIN("", IMPORTXML("http://www.biblestudytools.com/ESV/Numbers/5-3.html",
"//*[#class='scripture']//span[2]//text()[not(ancestor::sup)]"))

Retrieve an xpath text contains using text()

I've been hacking away at this one for hours and I just can't figure it out. Using XPath to find text values is tricky and this problem has too many moving parts.
I have a webpage with a large table and a section in this table contains a list of users (assignees) that are assigned to a particular unit. There is nearly always multiple users assigned to a unit and I need to make sure a particular user is assigned to any of the units on the table. I've used XPath for nearly all of my selectors and I'm half way there on this one. I just can't seem to figure out how to use contains with text() in this context.
Here's what I have so far:
//td[#id='unit']/span [text()='asdfasdfasdfasdfasdf (Primary); asdfasdfasdfasdfasdf, asdfasdfasdfasdf; 456, 3456'; testuser]
The XPath Query above captures all text in the particular section I am looking at, which is great. However, I only need to know if testuser is in that section.
text() gets you a set of text nodes. I tend to use it more in a context of //span//text() or something.
If you are trying to check if the text inside an element contains something you should use contains on the element rather than the result of text() like this:
span[contains(., 'testuser')]
XPath is pretty good with context. If you know exactly what text a node should have you can do:
span[.='full text in this span']
But if you want to do something like regular expressions (using exslt for example) you'll need to use the string() function:
span[regexp:test(string(.), 'testuser')]

Matching text with xpath?

I'm screen-scraping an HTML page which contains:
<table border=1 class="searchresult" cellpadding=2>
<tr><th colspan=2>Last search</th></tr>
<tr><th align=left>Search term</th><td>xxxxxx</td></tr>
<tr><th align=left>Result</th><td>yyyyyyyy/td></tr>
</table>
I want to write an XPATH expression which gets me the data cell containing "yyyyyyyy". I've gotten as far as
.//table[#class='searchresult']//tr/th
which gets me a list of all the table-header nodes in the table. I can iterate over them in user code, find the one whose .text is "Results" and then call .getnext() on that to get the table-data. But, is there a cleaner way to do this by writing a more specific XPATH pattern? It seems like there should be, but I haven't gotten my head that far around XPATH yet to figure out how.
If it matters, I'm doing this in Python with lxml.
.//table[#class='searchresult']//tr/td[preceding-sibling::th] might give you what you need.
Two comprehensive papers on semi-automatically creating XPath statements like this one, specifically for screen scraping purposes can be found here:
http://tobiasanton.com/Tobias_Anton/Academia.html
Use:
//table/tr[last()]/td
This selects any td element that is a child of any tr that is the last tr child of any table in this XHTML document.
This may select more than one td element, depending on whether or not there is only one table in the XHTML document. You need to make this expression more precise, if more than one table element is present.
For example, if the table in question is the first in the document, use:
(//table)[1]/tr[last()]/td

HtmlUnit getByXpath returns null

I am coding with Groovy, however, I don't believe its a language specific set of questions.
I actually have two questions
First Question
I've run into an issue while using HtmlUnit. It is telling me that what I am trying to grab is null.
The page I'm testing it on is:
http://browse.deviantart.com/resources/applications/psbrushes/?order=9&offset=0#/dbwam4
My code:
client = new WebClient(BrowserVersion.FIREFOX_3)
client.javaScriptEnabled = false
page = client.getPage(url)
//coming up as null
title = page.getByXPath("//html/body/div[4]/div/div[3]/div/div/div/div/div/div/div/div/div/div/h1/a")
println title
This simply prints out: []
Is this because the page uses onclick()? If so, how would I get around that? Enabling javascript creates a mess in my cmd prompt.
Second Question
I am wanting to also get the image but am having trouble because when I attempt to get the XPath (via firebug) it shows up as: //*[#id="gmi-ResViewSizer_img"]
How do I handle that?
First Answer:
/html/body/div[3]/div/div[3]/div/div/div/div/div/div/div/div/div/div/h1/a
Your XPATH was off by one in the predicate filter for the 4th div of the body, it should be the 3rd div. It appears the HTML for the site can/does change from when you had origionally snagged the XPATH using Firebug. You may need to adjust your XPATH to accommodate for potential change and be less sensitive to some differences in document structure.
Maybe something like this:
/html/body//div/h1/a
Second Answer: The XPATH that you listed will work. It may look odd/short(and may not be the most efficient), but // starts at the root node and looks throughout every node in the tree, * matches on any element(to include the img) and the [] predicate filter restricts it to those that have an id attribute who's value equals "gmi-ResViewSizer_img".
There are many other options for XPATHs that could work as well. It will also depend on how often the HTML structure changes. This is one that also works for the page referenced to select that img:
/html/body/div/div/div/div/img[1]
I had the same problem, I solved when I realize iframe tags on page, try call
((HtmlPage)current_page.getFrames()[n].getEnclosedPage()).getElementByXPath(...
where n is the position in frame in iframe collection. It's work for me !!!
Thanks a lot.

Resources