Xpath Scrub of Website Incomplete Results - xpath

I am trying to use the Google Spreadsheet function "importXML" to pull in all links and titles from a Khan Academy Website:
https://www.khanacademy.org/commoncore/grade-HSA-A-SSE
So far I have tried:
=IMPORTXML("https://www.khanacademy.org/commoncore/grade-HSA-A-SSE", "//a[#class='standard-preview']")
It brings in 29 results, but not all of the "a" elements with class "standard-preview". On the webpage, there are many more elements with that class than just the 29 results.
How do I grab all the elements with the class "standard-preview". Why would my xpath not return some of the values?
My spreadsheet is below:
https://docs.google.com/spreadsheets/d/1pP-WMnoCYzG38VyT_0tYpdblSKjNGvDpa8dRMnraQ7w/edit?usp=sharing

Related

Get data from XML by Google SheetS ImportXML function

I'm trying to scrape data from XML with the function IMPORTXML of Google sheet but the return is empty.
I tried these formulae:
=IMPORTXML("https://www.futbin.com/20/player/42955/", "//span[#id='ps-lowest-1']/text()")
=INDEX(IMPORTXML("https://www.futbin.com/20/player/42955/" , "//div[#class='xbox-lowest-1']"),1,1)
= IMPORTXML("https://www.futbin.com/20/player/42955/", "//*[#id='xbox-lowest-1']")
=IMPORTXML("https://www.futbin.com/20/player/42955/", "//*[#id='xbox-lowest-1']/text()")
Maybe the data is generated after by a script or something else.
you could do:
=QUERY(ARRAY_CONSTRAIN(IMPORTDATA("https://www.futbin.com/20/player/42955/"), 5000, 1),
"where lower(Col1) contains 'lowest'")
but as you can see there are no numeric values in between those tags because Google Sheets does not support web scraping of JavaScript elements

Accuracy of wbsearchentities in wikidata API [duplicate]

I'm using wbsearchentities (wikidata api) in a python request and I'm wondering why returned results are not the same that those seen on Wikidata. For example, the following command in Python:
url = "https://www.wikidata.org/w/api.php?action=wbsearchentities&search=%s&format=json&limit=50&formatversion=2&language=en" % ('New York Landmarks Preservation Commission')
r = requests.post(url,headers={"User-Agent" : "Magic Browser"})
returns nothing but the same search in the search box of Wikidata returns 2 results (one is the good one: New York City Landmarks Preservation Commission.
Ideally, I would like to have all these results returned from my python request.
The search box in the top right of Wikidata uses the wbsearchentities API module to provide the auto suggestion dropdown search.
If you press enter after entering your search instead of clicking on one of the suggestions you will end up on the Special:Search page.
As you can see they API result returns no results but the special page does.
That is due to these searches working in entirely different ways.
The Special:Search page is a MediaWiki concept that Wikibase provides
data to.
The wbsearchentities API module provided by Wikibase itself.

importxml xpath div how to get the info when neither class nor id?

using Importxml in google sheets.
How can I get "data-film-id" and "data-film-release-year" from this, when the info isn't a div class or a div id?:
<div class="react-component film-poster film-poster-193260 poster linked-film-poster -attributed"
data-component-class="globals.comps.FilmPosterComponent"
data-film-id="193260"
data-film-name="The Choice"
data-poster-url="/film/the-choice-1987/image-150/"
data-film-release-year="1987"
data-film-link="/film/the-choice-1987/"
I was able to get some info from the site (where A1 is [https://letterboxd.com/tag/30-countries-2018/diary/by/added/page/58/]) into google sheets using this:
=ImportXML(A1, "//div[contains(#class,'react-component') and contains(#class,'film-poster')]/a/#href")
So I know everything works, but that's only because the href is below that div in its own paragraph. My issue is trying to dig into the info that is being displayed above.
After searching on this site I tried this (among many other things) but it resulted in an error.
=ImportXML(A1, "//li[#class='poster-container']//div[not(#id) or not(#class)]")
But it gives me info I already have, not the info I need.
Maybe I can't get the date because it isn't a class or an id?
You need to use the attribute selector.
=ImportXML(A1, "//div[contains(#class,'react-component') and contains(#class,'film-poster')]/attribute::data-film-id")
So in Column B you can have the above formula to display the film ID, in Column C another formula for the release year, and so on.
If you want it all in one row, which I don't recommend, it would be
=ImportXML(A1, "//div[contains(#class,'react-component') and contains(#class,'film-poster')]/attribute::data-film-id | //div[contains(#class,'react-component') and contains(#class,'film-poster')]/attribute::data-film-release-year")
I don't recommend combining this because it outputs everything in one column "year, id, year, id, ...". Very messy.

scrapy xpath unsupported indirect child syntax

I want to select all the 'a' elements inside 'li's of class 'foo', so the xpath is used is li[#class="foo"]//a which works under the xpath tester and Javascript.
However, I'm trying to get this to work under a CrawlSpider built under Scrapy, specifically as one of its link extractor rules in a fashion such as
Rule(SgmlLinkExtractor(restrict_xpaths=('//li[#class="foo"]//a | //a[contains(.,"Next")]')), callback='parse_foo', follow=True)
It returns a much larger set than expected.
For example on this data.gc.ca page, there are 10 divs of dataset-item class. By selecting //div[#class="dataset-item"] I get 10 items. However, when I select with //div[#class="dataset-item"]//a I get 68 items. Per the specs, the //a should be all the a within these divs.
How can I implement the desired function in Scrapy?

I am not able to capture all search result links and description even though i am using correct className

I want to capture all search result links (search engine: http://search.yahoo.com) and summary from the result page. ClassName of Link is 'yschttl spt' and className for summary is 'abstr'
Source Code Looks like below for Link and summary.
Link:
<a id="yui_3_3_0_1_1301085039901361" dirtyhref="http://search.yahoo.com/r/_ylt=A0oG7m9u.4xNvWYA7N5XNyoA;_ylu=X3oDMTE2ZXNhNjRzBHNlYwNzcgRwb3MDMgRjb2xvA2FjMgR2dGlkA01TWUMwMDFfMTc5/SIG=11stois8r/EXP=1301106638/**http%3a//en.wikipedia.org/wiki/Pune,_India" class="yschttl spt" href="http://search.yahoo.com/r/_ylt=A0oG7m9u.4xNvWYA7N5XNyoA;_ylu=X3oDMTE2ZXNhNjRzBHNlYwNzcgRwb3MDMgRjb2xvA2FjMgR2dGlkA01TWUMwMDFfMTc5/SIG=11stois8r/EXP=1301106638/**http%3a//en.wikipedia.org/wiki/Pune,_India" data-bns="API" data-bk="5096.1"><b>Pune</b> - Wikipedia, the free encyclopedia</a>`
Summary Div:
<div id="yui_3_3_0_1_1301085039901338" class="abstr"><b id="yui_3_3_0_1_1301085039901337">Pune</b> is undoubtedly a great place to eat. Fergusson <b id="yui_3_3_0_1_1301085039901352">College</b> <b>Road</b> is full of budget eateries serving delicous hot food at nominal charges. For a range of multi-cuisine ...</div>
I am using below line of code to capture both (link and summary).
final List<WebElement> links = driver.findElements(By.className("yschttl spt"));
final List<WebElement> linksSummary = driver.findElements(By.className("abstr"));
But now working at all.
I also tried using below XPaths for Link:
//a[starts-with(#id, 'yui_')]
//a[#data-bns='API']
I cannot use ID as whole because that ID number is not same for all search result links.
Nothing is working. Please help.
thanks in advance.
The main problem is that you are not correctly asking for a classname. The instance of class="yschttl spt" says that the element may be identified by two class names, yschttl or spt. It does not say that the classname is yschttl spt so asking for By.className("yschttl spt") will always fail.
Note that the reason the XPath suggested by #Tarun works is that XPath has no notion of what an HTML class name is or should be. In XPath, #class simply specifies the name of an attribute--with no underlying semantics.
Further, note that a class name may not contain spaces. For more details about specifying class names, see the HTML class attribute specification.
I don't use selenium 2.0 but when I tried //a[#class='yschttl spt'] in Firefox XPath Checker, I got to see all 10 results in page. I am soon to begin with Selenium 2.0, may be I could try then...

Resources