scrapy HtmlXPathSelector determine xpath by searching for keyword - xpath

I have a portion of html like below
<li><label>The Keyword:</label><span>The text</span></li>
I want to get the string "The keyword: The text".
I know that I can get xpath of above html using Chrome inspect or FF firebug, then hxs.select(xpath).extract(), then strip html tags to get the string. However, the approach is not generic enough since the xpath is not consistent across different pages.
Hence, I'm thinking of below approach:
Firstly, search for "The Keyword:" using
hxs = HtmlXPathSelector(response)
hxs.select('//*[contains(text(), "The Keyword:")]')
When do pprint I get some return:
>>> pprint( hxs.select('//*[contains(text(), "The Keyword:")]') )
<HtmlXPathSelector xpath='//*[contains(text(), "The Keyword:")]' data=u'<label>The Keyword:</label>'>
My question is how to get the wanted string: "The keyword: The text". I am thinking of how to determine xpath, if xpath is known, then of course I can get the wanted string.
I am open to any solution other than scrapy HtmlXPathSelector. ( e.g lxml.html might have more features but I am very new to it).
Thanks.

If I understand your question correctly, "following-sibling" is what you are looking after.
//*[contains(text(), "The Keyword:")]/following-sibling::span/a/text()
Xpath Axes

Related

Scraping all data from Reddit searches

I am using PRAW to scrape data off of reddit. I am using the .search method to search very specific people. I can easily print the title of the submission if the keyword is in the title, but if the keyword is in the text of the submission nothing pops up. Here is the code I have so far.
import praw
reddit = praw.Reddit(----------)
alls = reddit.subreddit("all")
for submission in alls.search("Yoa ming",sort = comment, limit = 5):
print(submission.title)
When I run this code i get
Yoa Ming next to Elephant!
Obama's Yoa Ming impression
i used to yoa ming... until i took an arrow to the knee
Could someone make a rage face out of our dearest Yoa Ming? I think it would compliment his first one so well!!!
If you search Yoa Ming on reddit, there are posts that dont contain "Yoa Ming" in the title but "Yoa Ming" in the text and those are the posts I want.
Thanks.
You might need to update the version of PRAW you are using. Using v6.3.1 yields the expected outcome and includes submissions that have the keyword in the body and not the title.
Also, the sort=comment parameter should be sort='comments'. Using an invalid value for sort will not throw an error but it will fall back to the default value, which may be why you are seeing different search results between your script and the website.

How to grab a piece of data which has a different xpath on different webpages?

So I am trying to grab a piece of data that is displayed in a different xpath on different pages.
if you will see the xpath of the IPA pronunction on wiktionary... https://en.wiktionary.org/wiki/foo you will see that the xpath is
//*[#id="mw-content-text"]/ul[1]/li[1]/span[4]
but if I got to another word, like https://en.wiktionary.org/wiki/bar then the xpath would be
//*[#id="mw-content-text"]/ul[1]/li[2]/span[5]
I cannot think of any way to reconcile these, is there something that I am missing?
The answer is simple. Never let a tool write any XPath for you. All tools get it wrong.
Look at the document's HTML source and write the appropriate XPath it yourself.
var result = document.evaluate("//*[#class = 'IPA']", document),
elem;
while (elem = result.iterateNext()) {
console.log(elem);
}
The above shows the simplest variant. It selects two occurrences of <span class="IPA"> on https://en.wiktionary.org/wiki/foo and quite a few more on https://en.wiktionary.org/wiki/bar.
Use a more specific expression to narrow down the results.

How to use substring() with Import.io?

I'm having some issues with XPath and import.io and I hope you'll be able to help me. :)
The html code:
<a href="page.php?var=12345">
For the moment, I manage to extract the content of the href ( page.php?var=12345 ) with this:
./td[3]/a[1]/#href
Though, I would like to just collect: 12345
substring might be the solution but it does not seem to work on import.io as I use it...
substring(./td[3]/a[1]/#href,13)
Any ideas of what the problem is?
Thank's a lot in advance!
Try using this for the xpath: (Have the field selected as Text)
.//*[#class='oeil']/a/#href
Then use this for your regex:
([^=]*)$
This will get you the ISBN number you are looking for.
import.io only support functions in XPath when they return a node list
Your path expression is fine, but perhaps it should be
substring(./td[3]/a[1]/#href,14)
"Does not seem to work" is not a very clear description of what is wrong. Do you get error messages? Is the output wrong? Do you have any code surrounding the path expression you could show?
You can use substring, but using substring-after() would be even better.
substring-after(/a/#href,'=')
assuming as input the tiny snippet you have shown:
<a href="page.php?var=12345"/>
will select
12345
and taking into account the structure of your input
substring-after(./td[3]/a[1]/#href,'=')
A leading . in a path expression selects only immediate child td nodes of the current context node. I trust you know what you are doing.

Jekyll search with jekyll-lunr-js-search not loading full body in search.json

I'm trying to implement search functionality in a Jekyll-generated static website via jekyll-lunr-js-search (see here).
The search functionality works if the searched string appears outside of a html or highlight tag, or in the title, plain body text... But if I look for a word or part of a word that appears inside a tag in the post (.md file), it's not found.
Inspecting the search.json entry for that particular post, I see indeed that the body does not contain this content...
Is this a known issue? Or is this a configuration problem?
The post would e.g. contain
<ul><li>Labyrinth</li></ul> Bicycle races are fun yes yes!
and the body content in search.json would then be : body: "Bicycle races are fun yes yes!"
Searching for 'Lab' would return no results then.
Thanks in advance.
I think this is a problem with the plugin. Looking at the source, we can see that it builds the search index by scanning the text from the rendered page.
def render(item)
item.render({}, #site.site_payload)
doc = Nokogiri::HTML(item.output)
paragraphs = doc.search('//text()').map {|t| t.content }
paragraphs = paragraphs.join(" ").gsub("\r", " ").gsub("\n", " ").gsub("\t", " ").gsub(/\s+/, " ")
end
You could modify the plugin and change the //text() XPath to be a CSS selector you need, since Nokogiri's search allows for both XPath and CSS searching.

How can I use Nokogiri to find specific text/words on a webpage?

I am new to nokogiri, but it looks like this would be the tool that I would use to scrape a webpage. I am looking for specific words on a webpage. The words are "Valid", "Requirements Met", and "Requirements Not". I am using watir to drive through the website. I currently have:
page = Nokogiri::HTML.parse(browser.html)
to get the html, but I am not sure where to go from here.
Thanks for the help!
If you are using Watir to drive the website, I would suggest using Watir to check for the text. You can get all the text on the page using:
ie.text #Where ie is a Watir::IE
You could then check to see if it has those words are included (by comparing to a regex):
if ie.text =~ /Valid|Requirements Met|Requirements Not/
#Do something if the words are on the page
end
That said, if you are looking for a specific bits of text, you can use Watir to look specifically for those elements (and avoid parsing text or html). If you can provide an HTML sample of what you are working on, we can help find a more robust solution.
I am not sure why you are using both. You could get the page using 'net/http' or mechanize if you just want to check for text. Anyways, you can check for text in watir with browser.text.match 'Valid', same for nokogiri with page.text.match 'Valid'.
You should also be able to use the .text method from Justin's answer along with the standard ruby string .include? method which returns true or false.
if browser.text.include? /Valid|Requirements Met|Requirements Not/
#code to execute if text found
else
#code to execute if text not found
end
This also makes it easy to have a single line validation step if that is what you are after
if using rspec/cucumber
browser.text.should include /Valid|Requirements Met|Requirements Not/
if using test:Unit
assert browser.text.include? /Valid|Requirements Met|Requirements Not/

Resources