I can't seem to figure this out due to the square brackets issue. I have an html page full of h3 tags with hrefs that I need to grab but one has a class on that I don't want. Example:
I want all H3s hrefs but not this one:
<h3 class="leave_this">Leave me alone!
To grab all the hrefs I am using this:
//h3/a/#href
Tried a few variations but no luck.
REMOVED CONFUSING EXAMPLE, APOLOGIES
Related
I have a few hundred URLs where I'm trying to scrape the image path for an image on a page. Each page is the same format, but the div class is unique to each page.
I want to be able to use import xml in Google sheets to scrape just the content of the data-path element.
I've tried and failed to use xpath to pull out the URLs.
<div class="uniqueid active" data-path="/~/media/Images/image.jpg" data-alt="Anything"></div>
E.g. //div[#class='*']/#data-path"
Example of site: https://www.cannondale.com/en/Australia/Bike/ProductDetail?Id=77d3b8fe-41f7-42b6-bf69-b5cf0ae55548&parentid=undefined
If div class has the pattern "uniqueid active", then you can try the following XPath:
//div[contains(#class, "active")]/#data-path
Otherwise, if div class can be anything, use this query:
//div[#class]/#data-path
UPDATE:
I tried to get values of data-path attributes with IMPORTXML, but didn't succeed. Tried to do it using Python (requests and lxml) and it works. So probably the problem is in Google Sheets - some limitations or bugs, idk.
Having difficulty to get the correct XPath to scrape the real URL of any image of my Scoop.it topic. Here is the code excerpt centered on one image. Other images are treated the same way.
<div class="thisistherealimage" >
<img id="Here a specific image ID" width="467" height="412"
class="postDisplayedImage lazy"
src="/resources/img/white.gif"
data-original="https://img.scoop.it/jKj7v6ojzPtACT6EaeztHTl72eJkfbmt4t8yenImKBVvK0kTmF0xjctABnaLJIm9"
alt="Here an alternative text" style="width:467; height: 412;" />
So, in this code sample, I dont want to scrape "/resources/img/white.gif" but the URL following the "data-original" attribute!
I'd like to capture the the data-original attribute, not only to capture it when it contains a URL.
As an XPath beginner, I've tried //div[contains(#class,'thisistherealimage')]/img[contains(#class,'postDisplayedImage')][contains(#class,'lazy')]!
But it's not specific to data-original attribute. Isn't it?
Any advice?
If you want the data-original, you can access like this:
//div[contains(#class,'thisistherealimage')]/img[contains(#class,'postDisplayedImage') and contains(#class,'lazy')]/#data-original
I'm using scrapy to crawl a site with some odd formatting conventions. The basic idea is that I want all the text and subelements of a certain div, EXCEPT a few at the beginning, and a few at the end.
Here's the gist.
<div id="easy-id">
<stuff I don't want>
text I don't want
<div id="another-easy-id" more stuff I don't want>
text I want
<stuff I want>
...
<more stuff I want>
text I want
...
<div id="one-more-easy-id" more stuff I *don't* want>
<more stuff I *don't* want>
NB: The indenting implies closing tags, so everything here is a child of the first div -- the one with id="easy-id"
Because text and nodes are mixed, I haven't been able to figure out a simple xpath selector to grab the stuff I want. At this point, I'm wondering if it's possible to retrieve the result from xpath as an lxml.etree.elementTree, and then hack at it using the .remove() method.
Any suggestions?
I am guessing you want everything from the div with ID another-easy-id up to but not including the one-more-easy-id div.
Stack overflow has not preserved the indenting, so I do not know where the end of the first div element is, but I'm going to guess it ends before the text.
In that case you might want
//div[#id = 'another-easy-id']/following:node()
[not(preceding::div[#id = 'one-more-easy-id']) and not(#id = 'one-more-easy-id')]
If this is XHTML you'll need to bind some prefix, h, say, to the XHTML namespace and use h:div in both places.
EDIT: Here's the syntax I went with in the end. (See comments for the reasons.)
//div[#id='easy-id']/div[#id='one-more-easy-id']/preceding-sibling::node()[preceding-sibling::div[#id='another-easy-id']]
I'd like to convert html to plain text. I don't want to just strip the tags though, I'd like to intelligently retain as much formatting as possible. Inserting line breaks for <br> tags, detecting paragraphs and formatting them as such, etc.
The input is pretty simple, usually well-formatted html (not entire documents, just a bunch of content, usually with no anchors or images).
I could put together a couple regexs that get me 80% there but figured there might be some existing solutions with more intelligence.
First, don't try to use regex for this. The odds are really good you'll come up with a fragile/brittle solution that will break with changes in the HTML or will be very hard to manage and maintain.
You can get part of the way there very quickly using Nokogiri to parse the HTML and extract the text:
require 'nokogiri'
html = '
<html>
<body>
<p>This is
some text.</p>
<p>This is some more text.</p>
<pre>
This is
preformatted
text.
</pre>
</body>
</html>
'
doc = Nokogiri::HTML(html)
puts doc.text
>> This is
>> some text.
>> This is some more text.
>>
>> This is
>> preformatted
>> text.
The reason this works is Nokogiri is returning the text nodes, which are basically the whitespace surrounding the tags, along with the text contained in the tags. If you do a pre-flight cleanup of the HTML using tidy you can sometimes get a lot nicer output.
The problem is when you compare the output of a parser, or any means of looking at the HTML, with what a browser displays. The browser is concerned with presenting the HTML in as pleasing way as possible, ignoring the fact that the HTML can be horribly malformed and broken. The parser is not designed to do that.
You can massage the HTML before extracting the content to remove extraneous line-breaks, like "\n", and "\r" followed by replacing <br> tags with line-breaks. There are many questions here on SO explaining how to replace tags with something else. I think the Nokogiri site also has that as one of the tutorials.
If you really want to do it right, you'll need to figure out what you want to do for <li> tags inside <ul> and <ol> tags, along with tables.
An alternate attack method would be to capture the output of one of the text browsers like lynx. Several years ago I needed to do text processing for keywords on websites that didn't use Meta-Keyword tags, and found one of the text-browsers that let me grab the rendered output that way. I don't have the source available so I can't check to see which one it was.
I'm currently using HtmlUnit to attempt to grab an href out of a page and am having some trouble.
The XPath is:
/html/body/div[2]/div/div/table/tbody/tr/td[2]/div/div[5]/div/div[2]/span/a
On the webpage it looks like:
<a class="t" title="This Brush" href=http://domain.com/this/that">Brush Set</a>
In my code I am doing:
hrefs = page.getByXPath("//html/body/div[2]/div/div/table/tbody/tr/td[2]/div/div[5]/div/div[2]/span/a[#class='t']")
However, this is returning everything in there instead of just the url that I want.
Can someone explain what I must add to get the href? (also it doesn't end with .html)
You are selecting the a. You want to select the a/#href.
hrefs = page.getByXPath("//html/body/div[2]/div/div/table/tbody/tr/td[2]/div/div[5]/div/div[2]/span/a[#class='t']/#href")