How can I get the similar attribute with xpath? - xpath

<div><a src="What I need" data-src="What I don't need">Demo</a></div>
I am tried this xpath("./div/a/#src"),but it will give me all of that, but I don't want the #data-src, how should I do???
the raw page is here:
the raw page

first of all I would recommend you to change the #src-data in data-src!
it will avoid you many problems when parsing XML files.
Then you can use directly the following XPATH to get your src attribute:
/div/a/#src
if you make it start with . it will only access the relative path from the current node.

Related

Octoparse and relative Xpath iframe extraction issues

I am trying to use Octoparse to extract the podcast details from Marie Brown's "Beyond the kitchen table" website. https://beyondthekitchentable.co.uk/podcast/
I'm using Octoparse's free version which allows for scraping locally. The problem is that while Octoparse will automatically auto-detect the Title, Title_URL, and Content webpage data and correctly set up the Pagination, Scroll Page, and Loop item workflow to extract (Title, Title_URL, and Content fields), it does not auto-detect the 'Date' and 'Podcast time duration' fields of each individual podcast as these pieces appear to be getting embedded from an iframe. However, while I am able to custom add Date and Podcast time duration using an Absolute Xpath i.e. //div[#class="cfm-episodes-list"]/div[1]/div[2]/div[1]/iframe[1]. This results in the same value copied for each record. So when I attempt to fix this by using the Relative XPath setting in Octoparse to loop each item //span[#class="cp-episode-date"] in order to gather all individually unique, it does not get any values even though this relative Xpath //span[#class="cp-episode-date"] is finding all items when I use WebDevTools to search and find all occurrences seen within Chrome. I saw what might be another helpful post on Stackexchange about this but I was not able to make sense of it.
This portion //span[#class="cp-episode-date"] is relative Xpath as it finds multiple Date items in Chrome WebDevTools but it is not complete and I am not sure how to implement the unique Iframe traversal for the Date and Podcast time duration custom added fields I added that Octoparse's Relative XPath settings are looking for. I even tried to install the SelectorsHub Chrome browser extension but it didn't pull up the nested SelectorHub to query the Xpath the way the SelectorHub Youtube video demonstrates - it only showed me the relative Xpath I already am showing below.
Please have a look at this site using Octoparse and see if it is possible. If so, how can I do it?
When Absolute Path is used - //div[#class="cfm-episodes-list"]/div[1]/div[2]/div[1]/iframe[1]
vs.
When Relative Path is used - //span[#class="cp-episode-date"]
There are plenty of iframes inside the webpage. I don't know if Octoparse could handle this. Choose another starting point.
For example, use Apple Podcast :
https://podcasts.apple.com/gb/podcast/the-website-coach/id1587503231
Dates could be recovered with the following XPath :
//div[#class="l-row"]//time[#class]/#aria-label
Other possibility, scrape the following page :
https://feeds.captivate.fm/the-website-coach/
Dates could be recovered with the following XPath :
//h4/text()
Even easier, get directly the data from this URL (.json file) :
https://itunes.apple.com/lookup?id=1587503231&media=podcast&entity=podcastEpisode&limit=100

How can I find an image source based on xpath elements including a slash?

I'm trying to get an image on a remote page based on it's source value containing "/images/".
As an example: https://images-na.ssl-images-amazon.com/images/W/WEBP_402378-T1/images/G/01/kindle/ku/KU-retail-lp_KUPrePaid._CB661222046_.jpg
Because it has the /images/ in the source, this xpath should theoretically work:
#$xpath->query('//*[contains(#src,"/images/")]/#src')[0];
Unfortuantely it doesn't not. I thought it might be an escape issue, but that doesn't seem to work either. What's the trick here?

Letting Nokogiri decide whether to use #fragment or #parse

I have a piece of HTML that I would like to parse with Nokogiri, but I do not know whether it is a full HTML document (with DOCTYPE, etc) or a fragment (e.g. just a div with some elements in it).
This makes a difference for Nokogiri, because it should use #fragment for parsing fragments but #parse for parsing full documents.
Is there a way to determine whether a given piece of text is a fragment or a full HTML document?
Denis
Depends on how trashed your page is, but
/^(?:\s*<!DOCTYPE)|(?:\s*<html)/
should work in most cases.
The simplest way would be to look for the mandatory <html> tag, using for instance a regular expression /<html[\s>])/ (allowing attributes).
Is this sufficient to solve your problem?

howto get Builder to create <tag></tag> instead of <tag/>

I,m using Builder::XmlMarkup to create xml. I want to create a tag without content because the api force me to create this.
If I use a blog
xml.tag do
end
I get what i need
<tag></tag>
but I want it shorter
xml.mytag
this gives me
<mytag/>
but i want
<mytag></mytag>
what do I have to pass as option.
regards Kai
Just pass empty string as a parameter. xml.mytag('')
Why do you want <mytag></mytag> instead of <mytag/>? Since the output is XML, downstream applications should not know or care about the difference.
According to the Infoset spec (Appendix D point 7), "The difference between the two forms of an empty element: <foo/> and <foo></foo>" is not represented in the XML Information Set.
This doesn't answer your "how" question, but if you discover that you actually don't need to do what you're trying to do, it may save you from a difficult and unnecessary wild goose chase.
ok empty string is nice, another one-line-way is empty block I found out.
xml.mytag{}

XPATH remove attribute

Hi does anyone know hwo to remove an attrbute using xpath. In particular the rel attribute and its text from a link. i.e. <a href='http://google.com' rel='some text'>Link</a> and i want to remove rel='some text'.
There will be multiple links in the html i am parsing.
You can select items using xpath, but that's all it can do - it is a query language.
You need to use XSLT or an XML parser in order to remove attributes/elements.
As pointed out by Oded, Xpath merely identifies XML nodes. To remove/edit XML, you need some additional tooling.
One solution is the Ant-based plugin XMLTask (disclaimer - I wrote this). It provides a simple mechanism to read an XML file, identify parts of that using XPath, and change it (including removing nodes).
e.g.
<remove path="web/servlet/context[#id='redundant']"/>
Have you already tried using Javascript for this If that is applicable in your scenario:-
var allLinks=document.getElementsByTagName("a");
for(i=0;i<allLinks.length;i++)
{
allLinks[i].removeAttribute("rel");
}

Resources