I am trying to scrape data from an xml file using scrapy.
The file is structures as follows:
<feed xml:base="https://example.com/sap/...">
<entry><id>http://example.com/.../idset</id>
<m:properties>
<d:SubID>xyz</d:JobID>
<d:Posting>123456</d:Posting>
<d:Title>BoringTitle</d:Title>
</m:properties>
</entry>
</feed>
In Scrapy I import the atom namespace:
xxs = XmlXPathSelector(response)
xxs.register_namespace("atom", "http://www.w3.org/2005/Atom")
And it is possible to extract some of the data with
xxs.xpath("//atom:entry").extract()
However, I found it impossible to select the data with a colon:
<d:Title>BoringTitle</d:Title>
What would be the right xpath to print the title?
Maybe there is a simple answer, I am a mechanical engineer doing this for a hobby project.
Any help would be appreciated!
Kind regards
John
As mentioned in the question comments, you need to add a namespace for d as well.
However, in your case, it may be better to simply remove all namespaces and work without them.
Related
I am new to ruby and XML. I have been given an XML file and asked to do some data manipulation in that.
For ex. consider the below XML file.
<?xml version="1.0" encoding="UTF-8"?>
<note>
<to> Tove</to>
<from>Jani</from>
<heading>Reminder</heading>
<body>Don't forget me this weekend!</body>
</note>
They are asking me to extract the the string which are inside the tags for ex "Tove", "Jani" and do some manipulation(for ex replacing "tove" with "john") on it and rewrite the data to same xml document.
I know ruby has a lot of gems and utilities and there must be a good utility to do it. If someone has any idea about any utility to do this work easily then just let me know.
And if there is no utility then if someone could give me some idea on how to proceed with it then it would be good.
One way is to use REXML that comes as part of the standard library.
Another way is to use Nokogiri (I would recommend using this).
Here are some good tutorials that will definitely help you:
http://ruby.bastardsbook.com/chapters/html-parsing/
https://blog.engineyard.com/2010/getting-started-with-nokogiri/
This is my XML
<my_xml>
<record>
<p>hello <b>world</b> this is some html</p>
</record>
</my_xml>
Can I use XPath to return the following?
<p>hello <b>world</b> this is some html</p>
my_xml/record/child::*
child::* selects all element children of the context node
see details
The quick answer is, no. You can't accomplish this with XPath, but, once you select the parent node (i.e. "record" in your example), you should be able to manipulate it in whichever language you are using to parse the XML. Unfortunately, it may not be "easy".
It sounds like you would want something like the innerHTML property, but for XML DOM instead of the HTML DOM. Unfortunately, nothing like this exists for the XML DOM. If you don't care about the nodes themselves, you could use the textContent property; in the case of your example, you would get "hello world this is some html", which doesn't seem to be what you want.
Check out this similar question, which includes a parsing algorithm in Java. It seems that you will need to write a similar algorithm in whichever language you're using to parse the XML.
For anyone looking for this in the future, this IS very much possible to do using a DOT, that will return the entire node content as text (at least in MSSQL xpath it does).
'(/my_xml/record/.)[1]'
Can anyone please let me know how to extract the Xpaths from a webpage using Selenium Webdriver. I woudn't want to use Firebug or any other tools, a piece of code should extract all the Xpaths in a given webpage.
This isn't possible for a number of reasons. First and formost for any given XML document there are infinite XPaths for any given node.
Consider this XML document:
<root>
<a>
<b/>
</a>
</c>
</root>
Simple enough, but lets look at element <b>. Below are some of the XPaths for it:
/root/a/b
/root/a[1]/b[1]
/root/*/b
/root//b
/root/a/*[1]
//b[count(ancestor::a) == 1]
Get the point? Which of these is the right one? XPaths are a way of describing one or more sets of elements in an XML document, based on a given condition. Without a known starting point, and a specific desired output there are unlimited ways to describe an element, thats why it is so powerful.
The WebDriver doesn't do this for you, there's no included functionality for this.
If you really want it (and it definitely is possible and it's not that hard), you can write it yourself or adapt some of the existing solutions:
Firebug's source of its Copy XPath functionality
some random untested JS solution I found online
I"m unsure about this. Would having PHP ( or I guess any template language like Django's or Mako or whatever ) inside an html file prevent me from making changes to it with XPath?
I'm very new to XPath. I would think that you could not, but as I said, I'm unsure.
Xpath is a query language. You use it to query XML content, not change it.
You can use Xpath in conjunction with other technologies (XSLT is the first one that comes to mind) in order to query you XML and then use the results of these queries to transform your XML.
XPath doesn't change the XML document.
Use XSLT or a any other XPath-hosting language that can produce a new XML document.
Hi does anyone know hwo to remove an attrbute using xpath. In particular the rel attribute and its text from a link. i.e. <a href='http://google.com' rel='some text'>Link</a> and i want to remove rel='some text'.
There will be multiple links in the html i am parsing.
You can select items using xpath, but that's all it can do - it is a query language.
You need to use XSLT or an XML parser in order to remove attributes/elements.
As pointed out by Oded, Xpath merely identifies XML nodes. To remove/edit XML, you need some additional tooling.
One solution is the Ant-based plugin XMLTask (disclaimer - I wrote this). It provides a simple mechanism to read an XML file, identify parts of that using XPath, and change it (including removing nodes).
e.g.
<remove path="web/servlet/context[#id='redundant']"/>
Have you already tried using Javascript for this If that is applicable in your scenario:-
var allLinks=document.getElementsByTagName("a");
for(i=0;i<allLinks.length;i++)
{
allLinks[i].removeAttribute("rel");
}