What's the xpath syntax to get tag names? - ruby

I'm using Nokogiri to parse a large XML file. Say I've got the following structure:
<menagerie>
<penguin>Pablo</penguin>
<penguin>Mortimer</penguin>
<bull>Ferdinand</bull>
<aardvark>James Cornelius Madison Humphrey Zophar Handlebrush III</aardvark>
</menagerie>
I can count the non-penguins like this:
xml.xpath('//menagerie//*[not(penguin)]').length // 2
But how do I get a list of the tags, like this? (The exact format isn't important; I just want to visually scan the non-penguins.)
bull
aardvark
Update
This gave me the list I wanted - thanks Oded and TMN and delnan!
xml.xpath('//menageries/*[not(penguin)]').each do |node|
puts node.name()
end

You can use the name() or local-name() XPath function.
See the examples on zvon.

I know it's a bit outdated but you should do: xml.xpath('//meagerie/*[not(penguin)]/name()') as the expression. Note the slash, not the dot. This is how you call methods on the current node in XPath.

Related

How to extract items inside a table using scrapy

I want to extract all the functions listed inside the table in the below link : python functions list
I have tried using the chrome developers console to get the exact xpath to be used in the file spider.py as below:
$x('//*[#id="built-in-functions"]/table[1]/tbody//a/#href')
but this returns a list of all href's ( which I think what the xpath expression refers to).
I need to extract the text from here I believe but appending /text() to the above xpath return nothing. Can someone please help me to extract the function names from the table.
I think this should do the trick
response.css('.docutils .reference .pre::text').extract()
a non-exact xpath equivalent of it (but that also works in this case) would be:
response.xpath('//table[contains(#class, "docutils")]//*[contains(#class, "reference")]//*[contains(#class, "pre")]/text()').extract()
Try this:
for td in response.css("#built-in-functions > table:nth-child(4) td"):
td.css("span.pre::text").extract_first()

xpath query url with one folder depth only

I am using this XPath query succesfully:
//div[(#class="result")]//a[contains(#href,"pinterest.com")]/#href
The URL I am using the XPath query (with simple_html_dom.php) is this one here.
Now, I would like to find results for pinterest.com/one-folder-deep-only and exclude all URLs deeper than one directory, like pinterest.com/one-folder-deep-only/this or pinterest.com/one-folder-deep-only/this/this. I have no idea if there is a way to achieve that. Have googled a lot, but not found anything. Maybe my search terms weren't the best.
Do you have any ideas? Thanks for helping me here.
I am testing the query using the Chrome XPath Helper.
"//" is to evaluate all levels/depths. Instead use only one "/" for the "a" query to only evaluate immediate children
//div[(#id="first-result")]/a[contains(#href,"url.com")]/#href
Note use of / instead of // before the "a" tag.
Try below XPath to select #href from required anchors only:
//a[contains(#href, "url.com") and not(contains(substring-after(./#href, 'url.com/'), "/"))]/#href
Solution for XPath 2.0:
//a[contains(#href, "url.com") and count(tokenize(#href, "/"))=2]/#href
Note that if in real HTML source href starts-with "http://url.com" you should specify =4 instead of =2

How to use substring() with Import.io?

I'm having some issues with XPath and import.io and I hope you'll be able to help me. :)
The html code:
<a href="page.php?var=12345">
For the moment, I manage to extract the content of the href ( page.php?var=12345 ) with this:
./td[3]/a[1]/#href
Though, I would like to just collect: 12345
substring might be the solution but it does not seem to work on import.io as I use it...
substring(./td[3]/a[1]/#href,13)
Any ideas of what the problem is?
Thank's a lot in advance!
Try using this for the xpath: (Have the field selected as Text)
.//*[#class='oeil']/a/#href
Then use this for your regex:
([^=]*)$
This will get you the ISBN number you are looking for.
import.io only support functions in XPath when they return a node list
Your path expression is fine, but perhaps it should be
substring(./td[3]/a[1]/#href,14)
"Does not seem to work" is not a very clear description of what is wrong. Do you get error messages? Is the output wrong? Do you have any code surrounding the path expression you could show?
You can use substring, but using substring-after() would be even better.
substring-after(/a/#href,'=')
assuming as input the tiny snippet you have shown:
<a href="page.php?var=12345"/>
will select
12345
and taking into account the structure of your input
substring-after(./td[3]/a[1]/#href,'=')
A leading . in a path expression selects only immediate child td nodes of the current context node. I trust you know what you are doing.

What would be the best way to take a string of html, chop it up, and put each piece into an array?

I have a general idea of how I can do this, but can't pinpoint how exactly to get it done. I am sure it can be done with a regex of some sort. Wondering if anyone here can point me in the right direction.
If I have a string of html such as this
some_html = '<div><b>This is some BOLD text</b></div>'
I want to to divide it into logical pieces, and then put those pieces into an array so I end with a result like this
html_array = ["<div>", "<b>", "This is some BOLD text", "</b>","</div>" ]
Rather than use regex I'd use the nokogiri gem (a gem for parsing html written by Aaron Patterson - contributor to Rails and Ruby). Here's a sample of how to use it:
html_doc = Nokogiri::HTML("<html><body><h1>Mr. Belvedere Fan Club</h1></body></html>")
You can then call html_doc.children to get a nodeset and work your way from there
html_doc.children # returns a nodeset
Use an HTML parser, for instance, Nokogiri. Using SAX you can add tags/elements to the array as events are triggered.
It's not a good idea to try to regex HTML, unless you're planning to treat only a small determined subset of it.
some_html.split(/(<[^>]*>)/).reject{|x| '' == x}

XPath concat multiple nodes

I'm not very familiar with xpath. But I was working with xpath expressions and setting them in a database. Actually it's just the BAM tool for biztalk.
Anyway, I have an xml which could look like:
<File>
<Element1>element1<Element1>
<Element2>element2<Element2>
<Element3>
<SubElement>sub1</SubElement>
<SubElement>sub2</SubElement>
<SubElement>sub3</SubElement>
<Element3>
</File>
I was wondering if there is a way to use an xpath expression of getting all the SubElements concatted? At the moment, I am using:
/*[local-name()='File']/*[local-name()='Element3']/*[local-name()='SubElement']
This works if it only has one index. But apparently my xml sometimes has more nodes, so it gives NULL. I could just use
/*[local-name()='File']/*[local-name()='Element3']/*[local-name()='SubElement'][0]
but I need all the nodes. Is there a way to do this?
Thanks a lot!
Edit: I changed the XML, I was wrong, it's different, it should look like this:
<item>
<element1>el1</element1>
<element2>el2</element2>
<element3>el3</element3>
<element4>
<subEl1>subel1a</subEl1>
<subEl2>subel2a</subEl2>
</element4>
<element4>
<subEl1>subel1b</subEl1>
<subEl2>subel2b</subEl2>
</element4>
</item>
And I need to have a one line code to get a result like: "subel2a subel2b";
I need the one line because I set this xpath expression as an xml attribute (not my choice, it's specified). I tried string-join but it's not really working.
string-join(/file/Element3/SubElement, ',')
/File/Element3/SubElement will match all of the SubElement elements in your sample XML. What are you using to evaluate it?
If your evaluation method is subject to the "first node rule", then it will only match the first one. If you are using a method that returns a nodeset, then it will return all of them.
You can get all SubElements by using:
//SubElement
But this won't keep them grouped together how you want. You will want to do a query for all elements that contain a SubElement (basically do a search for the parent of any SubElements).
//parent::SubElement
Once you have that, you could (depending on your programming language) loop through the parents and concatenate the SubElements.

Resources