I want to edit a node of each item of an RSS feed in Ruby with Nokogiri and XPath.
I can get the value of this node but I can not edit them:
doc = Nokogiri::XML(open("http://www.pcinpact.com/rss/news.xml"))
doc.xpath('//item').each do |i|
pp i.xpath('title').first.text
end
I get the value of the title node in each item node.
I want to edit the "content" but I can't find how with xpath.
Obviously I want to get my original XML with the modifications.
Any idea?
For setting the content use the content= method.
doc = Nokogiri::XML(open("http://www.pcinpact.com/rss/news.xml"))
doc.xpath('//item').each do |i|
pp i.xpath('title').first.content = "My new title"
end
For more on how to manipulate a document in Nokogiri, check out "Modifying an HTML / XML Document".
Related
I'm trying to save the links only of the sample pages in this website
MusicRadar
require 'open-uri'
require 'nokogiri'
link = 'https://www.musicradar.com/news/tech/free-music-samples-royalty-free-loops-hits-and-multis-to-download'
html = OpenURI.open_uri(link)
doc = Nokogiri::HTML(html)
#used grep because every sample link in that page ends with '-samples'
doc.xpath('//div/a/#href').grep(/-samples/)
The problem is that it only finds 3 of that links
What am I doing wrong?
And If i wanted to open each of that links?
CSS selectors are more useful than XPath (if the document structure is good enough for that)
Now you used XPath with similar to CSS selector div > a, but you don't need it because for example some of the links inside p
If you need all links with -samples you can use *= selector
doc.css('a[href*="-samples"]') # return Nokogiri::XML::NodeSet with matched elements
doc.css('a[href*="-samples"]').map { |a| a[:href] } # return array of URLS
Consider a XML document
<string id = "id1" ><p> Text1 </p></string>
<string id = "id2" > Text2 </string>
I want to parse this document in ruby and make a hash like {id:"Text1", "id2":Text2}
I tried nokogiri and REXML tutorials but was not much useful. Can someone suggest me the way to do it.
It isn't possible to achieve the desired result in a single xpath query. You can select and iterate over all the string nodes and extract information like this:
require 'nokogiri'
doc = Nokogiri::XML(File.open("example.xml"));
result = {}
doc.xpath("//string").each do |node|
id = node.get_attribute "id"
text = node.inner_text.strip!
result[id] = text
end
puts result
Output:
{"id1"=>"Text1", "id2"=>"Text2"}
So I am using this:
Net::HTTP.get(URI.parse(url))
Works perfect.
Issue I am having is that the page it gets is formatted with head, html, body, etc tags.
There is a label element in the body with an id of "Result" I only want to get me back the text of "Result". Not all the html formatting.
Can this be done?
Well, to get only a part of a content in HTML you have to use a HTML parser, which will be Nokogiri in this case .
doc = Nokogiri::HTML(open(url))
doc.css('#Result').each do |re|
puts re.to_s
#puts re.content
end
I have some HTML that looks like:
<dt>
Hello
(2009)
</dt>
I already have all my HTML loaded into a variable called record. I need to parse out the year i.e. 2009 if it exists.
How can I get the text inside the dt tag but not the text inside the a tag? I've used record.search("dt").inner_text and this gives me everything.
It's a trivial question but I haven't managed to figure this out.
To get all the direct children with text, but not any further sub-children, you can use XPath like so:
doc.xpath('//dt/text()')
Or if you wish to use search:
doc.search('dt').xpath('text()')
Using XPath to select exactly what you want (as suggested by #Casper) is the right answer.
def own_text(node)
# Find the content of all child text nodes and join them together
node.xpath('text()').text
end
Here's an alternative, fun answer :)
def own_text(node)
node.clone(1).tap{ |copy| copy.element_children.remove }.text
end
Seen in action:
require 'nokogiri'
root = Nokogiri.XML('<r>hi <a>BOO</a> there</r>').root
puts root.text #=> hi BOO there
puts own_text(root) #=> hi there
The dt element has two children, so you can access it by:
doc.search("dt").children.last.text
The XML file I am trying to parse has all the data contained in attributes. I found how to build the string to insert into the text file.
I have this XML file:
<ig:prescribed_item class_ref="0161-1#01-765557#1">
<ig:prescribed_property property_ref="0161-1#02-016058#1" is_required="false" combination_allowed="false" one_of_allowed="false">
<dt:measure_number_type representation_ref="0161-1#04-000005#1">
<dt:real_type>
<dt:real_format pattern="\d(1,)\.\d(1,)"/>
</dt:real_type>
<dt:prescribed_unit_of_measure UOM_ref="0161-1#05-003260#1"/>
</dt:measure_number_type>
</ig:prescribed_property>
<ig:prescribed_property property_ref="0161-1#02-016059#1" is_required="false" combination_allowed="false" one_of_allowed="false">
<dt:measure_number_type representation_ref="0161-1#04-000005#1">
<dt:real_type>
<dt:real_format pattern="\d(1,)\.\d(1,)"/>
</dt:real_type>
<dt:prescribed_unit_of_measure UOM_ref="0161-1#05-003260#1"/>
</dt:measure_number_type>
</ig:prescribed_property>
</ig:prescribed_item>
</ig:identification_guide>
And I want to parse it into a text file like this with the class ref duplicated for each property:
class_ref|property_ref|is_required|UOM_ref
0161-1#01-765557#1|0161-1#02-016058#1|false|0161-1#05-003260#1
0161-1#01-765557#1|0161-1#02-016059#1|false|0161-1#05-003260#1
This is the code I have so far:
require 'nokogiri'
doc = Nokogiri::XML(File.open("file.xml"), 'UTF-8') do |config|
config.strict
end
content = doc.xpath("//ig:prescribed_item/#class_ref").map {|i|
i.search("//ig:prescribed_item/ig:prescribed_property/#property_ref").map { |d| d.text }
}
puts content.inspect
content.each do |c|
puts c.join('|')
end
I'd simplify it a bit using CSS accessors:
xml = <<EOT
<ig:prescribed_item class_ref="0161-1#01-765557#1">
<ig:prescribed_property property_ref="0161-1#02-016058#1" is_required="false" combination_allowed="false" one_of_allowed="false">
<dt:measure_number_type representation_ref="0161-1#04-000005#1">
<dt:real_type>
<dt:real_format pattern="\d(1,)\.\d(1,)"/>
</dt:real_type>
<dt:prescribed_unit_of_measure UOM_ref="0161-1#05-003260#1"/>
</dt:measure_number_type>
</ig:prescribed_property>
<ig:prescribed_property property_ref="0161-1#02-016059#1" is_required="false" combination_allowed="false" one_of_allowed="false">
<dt:measure_number_type representation_ref="0161-1#04-000005#1">
<dt:real_type>
<dt:real_format pattern="\d(1,)\.\d(1,)"/>
</dt:real_type>
<dt:prescribed_unit_of_measure UOM_ref="0161-1#05-003260#1"/>
</dt:measure_number_type>
</ig:prescribed_property>
</ig:prescribed_item>
</ig:identification_guide>
EOT
require 'nokogiri'
doc = Nokogiri::XML(xml)
data = [ %w[ class_ref property_ref is_required UOM_ref] ]
doc.css('|prescribed_item').each do |pi|
pi.css('|prescribed_property').each do |pp|
data << [
pi['class_ref'],
pp['property_ref'],
pp['is_required'],
pp.at_css('|prescribed_unit_of_measure')['UOM_ref']
]
end
end
puts data.map{ |row| row.join('|') }
Which outputs:
class_ref|property_ref|is_required|UOM_ref
0161-1#01-765557#1|0161-1#02-016058#1|false|0161-1#05-003260#1
0161-1#01-765557#1|0161-1#02-016059#1|false|0161-1#05-003260#1
Could you explain this line in greater detail "pp.at_css('|prescribed_unit_of_measure')['UOM_ref']"
In Nokogiri, there are two types of "find a node" methods: The "search" methods return all nodes that match a particular accessor as a NodeSet, and the "at" methods return the first Node of the NodeSet which will be the first encountered Node that matched the accessor.
The "search" methods are things like search, css, xpath and /. The "at" methods are things like at, at_css, at_xpath and %. Both search and at accept either XPath or CSS accessors.
Back to pp.at_css('|prescribed_unit_of_measure')['UOM_ref']: At that point in the code pp is a local variable containing a "prescribed_property" Node. So, I'm telling the code to find the first node under pp that matches the CSS |prescribed_unit_of_measure accessor, in other words the first <dt:prescribed_unit_of_measure> tag contained by the pp node. When Nokogiri finds that node, it returns the value of the UOM_ref attribute of the node.
As a FYI, the / and % operators are aliased to search and at respectively in Nokogiri. They're part of its "Hpricot" compatability; We used to use them a lot when Hpricot was the XML/HTML parser of choice, but they're not idiomatic for most Nokogiri developers. I suspect it's to avoid confusion with the regular use of the operators, at least it is in my case.
Also, Nokogiri's CSS accessors have some extra-special juiciness; They support namespaces, like the XPath accessors do, only they use |. Nokogiri will let us ignore the namespaces, which is what I did. You'll want to nose around in the Nokogiri docs for CSS and namespaces for more information.
There are definitely ways of parsing based on attributes.
The Engine yard article "Getting started with Nokogiri" has a full description.
But quickly, the examples they give are:
To match “h3″ tags that have a class
attribute, we write:
h3[#class]
To match “h3″ tags whose class
attribute is equal to the string “r”,
we write:
h3[#class = "r"]
Using the attribute matching
construct, we can modify our previous
query to:
//h3[#class = "r"]/a[#class = "l"]