I need to read in a file which will be in xml format but all crammed into a single line, and I need to parse that line to find a specific property and replace its value with something I have specified.
The file might contain:
<?xml version="1.0" encoding="UTF-8" standalone="no"?><VerificationPoint type="Screenshot" version="2"><Description/><Verification object=":qP1B11_QLabel" type="PNG">
I need to search through this line, find the property "Verification object=" and replace the :qP1B11 with my own string. Please not that I don't want to replace the _QLabel" type="PNG"> part of the string if possible.
I can't use sub as I don't value of the property which could be anything, and I believe I should be able to do this with Regular Expressions but I have never had to use them before and all examples I've seen just make me more confused than earlier.
If anyone can present me with an elegant answer (and an explanation if using regexp) it would be a huge help!
Thanks
You have XML so use an XML parser. Nokogiri will make short work of that:
doc = Nokogiri::XML(that_string)
doc.search('Verification').each do |node|
node['object'] = node['object'].sub(/:qP1B11/, 'PANCAKES')
end
new_string = doc.to_xml
# <?xml version="1.0" encoding="UTF-8" standalone="no"?>\n<VerificationPoint type="Screenshot" version="2">\n <Description/>\n <Verification object="PANCAKES_QLabel" type="PNG">\n</Verification>\n</VerificationPoint>\n"
You can adjust the output format using the options for to_xml.
If you only have one <Verification> then you could do it like this:
node = doc.at('Verification')
node['object'] = node['object'].sub(/:qP1B11/, 'PANCAKES')
new_string = doc.to_xml
In either case you'd adjust your regex and replacement to suit your needs.
Related
I would like to add things like bullet points "•" to HTML using the XML Builder in Nokogiri, but everything is being escaped. How do I prevent it from being escaped?
I would like the result to be:
<span>•</span>
rather than:
<span>•</span>
I'm just doing this:
xml.span {
xml.text "•\ "
}
What am I missing?
If you define
class Nokogiri::XML::Builder
def entity(code)
doc = Nokogiri::XML("<?xml version='1.0'?><root>&##{code};</root>")
insert(doc.root.children.first)
end
end
then this
builder = Nokogiri::XML::Builder.new do |xml|
xml.span {
xml.text "I can has "
xml.entity 8665
xml.text " entity?"
}
end
puts builder.to_xml
yields
<?xml version="1.0"?>
<span>I can has • entity?</span>
PS this a workaround only, for a clean solution please refer to the libxml2 documentation (Nokogiri is built on libxml2) for more help. However, even these folks admit that handling entities can be quite ..err, cumbersome sometimes.
When you're setting the text of an element, you really are setting text, not HTML source. < and & don't have any special meaning in plain text.
So just type a bullet: '•'. Of course your source code and your XML file will have to be using the same encoding for that to come out right. If your XML file is UTF-8 but your source code isn't, you'd probably have to say '\xe2\x80\xa2' which is the UTF-8 byte sequence for the bullet character as a string literal.
(In general non-ASCII characters in Ruby 1.8 are tricky. The byte-based interfaces don't mesh too well with XML's world of all-text-is-Unicode.)
I am using Ruby 1.9.3 with the lastest Nokogiri gem. I have worked out how to extract values from an xml using xpath and specifying the path(?) to the element. Here is the XML file I have:
<?xml version="1.0" encoding="utf-8"?>
<File xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<Houses>
<Ranch>
<Roof>Black</Roof>
<Street>Markham</Street>
<Number>34</Number>
</Ranch>
</Houses>
</File>
I use this code to print a value:
doc = Nokogiri::XML(File.open ("C:\\myfile.xml"))
puts doc.xpath("//Ranch//Street")
Which outputs:
<Street>Markham</Street>
This is all working fine but what I need is to write/replace the value. I want to use the same kind of path-style lookup to pass in a value to replace the one that is there. So I want to pass a street name to this path and overwrite the street name that is there. I've been all over the internet but can only find ways to create a new XML or insert a completely new node in the file. Is there a way to replace values by line like this? Thanks.
You want the content= method:
Set the Node’s content to a Text node containing string. The string gets XML escaped, not interpreted as markup.
Note that xpath returns a NodeSet not a single Node, so you need to use at_xpath or get the single node some other way:
doc = Nokogiri::XML(File.open ("C:\\myfile.xml"))
node = doc.xpath("//Ranch//Street")[0] # use [0] to select the first result
node.content = "New value for this node"
puts doc # produces XML document with new value for the node
Using the Gem libxml-ruby, when we parse XML like so:
document = LibXML::XML::Parser.string( xmlData ).parse
for n in document.root.children
# Do something
end
What we actually get is something like this:
root
-node empty
-node with data
-node empty
Same thing with attributes, there's a blank one padding between those we actually care about. What we end up needing to use is :options => LibXML::XML::Parser::Options::NOBLANKS
Why? :(
(Not necessarily an answer, but need formatting.)
What does the XML look like?
This XML:
<baz>
<plugh>ohai</plugh>
</baz>
may contain whitespace text nodes for the CR/LF and indentation between the <baz> and <plugh> opening tags, and the same for between the closing tags. This may or may not be significant whitespace depending on the nature of the XML. Structurally, it's different than:
<baz><plugh>ohai</plugh></baz>
I have a very large xml file which I load as a string
so my XML lools like
<publication ID="7728" contentstatus="Unchanged" idID="0b000064800e9e39">
<volume contentstatus="Unchanged" idID="0b0000648151c35d">
<article ID="5756261" contentstatus="Changed" doi="10.1109/TNB.2011.2145270" idID="0b0000648151d8ca"/>
</volume>
I want to count the number of occurrences the string
article ID="5705641" contentstatus="Changed"
how can I convert the ID to a regex
Here is what I have tried doing
searchstr = 'article ID=\"/[1-9]{7}/\" contentstatus=\"Changed\"'
count = ((xml.scan(searchstr).length)).to_s
puts count
Please let me know how can I achieve this?
Thanks
I'm going to go out on a limb and guess that you're new to Ruby. First, it's not necessary to convert count into a string to puts it. Puts automatically calls to_s on anything you send to it.
Second, it's rarely a good idea to handle XML with string manipulation. I would strongly advise that you use a full fledged XML parser such as Nokogiri.
That said, you can't embed a regex in a string like that. The entire query string would need to be a regex.
Something like
/article ID="[1-9]{7}" contentstatus="Changed"/
Quotation marks aren't special characters in a regex, so you don't need to escape them.
When in doubt about regex in Ruby, I recommend checking out Rubular.com.
And once again, I can't emphasize enough that I really don't condone trying to manipulate XML via regex. Nokogiri will make dealing with XML a billion times easier and more reliable.
If XPath is an option, it is a preferred way of selecting XML elements. You can use the selector:
//article[#contentstatus="Changed"]
Or, if possible:
count(//article[#contentstatus="Changed"])
Nokogiri is my recommended Ruby XML parser. It's very robust, and is probably the standard for the language now.
I added two more "articles" to show how easily you can find and manipulate the contents, without having to rely on a regex.
require 'nokogiri'
xml =<<EOT
<publication ID="7728" contentstatus="Unchanged" idID="0b000064800e9e39">
<volume contentstatus="Unchanged" idID="0b0000648151c35d">
<article ID="5756261" contentstatus="Changed" doi="10.1109/TNB.2011.2145270" idID="0b0000648151d8ca"/>
<article ID="5756262" contentstatus="Unchanged" doi="10.1109/TNB.2011.2145270" idID="0b0000648151d8ca"/>
<article ID="5756263" contentstatus="Changed" doi="10.1109/TNB.2011.2145270" idID="0b0000648151d8ca"/>
</volume>
EOT
doc = Nokogiri::XML(xml)
puts doc.search('//article[#contentstatus="Changed"]').size.to_s + ' found'
puts doc.search('//article[#contentstatus="Changed"]').map{ |n| "#{ n['ID'] } #{ n['doi'] } #{ n['idID'] }" }
>> 2 found
>> 5756261 10.1109/TNB.2011.2145270 0b0000648151d8ca
>> 5756263 10.1109/TNB.2011.2145270 0b0000648151d8ca
The problem with using regex with HTML or XML, is they'll break really easily if the XML changes, or if your XML comes from different sources or is malformed. Regex was never designed to handle that sort of problem, but a parser was. You could have XML with line ends after every tag, or none at all, and the parser won't really care as long as the XML is well-formed. A good parser, like Nokogiri can even do fixups if the XML is broken, in order to try to make sense of it, but
Your current string looks almost perfect to me, just remove the errant / from around the numbers:
searchstr = 'article ID=\"[1-9]{7}\" contentstatus=\"Changed\"'
I am currently doing some XML parsing and I've chosen to use Hpricot because of it's ease of use and syntax, however I am running into some problems. I need to write a piece of XML data that I have found out to another file. However, when I do this the format is not preserved. For example, if the content should look like this:
<dict>
<key>item1</key><value>12345</value>
<key>item2</key><value>67890</value>
<key>item3</key><value>23456</value>
</dict>
And assuming that there are many entries like this in the document. I am iterating through the 'dict' items by using
hpricot_element = Hpricot(xml_document_body)
f = File.new('some_new_file.xml')
(hpricot_element/:dict).each { |dict| f.write( dict.to_original_html ) }
After using the above code, I would expect that the output look like the following exactly like the XML shown above. However to my surprise, the output of the file looks more like this:
<dict>\n", " <key>item1</key><value>12345</value>\n", " <key>item2</key><value>67890</value>\n", " <key>item3</key><value>23456</value\n", " </dict>
I've tried splitting at the "\n" characters and writing to the file one line at a time, but that didn't seem to work either as it did not recognize the "\n" characters. Any help is greatly appreciated. It might be a very simple solution, but I am having troubling finding it. Thanks!
hpricot_element = Hpricot::XML(xml_document_body)
File.open('some_new_file.xml', 'w') {|f| f.write xml_document_body }
Don't use an an xml parser if you want the original xml to be written. It is unnecessary. You should still use one if you want to further process the data, though.
Also, for XML, you should be using Hpricot::XML instead of just Hpricot.
My solution was to just replace the literal '\n' characters with line breaks and remove the extra punctuation by simply adding two gsubs that looked like the following:
f.write( dict.to_original_html.gsub('\n', "\n").gsub('" ,"', '') )
I don't know why I didn't see this before. Like I said, it might be an easy answer that I wasn't seeing and that's exactly how it turned out. Thanks for all the answers!