Parsing an XML file using nokogiri to create \index fields for LaTeX - ruby

I'm a professional indexer new to Ruby and nokogiri and I am in need of some assistance.
I'm working on a set of macros that will allow me to take an XML file, output from my indexing software, and parse it into valid \index{} commands for inclusion in a LaTeX source file. Each XML <record> contains at least two <field> tags, so I will have to iterate over the multiple <field> tags to build my \index{} entry.
The following is an example of an index record from the xml file.
<record time="2022-08-27T17:25:12" id="30">
<field><text style="i"/><hide>SS </hide>Titanic<text/></field>
<field>passengers</field>
<field class="locator"><text style="b"/>5<text/></field>
</record>
I will produce intermediate output of this record in the form of:
\index{Titanic#\textit{SS Titanic}!passengers|textbf} 5
(The numeric locator is used to place the \index{} entry at the correct spot in the LaTex file and won't be included in the LaTeX source file)
I am using nokogiri to manipulate the xml file and have been able to reach the point where I return a nodelist that contains just the <field> tags for each <record>, but I need to be able to retrieve all the text in the <field>, including the formatting information (if I use the text method on a <field>, it returns "SS Titanic" for example, with all formatting information stripped away).
I'm stuck on how to access the entire text string in the <field> tag. Once I can get that, I have a good idea of how to structure my parser.
Any help will be greatly appreciated.

does this help?
xml = "<record time="2022-08-27T17:25:12" id="30">
<field><text style="i"/><hide>SS </hide>Titanic<text/></field>
<field>passengers</field>
<field class="locator"><text style="b"/>5<text/></field>
</record>"
fields = Nokogiri::XML(xml).xpath(".//field")
puts fields.first.text #=> "SS Titanic"
puts fields.map(&:text) #=> ["SS Titanic", "passengers", "5"]

Related

Parsing a non-XML document with Nokogiri when the node names are/contain integers

When I run:
#!/usr/bin/env ruby
require 'nokogiri'
xml = <<-EOXML
<pajamas>
<bananas>
<foo>bar</foo>
<bar>bar</bar>
<1>bar</1>
</bananas>
</pajamas>
EOXML
doc = Nokogiri::XML(xml)
puts doc.at('/pajamas/bananas/foo')
puts doc.at('/pajamas/bananas/bar')
puts doc.at('/pajamas/bananas/1')
I get an ERROR: Invalid expression: /pajamas/bananas/1 (Nokogiri::XML::XPath::SyntaxError)
Is this a case of Nokogiri not liking ints as node names and/or is there a work around?
Looking at the documentation, I did not see a workaround to this. Removing the last line eliminates the error and prints the first two nodes as expected.
An XML element with a name that starts with a number is invalid XML.
XML elements must follow these naming rules:
Names can contain letters, numbers, and other characters
Names cannot start with a number or punctuation character
Names cannot start with the letters xml (or XML, or Xml, etc)
Names cannot contain spaces Any name can be used, no words are reserved.
You're trying to parse invalid XML with a XML parser, it's just not going to work. If you're really getting <1> as a tag and can't control that somehow, I'd suggest replacing the tags using a regex before getting to nokogiri.

Can't YAML::load an YAML:dumped XML value

When trying to YAML::load a value produced by YAML::dump I get an error "did not find expected key while parsing a block mapping at line 1 column 1"
The YAML::dump value was written to an XML file as:
<format_store>---:text_formatting: '':url_pattern: ''</format_store>
If I look into the database, it is a text field with line breaks in it.
---
:text_formatting: ''
:url_pattern: ''
So it looks like the conversion from YAML::dump into the XML format dropped the line breaks.
I explicitly use the YAML::dump format for text fields. XML does not allow line breaks in element values. It would have to be escaped in some way and I assumed YAML would take care of that.
Is there a better way to dump/load text fields or is there someting I'm missing here?
Option 1: Wrap the YAML content in a <![CDATA]]> as suggested in Adding a new line/break tag in XML.
Option 2: Configure your YAML library to dump mappings using flow style (e.g {':text_formatting' : '', ':url_pattern' : ''). The exact method for accomplishing this will depend on the YAML library you are using and may require a bit of custom coding.

How to parse USPTO XML files with Ruby and Nokogiri?

It's been the whole day that I'm trying to figure out how to parse USPTO bulk XML files. I've tried to download one of those files, unzipped it and then run:
Nokogiri::XML(File.open('ipg140513.xml'))
But it seems to load only the first element, not all patents (in that file there are few thousands)
What am I doing wrong?
The file you linked to, and presumably the others, are not valid XML because they do not have a root element. From Wikipedia:
Each XML document has exactly one single root element.
Nokogiri hints at this if you look at the errors (suggested by Arup Rakshit), as detailed in the documentation:
Nokogiri::XML(File.open("/Users/b/Downloads/ipg140513.xml")).errors # =>
# [
# #<Nokogiri::XML::SyntaxError: XML declaration allowed only at the start of the document>,
# #<Nokogiri::XML::SyntaxError: Extra content at the end of the document>
# ]
The file appears to be a concatenation of a series of valid XML files, each having a <us-patent-grant/> as its root element.
Fortunately, Nokogiri can handle this invalid XML if you process it as a document fragment. Try this:
Nokogiri::XML::DocumentFragment.parse(File.read('ipg140513.xml')).select{|element| element.name == 'us-patent-grant'}
The select chooses the root node of each concatenated document, ignoring the processing instructions and DTD declarations.
Alternately, you could pre-process the file and split it into its constituent, correctly-formatted documents. Parsing a 650MB document all at once is quite slow and memory intensive.

How can I use Nokogiri with Ruby to replace values in existing xml?

I am using Ruby 1.9.3 with the lastest Nokogiri gem. I have worked out how to extract values from an xml using xpath and specifying the path(?) to the element. Here is the XML file I have:
<?xml version="1.0" encoding="utf-8"?>
<File xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<Houses>
<Ranch>
<Roof>Black</Roof>
<Street>Markham</Street>
<Number>34</Number>
</Ranch>
</Houses>
</File>
I use this code to print a value:
doc = Nokogiri::XML(File.open ("C:\\myfile.xml"))
puts doc.xpath("//Ranch//Street")
Which outputs:
<Street>Markham</Street>
This is all working fine but what I need is to write/replace the value. I want to use the same kind of path-style lookup to pass in a value to replace the one that is there. So I want to pass a street name to this path and overwrite the street name that is there. I've been all over the internet but can only find ways to create a new XML or insert a completely new node in the file. Is there a way to replace values by line like this? Thanks.
You want the content= method:
Set the Node’s content to a Text node containing string. The string gets XML escaped, not interpreted as markup.
Note that xpath returns a NodeSet not a single Node, so you need to use at_xpath or get the single node some other way:
doc = Nokogiri::XML(File.open ("C:\\myfile.xml"))
node = doc.xpath("//Ranch//Street")[0] # use [0] to select the first result
node.content = "New value for this node"
puts doc # produces XML document with new value for the node

How do I use the XPath tokenizer function in Nokogiri?

I am attempting to extract information from the following HTML using Nokogiri and XPath.
<p>Friday, February 1<br><strong>Apple <br> Orange</strong></p>
e.xpath('./text()[following-sibling::br]')
Gives me the date just fine. I want to then grab the text inside the strong node and split on br. There may be many fruits separated by br or there may just be one with no br. I would ideally like to accomplish this in xpath instead of code since I'm essentially defining a bunch of parsers via JSON.
Right now I'm thinking that I should use the tokenizer function and pass the text in the strong tag. I thought that should look like this:
e.xpath('./strong[fn::tokenize(.,"<br>")]')
and have also tried
e.xpath('fn::tokenize(./strong,"<br>")')
but I am getting:
.../gems/nokogiri-1.5.6/lib/nokogiri/xml/node.rb:159:in `evaluate': Invalid expression: ./strong/text()[fn::tokenize(.,"br")] (Nokogiri::XML::XPath::SyntaxError)
I'm modeling my usage after the documentation for the method that the error occurs in (line 139):
node.xpath('.//title[regex(., "\w+")]',...

Resources