How to parse USPTO XML files with Ruby and Nokogiri? - ruby

It's been the whole day that I'm trying to figure out how to parse USPTO bulk XML files. I've tried to download one of those files, unzipped it and then run:
Nokogiri::XML(File.open('ipg140513.xml'))
But it seems to load only the first element, not all patents (in that file there are few thousands)
What am I doing wrong?

The file you linked to, and presumably the others, are not valid XML because they do not have a root element. From Wikipedia:
Each XML document has exactly one single root element.
Nokogiri hints at this if you look at the errors (suggested by Arup Rakshit), as detailed in the documentation:
Nokogiri::XML(File.open("/Users/b/Downloads/ipg140513.xml")).errors # =>
# [
# #<Nokogiri::XML::SyntaxError: XML declaration allowed only at the start of the document>,
# #<Nokogiri::XML::SyntaxError: Extra content at the end of the document>
# ]
The file appears to be a concatenation of a series of valid XML files, each having a <us-patent-grant/> as its root element.
Fortunately, Nokogiri can handle this invalid XML if you process it as a document fragment. Try this:
Nokogiri::XML::DocumentFragment.parse(File.read('ipg140513.xml')).select{|element| element.name == 'us-patent-grant'}
The select chooses the root node of each concatenated document, ignoring the processing instructions and DTD declarations.
Alternately, you could pre-process the file and split it into its constituent, correctly-formatted documents. Parsing a 650MB document all at once is quite slow and memory intensive.

Related

Parsing a non-XML document with Nokogiri when the node names are/contain integers

When I run:
#!/usr/bin/env ruby
require 'nokogiri'
xml = <<-EOXML
<pajamas>
<bananas>
<foo>bar</foo>
<bar>bar</bar>
<1>bar</1>
</bananas>
</pajamas>
EOXML
doc = Nokogiri::XML(xml)
puts doc.at('/pajamas/bananas/foo')
puts doc.at('/pajamas/bananas/bar')
puts doc.at('/pajamas/bananas/1')
I get an ERROR: Invalid expression: /pajamas/bananas/1 (Nokogiri::XML::XPath::SyntaxError)
Is this a case of Nokogiri not liking ints as node names and/or is there a work around?
Looking at the documentation, I did not see a workaround to this. Removing the last line eliminates the error and prints the first two nodes as expected.
An XML element with a name that starts with a number is invalid XML.
XML elements must follow these naming rules:
Names can contain letters, numbers, and other characters
Names cannot start with a number or punctuation character
Names cannot start with the letters xml (or XML, or Xml, etc)
Names cannot contain spaces Any name can be used, no words are reserved.
You're trying to parse invalid XML with a XML parser, it's just not going to work. If you're really getting <1> as a tag and can't control that somehow, I'd suggest replacing the tags using a regex before getting to nokogiri.

Parsing an XML file using nokogiri to create \index fields for LaTeX

I'm a professional indexer new to Ruby and nokogiri and I am in need of some assistance.
I'm working on a set of macros that will allow me to take an XML file, output from my indexing software, and parse it into valid \index{} commands for inclusion in a LaTeX source file. Each XML <record> contains at least two <field> tags, so I will have to iterate over the multiple <field> tags to build my \index{} entry.
The following is an example of an index record from the xml file.
<record time="2022-08-27T17:25:12" id="30">
<field><text style="i"/><hide>SS </hide>Titanic<text/></field>
<field>passengers</field>
<field class="locator"><text style="b"/>5<text/></field>
</record>
I will produce intermediate output of this record in the form of:
\index{Titanic#\textit{SS Titanic}!passengers|textbf} 5
(The numeric locator is used to place the \index{} entry at the correct spot in the LaTex file and won't be included in the LaTeX source file)
I am using nokogiri to manipulate the xml file and have been able to reach the point where I return a nodelist that contains just the <field> tags for each <record>, but I need to be able to retrieve all the text in the <field>, including the formatting information (if I use the text method on a <field>, it returns "SS Titanic" for example, with all formatting information stripped away).
I'm stuck on how to access the entire text string in the <field> tag. Once I can get that, I have a good idea of how to structure my parser.
Any help will be greatly appreciated.
does this help?
xml = "<record time="2022-08-27T17:25:12" id="30">
<field><text style="i"/><hide>SS </hide>Titanic<text/></field>
<field>passengers</field>
<field class="locator"><text style="b"/>5<text/></field>
</record>"
fields = Nokogiri::XML(xml).xpath(".//field")
puts fields.first.text #=> "SS Titanic"
puts fields.map(&:text) #=> ["SS Titanic", "passengers", "5"]

Use of --- in yaml

I came across this yaml document:
--- !ruby/object:MyClass
myint: 100
mystring: hello world
What does the line:
--- !ruby/object:MyClass
mean?
In YAML, --- is the end of directives marker.
A YAML document may begin with a number of YAML directives (currently, two directives are defined, %YAML and %TAG). Since a text node (for example) can also start with a % character, there needs to be a way to distinguish between directives and text. This is achieved using the end of directives marker --- which signals the end of the directives and the beginning of the document.
Since directives are allowed to be empty, --- can also serve as a document separator.
YAML also has an end of document marker .... However, this is not often used, because and end of directives marker / document separator also implies the end of the document. You need it if you want to have multiple documents with directives within the same stream or when you want to indicate that a document is finished without necessarily starting a new one (e.g. in cases where there may be significant time passing between the end of one document and the start of another).
Many YAML emitters, and Psych is no exception, always emit an end of directives marker at the beginning of each document. This allows you to easily concatenate multiple documents into a single stream without doing any additional processing of the documents.
The other half of that line, !ruby/object:MyClass, is a tag. A tag is used to give a type to the following node. In YAML, every node has a type, even if it is implicit. You can also write the tag explicitly, for example text nodes have the type (tag) !!str. This can be useful in certain circumstances, for example here:
!!str 2018-10-31
This tells YAML that 2018-10-31 is text, not a date.
!ruby/object:MyClass is a tag used by Psych to indicate that the node is a serialized Ruby Object which is an instance of class MyClass. This way, when deserializing the document, Psych knows what class to instantiate and how to treat the node.
According to yaml.org, '---' indicates the start of a document.
https://yaml.org/spec/1.2/spec.html
for official specifications.

replace the first or nth line of file with ruby

How would I replace the first line of a text file or xml file using ruby? I'm having problems replicating a strange xml API and need to edit the document instruction after I create the XML file. It is strange that I have to do this, but in this case it is necessary.
If you are editing XML, use a tool specially designed for the task. sub, gsub and regex are not good choices if the XML being manipulated is not under your control.
Use Nokogiri to parse the XML, locate nodes and change them, then emit the updated XML.
There are many examples on SO showing how to do this, plus the tutorials on the Nokogiri site.
There are a couple different ways you can do this:
Use ARGF (assuming that your ruby program takes a file name as a command line parameter)
ruby -e "puts ARGF.to_a[n]" yourfile.xml
Open the file regularly then read n lines
File.open("yourfile") { |f|
line = nil
n.times { line = f.gets }
puts line
}
This approach is less intensive on memory, as only a single line is considered at a time, it is also the simplest method.
Use IO.readlines() (will only work if the entire file will fit in memory!)
IO.readlines("yourfile")[n]
IO.readlines(...) will read every line from your file into an array.
Where n in all the above examples is the nth line of your file.

How to search binary file and replace string with Ruby?

Ruby newbie here. I'm using Ruby version 1.9.2. I working at a military facility and whenever when need to send support data to our vendors it needs to be scrubbed of idenfying IP and Hostname info. This is new role for me and now the task of scrubbing files (both text and binary) falls on me when handling support issues.
I created the following script to "scrub" files plain text files of IP address info:
File.open("subnet.htm", 'r+') do |f|
text = f.read
text.gsub!(/\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}/, "000.000.000.000")
f.rewind
f.write(text)
end
I need to modify my script to search and replace hostname AND IP address information on text files AND .dat binary files. I'm looking for something really simple like my little script above and I'd like the keep the processing of txt and dat files as separate scripts. The task of creating one script to do both is one I'd like to take up as learning exercise from the two separate scripts. Right now I'm under certain time constraints to scrub the supports files and send them out.
The priority for me is to scrub my binary .dat trace files which are of data type XML. These are binary performance trace files from our storage arrays and they need to have the identifying IP address information scrubbed out before sending off to support for analysis.
I've searched stackoverflow.com somewhat extensively and haven't found a question with answer that addresses my specific need and I simply having a time trying to figure out string.unpack.
Thanks.
In general Ruby processes binary files the same as other files, with two caveats:
On Windows reading files normally translates CRLF pairs into just LF. You need to read in binary mode to ensure no conversion:
File.open('foo.bin','rb'){ ... }
In order to ensure that your binary data is not interpreted as text in some other encoding under Ruby 1.9+ you need to specify the ASCII-8BIT encoding:
File.open('foo.bin','r:ASCII-8BIT'){ ... }
However, as noted in this post, setting the 'b' flag as shown above also sets the encoding for you. Thus, just use the first code snippet above.
However, as noted in the comment by #ennuikiller, I suspect that you don't actually have true binary data. If you're really reading text files with a non-ASCII encoding (e.g. UTF-8) there is a small chance that treating them as binary will accidentally find only half of a multi-byte encoding and cause harm in the resulting file.
Edit: To use Nokogiri on XML files, you might do something like the following:
require 'nokogiri'
File.open("foo.xml", 'r+') do |f|
doc = Nokogiri.XML(f.read)
doc.xpath('//text()').each do |text_node|
# You cannot use gsub! here
text_node.content = text_node.content.gsub /.../, '...'
end
f.rewind
f.write doc.to_xml
end
I've done some binary file parsing, and this is how I read it in and cleaned it up:
data = File.open("file", 'rb' ) {|io| io.read}.unpack("C*").map do |val|
val if val == 9 || val == 10 || val == 13 || (val > 31 && val < 127)
end
For me, my binary file didn't have sequential character strings, so I had to do some shifting and filtering before I could read it (Hence the .map do |val| ... end Unpack with the "C" tag (see http://www.ruby-doc.org/core-1.9.2/String.html#method-i-unpack) will give ASCII character codes rather than the letters, so call val.chr if you'd like to use the interpreted character instead.
I'd suggest that you open your files in a binary editor and look through them to determine how to best handle the data parsing. If they are XML, you might consider parsing them with Nokogiri or a similar XML tool.

Resources