Ruby Nokogiri converting KML to CSV - ruby

I'm trying to extract two different elements from a KML file and turn them into a CSV. I'm starting with the great site here: http://ckdake.com/content/2012/highgroove-hack-night-kml-heatmaps.html that generates a csv of coordinates. All I want to do now is add the name tag to the start of each line. I'm a ruby/nokogiri n00b so I can stick this bit of code in which gets me a) a list of all names followed by b) a list of all coordinates. But again - I'd like them on the same line.
require 'rubygems'
require 'nokogiri' # gem install nokogiri
#doc = Nokogiri::XML(File.open("WashingtonDC2013-01-04 12h09m01s.kml"))
#doc.css('name').each do |name|
puts name.content
end
#doc.css('coordinates').each do |coordinates|
coordinates.text.split(' ').each do |coordinate|
(lat,lon,elevation) = coordinate.split(',')
puts "#{lat},#{lon}\n"
end
end

How about this:
#doc.css('Placemark').each do |placemark|
name = placemark.css('name')
coordinates = placemark.at_css('coordinates')
if name && coordinates
print name.text + ","
coordinates.text.split(' ').each do |coordinate|
(lon,lat,elevation) = coordinate.split(',')
print "#{lat},#{lon}"
end
puts "\n"
end
end
I'm assuming here that there is one coordinates pair in the <coordinates> tags for each <Placemark>. If there are more, they'll all get appended onto the same line.
If that doesn't work, you'll need to post some of the KML file itself so I can test on it. I'm just guessing based on this sample KML file.

Related

Ruby - iterate tasks with files

I am struggling to iterate tasks with files in Ruby.
(Purpose of the program = every week, I have to save 40 pdf files off the school system containing student scores, then manually compare them to last week's pdfs and update one spreadsheet with every student who has passed their target this week. This is a task for a computer!)
I have converted a pdf file to text, and my program then extracts the correct data from the text files and turns each student into an array [name, score, house group]. It then checks each new array against the data in the csv file, and adds any new results.
My program works on a single pdf file, because I've manually typed in:
f = File.open('output\agb summer report.txt')
agb = []
f.each_line do |line|
agb.push line
end
But I have a whole folder of pdf files that I want to run the program on iteratively. I've also had problems when I try to write each result to a new-named file.
I've tried things with variables and code blocks, but I now don't think you can use a variable in that way?
Dir.foreach('output') do |ea|
f = File.open(ea)
agb = []
f.each_line do |line|
agb.push line
end
end
^ This doesn't work. I've also tried exporting the directory names to an array, and doing something like:
a.each do |ea|
var = '\'output\\' + ea + '\''
f = File.open(var)
agb = []
f.each_line do |line|
agb.push line
end
end
I think I'm fundamentally confused about the sorts of object File and Dir are? I've searched a lot and haven't found a solution yet. I am fairly new to Ruby.
Anyway, I'm sure this can be done - my current backup plan is to copy my program 40 times with different details, but that sounds absurd. Please offer thoughts?
You're very close. Dir.foreach() will return the name of the files whereas File.open() is going to want the path. A crude example to illustrate this:
directory = 'example_directory'
Dir.foreach(directory) do |file|
# Assuming Unix style filesystem, skip . and ..
next if file.start_with? '.'
# Simply puts the contents
path = File.join(directory, file)
puts File.read(path)
end
Use Globbing for File Lists
You need to use Dir#glob to get your list of files. For example, given three PDF files in /tmp/pdf, you collect them with a glob like so:
Dir.glob('/tmp/pdf/*pdf')
# => ["/tmp/pdf/1.pdf", "/tmp/pdf/2.pdf", "/tmp/pdf/3.pdf"]
Dir.glob('/tmp/pdf/*pdf').class
# => Array
Once you have a list of filenames, you can iterate over them with something like:
Dir.glob('/tmp/pdf/*pdf').each do |pdf|
text = %x(pdftotext "#{pdf}")
# do something with your textual data
end
If you're on a Windows system, then you might need a gem like pdf-reader or something else from Ruby Toolbox that suits you better to actually parse the PDF. Regardless, you should use globbing to create a file list; what you do after that depends on what kind of data the file actually holds. IO#read and descendants like File#read are good places to start.
Handling Text Files
If you're dealing with text files rather than PDF files, then something like this will get you started:
Dir.glob('/tmp/pdf/*txt').each do |text|
# Do something with your textual data. In this case, just
# dump the files to standard output.
p File.read(text)
end
You can use Dir.new("./") to get all the files in the current directory
so something like this should work.
file_names = Dir.new "./"
file_names.each do |file_name|
if file_name.end_with? ".txt"
f = File.open(file_name)
agb = []
f.each_line do |line|
agb.push line
end
end
end
btw, you can just use agb = f.to_a to convert the file contents into an array were each element is a line from the file.
file_names = Dir.new "./"
file_names.each do |file_name|
if file_name.end_with? ".txt"
f = File.open file_name
agb = f.to_a
# do whatever processing you need to do
end
end
if you assign your target folder like this /path/to/your/folder/*.txt it will only iterate over text files.
2.2.0 :009 > target_folder = "/home/ziya/Desktop/etc3/example_folder/*.txt"
=> "/home/ziya/Desktop/etc3/example_folder/*.txt"
2.2.0 :010 > Dir[target_folder].each do |texts|
2.2.0 :011 > puts texts
2.2.0 :012?> end
/home/ziya/Desktop/etc3/example_folder/ex4.txt
/home/ziya/Desktop/etc3/example_folder/ex3.txt
/home/ziya/Desktop/etc3/example_folder/ex2.txt
/home/ziya/Desktop/etc3/example_folder/ex1.txt
iteration over text files is ok
2.2.0 :002 > Dir[target_folder].each do |texts|
2.2.0 :003 > File.open(texts, 'w') {|file| file.write("your content\n")}
2.2.0 :004?> end
results
2.2.0 :008 > system ("pwd")
/home/ziya/Desktop/etc3/example_folder
=> true
2.2.0 :009 > system("for f in *.txt; do cat $f; done")
your content
your content
your content
your content

How to parse XML nodes to CSV with Ruby and Nokogiri

I have an XML file:
?xml version="1.0" encoding="iso-8859-1"?>
<Offers xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="http://ssc.channeladvisor.com/files/cageneric.xsd">
<Offer>
<Model><![CDATA[11016001]]></Model>
<Manufacturer><![CDATA[Crocs, Inc.]]></Manufacturer>
<ManufacturerModel><![CDATA[11016-001]]></ManufacturerModel>
...lots more nodes
<Custom6><![CDATA[<li>Bold midsole stripe for a sporty look.</li>
<li>Odor-resistant, easy to clean, and quick to dry.</li>
<li>Ventilation ports for enhanced breathability.</li>
<li>Lightweight, non-marking soles.</li>
<li>Water-friendly and buoyant; weighs only ounces.</li>
<li>Fully molded Croslite™ material for lightweight cushioning and comfort.</li>
<li>Heel strap swings back for snug fit, forward for wear as a clog.</li>]]></Custom6>
</Offer>
....lots lots more <Offer> entries
</Offers>
I want to parse each instance of 'Offer' into its own row in a CSV file:
require 'csv'
require 'nokogiri'
file = File.read('input.xml')
doc = Nokogiri::XML(file)
a = []
csv = CSV.open('output.csv', 'wb')
doc.css('Offer').each do |node|
a.push << node.content.split
end
a.each { |a| csv << a }
This runs nicely except I'm splitting on whitespace rather than each element of the Offer node so every word is going into its own column in the CSV file.
Is there a way to pick up the content of each node and how do I use the node names as headers in the CSV file?
This assumes that each Offer element always has the same child nodes (though they can be empty):
CSV.open('output.csv', 'wb') do |csv|
doc.search('Offer').each do |x|
csv << x.search('*').map(&:text)
end
end
And to get headers (from the first Offer element):
CSV.open('output.csv', 'wb') do |csv|
csv << doc.at('Offer').search('*').map(&:name)
doc.search('Offer').each do |x|
csv << x.search('*').map(&:text)
end
end
search and at are Nokogiri functions that can take either XPath or CSS selector strings. at will return the first occurrence of an element; search will provide an array of matching elements (or an empty array if no matches are found). The * in this case will select all nodes that are direct children of the current node.
Both name and text are also Nokogiri functions (for an element). name provides the element's name; text provides the text or CDATA content of a node.
Try this, and modify it to push into your CSV:
doc.css('Offer').first.elements.each do |n|
puts "#{n.name}: #{n.content}"
end

How to properly automate xml to xls

I am getting a lot of xml files recently, that i want to analyse in excel. In stead of using the xml conversion standard in (newer versions of) excel, I want to use a Ruby code that does it for a number of files automatically.
I am not very familiar, however, with rexml. After half a days work I got the code to convert just one(!) xml node. This is how it looks:
require 'rexml/document'
Dir.glob("FILES/archive/*.xml") do |eksemel|
puts "converting #{eksemel}"
filename = (/\d+/.match(eksemel)).to_s
xml_file = File.open("#{eksemel}", "r")
csv_file = File.new("#{filename}.csv", "w")
xml = REXML::Document.new( xml_file )
counter = 0
xml.elements.each("RESULTS") do |e|
e.elements.each("component") do |f|
f.elements.each("paragraph") do |g|
counter = counter + 1
csv_file.puts g.text
end
end
end
end
Is there a way to a) instead of define the names of the elements and the number let ruby do it automatically and b) save all of these as separate columns in a csv file?
It isn't clear what you are using counter for. It would also help if you clarified what kind of structure the XML file has (for instance, are there many <paragraph> elements within each <component> element?). But, here is a cleaner way to write what I think you shooting for:
require 'rexml/document'
require 'csv'
Dir.glob('FILES/archive/*.xml') do |eksemel|
puts "converting #{eksemel}"
# I assume you are creating a .csv file with the same name as your .xml file
xml_file = File.new(eksemel)
csv_file = CSV.open(eksemel.sub(/\.xml$/, '.csv'), 'w')
xml = REXML::Document.new(xml_file)
counter = xml.elements.to_a('RESULTS//component//paragraph').length
xml.elements.each('RESULTS//component') do |component|
csv_file << component.elements.to_a('paragraph')
end
[xml_file, csv_file].each {|f| f.close}
end

trying to get the delta between columns using FasterCSV

A bit of a noob here so apologies in advance.
I am trying to read a CSV file which has a number of columns, I would like see if one string "foo" exists anywhere in the file, and if so, grab the string one cell over (aka same row, one column over) and then write that to a file
my file c.csv:
foo,bar,yip
12,apple,yap
23,orange,yop
foo,tom,yum
so in this case, I would want "bar" and "tom" in a new csv file.
Here's what I have so far:
#!/usr/local/bin/ruby -w
require 'rubygems'
require 'fastercsv'
rows = FasterCSV.read("c.csv")
acolumn = rows.collect{|row| row[0]}
if acolumn.select{|v| v =~ /foo/} == 1
i = 0
for z in i..(acolumn).count
puts rows[1][i]
end
I've looked here https://github.com/circle/fastercsv/blob/master/examples/csv_table.rb but I am obviously not understanding it, my best guess is that I'd have to use Table to do what I want to do but after banging my head up against the wall for a bit, I decided to ask for advice from the experienced folks. help please?
Given your input file c.csv
foo,bar,yip
12,apple,yap
23,orange,yop
foo,tom,yum
then this script:
#!/usr/bin/ruby1.8
require 'fastercsv'
FasterCSV.open('output.csv', 'w') do |output|
FasterCSV.foreach('c.csv') do |row|
foo_index = row.index('foo')
if foo_index
value_to_the_right_of_foo = row[foo_index + 1]
output << value_to_the_right_of_foo
end
end
end
will create the file output.csv
bar
tom

How do I tell the line number for a node using the Nokogiri reader interface?

I'm trying to write a Nokogiri script that will grep XML for text nodes containing ASCII double-quotes («"»). Since I want a grep-like output I need the line number, and the contents of each line. However, I am unable to see how to tell the line number where the element starts at. Here is my code:
require 'rubygems'
require 'nokogiri'
ARGV.each do |filename|
xml_stream = File.open(filename)
reader = Nokogiri::XML::Reader(xml_stream)
titles = []
text = ''
grab_text = false
reader.each do |elem|
if elem.node_type == Nokogiri::XML::Node::TEXT_NODE
data = elem.value
lines = data.split(/\n/, -1);
lines.each_with_index do |line, idx|
if (line =~ /"/) then
STDOUT.printf "%s:%d:%s\n", filename, elem.line()+idx, line
end
end
end
end
end
elem.line() does not work.
XML and parsers don't really have a concept of line numbers. You're talking about the physical layout of the file.
You can play a game with the parser using accessors looking for text nodes containing linefeeds and/or carriage returns but that can be thrown off because XML allows nested nodes.
require 'nokogiri'
xml =<<EOT_XML
<atag>
<btag>
<ctag
id="another_node">
other text
</ctag>
</btag>
<btag>
<ctag id="another_node2">yet
another
text</ctag>
</btag>
<btag>
<ctag id="this_node">this text</ctag>
</btag>
</atag>
EOT_XML
doc = Nokogiri::XML(xml)
# find a particular node via CSS accessor
doc.at('ctag#this_node').text # => "this text"
# count how many "lines" there are in the document
doc.search('*/text()').select{ |t| t.text[/[\r\n]/] }.size # => 12
# walk the nodes looking for a particular string, counting lines as you go
content_at = []
doc.search('*/text()').each do |n|
content_at << [n.line, n.text] if (n.text['this text'])
end
content_at # => [[14, "this text"]]
This works because of the parser's ability to figure out what is a text node and cleanly return it, without relying on regex or text matches.
EDIT: I went through some old code, snooped around in Nokogiri's docs some, and came up with the above edited changes. It's working correctly, including working with some pathological cases. Nokogiri FTW!
As of 1.2.0 (released 2009-02-22), Nokogiri supports Node#line, which returns the line number in the source where that node is defined.
It appears to use the libxml2 function xmlGetLineNo().
require 'nokogiri'
doc = Nokogiri::XML(open 'tmpfile.xml')
doc.xpath('//xmlns:package[#arch="x86_64"]').each do |node|
puts '%4d %s' % [node.line, node['name']]
end
NOTE if you are working with large xml files (> 65535 lines), be sure to use Nokogiri 1.13.0 or newer (released 2022-01-06), or your Node#line results will not be accurate for large line numbers. See PR 2309 for an explanation.

Resources