How to properly automate xml to xls - ruby

I am getting a lot of xml files recently, that i want to analyse in excel. In stead of using the xml conversion standard in (newer versions of) excel, I want to use a Ruby code that does it for a number of files automatically.
I am not very familiar, however, with rexml. After half a days work I got the code to convert just one(!) xml node. This is how it looks:
require 'rexml/document'
Dir.glob("FILES/archive/*.xml") do |eksemel|
puts "converting #{eksemel}"
filename = (/\d+/.match(eksemel)).to_s
xml_file = File.open("#{eksemel}", "r")
csv_file = File.new("#{filename}.csv", "w")
xml = REXML::Document.new( xml_file )
counter = 0
xml.elements.each("RESULTS") do |e|
e.elements.each("component") do |f|
f.elements.each("paragraph") do |g|
counter = counter + 1
csv_file.puts g.text
end
end
end
end
Is there a way to a) instead of define the names of the elements and the number let ruby do it automatically and b) save all of these as separate columns in a csv file?

It isn't clear what you are using counter for. It would also help if you clarified what kind of structure the XML file has (for instance, are there many <paragraph> elements within each <component> element?). But, here is a cleaner way to write what I think you shooting for:
require 'rexml/document'
require 'csv'
Dir.glob('FILES/archive/*.xml') do |eksemel|
puts "converting #{eksemel}"
# I assume you are creating a .csv file with the same name as your .xml file
xml_file = File.new(eksemel)
csv_file = CSV.open(eksemel.sub(/\.xml$/, '.csv'), 'w')
xml = REXML::Document.new(xml_file)
counter = xml.elements.to_a('RESULTS//component//paragraph').length
xml.elements.each('RESULTS//component') do |component|
csv_file << component.elements.to_a('paragraph')
end
[xml_file, csv_file].each {|f| f.close}
end

Related

Ruby - iterate tasks with files

I am struggling to iterate tasks with files in Ruby.
(Purpose of the program = every week, I have to save 40 pdf files off the school system containing student scores, then manually compare them to last week's pdfs and update one spreadsheet with every student who has passed their target this week. This is a task for a computer!)
I have converted a pdf file to text, and my program then extracts the correct data from the text files and turns each student into an array [name, score, house group]. It then checks each new array against the data in the csv file, and adds any new results.
My program works on a single pdf file, because I've manually typed in:
f = File.open('output\agb summer report.txt')
agb = []
f.each_line do |line|
agb.push line
end
But I have a whole folder of pdf files that I want to run the program on iteratively. I've also had problems when I try to write each result to a new-named file.
I've tried things with variables and code blocks, but I now don't think you can use a variable in that way?
Dir.foreach('output') do |ea|
f = File.open(ea)
agb = []
f.each_line do |line|
agb.push line
end
end
^ This doesn't work. I've also tried exporting the directory names to an array, and doing something like:
a.each do |ea|
var = '\'output\\' + ea + '\''
f = File.open(var)
agb = []
f.each_line do |line|
agb.push line
end
end
I think I'm fundamentally confused about the sorts of object File and Dir are? I've searched a lot and haven't found a solution yet. I am fairly new to Ruby.
Anyway, I'm sure this can be done - my current backup plan is to copy my program 40 times with different details, but that sounds absurd. Please offer thoughts?
You're very close. Dir.foreach() will return the name of the files whereas File.open() is going to want the path. A crude example to illustrate this:
directory = 'example_directory'
Dir.foreach(directory) do |file|
# Assuming Unix style filesystem, skip . and ..
next if file.start_with? '.'
# Simply puts the contents
path = File.join(directory, file)
puts File.read(path)
end
Use Globbing for File Lists
You need to use Dir#glob to get your list of files. For example, given three PDF files in /tmp/pdf, you collect them with a glob like so:
Dir.glob('/tmp/pdf/*pdf')
# => ["/tmp/pdf/1.pdf", "/tmp/pdf/2.pdf", "/tmp/pdf/3.pdf"]
Dir.glob('/tmp/pdf/*pdf').class
# => Array
Once you have a list of filenames, you can iterate over them with something like:
Dir.glob('/tmp/pdf/*pdf').each do |pdf|
text = %x(pdftotext "#{pdf}")
# do something with your textual data
end
If you're on a Windows system, then you might need a gem like pdf-reader or something else from Ruby Toolbox that suits you better to actually parse the PDF. Regardless, you should use globbing to create a file list; what you do after that depends on what kind of data the file actually holds. IO#read and descendants like File#read are good places to start.
Handling Text Files
If you're dealing with text files rather than PDF files, then something like this will get you started:
Dir.glob('/tmp/pdf/*txt').each do |text|
# Do something with your textual data. In this case, just
# dump the files to standard output.
p File.read(text)
end
You can use Dir.new("./") to get all the files in the current directory
so something like this should work.
file_names = Dir.new "./"
file_names.each do |file_name|
if file_name.end_with? ".txt"
f = File.open(file_name)
agb = []
f.each_line do |line|
agb.push line
end
end
end
btw, you can just use agb = f.to_a to convert the file contents into an array were each element is a line from the file.
file_names = Dir.new "./"
file_names.each do |file_name|
if file_name.end_with? ".txt"
f = File.open file_name
agb = f.to_a
# do whatever processing you need to do
end
end
if you assign your target folder like this /path/to/your/folder/*.txt it will only iterate over text files.
2.2.0 :009 > target_folder = "/home/ziya/Desktop/etc3/example_folder/*.txt"
=> "/home/ziya/Desktop/etc3/example_folder/*.txt"
2.2.0 :010 > Dir[target_folder].each do |texts|
2.2.0 :011 > puts texts
2.2.0 :012?> end
/home/ziya/Desktop/etc3/example_folder/ex4.txt
/home/ziya/Desktop/etc3/example_folder/ex3.txt
/home/ziya/Desktop/etc3/example_folder/ex2.txt
/home/ziya/Desktop/etc3/example_folder/ex1.txt
iteration over text files is ok
2.2.0 :002 > Dir[target_folder].each do |texts|
2.2.0 :003 > File.open(texts, 'w') {|file| file.write("your content\n")}
2.2.0 :004?> end
results
2.2.0 :008 > system ("pwd")
/home/ziya/Desktop/etc3/example_folder
=> true
2.2.0 :009 > system("for f in *.txt; do cat $f; done")
your content
your content
your content
your content

Missing parts after parsing and processing a very large XML file in Ruby

I have to parse and modify a 22.2MB XML file (a wordpress export).
The problem is after parsing, the last part of the file is always missing, but I can't really figure out why.
I've tried using the saxerator gem, but it does not seem to solve my problem
Here I'm just trying to get all the <item> from the input file and display them in an output file:
class SaxImport
def initialize input_file, output_file
f = File.read(input_file, File.size(input_file))
xml_data = Saxerator.parser(f) do |config|
config.output_type = :xml
end
category_fr_list = {}
items = []
output = File.open output_file, "w"
xml_data.for_tag(:item).reverse_each do |item|
output << item.to_xml
end
output.close
end
end
import_en = SaxImport.new 'weekly.xml', 'weekly.processed.xml'

Need help exporting parsed results, via Nokogiri, and exporting to CSV,. Only last parsed result is shown, why?

This is killing me and searching here and the big G is confusing me even more.
I followed the tutorial at Railscasts #190 on Nokogiri and was able to write myself a nice little parser:
require 'rubygems'
require 'nokogiri'
require 'open-uri'
url = "http://www.target.com/c/movies-entertainment/-/N-5xsx0/Ntk-All/Ntt-wwe/Ntx-matchallpartial+rel+E#navigation=true&facetedValue=/-/N-5xsx0&viewType=medium&sortBy=PriceLow&minPrice=0&maxPrice=10&isleaf=false&navigationPath=5xsx0&parentCategoryId=9975218&RatingFacet=0&customPrice=true"
doc = Nokogiri::HTML(open(url))
puts doc.at_css("title").text
doc.css(".standard").each do |item|
title = item.at_css("span.productTitle a")[:title]
format = item.at_css("span.description").text
price = item.at_css(".price-label").text[/\$[0-9\.]+/]
link = item.at_css("span.productTitle a")[:href]
puts "#{title}, #{format}, #{price}, #{link}"
end
I'm happy with the results and able to see it in the Windows console. However, I want to export the results to a CSV file and have tried numerous ways (with no luck) and I know I'm missing something. My latest updated code (after downloading the html files) is below:
require 'rubygems'
require 'nokogiri'
require 'csv'
#title = Array.new
#format = Array.new
#price = Array.new
#link = Array.new
doc = Nokogiri::HTML(open("index1.html"))
doc.css(".standard").each do |item|
#title << item.at_css("span.productTitle a")[:title]
#format << item.at_css("span.description").text
#price << item.at_css(".price-label").text[/\$[0-9\.]+/]
#link << item.at_css("span.productTitle a")[:href]
end
CSV.open("file.csv", "wb") do |csv|
csv << ["title", "format", "price", "link"]
csv << [#title, #format, #price, #link]
end
It works and spits a file out for me, but just the last result. I followed the tutorial at Andrew!: WEb Scraping... and trying to mix what I'm trying to achieve with someone else's process is confusing.
I assume it's looping through all of the results and only printing the last. Can someone give me pointers on how I should loop this (if that's the problem) so that all the results are in their respective columns?
Thanks in advance.
You're storing values in four arrays, but you're not enumerating the arrays when you generate your output.
Here is a possible fix:
CSV.open("file.csv", "wb") do |csv|
csv << ["title", "format", "price", "link"]
until #title.empty?
csv << [#title.shift, #format.shift, #price.shift, #link.shift]
end
end
Note that this is a destructive operation that shifts the values off of the arrays one at a time, so in the end they will all be empty.
There are more efficient ways to read and convert the data, but this will hopefully do what you want for now.
There are several things you could do to write this more in the "Ruby way":
require 'rubygems'
require 'nokogiri'
require 'csv'
doc = Nokogiri::HTML(open("index1.html"))
CSV.open('file.csv', 'wb') do |csv|
csv << %w[title format price link]
doc.css('.standard').each do |item|
csv << [
item.at_css('span.productTitle a')[:title]
item.at_css('span.description').text
item.at_css('.price-label').text[/\$[0-9\.]+/]
item.at_css('span.productTitle a')[:href]
]
end
end
Without sample HTML it's not possible to test this, but, based on your code, it looks like it'd work.
Notice that in your code you're using instance variables. They're not necessary because you aren't defining a class to have an instance of. You can use local values instead.

How to crawl the right way?

I have been working and tinkering with Nokogiri, REXML & Ruby for a month. I have this giant database that I am trying to crawl. The things that I am scraping are HTML links and XML files.
There are exactly 43612 XML files that I want to crawl and store in a CSV file.
My script works if crawl maybe 500 xml files, but larger that takes too much time and it freezes or something.
I have divided the code in pieces here so it would be easy to read, the whole script/code is here: https://gist.github.com/1981074
I am using two libraries beacuse I couldn't find a way to do this all in nokogiri. I personally find REXML easier to use.
My question: How can fix it so it wont that a week for me to crawl all this? How do I make it run faster?
HERE IS MY SCRIPT:
Require the necessary lib:
require 'rubygems'
require 'nokogiri'
require 'open-uri'
require 'rexml/document'
require 'csv'
include REXML
Create bunch of array to store that grabs data:
#urls = Array.new
#ID = Array.new
#titleSv = Array.new
#titleEn = Array.new
#identifier = Array.new
#typeOfLevel = Array.new
Grab all the xml links from a spec site and store them in a array called #urls
htmldoc = Nokogiri::HTML(open('http://testnavet.skolverket.se/SusaNavExport/EmilExporter?GetEvent&EMILVersion=1.1&NotExpired&EEFormOfStudy=normal&EIAcademicType=UoH&SelectEI'))
htmldoc.xpath('//a/#href').each do |links|
#urls << links.content
end
Loop throw the #urls array, and grab every element node that I want to grab with xpath.
#urls.each do |url|
# Loop throw the XML files and grab element nodes
xmldoc = REXML::Document.new(open(url).read)
# Root element
root = xmldoc.root
# Hämtar info-id
#ID << root.attributes["id"]
# TitleSv
xmldoc.elements.each("/educationInfo/titles/title[1] | /ns:educationInfo/ns:titles/ns:title[1]"){
|e| m = e.text
m = m.to_s
next if m.empty?
#titleSv << m
}
Then store them in a CSV file.
CSV.open("eduction_normal.csv", "wb") do |row|
(0..#ID.length - 1).each do |index|
row << [#ID[index], #titleSv[index], #titleEn[index], #identifier[index], #typeOfLevel[index], #typeOfResponsibleBody[index], #courseTyp[index], #credits[index], #degree[index], #preAcademic[index], #subjectCodeVhs[index], #descriptionSv[index], #lastedited[index], #expires[index]]
end
end
It's hard to pinpoint the exact problem because of the way the code is structured. Here are a few suggestions to increase the speed and structure the program so that it will be easier to find what's blocking you.
Libraries
You're using a lot of libraries here that probably aren't necessary.
You use both REXML and Nokogiri. They both do the same job. Except Nokogiri is much better at it (benchmark).
Use Hashes
Instead of storing data at index in 15 arrays, have one set of hashes.
For instance,
items = Set.new
doc.xpath('//a/#href').each do |url|
item = {}
item[:url] = url.content
items << item
end
items.each do |item|
xml = Nokogiri::XML(open(item[:url]))
item[:id] = xml.root['id']
...
end
Collect the data, then write to file
Now that you have your items set, you can iterate over it and write to the file. This is much faster than doing it line by line.
Be DRY
In your original code, you have the same thing repeated a dozen times. Instead of copying and pasting, try instead to abstract out the common code.
xmldoc.elements.each("/educationInfo/titles/title[1] | /ns:educationInfo/ns:titles/ns:title[1]"){
|e| m = e.text
m = m.to_s
next if m.empty?
#titleSv << m
}
Move what's common to a method
def get_value(xml, path)
str = ''
xml.elements.each(path) do |e|
str = e.text.to_s
next if str.empty?
end
str
end
And move anything constant to another hash
xml_paths = {
:title_sv => "/educationInfo/titles/title[1] | /ns:educationInfo/ns:titles/ns:title[1]",
:title_en => "/educationInfo/titles/title[2] | /ns:educationInfo/ns:titles/ns:title[2]",
...
}
Now you can combine these techniques to make for much cleaner codes
item[:title_sv] = get_value(xml, xml_paths[:title_sv])
item[:title_en] = get_value(xml, xml_paths[:title_en])
I hope this helps!
It won't work without your fixings. And I believe you should do like #Ian Bishop said to refactor your parsing code
require 'rubygems'
require 'pioneer'
require 'nokogiri'
require 'rexml/document'
require 'csv'
class Links < Pioneer::Base
include REXML
def locations
["http://testnavet.skolverket.se/SusaNavExport/EmilExporter?GetEvent&EMILVersion=1.1&NotExpired&EEFormOfStudy=normal&EIAcademicType=UoH&SelectEI"]
end
def processing(req)
doc = Nokogiri::HTML(req.response.response)
htmldoc.xpath('//a/#href').map do |links|
links.content
end
end
end
class Crawler < Pioneer::Base
include REXML
def locations
Links.new.start.flatten
end
def processing(req)
xmldoc = REXML::Document.new(req.respone.response)
root = xmldoc.root
id = root.attributes["id"]
xmldoc.elements.each("/educationInfo/titles/title[1] | /ns:educationInfo/ns:titles/ns:title[1]") do |e|
title = e.text.to_s
CSV.open("eduction_normal.csv", "a") do |f|
f << [id, title ...]
end
end
end
end
Crawler.start
# or you can run 100 concurrent processes
Crawler.start(concurrency: 100)
If you really want to speed it up, you're going to have to go concurrent.
One of the simplest ways is to install JRuby and then run your application with one small modification: install either the 'peach' or 'pmap' gems and then change your items.each to items.peach(n) (parallel each), where n is the number of threads. You'll need at least one thread per CPU core, but if you put I/O in your loop then you'll want more.
Also, use Nokogiri, it's much faster. Ask a separate Nokogiri question if you need to solve something specific with Nokogiri. I'm sure it can do what you need.

Script that saves a series of pages then tries to combine them but only combines one?

Here's my code..
require "open-uri"
base_url = "http://en.wikipedia.org/wiki"
(1..5).each do |x|
# sets up the url
full_url = base_url + "/" + x.to_s
# reads the url
read_page = open(full_url).read
# saves the contents to a file and closes it
local_file = "my_copy_of-" + x.to_s + ".html"
file = open(local_file,"w")
file.write(read_page)
file.close
# open a file to store all entrys in
combined_numbers = open("numbers.html", "w")
entrys = open(local_file, "r")
combined_numbers.write(entrys.read)
entrys.close
combined_numbers.close
end
As you can see. It basically scrapes the contents of the wikipedia articles 1 through 5 and then attempts to combine them nto a single file called numbers.html.
It does the first bit right. But when it gets to the second. It only seem's to write in the contents of the fifth article in the loop.
I can't see where im going wrong though. Any help?
You chose the wrong mode when opening your summary file. "w" overwrites existing files while "a" appends to existing files.
So use this to get your code working:
combined_numbers = open("numbers.html", "a")
Otherwise with each pass of the loop the file contents of numbers.html are overwritten with the current article.
Besides I think you should use the contents in read_page to write to numbers.html instead of reading them back in from your freshly written file:
require "open-uri"
(1..5).each do |x|
# set up and read url
url = "http://en.wikipedia.org/wiki/#{x.to_s}"
article = open(url).read
# saves current article to a file
# (only possible with 1.9.x use open too if on 1.8.x)
IO.write("my_copy_of-#{x.to_s}.html", article)
# add current article to summary file
open("numbers.html", "a") do |f|
f.write(article)
end
end

Resources