Ruby webscraping csv encoding output - ruby

I know it would be very basic for most of you but I didn't find an answer so I have to ask!:)
The thing is that the output I receive in my webscraping CSV file returns strange characters, like \u00F3, etc. for Spanish accents. I'd probably need to do something at the end of my code where CSV is, but I don't know what.
And the other thing is that I'm getting only one array where there should be one per every line of the website.
Thanks
CODE:
url= "(the url of th website)"
page= Nokogiri::HTML(open(url))
description= page.css('div.post-body.entry-content').each do |line|
body << line.text.strip
end
puts body
# CSV
CSV.open("hello.csv", "w") do |file|
file << [body]
end

I did it! actually I have to put the following to the end:
CSV.open("hello.csv", "w+:UTF-16LE:UTF-8") do |file|
file << body
end

Related

Stub a CSV file in Ruby Test::Unit

So I have a Ruby script which parses a CSV file as a list of rules and does some processing and returns an array of hashes as processed data.
CSV looks like this:
id_number,rule,notes
1,"dummy_rule","necessary"
2,"sample_rule","optional"
The parsing of CSV looks like this:
def parse_csv(file_name)
filtered_data = []
CSV.foreach(file_name, headers: true) do |row|
filtered_data << row # some processing
end
filtered_data
end
Now I am wondering if it is possible to stub/mock an actual CSV file for unit testing in such a way that I could pass a "filename" into this function. I could make a CSV object and use the generate function but then the actual filename would not be possible.
def test_parse_csv
expected_result = ["id_number"=>1, "rule"=>"dummy_rule", "notes"=>"optional"}]
# stub/mock csv with filename: say "rules.csv"
assert.equal(expected_result, parse_csv(file_name))
end
I use Test::Ruby
I also found a ruby gem library called mocha but I don't know how this works for CSVs
https://github.com/freerange/mocha
Any thoughts are welcome!
I would create a support directory inside your test directory then create a file..
# test/support/rules.csv
id_number,rule,notes
1,"dummy_rule","necessary"
2,"sample_rule","optional"
Your test should look like this, also adding an opening curly bracket which looks like you've missed on line #2.
def test_parse_csv
expected_result = [{"id_number"=>1, "rule"=>"dummy_rule", "notes"=>"optional"}]
assert.equal(expected_result, parse_csv(File.read(csv_file)).first)
end
def csv_file
Rails.root.join('test', 'support', 'rules.csv')
end
EDIT: Sorry, I should have noticed this wasn't Ruby on Rails... heres a ruby solution:
def test_parse_csv
expected_result = [{"id_number"=>1, "rule"=>"dummy_rule", "notes"=>"optional"}]
assert.equal(expected_result, parse_csv(csv_file).first)
end
def csv_file
File.read('test/support/rules.csv')
end
I did this in my test let(:raw_csv) { "row_number \n 1 \n 2" }, telling myself that a CSV file was surely read as a simple string with linebreaks and commas... and it works fine. Be careful with additional whitespaces and commas, it is easy to make a mistake.

Adding Headers to a created CSV file in Ruby - keep getting errors

I've been trying to use Ruby to create a CSV file from json data. I was able to create the file, but I need to add a few headers. I tried following suggestions and answers from similar questions posted here on Stack Overflow, but I keep getting errors. Can anyone give me some pointers?
Here's my code.
require 'csv'
require 'json'
CSV.open("your_csv.csv", "w") do |csv|
JSON.parse(File.open("tojson.txt").read).each do |hash|
csv << hash.values
#csv.each { |line| line['New_header'] = line[0].to_i + line[1].to_i }
end
end
And here is the error I'm getting:
Anyone have any suggestions?
This is not how you add headers to a csv file. When you generate csv content, a header row is just a regular row. And should be generated as such. Example:
CSV.open("your_csv.csv", "w") do |csv|
csv << ['new_header', 'value1', 'value2'] # the headers
JSON.parse(File.open("tojson.txt").read).each do |hash|
row = [generate, values, for, headers, above]
csv << row
end
end
You don't have a #csv variable. You have a csv one.

How do I create a CSV file without the CSV class?

I'm using a Mac OSX version 10.8
I'm trying to create a CSV file the old fashion way, but there is a bug in my code. It should create a spreadsheet with three rows, i.e., header, and two rows of data beneath it:
File.open('table.csv', 'w') do |f|
f.puts.each {|line| puts line}
'Date','Open','High','Low','Close','Volume','Adj Close'
'10/8/2013','1676.22','1676.79','1655.03','1655.45','3569230000','1655.45'
'10/7/2013','1687.15','1687.15','1674.7','1676.12','2678490000','1676.12'
end
Can someone fix this so that it works and explain what I am doing wrong.
Thanks
File.open('table.csv', 'w') do |csv|
csv << ["Date","Open","High","Low","Close","Volume","Adj Close"]
csv << ["10/8/2013","1676.22","1676.79","1655.03","1655.45","3569230000","1655.45"]
csv << ["10/7/2013","1687.15","1687.15","1674.7","1676.12","2678490000","1676.12"]
end
This should work. You don't need an array.
Try putting your data in an array and then loop over it like this:
data = [
['Date','Open','High','Low','Close','Volume','Adj Close'],
['10/8/2013','1676.22','1676.79','1655.03','1655.45','3569230000','1655.45']
['10/7/2013','1687.15','1687.15','1674.7','1676.12','2678490000','1676.12'],
]
File.open('table.csv', 'w') do |f|
data.each{|line| f.puts line.join(',')}
end

How not to save to csv when array is empty

I'm parsing through a website and i'm looking for potentially many million rows of content. However, csv/excel/ods doesn't allow for more than a million rows.
That is why I'm trying to use a provisionary to exclude saving empty content. However, it's not working: My code keeps creating empty rows in csv.
This is the code I have:
# create csv
CSV.open("neverending.csv", "w") do |csv|
csv << ["kuk","date","name"]
# loop through all urls
File.foreach("neverendingurls.txt") do |line|
begin
doorzoekbarefile = Nokogiri::HTML(open(line))
for k in 1..999 do
# PROVISIONARY / CONDITIONAL
unless doorzoekbarefile.at_xpath("//td[contains(style, '60px')])[#{k}]").nil?
# xpaths
kuk = doorzoekbarefile.at_xpath("(//td[contains(#style,'60px')])[#{k}]")
date = doorzoekbarefile.at_xpath("(//td[contains(#style, '60px')])[#{k}]/following-sibling::*[1]")
name = doorzoekbarefile.at_xpath("(//td[contains(#style, '60px')])[#{k}]/following-sibling::*[2]")
# save to csv
csv << [kuk,date,name]
end
end
end
rescue
puts "error bij url #{line}"
end
end
end
Anybody have a clue what's going wrong or how to solve the problem? Basically I simply need to change the code so that it doesn't create a new row of csv data when the xpaths are empty.
This really doesn't have to do with xpath. It's simple Array#empty?
row = [kuk,date,name]
csv << row if row.compact.empty?
BTW, your code is a mess. Learn how to indent at least beore posting again.

Ruby: How to replace text in a file?

The following code is a line in an xml file:
<appId>455360226</appId>
How can I replace the number between the 2 tags with another number using ruby?
There is no possibility to modify a file content in one step (at least none I know, when the file size would change).
You have to read the file and store the modified text in another file.
replace="100"
infile = "xmlfile_in"
outfile = "xmlfile_out"
File.open(outfile, 'w') do |out|
out << File.open(infile).read.gsub(/<appId>\d+<\/appId>/, "<appId>#{replace}</appId>")
end
Or you read the file content to memory and afterwords you overwrite the file with the modified content:
replace="100"
filename = "xmlfile_in"
outdata = File.read(filename).gsub(/<appId>\d+<\/appId>/, "<appId>#{replace}</appId>")
File.open(filename, 'w') do |out|
out << outdata
end
(Hope it works, the code is not tested)
You can do it in one line like this:
IO.write(filepath, File.open(filepath) {|f| f.read.gsub(//<appId>\d+<\/appId>/, "<appId>42</appId>"/)})
IO.write truncates the given file by default, so if you read the text first, perform the regex String.gsub and return the resulting string using File.open in block mode, it will replace the file's content in one fell swoop.
I like the way this reads, but it can be written in multiple lines too of course:
IO.write(filepath, File.open(filepath) do |f|
f.read.gsub(//<appId>\d+<\/appId>/, "<appId>42</appId>"/)
end
)
replace="100"
File.open("xmlfile").each do |line|
if line[/<appId>/ ]
line.sub!(/<appId>\d+<\/appId>/, "<appId>#{replace}</appId>")
end
puts line
end
The right way is to use an XML parsing tool, and example of which is XmlSimple.
You did tag your question with regex. If you really must do it with a regex then
s = "Blah blah <appId>455360226</appId> blah"
s.sub(/<appId>\d+<\/appId>/, "<appId>42</appId>")
is an illustration of the kind of thing you can do but shouldn't.

Resources