So I am converting urls into images and downloading them into a document. The file can be an .jpg or .pdf. I can successfully download the pdf and there is something on the pdf (in form of memory) but when I try to open the pdf, adobe reader does not recognize it and deem it broken.
Here is a link to one of the URLs - http://www.finfo.se/www.artdb.finfo.se/cgi-bin/lankkod.dll/lev?knr=7770566&art=001317514&typ=PI
And here is the code =>
require 'open-uri'
require 'tempfile'
require 'uri'
require 'csv'
DOWNLOAD_DIR = "#{Dir.pwd}/PI/"
CSV_FILE = "#{Dir.pwd}/konvertera4.csv"
def downloadFile(id, url, format)
begin
open("#{DOWNLOAD_DIR}#{id}.#{format}", "w") do |file|
file << open(url).read
puts "Successfully downloaded #{url} to #{DOWNLOAD_DIR}#{id}.#{format}"
end
rescue Exception => e
puts "#{e} #{url}"
end
end
CSV.foreach(CSV_FILE, headers: true, col_sep: ";") do |row|
puts row
next unless row[0] && row[1]
id = row[0]
format = row[1].match(/PI\.(.+)$/)&.captures.first
puts format
#format = "pdf"
#format = row[1].match(/BD\.(.+)$/)&.captures.first
url = row[1].gsub ".pdf", ""
downloadFile(id, url, format)
end
Try using wb instead of w:
open("#{DOWNLOAD_DIR}#{id}.#{format}", "wb")
Related
I am a Ruby-newbie and I tried my first scraper today. It's a scraper designed to store recipes in a CSV file. Nevertheless, I can't figure out why it doesn't work. here is my code:
recipe.rb :
require 'csv'
require 'nokogiri'
require 'open-uri'
def write_csv(ingredient)
doc = Nokogiri::HTML(open("http://www.marmiton.org/recettes/recherche.aspx?aqt=#{ingredient}"), nil, 'utf-8')
doc.search(".m_contenu_resultat").first(10).each do |item|
name = item.search('.m_titre_resultat a').text
description = item.search('.m_texte_resultat').text
cooking_time = item.search('.m_detail_time').text
diff = item.search('.m_detail_recette').text.split('-')
difficulty = diff[2]
recipes = [name, description, cooking_time, difficulty]
CSV.open('recueil.csv', 'wb') do |csv|
csv << recipes
end
end
end
write_csv('chocolat')
Thank you so much for your answers, it'll help me a lot !
IT WORKED ! I changed my code as below, using a hash :
require 'csv'
require 'nokogiri'
require 'open-uri'
def write_csv(ingredient)
recipes= []
doc = Nokogiri::HTML(open("http://www.marmiton.org/recettes/recherche.aspx?aqt=#{ingredient}"), nil, 'utf-8')
doc.search(".m_contenu_resultat").first(10).each do |item|
name = item.search('.m_titre_resultat a').text
description = item.search('.m_texte_resultat').text
cooking_time = item.search('.m_detail_time').text
diff = item.search('.m_detail_recette').text.split('-')
difficulty = diff[2]
recipes << {
name: name,
description: description,
difficulty: difficulty
}
end
CSV.open('recueil.csv','a') do |csv|
csv << ["name", "description", "cooking_time", "difficulty"]
recipes.each do |recipe|
csv << [
recipe[:name],
recipe[:description],
recipe[:cooking_time],
recipe[:difficulty]
]
end
end
end
write_csv('chocolat')
When you are opening your CSV file you are overwriting the previous one every time. You should eighter append to the file like this:
CSV.open('recueil.csv', 'a') do |csv|
or you could open it before you start looping like this:
def write_csv(ingredient)
doc = Nokogiri::HTML(open("http://www.marmiton.org/recettes/recherche.aspx?aqt=#{ingredient}"), nil, 'utf-8')
csv = CSV.open('recueil.csv', 'wb')
doc.search(".m_contenu_resultat").first(10).each do |item|
name = item.search('.m_titre_resultat a').text
description = item.search('.m_texte_resultat').text
cooking_time = item.search('.m_detail_time').text
diff = item.search('.m_detail_recette').text.split('-')
difficulty = diff[2]
recipes = [name, description, cooking_time, difficulty]
csv << recipes
end
csv.close
end
You don't specify what doesn't work, what the result of the errors are, so I must speculate.
I tried your script and had difficulties with the encoding, since the site is in french, there are lots of special characters.
Try again with this at the head of your script, it should solve at least that problem.
# encoding: utf-8
Encoding.default_external = Encoding::UTF_8
Encoding.default_internal = Encoding::UTF_8
I have some ruby code that I'm using to download a csv file from an FTP server.
However, right now but it's not working and not showing any error message.
require 'net/ftp'
require 'net/ftp'
require 'fileutils'
get '/romil' do
localfile = 'C:\\Users\\dell\\Desktop\\test1.csv'
ftp = Net::FTP.new(CONTENT_SERVER_DOMAIN_NAME)
ftp.login CONTENT_SERVER_FTP_LOGIN, CONTENT_SERVER_FTP_PASSWORD
ftp.passive = true
files = ftp.chdir('abhi/')
files = ftp.list
puts "list out of directory:"
puts files
ftp.gettextfile('test.csv', localfile, 1024)
ftp.close
end
Ok folks. I got the answer,
that's little bit tricky,
here is the working code :
get '/romil' do
ftp = Net::FTP.open(CONTENT_SERVER_DOMAIN_NAME) do |ftp|
ftp.login CONTENT_SERVER_FTP_LOGIN, CONTENT_SERVER_FTP_PASSWORD
ftp.passive = true
files = ftp.chdir('abhi/')
files = ftp.list
puts "list out of directory:"
puts files
ftp.gettextfile('test7.csv')
filename = 'test7.csv'
str = ''
CSV.foreach(filename, headers: true) do |row|
status 200
headers \
"Content-Type" => "text\\plain"
str = str + row[0] + ' ' + row[1]+ "\n"
end
body str
end
end
I'm trying to copy the content of a CSV file to another one.
First attempt:
FileUtils.cp(source_file, target_file)
Second attempt:
CSV.open(target_file, "w") {|file| file.truncate(0) }
CSV.open(target_file, "a+") do |csv|
CSV.foreach(source_file, :headers => true) do |row|
csv << row
end
end
Third attempt:
CSV.open(target_file, "w") {|file| file.truncate(0) }
File.open(source_file, "rb") do |input|
File.open(target_file, "wb") do |output|
while buff = input.read(4096)
output.write(buff)
end
end
end
Nothing is happening. What am I doing wrong?
When I run from Terminal cp source_file target_file, the target file is correctly created/copied. But in my app it is just creating the target_file, without any content.
im trying to convert links into images in different formats (jpg,pdf) and so on. I tried it earlier today and it worked fine until the last 500 links because my internet had a hiccup. So I removed all the converted links and was going to go at it again, but this time nothing is working. The program is running but cant seem to download the image and thus getting the error "404not found"
require 'open-uri'
require 'tempfile'
require 'uri'
require 'csv'
DOWNLOAD_DIR = "#{Dir.pwd}/BD/"
CSV_FILE = "#{Dir.pwd}/konvertera.csv"
def downloadFile(id, url, format)
open("#{DOWNLOAD_DIR}#{id}.#{format}", "wb+") do |file|
file << open(url).read
puts "Successfully downloaded #{url} to #{DOWNLOAD_DIR}#{id}.#{format}"
end
rescue
puts "404 not found #{url}"
end
CSV.foreach(CSV_FILE, headers: true, col_sep: ";") do |row|
puts row[0],row[1]
next unless row[0] && row[1]
id = row[0]
format = row[1].match(/BD\.(.+)$/)&.captures.first
puts format
url = row[1].gsub ".pdf", ""
downloadFile(id, url, format)
end
I'm doing a scraper to download all the issues of The Exile available at http://exile.ru/archive/list.php?IBLOCK_ID=35&PARAMS=ISSUE.
So far, my code is like this:
require 'rubygems'
require 'open-uri'
DATA_DIR = "exile"
Dir.mkdir(DATA_DIR) unless File.exists?(DATA_DIR)
BASE_exile_URL = "http://exile.ru/docs/pdf/issues/exile"
for number in 120..290
numero = BASE_exile_URL + number.to_s + ".pdf"
puts "Downloading issue #{number}"
open(numero) { |f|
File.open("#{DATA_DIR}/#{number}.pdf",'w') do |file|
file.puts f.read
end
}
end
puts "done"
The thing is, a lot of the issue links are down, and the code creates a PDF for every issue and, if it's empty, it will leave an empty PDF. How can I change the code so that it can only create and copy a file if the link exists?
require 'open-uri'
DATA_DIR = "exile"
Dir.mkdir(DATA_DIR) unless File.exists?(DATA_DIR)
url_template = "http://exile.ru/docs/pdf/issues/exile%d.pdf"
filename_template = "#{DATA_DIR}/%d.pdf"
(120..290).each do |number|
pdf_url = url_template % number
print "Downloading issue #{number}"
# Opening the URL downloads the remote file.
open(pdf_url) do |pdf_in|
if pdf_in.read(4) == '%PDF'
pdf_in.rewind
File.open(filename_template % number,'w') do |pdf_out|
pdf_out.write(pdf_in.read)
end
print " OK\n"
else
print " #{pdf_url} is not a PDF\n"
end
end
end
puts "done"
open(url) downloads the file and provides a handle to a local temp file. A PDF starts with '%PDF'. After reading the first 4 characters, if the file is a PDF, the file pointer has to be put back to the beginning to capture the whole file when writing a local copy.
you can use this code to check if exist the file:
require 'net/http'
def exist_the_pdf?(url_pdf)
url = URI.parse(url_pdf)
Net::HTTP.start(url.host, url.port) do |http|
puts http.request_head(url.path)['content-type'] == 'application/pdf'
end
end
Try this:
require 'rubygems'
require 'open-uri'
DATA_DIR = "exile"
Dir.mkdir(DATA_DIR) unless File.exists?(DATA_DIR)
BASE_exile_URL = "http://exile.ru/docs/pdf/issues/exile"
for number in 120..290
numero = BASE_exile_URL + number.to_s + ".pdf"
open(numero) { |f|
content = f.read
if content.include? "Link is missing"
puts "Issue #{number} doesnt exists"
else
puts "Issue #{number} exists"
File.open("./#{number}.pdf",'w') do |file|
file.write(content)
end
end
}
end
puts "done"
The main thing I added is a check to see if the string "Link is missing". I wanted to do it using HTTP status codes but they always give a 200 back, which is not the best practice.
The thing to note is that with my code you always download the whole site to look for that string, but I don't have any other idea to fix it at the moment.