404 not found. CSV and Ruby - ruby

im trying to convert links into images in different formats (jpg,pdf) and so on. I tried it earlier today and it worked fine until the last 500 links because my internet had a hiccup. So I removed all the converted links and was going to go at it again, but this time nothing is working. The program is running but cant seem to download the image and thus getting the error "404not found"
require 'open-uri'
require 'tempfile'
require 'uri'
require 'csv'
DOWNLOAD_DIR = "#{Dir.pwd}/BD/"
CSV_FILE = "#{Dir.pwd}/konvertera.csv"
def downloadFile(id, url, format)
open("#{DOWNLOAD_DIR}#{id}.#{format}", "wb+") do |file|
file << open(url).read
puts "Successfully downloaded #{url} to #{DOWNLOAD_DIR}#{id}.#{format}"
end
rescue
puts "404 not found #{url}"
end
CSV.foreach(CSV_FILE, headers: true, col_sep: ";") do |row|
puts row[0],row[1]
next unless row[0] && row[1]
id = row[0]
format = row[1].match(/BD\.(.+)$/)&.captures.first
puts format
url = row[1].gsub ".pdf", ""
downloadFile(id, url, format)
end

Related

Ruby script. Connection reset by peer (Errno::ECONNRESET)

I have a code:
require 'openssl'
require 'open-uri'
require 'rubygems'
require 'nokogiri'
require 'image_downloader'
uri = 'https://aliexpress.com/item/2016-Hot-Sale-Brand-Clothing-Spring-Suit-Blazer-Men-Fashion-Slim-Fit-Masculine-Blazer-Casual-Solid/32759268261.html'
doc = Nokogiri::HTML(open(uri))
strings = doc.text.strip!.split('""') # get massive of quoted links
strings.delete_if { |bite| (bite.include? "alicdn.com/kf/") == false } # delete links that do not include alicdn.com/kf/
strings.map! {|link| URI.extract(URI.encode("#{link}"))} # encoding all links
strings.each_with_index {|f,i| strings[i] = f[0]} #transform a two-dimensional array into one-dimensional
strings.each_with_index do |file, i|
open(file) { |f|
File.open("blazer" + "#{i.to_s}" + ".jpg", "wb") do |file| # save image to current directory
file.puts f.read
end
}
end
On Ubuntu it works fine, but on Linux Mint 18.1 Cinnamon it gives an error message:
/usr/lib/ruby/2.3.0/openssl/buffering.rb:322:in `syswrite': Connection
reset by peer (Errno::ECONNRESET)
Screenshot:
I googled it, but didn't find normal answers. Can anybode help me?

Ruby,CSV and pdf's

So I am converting urls into images and downloading them into a document. The file can be an .jpg or .pdf. I can successfully download the pdf and there is something on the pdf (in form of memory) but when I try to open the pdf, adobe reader does not recognize it and deem it broken.
Here is a link to one of the URLs - http://www.finfo.se/www.artdb.finfo.se/cgi-bin/lankkod.dll/lev?knr=7770566&art=001317514&typ=PI
And here is the code =>
require 'open-uri'
require 'tempfile'
require 'uri'
require 'csv'
DOWNLOAD_DIR = "#{Dir.pwd}/PI/"
CSV_FILE = "#{Dir.pwd}/konvertera4.csv"
def downloadFile(id, url, format)
begin
open("#{DOWNLOAD_DIR}#{id}.#{format}", "w") do |file|
file << open(url).read
puts "Successfully downloaded #{url} to #{DOWNLOAD_DIR}#{id}.#{format}"
end
rescue Exception => e
puts "#{e} #{url}"
end
end
CSV.foreach(CSV_FILE, headers: true, col_sep: ";") do |row|
puts row
next unless row[0] && row[1]
id = row[0]
format = row[1].match(/PI\.(.+)$/)&.captures.first
puts format
#format = "pdf"
#format = row[1].match(/BD\.(.+)$/)&.captures.first
url = row[1].gsub ".pdf", ""
downloadFile(id, url, format)
end
Try using wb instead of w:
open("#{DOWNLOAD_DIR}#{id}.#{format}", "wb")

Trying to run a file but receiving error. Works fine on another computer

This is the error when entering URI:
/Users/wiggum/.rvm/rubies/ruby-2.2.0/lib/ruby/2.2.0/uri/rfc3986_parser.rb:66:in `split': bad URI(is not URI?): http://www.treasuredata.com (URI::InvalidURIError)
from /Users/wiggum/.rvm/rubies/ruby-2.2.0/lib/ruby/2.2.0/uri/rfc3986_parser.rb:72:in `parse'
from /Users/wiggum/.rvm/rubies/ruby-2.2.0/lib/ruby/2.2.0/uri/common.rb:226:in `parse'
from sitecrawl.rb:11:in `<main>'
here is my code which runs fine on my other computer. Any suggestions?
require 'Spidr'
require 'csv'
require 'Nokogiri'
require 'open-uri'
puts "What is the website you are looking to crawl?"
site = gets
#make a filename
f2 = ".csv"
f1 = URI.parse(site).host
filename = "#{f1}#{f2}"
CSV.open(filename, "wb") do |csv|
csv <<["Url", "Title Tag", "H1 Tags", "Meta Desc"]
Spidr.site(site) do |spider|
spider.every_url do |url|
page = Nokogiri::HTML(open(url)) rescue nil
title = page.xpath('//title') rescue nil
desc = page.xpath("//meta[#name='description']/#content") rescue nil
h1 = page.xpath('//h1') rescue nil
puts "#{url} #{title}"
puts "#{h1} #{desc}"
csv <<["#{url}", "#{title}", "#{h1}", "#{desc}"]
end`enter code here`
end
end
No idea why it works on your other computer, it shouldn't work anywhere. gets grabs the entire string that you enter including the trailing newline, so the string you're trying to parse is actually: http://www.treasuredata.com\n which is not a valid URI.
Change your gets to a gets.chomp

How do I parse XML nodes from an API request?

How do I save the information from an XML page that I got from a API?
The URL is "http://api.url.com?number=8-6785503" and it returns:
<OperatorDataContract xmlns="http://psgi.pts.se/PTS_Number_Service" xmlns:i="http://www.w3.org/2001/XMLSchema-instance">
<Name>Tele2 Sverige AB</Name>
<Number>8-6785503</Number>
</OperatorDataContract>
How do I parse the Name and Number nodes to a file?
Here is my code:
require 'rubygems'
require 'nokogiri'
require 'open-uri'
url = "http://api.url.com?number=8-6785503"
doc = Nokogiri::XML(open(url))
File.open("exporterad.txt", "w") do |file|
doc.xpath("//*").each do |item|
title = item.xpath('//result[group_name="Name"]')
phone = item.xpath("/Number").text.strip
puts "#{title} ; \n"
puts "#{phone} ; \n"
company = " #{title}; #{phone}; \n\n"
file.write(company.gsub(/^\s+/,''))
end
end
Besides the fact that your code isn't valid Ruby, you're making it a lot harder than necessary, at least for a simple scrape and save:
require 'nokogiri'
require 'open-uri'
url = "http://api.pts.se/PTSNumberService/Pts_Number_Service.svc/pox/SearchByNumber?number=8-6785503"
doc = Nokogiri::XML(open(url))
File.open("exported.txt", "w") do |file|
name = doc.at('Name').text
number = doc.at('Number').text
file.puts name
file.puts number
end
Running that results in a file called "exported.txt" that contains:
Tele2 Sverige AB
8-6785503
You can build upon that as necessary.

Download a file only if it exists with ruby

I'm doing a scraper to download all the issues of The Exile available at http://exile.ru/archive/list.php?IBLOCK_ID=35&PARAMS=ISSUE.
So far, my code is like this:
require 'rubygems'
require 'open-uri'
DATA_DIR = "exile"
Dir.mkdir(DATA_DIR) unless File.exists?(DATA_DIR)
BASE_exile_URL = "http://exile.ru/docs/pdf/issues/exile"
for number in 120..290
numero = BASE_exile_URL + number.to_s + ".pdf"
puts "Downloading issue #{number}"
open(numero) { |f|
File.open("#{DATA_DIR}/#{number}.pdf",'w') do |file|
file.puts f.read
end
}
end
puts "done"
The thing is, a lot of the issue links are down, and the code creates a PDF for every issue and, if it's empty, it will leave an empty PDF. How can I change the code so that it can only create and copy a file if the link exists?
require 'open-uri'
DATA_DIR = "exile"
Dir.mkdir(DATA_DIR) unless File.exists?(DATA_DIR)
url_template = "http://exile.ru/docs/pdf/issues/exile%d.pdf"
filename_template = "#{DATA_DIR}/%d.pdf"
(120..290).each do |number|
pdf_url = url_template % number
print "Downloading issue #{number}"
# Opening the URL downloads the remote file.
open(pdf_url) do |pdf_in|
if pdf_in.read(4) == '%PDF'
pdf_in.rewind
File.open(filename_template % number,'w') do |pdf_out|
pdf_out.write(pdf_in.read)
end
print " OK\n"
else
print " #{pdf_url} is not a PDF\n"
end
end
end
puts "done"
open(url) downloads the file and provides a handle to a local temp file. A PDF starts with '%PDF'. After reading the first 4 characters, if the file is a PDF, the file pointer has to be put back to the beginning to capture the whole file when writing a local copy.
you can use this code to check if exist the file:
require 'net/http'
def exist_the_pdf?(url_pdf)
url = URI.parse(url_pdf)
Net::HTTP.start(url.host, url.port) do |http|
puts http.request_head(url.path)['content-type'] == 'application/pdf'
end
end
Try this:
require 'rubygems'
require 'open-uri'
DATA_DIR = "exile"
Dir.mkdir(DATA_DIR) unless File.exists?(DATA_DIR)
BASE_exile_URL = "http://exile.ru/docs/pdf/issues/exile"
for number in 120..290
numero = BASE_exile_URL + number.to_s + ".pdf"
open(numero) { |f|
content = f.read
if content.include? "Link is missing"
puts "Issue #{number} doesnt exists"
else
puts "Issue #{number} exists"
File.open("./#{number}.pdf",'w') do |file|
file.write(content)
end
end
}
end
puts "done"
The main thing I added is a check to see if the string "Link is missing". I wanted to do it using HTTP status codes but they always give a 200 back, which is not the best practice.
The thing to note is that with my code you always download the whole site to look for that string, but I don't have any other idea to fix it at the moment.

Resources