Get filename on server while using Ruby gem Curb - ruby

Is there a way to get the filename of the file being downloaded (without having to parse the url provided)? I am hoping to find something like:
c = Curl::Easy.new("http://google.com/robots.txt")
c.perform
File.open( c.file_name, "w") { |file| file.write c.body_str }

Unfortunately, there's nothing in the Curb documentation regarding polling the filename. I don't know whether you have a particular aversion to parsing, but it's a simple process if using the URI module:
require 'uri'
url = 'http://google.com/robots.txt'
uri = URI.parse(url)
puts File.basename(uri.path)
#=> "robots.txt"
UPDATE:
In the comments to this question, the OP suggests using split() to split the URL by slashes (/). While this may work in the majority of situations, it isn't a catch-all solution. For instance, versioned files won't be parsed correctly:
url = 'http://google.com/robots.txt?1234567890'
puts url.split('/').last
#=> "robots.txt?1234567890"
In comparison, using URI.parse() guarantees the filename – and only the filename – is returned:
require 'uri'
url = 'http://google.com/robots.txt?1234567890'
uri = URI.parse(url)
puts File.basename(uri.path)
#=> "robots.txt"
In sum, for optimal coherence and integrity, it's wise to use the URI library to parse universal resources – it's what it was created for, after all.

Related

Save and parse CSV file from URL

I am looking for an implementation that would allow me to download a CSV file from a browser (via a URL), to a point where I can open that file manually and view its contents in CSV form.
I have been doing some research and can see that I should use the IO, CSV or File classes.
I have a URL that looks something like:
"https://mydomain/manage/reporting/index?who=user&users=0&teams=0&datasetName=0&startDate=2015-10-18&endDate=2015-11-17&format=csv"
From what I have read I have:
href = page.find('#csv-download > a')['href']
csv_path = "https://mydomain/manage/reporting/index?who=user&users=0&teams=0&datasetName=0&startDate=2015-10-18&endDate=2015-11-17&format=csv"
require 'open-uri'
download = open(csv_path, ssl_verify_mode: OpenSSL::SSL::VERIFY_NONE)
IO.copy_stream(download, 'test.csv')
This actually outputs:
2684
Which tells me that I have successfully got the data?
When downloading the file, the contents are just
#<StringIO:0x00000003e07d30>
Would there be any reason for this?
It's where to go from here, could anyone point me in the right direction please?
This should read from remote, write and then parse the file:
require 'open-uri'
require 'csv'
url = "https://mydomain/manage/reporting/index?who=user&users=0&teams=0&datasetName=0&startDate=2015-10-18&endDate=2015-11-17&format=csv"
download = open(url)
IO.copy_stream(download, 'test.csv')
CSV.new(download).each do |l|
puts l
end
If all you want to do is read a file and save it, it's simple. This untested code should be all that's required:
require 'open-uri'
CSV_PATH = "https://mydomain/manage/reporting/index?who=user&users=0&teams=0&datasetName=0&startDate=2015-10-18&endDate=2015-11-17&format=csv"
IO.copy_stream(
open(
CSV_PATH,
ssl_verify_mode: OpenSSL::SSL::VERIFY_NONE
),
'test.csv'
)
OpenURI's open returns an IO stream, which is all you need to make copy_stream happy.
More typically you'll see the open, read, write pattern. open will create the IO stream for the remote document and read will retrieve the remote document and write will output it to a text file on your local disk. See their documentation for more information.
require 'open-uri'
CSV_PATH = "https://mydomain/manage/reporting/index?who=user&users=0&teams=0&datasetName=0&startDate=2015-10-18&endDate=2015-11-17&format=csv"
File.write(
'test.csv',
open(
CSV_PATH,
ssl_verify_mode: OpenSSL::SSL::VERIFY_NONE
).read
)
There might be a scalability advantage to using copy_stream for huge files that potentially wouldn't fit into memory. That'd be a test for the user.
Here is a one-liner I use. Of course if the file is huge - I might want to stream or download it first, but this works in 99% of cases, just fine.
require 'open-uri'
require 'csv'
csv_data = CSV.readlines(open(download_url), headers: true)

ruby rss module not reading full path

I am downloading an rss file posted as xml, and saving it with the rss extension.
I then use the rss module to read it as an rss file. The issue I have is the following:
If I create the file (page.rss) with an implicit path and I use just
that filename to process it with the rss parsing function, everything
is fine (downloaded_file = 'page.rss')
If I explicity enter manually the full path into the script (downloaded_file = "E:/Libraries/Documents/Android dev/page.rss"), everything works fine also.
But if I "calculate" the value of the absolute path with: downloaded_file = File.join(Dir.pwd, 'page.rss') the rss function fails. The value of the variable is apparently the same ("E:/Libraries/Documents/Android dev/page.rss") but there must be an invisible difference. I would like to be able to use the 'calculated' absolute path. I am sure there is a subtle difference in the way this string is interpreted by the rss function. How can I elucidate it?
Thanks for any suggestion.
Here is my script:
require 'rss'
require 'open-uri'
url = 'http://tutorialspoint.com/android/sampleXML.xml'
downloaded_file = File.join(Dir.pwd, 'page.rss') # FAILS
puts "Path = #{downloaded_file}"#=> "E:/Libraries/Documents/Android dev/page.rss"
downloaded_file = 'page.rss' # WORKS
#downloaded_file = "E:/Libraries/Documents/Android dev/page.rss" # WORKS
puts "Used path/filename: #{downloaded_file}"
File.open(downloaded_file, 'wb') do |file| # Download url content into rss file
file << open(url).read
end
rss = RSS::Parser.parse(downloaded_file, false) # Read rss from downloaded_file
puts "Title: #{rss.channel.title}"
NEW ANSWER
Okay, so your downloaded_file string has been marked as tainted, and the RSS::Parser won't open a tainted file string for some reason (see rss/parser.rb about l. 105 for more details). The solution is to either: untaint the downloaded_file string before you call parse, e.g.:
RSS::Parser.parse(downloaded_file.untaint, false)
or to just open the file for the parser, e.g.:
RSS::Parser.parse(File.open(downloaded_file), false)
I'd never run into this issue before, so thanks! I'd heard of object tainting before, but I never really had any use to look into it. There is a bit more information about it here: What are tainted objects, and when should we untaint them?.
PREVIOUS ANSWER
Dir.pwd is going to change depending on where you call the script from. Unless you are calling the script from E:/Libraries/Documents/Android dev, the filepath will be off.
It's better to build your filepath from the location of your script itself. To do so you can add:
ROOT = File.expand_path('..', __FILE__)
downloaded_file = File.join(ROOT, 'page.rss')
# or just downloaded_file = File.expand_path('../page.rss', __FILE__)

Match regex works for one search, but scan does not

The following gets me one match:
query = http://0.0.0.0:9393/review?first_name=aoeu&last_name=rar
find = /(?<=(\?|\&)).*?(?=(\&|\z))/.match(query)
When I examine 'find' I get:
first_name=aoeu
I want to match everything between a '?' and a '&', so I tried
find = query.scan(/(?<=(\?|\&)).*?(?=(\&|\z))/)
But yet when I examine 'find' I now get:
[["?", "&"], ["&", ""]]
What do I need to do to get:
[first_name=aoeu][last_name=rar]
or
["first_name=aoeu","last_name=rar"]
?
Use String#split.
query.split(/[&?]/).drop(1)
or
query[/(?<=\?).*/].split("&")
But if your real purpose is to extract the parameters from url, then question and its answer.
Use other module provided by ruby or rails will make your code more maintainable and readable.
require 'uri'
uri = 'http://0.0.0.0:9393/review?first_name=aoeu&last_name=rar'
require 'rack'
require 'rack/utils'
Rack::Utils.parse_query(URI.parse(uri).query)
# => {"first_name"=>"aoeu", "last_name"=>"rar"}
# or CGI
require 'cgi'
CGI::parse(URI.parse(uri).query)
# => {"first_name"=>["aoeu"], "last_name"=>["rar"]}
If you need extract query params from URI, please, check thread "How to extract URL parameters from a URL with Ruby or Rails?". It contains a lot of solutions without using regexps.

Getting webpage content with Ruby -- I'm having troubles

I want to get the content off this* page. Everything I've looked up gives the solution of parsing CSS elements; but, that page has none.
Here's the only code that I found that looked like it should work:
file = File.open('http://hiscore.runescape.com/index_lite.ws?player=zezima', "r")
contents = file.read
puts contents
Error:
tracker.rb:1:in 'initialize': Invalid argument - http://hiscore.runescape.com/index_lite.ws?player=zezima (Errno::EINVAL)
from tracker.rb:1:in 'open'
from tracker.rb:1
*http://hiscore.runescape.com/index_lite.ws?player=zezima
If you try to format this as a link in the post it doesn't recognize the underscore (_) in the URL for some reason.
You really want to use open() provided by the Kernel class which can read from URIs you just need to require the OpenURI library first:
require 'open-uri'
Used like so:
require 'open-uri'
file = open('http://hiscore.runescape.com/index_lite.ws?player=zezima')
contents = file.read
puts contents
This related SO thread covers the same question:
Open an IO stream from a local file or url
The appropriate way to fetch the content of a website is through the NET::HTTP module in Ruby:
require 'uri'
require 'net/http'
url = "http://hiscore.runescape.com/index_lite.ws?player=zezima"
r = Net::HTTP.get_response(URI.parse(url).host, URI.parse(url).path)
File.open() does not support URIs.
Best wishes,
Fabian
Please use open-uri, its support both uri and local files
require 'open-uri'
contents = open('http://www.google.com') {|f| f.read }

Open an IO stream from a local file or url

I know there are libs in other languages that can take a string that contains either a path to a local file or a url and open it as a readable IO stream.
Is there an easy way to do this in ruby?
open-uri is part of the standard Ruby library, and it will redefine the behavior of open so that you can open a url, as well as a local file. It returns a File object, so you should be able to call methods like read and readlines.
require 'open-uri'
file_contents = open('local-file.txt') { |f| f.read }
web_contents = open('http://www.stackoverflow.com') {|f| f.read }

Resources