how to parse XML file remotely from FTP with nokogiri gem, without downloading - ruby

require 'net/ftp'
require 'nokogiri'
server = "xxxxxx"
user = "xxxxx"
password = "xxxxx"
ftp = Net::FTP.new(server, user, password)
files = ftp.nlst('File*.xml')
files.each do |file|
ftp.getbinaryfile(file)
doc = Nokogiri::XML(open(file))
# some operations with doc
end
With the code above I'm able to parse/read XML file, because it first downloads a file.
But how can I parse remote XML file without downloading it?
The code above is a part of rake task that loads rails environment when run.
UPDATE:
I'm not going to create any file. I will import info into the mongodb using mongoid.

If you simply want to avoid using a temporary local file, it is possible to to fetch the file contents direct as a String, and process in memory, by supplying nil as the local file name:
files.each do |file|
xml_string = ftp.getbinaryfile( file, nil )
doc = Nokogiri::XML( xml_string )
# some operations with doc
end
This still does an FTP fetch of the contents, and XML parsing happens at the client.
It is not really possible to avoid fetching the data in some form or other, and if FTP is the only protocol you have available, then that means copying data over the network using an FTP get. However, it is possible, but far more complicated, to add capabilities to your FTP (or other net-based) server, and return the data in some other form. That could include Nokogiri parsing done remotely on the server, but you'd still need to serialise the end result, fetch it and deserialise it.

Related

in Ruby open IO object and pass each line to another object

I need to download a large zipped file, unzip it and modify each string before I save them to array.
I prefer to read downloaded zipped file line(entry) at a time, and manipulate each line(entry) as they load, rather then load the whole file in the memory.
I experimented with many IO methods of opening a file this way, but I struggle to pass a line(entry) to Zip::InputStream object. This is what I have:
require 'tempfile'
require 'zip'
require 'open-uri'
f = open(FILE_URL) #FILE_URL contains download path to .zip file
Zip::InputStream.open(f) do |io| #io is a String
while (io.get_next_entry)
io.each do |line|
# manipulate the line and push it to an array
end
end
end
if I use open(FILE_URL).each do |zip_entry|, I cannot figure out how to pass zip_entry to Zip::InputStream. Simply Zip::InputStream.open(zip_entry) does not work...
is this scenario possible, or do I have to have content of zipped file downloaded in to Tempfile completely? Any pointers so solve will be helpful

Ruby: Download zip file and extract

I have a ruby script that downloads a remote ZIP file from a server using rubys opencommand. When I look into the downloaded content, it shows something like this:
PK\x03\x04\x14\x00\b\x00\b\x00\x9B\x84PG\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\n\x00\x10\x00foobar.txtUX\f\x00\x86\v!V\x85\v!V\xF6\x01\x14\x00K\xCB\xCFOJ,RH\x03S\\\x00PK\a\b\xC1\xC0\x1F\xE8\f\x00\x00\x00\x0E\x00\x00\x00PK\x01\x02\x15\x03\x14\x00\b\x00\b\x00\x9B\x84PG\xC1\xC0\x1F\xE8\f\x00\x00\x00\x0E\x00\x00\x00\n\x00\f\x00\x00\x00\x00\x00\x00\x00\x00#\xA4\x81\x00\x00\x00\x00foobar.txtUX\b\x00\x86\v!V\x85\v!VPK\x05\x06\x00\x00\x00\x00\x01\x00\x01\x00D\x00\x00\x00T\x00\x00\x00\x00\x00
I tried using the Rubyzip gem (https://github.com/rubyzip/rubyzip) along with its class Zip::ZipInputStream like this:
stream = open("http://localhost:3000/foobar.zip").read # this outputs the zip content from above
zip = Zip::ZipInputStream.new stream
Unfortunately, this throws an error:
Failure/Error: zip = Zip::ZipInputStream.new stream
ArgumentError:
string contains null byte
My questions are:
Is it possible, in general, to download a ZIP file and extract its content in-memory?
Is Rubyzip the right library for it?
If so, how can I extract the content?
I found the solution myself and then at stackoverflow :D (How to iterate through an in-memory zip file in Ruby)
input = HTTParty.get("http://example.com/somedata.zip").body
Zip::InputStream.open(StringIO.new(input)) do |io|
while entry = io.get_next_entry
puts entry.name
parse_zip_content io.read
end
end
Download your ZIP file, I'm using HTTParty for this (but you could also use ruby's open command (require 'open-uri').
Convert it into a StringIO stream using StringIO.new(input)
Iterate over every entry inside the ZIP archive using io.get_next_entry (it returns an instance of Entry)
With io.read you get the content, and with entry.name you get the filename.
Like I commented in https://stackoverflow.com/a/43303222/4196440, we can just use Zip::File.open_buffer:
require 'open-uri'
content = open('http://localhost:3000/foobar.zip')
Zip::File.open_buffer(content) do |zip|
zip.each do |entry|
puts entry.name
# Do whatever you want with the content files.
end
end

Open a local file with open-uri

I am doing data scraping with Ruby and Nokogiri. Is it possible to download and parse a local file in my computer?
I have:
require 'open-uri'
url = "file:///home/nav/Desktop/Scraping/scrap1.html"
It gives error as:
No such file or directory # rb_sysopen - file:\home/nav/Desktop/Scraping/scrap1.html
If you want to parse a local file with Nokogiri you can do it like this.
file = File.read('/home/nav/Desktop/Scraping/scrap1.html')
doc = Nokogiri::HTML(file)
When you open a local file in a browser, the URL in the address bar is displayed as:
file:///Users/7stud/Desktop/accounts.txt
But that doesn't mean you use that format in a Ruby script. Your Ruby script doesn't send the file name to a browser and then ask the browser to retrieve the file. Your Ruby script searches your file system directly.
The same is true for URLs: your Ruby script doesn't ask your browser to go retrieve a page from the internet, Ruby retrieves the page itself by sending a request using your system's network interface. After all, a browser and a Ruby program are both just computer programs. What your browser can do over a network, a Ruby program can do, too.
This works for me:
require 'open-uri'
text = open('./data.txt').read
puts text
You have to get your path right, though. The only reason I can think of to use open() is if you had an array of filenames and URLs mixed together. If that isn't your situation, see new2code's answer.
This is how I do it as according to the documentation.
f = File.open("//home/nav/Desktop/Scraping/scrap1.html")
doc = Nokogiri::HTML(f)
f.close
I would make use of Mechanize and save the file locally, then parse it with Nokogiri like so:
# Save the file
agent = Mechanize.new
agent.pluggable_parser.default = Mechanize::Download
current_url = 'http://www.example.com'
file = agent.get(current_url)
file.save!("#{Rails.root}/tmp/")
# Read the file
page = Nokogiri::HTML::Reader(File.open(file))
Hope that helps!

How to read file from s3?

I'm trying to read a CSV file directly from s3.
I'm getting the s3 URL but I am not able to open it as it's not in the local system. I don't want to download the file and read it.
Is there any other way to achieve this?
There are few ways, depending on the gems that you are using. For example, one of the approaches from official documentation:
s3 = Aws::S3::Client.new
resp = s3.get_object(bucket:'bucket-name', key:'object-key')
resp.body
#=> #<StringIO ...>
resp.body.read
#=> '...'
Or if you are using CarrierWave/Fog:
obj = YourModel.first
content = obj.attachment.read
You can open the file from URL directly:
require 'open-uri'
csv = open('http://server.com/path-to-your-file.csv').read
I think s3 doesn't provide you any way of reading the file without downloading it.
What you can do is save it in a tempfile:
#temp_file = Tempfile.open("your_csv.csv")
#temp_file.close
`s3cmd get s3://#{#your_path} #{#temp_file.path}`
For further information: http://www.ruby-doc.org/stdlib-1.9.3/libdoc/tempfile/rdoc/Tempfile.html

How to FTP in Ruby without first saving the text file

Since Heroku does not allow saving dynamic files to disk, I've run into a dilemma that I am hoping you can help me overcome. I have a text file that I can create in RAM. The problem is that I cannot find a gem or function that would allow me to stream the file to another FTP server. The Net/FTP gem I am using requires that I save the file to disk first. Any suggestions?
ftp = Net::FTP.new(domain)
ftp.passive = true
ftp.login(username, password)
ftp.chdir(path_on_server)
ftp.puttextfile(path_to_web_file)
ftp.close
The ftp.puttextfile function is what is requiring a physical file to exist.
StringIO.new provides an object that acts like an opened file. It's easy to create a method like puttextfile, by using StringIO object instead of file.
require 'net/ftp'
require 'stringio'
class Net::FTP
def puttextcontent(content, remotefile, &block)
f = StringIO.new(content)
begin
storlines("STOR " + remotefile, f, &block)
ensure
f.close
end
end
end
file_content = <<filecontent
<html>
<head><title>Hello!</title></head>
<body>Hello.</body>
</html>
filecontent
ftp = Net::FTP.new(domain)
ftp.passive = true
ftp.login(username, password)
ftp.chdir(path_on_server)
ftp.puttextcontent(file_content, path_to_web_file)
ftp.close
David at Heroku gave a prompt response to a support ticket I entered there.
You can use APP_ROOT/tmp for temporary file output. The existence of files created in this dir is not guaranteed outside the life of a single request, but it should work for your purposes.
Hope this helps,
David

Resources