Read header data from files on remote server - ftp

I'm working on a project right now where I need to read header data from files on remote servers. I'm talking about many and large files so I cant read whole files, but just the header data I need.
The only solution I have is to mount the remote server with fuse and then read the header from the files as if they where on my local computer. I've tried it and it works. But it has some drawbacks. Specially with FTP:
Really slow (FTP is compared to SSH with curlftpfs). From same server, with SSH 90 files was read in 18 seconds. And with FTP 10 files in 39 seconds.
Not dependable. Sometimes the mountpoint will not be unmounted.
If the server is active and a passive mounting is done. That mountpoint and the parent folder gets locked in about 3 minutes.
Does timeout, even when there's data transfer going (guess this is the FTP-protocol and not curlftpfs).
Fuse is a solution, but I don't like it very much because I don't feel that I can trust it. So my question is basically if there's any other solutions to the problem. Language is preferably Ruby, but any other will work if Ruby does not support the solution.
Thanks!

What type of information are you looking for?
You could try using ruby's open-uri module.
The following example is from http://www.ruby-doc.org/stdlib/libdoc/open-uri/rdoc/index.html
require 'open-uri'
open("http://www.ruby-lang.org/en") {|f|
p f.base_uri # <URI::HTTP:0x40e6ef2 URL:http://www.ruby-lang.org/en/>
p f.content_type # "text/html"
p f.charset # "iso-8859-1"
p f.content_encoding # []
p f.last_modified # Thu Dec 05 02:45:02 UTC 2002
}
EDIT: It seems that the op wanted to retrieve ID3 tag information from the remote files. This is more complex.
From wiki:
This appears to be a difficult problem.
On wiki:
Tag location within file
Only with the ID3v2.4 standard has it
been possible to place the tag data at
the end of the file, in common with
ID3v1. ID3v2.2 and 2.3 require that
the tag data precede the file. Whilst
for streaming data this is absolutely
required, for static data it means
that the entire audio file must be
updated to insert data at the front of
the file. For initial tagging this
incurs a large penalty as every file
must be re-written. Tag writers are
encouraged to introduce padding after
the tag data in order to allow for
edits to the tag data without
requiring the entire audio file to be
re-written, but these are not standard
and the tag requirements may vary
greatly, especially if APIC
(associated pictures) are also
embedded.
This means that depending on the ID3 tag version of the file, you may have to read different parts of the file.
Here's an article that outlines the basics of reading ID3 tag using ruby for ID3tagv1.1 but should server as a good starting point: http://rubyquiz.com/quiz136.html
You could also look into using a ID3 parsing library, such as id3.rb or id3lib-ruby; however, I'm not sure if either supports the ability to parse a remote file (Most likely could through some modifications).

A "best-as-nothing" solution would be to start the transfer, and stop it when dowloaded file has more than bytes. Since not many (if any) libraries will allow interruption of the connection, it is more complex and will probably require you to manually code a specific ftp client, with two threads, one doing the FTP connection and transfer, and the other monitoring the size of the downloaded file and killing the first thread.
Or, at least, you could parallelize the file transfers. So that you don't wait for all the files being fully transferred to analyze the start of the file. The transfer will then continue

There has been a proposal of a RANG command, allowing to retrieve only a part of the files (here, the first bytes).
I didn't find any reference of inclusion of this proposal, nor implementation, however.
So, for a specific server it could be useful to test (or check the docs of the FTP server) - and use it if available.

Related

Unable to send data in chunks to server in Golang

I'm completely new to Golang. I am trying to send a file from the client to the server. The client should split it into smaller chunks and send it to the rest end point exposed by the server. The server should combine those chunks and save it.
This is the client and server code I have written so far. When I run this to copy a file of size 39 bytes, the client is sending two requests to the server. But the server is displaying the following errors.
2017/05/30 20:19:28 Was not able to access the uploaded file: unexpected EOF
2017/05/30 20:19:28 Was not able to access the uploaded file: multipart: NextPart: EOF
You are dividing buffer with the file into separate chunks and sending each of them as separate HTTP message. This is not how multipart is intended to be used.
multipart MIME means that a single HTTP message may contain one or more entities, quoting HTTP RFC:
MIME provides for a number of "multipart" types -- encapsulations of
one or more entities within a single message-body. All multipart types
share a common syntax, as defined in section 5.1.1 of RFC 2046
You should send the whole file and send it in a single HTTP message (file contents should be a single entity). The HTTP protocol will take care of the rest but you may consider using FTP if the files you are planning to transfer are large (like > 2GB).
If you are using a multipart/form-data, then it is expected to take the entire file and send it up as a single byte stream. Go can handle multi-gigabyte files easily this way. But your code needs to be smart about this.
ioutil.ReadAll(r.Body) is out of the question unless you know for sure that the file will be very small. Please don't do this.
multipartReader, err := r.MultipartReader() use a multipart reader. This will iterate over uploading files, in the order they are included in the encoding. This is important, because you can keep the file entirely out of memory, and do a Copy from one filehandle to another. This is how large files are handled easily.
You will have issues with middle-boxes and reverse proxies. We have to change defaults in Nginx so that it will not cut off large files. Nginx (or whatever reverse-proxy you might use) will need to cooperate, as they often are going to default to some really tiny file size max like 300MB.
Even if you think you dealt with this issue on upload with some file part trick, you will then need to deal with large files on download. Go can do single large files very efficiently by doing a Copy from filehandle to filehandle. You will also end up needing to support partial content (http 206) and not modified (304) if you want great performance for downloading files that you uploaded. Some browsers will ignore your pleas to not ask for partial content when things like large video is involved. So, if you don't support this, then some content will fail to download.
If you want to use some tricks to cut up files and send them in parts, then you will end up needing to use a particular Javascript library. This is going to be quite harmful to interoperability if you are going for programmatic access from any client to your Go server. But maybe you can't fix middle-boxes that impose size limits, and you really want to cut files up into chunks. You will have a lot of work to handle downloading the files that you managed to upload in chunks.
What you are trying to do is the typical code that is written with a tcp connection with most other languages, in GO you can use tcp too with net.Listen and eventually accept on the listener object. Then this should be fine.

Read and write file atomically

I'd like to read and write a file atomically in Ruby between multiple independent Ruby processes (not threads).
I found atomic_write from ActiveSupport. This writes to a temp file, then moves it over the original and sets all permissions. However, this does not prevent the file from being read while it is being written.
I have not found any atomic_read. (Are file reads already atomic?)
Do I need to implement my own separate 'lock' file that I check for before reads and writes? Or is there a better mechanism already present in the file system for flagging a file as 'busy' that I could check before any read/write?
The motivation is dumb, but included here because you're going to ask about it.
I have a web application using Sinatra and served by Thin which (for its own reasons) uses a JSON file as a 'database'. Each request to the server reads the latest version of the file, makes any necessary changes, and writes out changes to the file.
This would be fine if I only had a single instance of the server running. However, I was thinking about having multiple copies of Thin running behind an Apache reverse proxy. These are discrete Ruby processes, and thus running truly in parallel.
Upon further reflection I realize that I really want to make the act of read-process-write atomic. At which point I realize that this basically forces me to process only one request at a time, and thus there's no reason to have multiple instances running. But the curiosity about atomic reads, and preventing reads during write, remains. Hence the question.
You want to use File#flock in exclusive mode. Here's a little demo. Run this in two different terminal windows.
filename = 'test.txt'
File.open(filename, File::RDWR) do |file|
file.flock(File::LOCK_EX)
puts "content: #{file.read}"
puts 'doing some heavy-lifting now'
sleep(10)
end
Take a look at transaction and open_and_lock_file methods in "pstore.rb" (Ruby stdlib).
YAML::Store works fine for me. So when I need to read/write atomically I (ab)use it to store data as a Hash.

Storing and processing large XML files with Heroku?

I'm working on an application that needs to store a large 2GB+ XML file for processing, and I'm facing two problems:
How do I process the file? Loading the whole file into Nokogiri at once won't work. It quickly eats up memory and, as far as I can tell, the process gets nuked from orbit. Are there Heroku-compatible ways to quickly/easily read a large XML file located on a non-Heroku server in smaller chunks?
How do I store the file? The site is set up to use S3, but the data provider needs FTP access to upload the XML file nightly. S3 via FTP is apparently a no-go, and storing the file on Heroku won't work either, as it'll only be seen by the dyno that owns it and is susceptible to being randomly purged. Has anyone encountered this type of constraint before, and if so, how'd you work around it?
Most of the time we prefer parsing the entire file that has been pulled into memory because it's easier to jump back and forth, extracting this and that as our code needs. Because it's in memory, we can do random access easily, if we want.
For your need, you'll want to start at the top of the file, and read each line, looking for the tags of interest, until you get to the end of the file. For that, you want to use Nokogiri::XML::SAX and Nokogiri::XML::SAX::Parser, along with the events in Nokogiri::XML::SAX::Document. Here's a summary of what it does, from Nokogiri's site:
The basic way a SAX style parser works is by creating a parser, telling the parser about the events we’re interested in, then giving the parser some XML to process. The parser will notify you when it encounters events your said you would like to know about.
SAX is a different beast than dealing with the DOM, but it can be very fast, and is a lot easier on memory.
If you wanted to load the file in smaller chunks, you could process the XML inside an OpenURI.open or Net::HTTP block, so you'd be getting it in TCP packet-size chunks. The problem then is that your lines could be split, because TCP doesn't guarantee reading by lines, but by blocks, which is what you'll see inside the read loop. Your code would have to peel off partial lines at the end of the buffer, and then prepend them to the read buffer so the next block read finishes the line.
You'll need a streaming parser. Have a look at https://github.com/craigambrose/sax_stream
You could run your own FTP server on EC2? Or use a hosted provider such as https://hostedftp.com/

determine if file is complete

I am trying to write a video ruby transformer script (using ffmpeg) that depends on mov files being ftped to a server.
The problem I've run into is that when a large file is uploaded by a user, the watch script (using rb-inotify) attempts to execute (and run the transcoder) before the mov is completely uploaded.
I'm a complete noob. But I'm trying to discover if there is a way for me to be able to ensure my watch script doesn't run until the file(s) is/are completely uploaded.
My watch script is here:
watch_me = INotify::Notifier.new
watch_me.watch("/directory_to_my/videos", :close_write) do |directories|
load '/directory_to_my/videos/.transcoder.rb'
end
watch_me.run
Thank you for any help you can provide.
Just relying on inotify(7) to tell you when a file has been updated isn't a great fit for telling when an upload is 'complete' -- an FTP session might time out and be re-started, for example, allowing a user to upload a file in chunks over several days as connectivity is cheap or reliable or available. inotify(7) only ever sees file open, close, rename, and access, but never the higher-level event "I'm done modifying this file", as the user would understand it.
There are two mechanisms I can think of: one is to have uploads go initially into one directory and ask the user to move the file into another directory when the upload is complete. The other creates some file meta-data on the client and uses that to "know" when the upload is complete.
Move completed files manually
If your users upload into the directory ftp/incoming/temporary/, they can upload the file in as many connections is required. Once the file is "complete", they can rename the file (rename ftp/incoming/temporary/hello.mov ftp/incoming/complete/hello.mov) and your rb-inotify interface looks for file renames in the ftp/incoming/complete/ directory, and starts the ffmpeg(1) command.
Generate metadata
For a transfer to be "complete", you're really looking for two things:
The file is the same size on both systems.
The file is identical on both systems.
Since "identical" is otherwise difficult to check, most people content themselves with checking if the contents of the file, when run through a cryptographic hash function such as MD5 or SHA-1 (or better, SHA-224, SHA-256, SHA-384, or SHA-512) functions. MD5 is quite fine if you're guarding against incomplete transmission but if you intend on using the output of the function for other means, using a stronger function would be wise.
MD5 is really tempting though, since tools to create and validate MD5 hashes are very widespread: md5sum(1) on most Linux systems, md5(1) on most BSD systems (including OS X).
$ md5sum /etc/passwd
c271aa0e11f560af419557ef49a27ac8 /etc/passwd
$ md5sum /etc/passwd > /tmp/sums
$ md5sum -c /tmp/sums
/etc/passwd: OK
The md5sum -c command asks the md5sum(1) program to check the file of hashes and filenames for correctness. It looks a little silly when used on just a single file, but when you've got dozens or hundreds of files, it's nice to let the software do the checking for you. For example: http://releases.mozilla.org/pub/mozilla.org/firefox/releases/3.0.19-real-real/MD5SUMS -- Mozilla has published such files with 860 entries -- checking them by hand would get tiring.
Because checking hashes can take a long time (five minutes on my system to check a high-definition hour-long video that wasn't recently used), it'd be a good idea to only check the hashes when the filesizes match. Modify your upload tool to send along some metadata about how long the file is and what its cryptographic hash is. When your rb-inotify script sees file close requests, check the file size, and if the sizes match, check the cryptographic hash. If the hashes match, then start your ffmpeg(1) command.
It seems easier to upload the file to a temporal directory on the server and move it to the location your script is watching once the transfer is completed.

Verify whether ftp is complete or not?

I got an application which is polling on a folder continuously. Once any file is ftp to the folder, the application has to move this file to some other folder for processing.
Here, we don't have any option to verify whether ftp is complete or not.
One command "lsof" is suggested in the technical forums. It got a file description column which gives the file status.
Since, this is a free bsd command and not present in old versions of linux, I want to clarify the usage of this command.
Can you guys tell us your experience in file verification and is there any other alternative solution available?
Also, is there any risk in using this utility?
Appreciate your help in advance.
Thanks,
Mathew Liju
We've done this before in a number of different ways.
Method one:
If you can control the process sending the files, have it send the file itself followed by a sentinel file. For example, send the real file "contracts.doc" followed by a one-byte "contracts.doc.sentinel".
Then have your listener process watch out for the sentinel files. When one of them is created, you should process the equivalent data file, then delete both.
Any data file that's more than a day old and doesn't have a corresponding sentinel file, get rid of it - it was a failed transmission.
Method two:
Keep an eye on the files themselves (specifically the last modification date/time). Only process files whose modification time is more than N minutes in the past. That increases the latency of processing the files but you can usually be certain that, if a file hasn't been written to in five minutes (for example), it's done.
Conclusion:
Both those methods have been used by us successfully in the past. I prefer the first but we had to use the second one once when we were not allowed to change the process sending the files.
The advantage of the first one is that you know the file is ready when the sentinel file appears. With both lsof (I'm assuming you're treating files that aren't open by any process as ready for processing) and the timestamps, it's possible that the FTP crashed in the middle and you may be processing half a file.
There are normally three approaches to this sort of problem.
providing a signal file so that when your file is transferred, an additional file is sent to mark that transfer is complete
add an entry to a log file within that directory to indicate a transfer is complete (this really only works if you have a single peer updating the directory, to avoid concurrency issues)
parsing the file to determine completeness. e.g. does the file start with a length field, or is it obviously incomplete ? e.g. parsing an incomplete XML file will result in a parse error due to the lack of an end element. Depending on your file's size and format, this can be trivial, or can be very time-consuming.
lsof would possibly be an option, although you've identified your Linux portability issue. If you use this, note the -F option, which formats the output suitable for processing by other programs, rather than being human-readable.
EDIT: Pax identified a fourth (!) method I'd forgotten - using the fact that the timestamp of the file hasn't updated in some time.
There is a fifth method. You can also check if the FTP Session is still active. This will work if every peer has it's own ftp user account. As long as the user is not logged off from FTP, assume the files are not complete.

Resources