NodeJS - failed to read newly uploaded file - ajax

I was trying to build a system (NodeJS + Express 4) that reads a user uploaded text file, process it, and feed it back to the user. I was trying to use ajax upload, and multer as the parser for multi-part data. The whole workflow is supposedly to be like this:
User chooses a local file, and clicks the upload button.
Server received file, and read it.
Server do some processing with the data
Send results back
Every part of the link works except the server read part - sometimes the file is not read fully even though the server signals that the file upload was completed (I have tried multiple libraries, like multer, busboy, formidable that triggers the file upload complete event). I have done various experiments, and here's what I find (with 1000 lines of file):
the fs.readFile sometimes ends prematurely. The result file can be anywhere between 100 - 1000 lines.
missing part is almost always the last small piece, feels like the pipe was not fully flushed yet. I have tried file size between 1000 lines to 200,000 lines, and it's always missing the last few hundred lines.
using streaming almost solved the issue - like createReadStream, or byline, line-by-line, but sometimes the result can be 'undefined' or missing the last few lines, but a lot less frequent.
trigger the read twice, and the second time is almost guaranteed to read the full 1000 lines.
Is there anyway to force NodeJS to 'flush' the uploaded file? Somehow I feel the upload complete event was triggered (regardless of library, and everyone is dependent on FileSystem I guess) before the last piece of file was flushed in the stream. Or maybe there are some other issues - reading a static files always give the correct results. I could use the http POST forms but I'd like to use ajax to improve user experience.
Any thoughts?

Related

Appropriate way to cancel saving file via file stream?

A tool I'm writing is responsible for downloading thousands of image files over a matter of many hours. Originally, using TIdHTTP, I would Get the file(s) into a TMemoryStream, and then save that to a file, so long as there were no exceptions. In order to improve speed, I changed the TMemoryStream to a TFileStream.
However, now if the resource was not found, or otherwise any sort of exception which results in no actual file, it still saves an empty file.
Completely understandable, since I simply create a file stream just prior to the download...
FileStream:= TFileStream.Create(FileName, fmCreate);
try
Web.Get(AURL, FileStream);
finally
FileStream.Free;
end;
I know I could simply delete the file if there was an exception. But it seems far too sloppy. I'm sure there's a more appropriate method of aborting such a situation.
How should I make this to not save a file if there was an exception, while not altering the performance (if at all possible)?
How should I make this to not save a file if there was an exception, while not altering the performance (if at all possible)?
This isn't possible in general. Errors and failures can happen at any step if the way, including part way through the download. Once this point is understood, then you must accept that the file can be partially downloaded and then abandoned. At which point where do you store it?
The obvious choices are memory and file. You don't want to store to memory, which leaves to file.
This takes you back to your current solution.
I know I could simply delete the file if there was an exception.
This is the correct approach. There are a few variants on this. For instance you might download to a temporary file that is created with flags to arrange its deletion when closed. Only if the download completes do you then copy to the true destination. This is the approach that a browser takes. But the basic idea is to download to file and deal with any failure by tidying up.
Instead of downloading the entire image in one go, you could consider using HTTP range requests if the server supports it. Then you could chunk the file into smaller parts, requesting the next part after the first finishes (or even requesting multiple parts at the same time to increase performance). If there is an exception then you can about the future requests, so they never start in the first place.
YouTube and a number of streaming media sites started doing this a while ago. It used to be if you started playing a video, then paused it, then it would eventually cache the entire video. Now it only caches a little ahead of the current position. This saves a ton of bandwidth because of the abandon rate for videos.
You could write the partial file to disk or keep it in memory.

get_dir_file_info() hangs when run on a large directory

I have made a little function that deletes files based on date. Prior to doing the deletions, it lets the user choose how many days/months back to delete files, telling them how many files and how much memory it would clean up.
It worked great in my test environment, but when I attempted to test it on a larger directory (approximately 100K files), it hangs.
I’ve stripped everything else from my code to ensure that it is the get_dir_info() function that is causing the issue.
$this->load->helper('file');
$folder = "iPad/images/";
set_time_limit (0);
echo "working<br />";
$dirListArray = get_dir_file_info($folder);
echo "still working";
When I run this, the page loads for approximately 60 seconds, then displays only the first message “working” and not the following message “still working”.
It doesn’t seem to be a system/php memory problem as it is coming back after 60 seconds and the server respects my set_time_limit() as I’ve had to use that for other processes.
Is there some other memory/time limit I might be hitting that I need to adjust?
from the CI user guide the get_dir_file_info() is:
Reads the specified directory and builds an array containing the filenames, filesize, dates, and permissions. Sub-folders contained within the specified path are only read if forced by sending the second parameter, $top_level_only to FALSE, as this can be an intensive operation.
so if you are saying that you have 100k files then the best way to do it, is to cut it into two steps:
First: use get_filenames('path/to/directory/') to retrieve all your files without their information.
Second: use get_file_info('path/to/file', $file_information) to retrieve a specific file info, as you might not need all the file information immediately. it can be done on file name click or something relevant.
the idea here is not to force your server to deal with large amount of process while in production. that would kill two things, responsiveness, and performance (I haven't found a better definition for performance) but the idea here is clear.

Storing and processing large XML files with Heroku?

I'm working on an application that needs to store a large 2GB+ XML file for processing, and I'm facing two problems:
How do I process the file? Loading the whole file into Nokogiri at once won't work. It quickly eats up memory and, as far as I can tell, the process gets nuked from orbit. Are there Heroku-compatible ways to quickly/easily read a large XML file located on a non-Heroku server in smaller chunks?
How do I store the file? The site is set up to use S3, but the data provider needs FTP access to upload the XML file nightly. S3 via FTP is apparently a no-go, and storing the file on Heroku won't work either, as it'll only be seen by the dyno that owns it and is susceptible to being randomly purged. Has anyone encountered this type of constraint before, and if so, how'd you work around it?
Most of the time we prefer parsing the entire file that has been pulled into memory because it's easier to jump back and forth, extracting this and that as our code needs. Because it's in memory, we can do random access easily, if we want.
For your need, you'll want to start at the top of the file, and read each line, looking for the tags of interest, until you get to the end of the file. For that, you want to use Nokogiri::XML::SAX and Nokogiri::XML::SAX::Parser, along with the events in Nokogiri::XML::SAX::Document. Here's a summary of what it does, from Nokogiri's site:
The basic way a SAX style parser works is by creating a parser, telling the parser about the events we’re interested in, then giving the parser some XML to process. The parser will notify you when it encounters events your said you would like to know about.
SAX is a different beast than dealing with the DOM, but it can be very fast, and is a lot easier on memory.
If you wanted to load the file in smaller chunks, you could process the XML inside an OpenURI.open or Net::HTTP block, so you'd be getting it in TCP packet-size chunks. The problem then is that your lines could be split, because TCP doesn't guarantee reading by lines, but by blocks, which is what you'll see inside the read loop. Your code would have to peel off partial lines at the end of the buffer, and then prepend them to the read buffer so the next block read finishes the line.
You'll need a streaming parser. Have a look at https://github.com/craigambrose/sax_stream
You could run your own FTP server on EC2? Or use a hosted provider such as https://hostedftp.com/

Sencha Touch - big XML file issue

I am reading the content out from a xml file over the internet!
The file contains about 10000 xml-elements and is loaded into a list (one picture and headline for each element)!
This slows down the app extremly!
Is there a way to speed this up?
Maybe with a select-command?
Are there some examples or tutorials out there?
You are out of luck for a easy-straight forward answer.
If you control the server that the XML file is coming from, you should make the changes on it to support pagination of the results instead of sending the complete document.
If you don't control the server, you could set up one to proxy the results and do the pagination for the application on the server side.
The last option is the process the file in chunks. This would mean, processing sub-strings of the text. Just take a sub-string of the first x characters, parse it and then do something with the results. If you needed more you would process the next x characters. This could get very messy fast (as XML doesn't really parse nicely in this manner) and just downloading a document with 10k elements and loading it into memory is probably going to be taxing/slow/expensive (if downloading over a 3G connection) for mobile devices.

Verify whether ftp is complete or not?

I got an application which is polling on a folder continuously. Once any file is ftp to the folder, the application has to move this file to some other folder for processing.
Here, we don't have any option to verify whether ftp is complete or not.
One command "lsof" is suggested in the technical forums. It got a file description column which gives the file status.
Since, this is a free bsd command and not present in old versions of linux, I want to clarify the usage of this command.
Can you guys tell us your experience in file verification and is there any other alternative solution available?
Also, is there any risk in using this utility?
Appreciate your help in advance.
Thanks,
Mathew Liju
We've done this before in a number of different ways.
Method one:
If you can control the process sending the files, have it send the file itself followed by a sentinel file. For example, send the real file "contracts.doc" followed by a one-byte "contracts.doc.sentinel".
Then have your listener process watch out for the sentinel files. When one of them is created, you should process the equivalent data file, then delete both.
Any data file that's more than a day old and doesn't have a corresponding sentinel file, get rid of it - it was a failed transmission.
Method two:
Keep an eye on the files themselves (specifically the last modification date/time). Only process files whose modification time is more than N minutes in the past. That increases the latency of processing the files but you can usually be certain that, if a file hasn't been written to in five minutes (for example), it's done.
Conclusion:
Both those methods have been used by us successfully in the past. I prefer the first but we had to use the second one once when we were not allowed to change the process sending the files.
The advantage of the first one is that you know the file is ready when the sentinel file appears. With both lsof (I'm assuming you're treating files that aren't open by any process as ready for processing) and the timestamps, it's possible that the FTP crashed in the middle and you may be processing half a file.
There are normally three approaches to this sort of problem.
providing a signal file so that when your file is transferred, an additional file is sent to mark that transfer is complete
add an entry to a log file within that directory to indicate a transfer is complete (this really only works if you have a single peer updating the directory, to avoid concurrency issues)
parsing the file to determine completeness. e.g. does the file start with a length field, or is it obviously incomplete ? e.g. parsing an incomplete XML file will result in a parse error due to the lack of an end element. Depending on your file's size and format, this can be trivial, or can be very time-consuming.
lsof would possibly be an option, although you've identified your Linux portability issue. If you use this, note the -F option, which formats the output suitable for processing by other programs, rather than being human-readable.
EDIT: Pax identified a fourth (!) method I'd forgotten - using the fact that the timestamp of the file hasn't updated in some time.
There is a fifth method. You can also check if the FTP Session is still active. This will work if every peer has it's own ftp user account. As long as the user is not logged off from FTP, assume the files are not complete.

Resources