How to optimize the file processing? - performance

I'm working on a Perl/CGI script which reads an 8MB file with over 100k lines and displays it in chunks of 100 lines (using pagination).
Which one of the following will be faster
Storing the entire input file into an array and extracting 100 lines for each page (using array slicing)
my #extract = #main_content[101..200];
or
For each page, using the sed command to extract any 100 lines that the user wants to view.
sed -n '101,200'p filename

If you really want performance then don't use CGI, try using something that keeps a persistent copy of the data in memory between requests. 8mb is tiny these days but loading for every request would not be sensible nor would scanning the whole file. Modperl was the older way of doing this , it was a perl interpreter embedded in the webserver , the newer way is to use catalyst or dancer, instructions for those are outside the scope of this reply. You could get away with using CGI if this was only to be use occasionally and was password protected to limit use.

Related

Bash to multithread multiple files

I have some files, about 10 (variable), each with 2000 lines of code.
I have some function that will read each line of a document and do something with it.
I am trying to execute all of the files at the same time with a multithreaded process. I am trying to use Xargs for this, but I'm not sure how it would handle multiple indetermined files (they all have a similar name though: segmentA, segmentB etc etc).
Any suggestions?
Edit:
For further clarification, the function takes a document, and reads each line and sends each line to a solr server.

Ruby PStore file too large

I am using PStore to store the results of some computer simulations. Unfortunately, when the file becomes too large (more than 2GB from what I can see) I am not able to write the file to disk anymore and I receive the following error;
Errno::EINVAL: Invalid argument - <filename>
I am aware that this is probably a limitation of IO but I was wondering whether there is a workaround. For example, to read large JSON files, I would first split the file and then read it in parts. Probably the definitive solution should be to switch to a proper database in the backend, but because of some limitations of the specific Ruby (Sketchup) I am using this is not always possible.
I am going to assume that your data has a field that could be used as a crude key.
Therefore I would suggest that instead of dumping data into one huge file, you could put your data into different files/buckets.
For example, if your data has a name field, you could take the first 1-4 chars of the name, create a file with those chars like rojj-datafile.pstore and add the entry there. Any records with a name starting 'rojj' go in that file.
A more structured version is to take the first char as a directory, then put the file inside that, like r/rojj-datafile.pstore.
Obviously your mechanism for reading/writing will have to take this new file structure into account, and it will undoubtedly end up slower to process the data into the pstores.

NodeJS - failed to read newly uploaded file

I was trying to build a system (NodeJS + Express 4) that reads a user uploaded text file, process it, and feed it back to the user. I was trying to use ajax upload, and multer as the parser for multi-part data. The whole workflow is supposedly to be like this:
User chooses a local file, and clicks the upload button.
Server received file, and read it.
Server do some processing with the data
Send results back
Every part of the link works except the server read part - sometimes the file is not read fully even though the server signals that the file upload was completed (I have tried multiple libraries, like multer, busboy, formidable that triggers the file upload complete event). I have done various experiments, and here's what I find (with 1000 lines of file):
the fs.readFile sometimes ends prematurely. The result file can be anywhere between 100 - 1000 lines.
missing part is almost always the last small piece, feels like the pipe was not fully flushed yet. I have tried file size between 1000 lines to 200,000 lines, and it's always missing the last few hundred lines.
using streaming almost solved the issue - like createReadStream, or byline, line-by-line, but sometimes the result can be 'undefined' or missing the last few lines, but a lot less frequent.
trigger the read twice, and the second time is almost guaranteed to read the full 1000 lines.
Is there anyway to force NodeJS to 'flush' the uploaded file? Somehow I feel the upload complete event was triggered (regardless of library, and everyone is dependent on FileSystem I guess) before the last piece of file was flushed in the stream. Or maybe there are some other issues - reading a static files always give the correct results. I could use the http POST forms but I'd like to use ajax to improve user experience.
Any thoughts?

speedup postgresql to add data from text file using Django python script

I am working with server who's configurations are as:
RAM - 56GB
Processor - 2.6 GHz x 16 cores
How to do parallel processing using shell? How to utilize all the cores of processor?
I have to load data from text file which contains millions of entries for example one file contains half million lines data.
I am using django python script to load data in postgresql database.
But it takes lot of time to add data in database even though i have such a good config. server but i don't know how to utilize server resources in parallel so that it takes less time to process data.
Yesterday i had loaded only 15000 lines of data from text file to postgresql and it took nearly 12 hours to do it.
My django python script is as below:
import re
import collections
def SystemType():
filename = raw_input("Enter file Name:")
in_file = file(filename,"r")
out_file = file("SystemType.txt","w+")
for line in in_file:
line = line.decode("unicode_escape")
line = line.encode("ascii","ignore")
values = line.split("\t")
if values[1]:
for list in values[1].strip("wordnetyagowikicategory"):
out_file.write(re.sub("[^\ a-zA-Z()<>\n""]"," ",list))
# Eliminate Duplicate Entries from extracted data using regular expression
def FSystemType():
lines_seen = set()
outfile = open("Output.txt","w+")
infile = open("SystemType.txt","r+")
for line in infile:
if line not in lines_seen:
l = line.lstrip()
# Below reg exp is used to handle Camel Case.
outfile.write(re.sub(r'((?<=[a-z])[A-Z]|(?<!\A)[A-Z](?=[a-z]))', r' \1', l).lower())
lines_seen.add(line)
infile.close()
outfile.close()
sylist=[]
def create_system_type(stname):
syslist=Systemtype.objects.all()
for i in syslist:
sylist.append(str(i.title))
if not stname in sylist:
slu=slugify(stname)
st=Systemtype()
st.title=stname
st.slug=slu
# st.sites=Site.objects.all()[0]
st.save()
print "one ST added."
if you could express your requirements without the code (not every shell programmer can really read phython), possibly we could help here.
e.g. your report of 12 hours for 15000 lines suggests you have a too-busy "for" loop somewhere, and i'd suggest the nested for
for list in values[1]....
what are you trying to strip? individual characters, whole words? ...
then i'd suggest "awk".
If you are able to work out the precise data structure required by Django, you can load the database tables directly, using the psql "copy" command. You could do this by preparing a csv file to load into the db.
There are any number of reasons why loading is slow using your approach. First of all Django has a lot of transactional overhead. Secondly it is not clear in what way you are running the Django code -- is this via the internal testing server? If so you may have to deal with the slowness of that. Finally what makes a fast database is not normally to do with CPU, but rather fast IO and lots of memory.

Storing and processing large XML files with Heroku?

I'm working on an application that needs to store a large 2GB+ XML file for processing, and I'm facing two problems:
How do I process the file? Loading the whole file into Nokogiri at once won't work. It quickly eats up memory and, as far as I can tell, the process gets nuked from orbit. Are there Heroku-compatible ways to quickly/easily read a large XML file located on a non-Heroku server in smaller chunks?
How do I store the file? The site is set up to use S3, but the data provider needs FTP access to upload the XML file nightly. S3 via FTP is apparently a no-go, and storing the file on Heroku won't work either, as it'll only be seen by the dyno that owns it and is susceptible to being randomly purged. Has anyone encountered this type of constraint before, and if so, how'd you work around it?
Most of the time we prefer parsing the entire file that has been pulled into memory because it's easier to jump back and forth, extracting this and that as our code needs. Because it's in memory, we can do random access easily, if we want.
For your need, you'll want to start at the top of the file, and read each line, looking for the tags of interest, until you get to the end of the file. For that, you want to use Nokogiri::XML::SAX and Nokogiri::XML::SAX::Parser, along with the events in Nokogiri::XML::SAX::Document. Here's a summary of what it does, from Nokogiri's site:
The basic way a SAX style parser works is by creating a parser, telling the parser about the events we’re interested in, then giving the parser some XML to process. The parser will notify you when it encounters events your said you would like to know about.
SAX is a different beast than dealing with the DOM, but it can be very fast, and is a lot easier on memory.
If you wanted to load the file in smaller chunks, you could process the XML inside an OpenURI.open or Net::HTTP block, so you'd be getting it in TCP packet-size chunks. The problem then is that your lines could be split, because TCP doesn't guarantee reading by lines, but by blocks, which is what you'll see inside the read loop. Your code would have to peel off partial lines at the end of the buffer, and then prepend them to the read buffer so the next block read finishes the line.
You'll need a streaming parser. Have a look at https://github.com/craigambrose/sax_stream
You could run your own FTP server on EC2? Or use a hosted provider such as https://hostedftp.com/

Resources