speedup postgresql to add data from text file using Django python script - shell

I am working with server who's configurations are as:
RAM - 56GB
Processor - 2.6 GHz x 16 cores
How to do parallel processing using shell? How to utilize all the cores of processor?
I have to load data from text file which contains millions of entries for example one file contains half million lines data.
I am using django python script to load data in postgresql database.
But it takes lot of time to add data in database even though i have such a good config. server but i don't know how to utilize server resources in parallel so that it takes less time to process data.
Yesterday i had loaded only 15000 lines of data from text file to postgresql and it took nearly 12 hours to do it.
My django python script is as below:
import re
import collections
def SystemType():
filename = raw_input("Enter file Name:")
in_file = file(filename,"r")
out_file = file("SystemType.txt","w+")
for line in in_file:
line = line.decode("unicode_escape")
line = line.encode("ascii","ignore")
values = line.split("\t")
if values[1]:
for list in values[1].strip("wordnetyagowikicategory"):
out_file.write(re.sub("[^\ a-zA-Z()<>\n""]"," ",list))
# Eliminate Duplicate Entries from extracted data using regular expression
def FSystemType():
lines_seen = set()
outfile = open("Output.txt","w+")
infile = open("SystemType.txt","r+")
for line in infile:
if line not in lines_seen:
l = line.lstrip()
# Below reg exp is used to handle Camel Case.
outfile.write(re.sub(r'((?<=[a-z])[A-Z]|(?<!\A)[A-Z](?=[a-z]))', r' \1', l).lower())
lines_seen.add(line)
infile.close()
outfile.close()
sylist=[]
def create_system_type(stname):
syslist=Systemtype.objects.all()
for i in syslist:
sylist.append(str(i.title))
if not stname in sylist:
slu=slugify(stname)
st=Systemtype()
st.title=stname
st.slug=slu
# st.sites=Site.objects.all()[0]
st.save()
print "one ST added."

if you could express your requirements without the code (not every shell programmer can really read phython), possibly we could help here.
e.g. your report of 12 hours for 15000 lines suggests you have a too-busy "for" loop somewhere, and i'd suggest the nested for
for list in values[1]....
what are you trying to strip? individual characters, whole words? ...
then i'd suggest "awk".

If you are able to work out the precise data structure required by Django, you can load the database tables directly, using the psql "copy" command. You could do this by preparing a csv file to load into the db.
There are any number of reasons why loading is slow using your approach. First of all Django has a lot of transactional overhead. Secondly it is not clear in what way you are running the Django code -- is this via the internal testing server? If so you may have to deal with the slowness of that. Finally what makes a fast database is not normally to do with CPU, but rather fast IO and lots of memory.

Related

What is good way of using multiprocessing for bifacial_radiance simulations?

For a university project I am using bifacial_radiance v0.4.0 to run simulations of approx. 270 000 rows of data in an EWP file.
I have set up a scene with some panels in a module following a tutorial on the bifacial_radiance GitHub page.
I am running the python script for this on a high power computer with 64 cores. Since python natively only uses 1 processor I want to use multiprocessing, which is currently working. However it does not seem very fast, even when starting 64 processes it uses roughly 10 % of the CPU's capacity (according to the task manager).
The script will first create the scene with panels.
Then it will look at a result file (where I store results as csv), and compare it to the contents of the radObj.metdata object. Both metdata and my result file use dates, so all dates which exist in the metdata file but not in the result file are stored in a queue object from the multiprocessing package. I also initialize a result queue.
I want to send a lot of the work to other processors.
To do this I have written two function:
A file writer function which every 10 seconds gets all items from the result queue and writes them to the result file. This function is running in a single multiprocessing.Process process like so:
fileWriteProcess = Process(target=fileWriter,args=(resultQueue,resultFileName)).start()
A ray trace function with a unique ID which does the following:
Get an index ìdx from the index queue (described above)
Use this index in radObj.gendaylit(idx)
Create the octfile. For this I have modified the name which the octfile is saved with to use a prefix which is the name of the process. This is to avoid all the processes using the same octfile on the SSD. octfile = radObj.makeOct(prefix=name)
Run an analysis analysis = bifacial_radiance.AnalysisObj(octfile,radObj.basename)
frontscan, backscan = analysis.moduleAnalysis(scene)
frontDict, backDict = analysis.analysis(octfile, radObj.basename, frontscan, backscan)
Read the desired results from resultDict and put them in the resultQueue as a single line of comma-separated values.
This all works. The processes are running after being created in a for loop.
This speeds up the whole simulation process quite a bit (10 days down to 1½ day), but as said earlier the CPU is running at around 10 % capacity and the GPU is running around 25 % capacity. The computer has 512 GB ram which is not an issue. The only communication with the processes is through the resultQueue and indexQueue, which should not bottleneck the program. I can see that it is not synchronizing as the results are written slightly unsorted while the input EPW file is sorted.
My question is if there is a better way to do this, which might make it run faster? I can see in the source code that a boolean "hpc" is used to initiate some of the classes, and a comment in the code mentions that it is for multiprocessing, but I can't find any information about it elsewhere.

How to optimize the file processing?

I'm working on a Perl/CGI script which reads an 8MB file with over 100k lines and displays it in chunks of 100 lines (using pagination).
Which one of the following will be faster
Storing the entire input file into an array and extracting 100 lines for each page (using array slicing)
my #extract = #main_content[101..200];
or
For each page, using the sed command to extract any 100 lines that the user wants to view.
sed -n '101,200'p filename
If you really want performance then don't use CGI, try using something that keeps a persistent copy of the data in memory between requests. 8mb is tiny these days but loading for every request would not be sensible nor would scanning the whole file. Modperl was the older way of doing this , it was a perl interpreter embedded in the webserver , the newer way is to use catalyst or dancer, instructions for those are outside the scope of this reply. You could get away with using CGI if this was only to be use occasionally and was password protected to limit use.

Are there any metrics, known issues, or directions for saving a large pickle object to the Windows 10 file system

I have a rather large list: 19 million items in memory that I am trying to save to disk (Windows 10 x64 with plenty of space).
pickle.dump(list, open('list.p'.format(file), 'wb'))
Background:
The original data was read in from a csv (2 columns) with the same number of rows (19mil) and was modified to a list of tuples.
The original csv file was 740mb. The file "list.p" is showing up in my directory at 2.5 gb but the python process does not budge (I was debugging and stepping through each line) and the memory utilization at last check was at 19 gb and increasing.
I am just interested if anyone can shed some light on this pickle process.
PS - I understand that pickle.HIGHEST_PROTOCOL is now at Protocol version 4 which was added in Python 3.4. (It adds support for very large objects)
I love the concept of pickle but find it makes for bad, opaque, and fragile backing store. The data are in CSV and I don't see any obvious reason to not leave it in that form.
Testing under Python 3.4 on Linux has yielded timeit results of:
Create dummy two column CSV 19M lines: 17.6s
Read CSV file back in to a persistent list: 8.62s
Pickle dump of list of lists: 21.0s
Pickle load of dump into list of lists: 7.00s
As the mantra goes: until you measure it, your intuitions are useless. Sure, loading the pickle is slightly faster (7.00 < 8.62) but not dramatically. The pickle file is nearly twice as large as the CSV and can only be unpickled. By contrast, every tool can read the CSV including Python. I just don't see the advantage.
For reference, here is my IPython 3.4 test code:
def create_csv(path):
with open(path, 'w') as outf:
csvw = csv.writer(outf)
for i in range(19000000):
csvw.writerow((i, i*2))
def read_csv(path):
table = []
with open(path) as inf:
csvr = csv.reader(inf)
for row in csvr:
table.append(row)
return table
%timeit create_csv('data.csv')
%timeit read_csv('data.csv')
%timeit pickle.dump(table, open('data.pickle', 'wb'))
%timeit new_table = pickle.load(open('data.pickle', 'rb'))
In case you are unfamiliar, IPython is Python in a nicer shell. I explicitly didn't look at memory utilization because the thrust of this answer (Why use pickle?) renders memory use irrlevant.

Why is Ruby CSV file reading very slow?

I have a fairly large CSV file, with 4 Million records with 375 fields, that needs to be processed.
I'm using the RUBY CSV library to read this file and it is very slow. I thought PHP CSV file processing was slow but comparing the two reads PHP is is more then 100 times faster. I'm not sure if I'm doing something dumb or this is just the reality of RUBY not being optimized for this type of batch processing. I set up simple test pgms to get comparative times in both RUBY and PHP. All I do is read, no writing, no building of big arrays, and break out of the CSV read loops after processing 50,000 records. Has anyone else experienced this performance issue?
I'm running locally on a MAC with 4gig of memory, running OS X 10.6.8 and Ruby 1.8.7.
The Ruby process takes 497 seconds to simply read 50,000 records, the PHP process runs in 4 seconds which is not a typo, it's more then 100 times faster. FYI - I had code in the loops to print out data values to make sure that each of the processes was actually reading the files and bringing data back.
This is the Ruby Code:
require('time')
require('csv')
x=0
t1=Time.new
CSV.foreach(pathfile) do |row|
x += 1
if x > 50000 then break end
end
t2 = Time.new
puts " Time to read the file was #{t2-t1} seconds"
Here is the PHP code:
$t1=time();
$fpiData = fopen($pathdile,'r') or die("can not open input file ");
$seqno=0;
while($inrec = fgetcsv($fpiData,0,',','"')) {
if ($seqno > 50000) break;
$seqno++;
}
fclose($fpiData) or die("can not close input data file");
$t2=time();
$t3=$t2-$t1;
echo "Start time is $t1 - end time is $t2 - Time to Process was " . $t3 . "\n";
You'll likely get a massive speed boost by simply updating to a current version of Ruby. in Version 1.9, FasterCSV was integrated as Ruby's standard CSV library.
Check out Chruby to manage your different Ruby versions.
Check out the smarter_csv Gem, which has special options for handling huge files by reading data in chunks.
It also returns the CSV data as hashes, which can make it easier to insert or update the data in a database.
I think that using CSV is little bit overkill for this.
A long time ago I saw this question, and the reason for the slowness of the Ruby is that it loads the entire CSV file into the memory at once. I have seen some people overcome this issue by using the IO class. For example take a look at this gist for its self.perform(url) method.

Increasing the Loading Speed of Large Files

There are two large text files (Millions of lines) that my program uses. These files are parsed and loaded into hashes so that the data can be accessed quickly. The problem I face is that, currently, the parsing and loading is the slowest part of the program. Below is the code where this is done.
database = extractDatabase(#type).chomp("fasta") + "yml"
revDatabase = extractDatabase(#type + "-r").chomp("fasta.reverse") + "yml"
#proteins = Hash.new
#decoyProteins = Hash.new
File.open(database, "r").each_line do |line|
parts = line.split(": ")
#proteins[parts[0]] = parts[1]
end
File.open(revDatabase, "r").each_line do |line|
parts = line.split(": ")
#decoyProteins[parts[0]] = parts[1]
end
And the files look like the example below. It started off as a YAML file, but the format was modified to increase parsing speed.
MTMDK: P31946 Q14624 Q14624-2 B5BU24 B7ZKJ8 B7Z545 Q4VY19 B2RMS9 B7Z544 Q4VY20
MTMDKSELVQK: P31946 B5BU24 Q4VY19 Q4VY20
....
I've messed around with different ways of setting up the file and parsing them, and so far this is the fastest way, but it's still awfully slow.
Is there a way to improve the speed of this, or is there a whole other approach I can take?
List of things that don't work:
YAML.
Standard Ruby threads.
Forking off processes and then retrieving the hash through a pipe.
In my usage, reading all or part the file into memory before parsing usually goes faster. If the database sizes are small enough this could be as simple as
buffer = File.readlines(database)
buffer.each do |line|
...
end
If they're too big to fit into memory, it gets more complicated, you have to setup block reads of data followed by parse, or threaded with separate read and parse threads.
Why not use the solution devised through decades of experience: a database, say SQLlite3?
(To be different, although I'd first recommend looking at (Ruby) BDB and other "NoSQL" backend-engines, if they fit your need.)
If fixed-sized records with a deterministic index are used then you can perform a lazy-load of each item through a proxy object. This would be a suitable candidate for a mmap. However, this will not speed up the total access time, but will merely amortize the loading throughout the life-cycle of the program (at least until first use and if some data is never used then you get the benefit of never loading it). Without fixed-sized records or deterministic index values this problem is more complex and starts to look more like a traditional "index" store (eg. a B-tree in an SQL back-end or whatever BDB uses :-).
The general problems with threading here are:
The IO will likely be your bottleneck around Ruby "green" threads
You still need all the data before use
You may be interested in the Widefinder Project, just in general "trying to get faster IO processing".
I don't know too much about Ruby but I have had to deal with the problem before. I found the best way was to split the file up into chunks or separate files then spawn threads to read each chunk in at a single time. Once the partitioned files are in memory combining the results should be fast. Here is some information on Threads in Ruby:
http://rubylearning.com/satishtalim/ruby_threads.html
Hope that helps.

Resources