Are there any metrics, known issues, or directions for saving a large pickle object to the Windows 10 file system - windows

I have a rather large list: 19 million items in memory that I am trying to save to disk (Windows 10 x64 with plenty of space).
pickle.dump(list, open('list.p'.format(file), 'wb'))
Background:
The original data was read in from a csv (2 columns) with the same number of rows (19mil) and was modified to a list of tuples.
The original csv file was 740mb. The file "list.p" is showing up in my directory at 2.5 gb but the python process does not budge (I was debugging and stepping through each line) and the memory utilization at last check was at 19 gb and increasing.
I am just interested if anyone can shed some light on this pickle process.
PS - I understand that pickle.HIGHEST_PROTOCOL is now at Protocol version 4 which was added in Python 3.4. (It adds support for very large objects)

I love the concept of pickle but find it makes for bad, opaque, and fragile backing store. The data are in CSV and I don't see any obvious reason to not leave it in that form.
Testing under Python 3.4 on Linux has yielded timeit results of:
Create dummy two column CSV 19M lines: 17.6s
Read CSV file back in to a persistent list: 8.62s
Pickle dump of list of lists: 21.0s
Pickle load of dump into list of lists: 7.00s
As the mantra goes: until you measure it, your intuitions are useless. Sure, loading the pickle is slightly faster (7.00 < 8.62) but not dramatically. The pickle file is nearly twice as large as the CSV and can only be unpickled. By contrast, every tool can read the CSV including Python. I just don't see the advantage.
For reference, here is my IPython 3.4 test code:
def create_csv(path):
with open(path, 'w') as outf:
csvw = csv.writer(outf)
for i in range(19000000):
csvw.writerow((i, i*2))
def read_csv(path):
table = []
with open(path) as inf:
csvr = csv.reader(inf)
for row in csvr:
table.append(row)
return table
%timeit create_csv('data.csv')
%timeit read_csv('data.csv')
%timeit pickle.dump(table, open('data.pickle', 'wb'))
%timeit new_table = pickle.load(open('data.pickle', 'rb'))
In case you are unfamiliar, IPython is Python in a nicer shell. I explicitly didn't look at memory utilization because the thrust of this answer (Why use pickle?) renders memory use irrlevant.

Related

What is good way of using multiprocessing for bifacial_radiance simulations?

For a university project I am using bifacial_radiance v0.4.0 to run simulations of approx. 270 000 rows of data in an EWP file.
I have set up a scene with some panels in a module following a tutorial on the bifacial_radiance GitHub page.
I am running the python script for this on a high power computer with 64 cores. Since python natively only uses 1 processor I want to use multiprocessing, which is currently working. However it does not seem very fast, even when starting 64 processes it uses roughly 10 % of the CPU's capacity (according to the task manager).
The script will first create the scene with panels.
Then it will look at a result file (where I store results as csv), and compare it to the contents of the radObj.metdata object. Both metdata and my result file use dates, so all dates which exist in the metdata file but not in the result file are stored in a queue object from the multiprocessing package. I also initialize a result queue.
I want to send a lot of the work to other processors.
To do this I have written two function:
A file writer function which every 10 seconds gets all items from the result queue and writes them to the result file. This function is running in a single multiprocessing.Process process like so:
fileWriteProcess = Process(target=fileWriter,args=(resultQueue,resultFileName)).start()
A ray trace function with a unique ID which does the following:
Get an index ìdx from the index queue (described above)
Use this index in radObj.gendaylit(idx)
Create the octfile. For this I have modified the name which the octfile is saved with to use a prefix which is the name of the process. This is to avoid all the processes using the same octfile on the SSD. octfile = radObj.makeOct(prefix=name)
Run an analysis analysis = bifacial_radiance.AnalysisObj(octfile,radObj.basename)
frontscan, backscan = analysis.moduleAnalysis(scene)
frontDict, backDict = analysis.analysis(octfile, radObj.basename, frontscan, backscan)
Read the desired results from resultDict and put them in the resultQueue as a single line of comma-separated values.
This all works. The processes are running after being created in a for loop.
This speeds up the whole simulation process quite a bit (10 days down to 1½ day), but as said earlier the CPU is running at around 10 % capacity and the GPU is running around 25 % capacity. The computer has 512 GB ram which is not an issue. The only communication with the processes is through the resultQueue and indexQueue, which should not bottleneck the program. I can see that it is not synchronizing as the results are written slightly unsorted while the input EPW file is sorted.
My question is if there is a better way to do this, which might make it run faster? I can see in the source code that a boolean "hpc" is used to initiate some of the classes, and a comment in the code mentions that it is for multiprocessing, but I can't find any information about it elsewhere.

h2o sparkling water save frame to disk

I am trying to import a frame by creating a h2o frame from a spark parquet file.
The File is 2GB has about 12M rows and Sparse Vectors with 12k cols.
It is not that big in parquet format but the import takes forever.
In h2o it is actually reported as 447mb compressed size. Quite small actually.
Am I doing it wrong and when I actually finish importing (took 39min), Is there any form in h2o to save the frame to disk for a fast loading next time??
I understand h2o does some magic behind the scene which takes so long but I only found a download csv option which is slow and huge for a 11k x 1M sparse data and I doubt it is any faster to import.
I feel like there is a part missing. Any Info about h2o data import/export is appreciated.
Model save/load works great but train/val/test data loading seems an unreasonably slow procedure.
I got 10 sparkworkers with 10g each and gave the driver 8g. This should be plenty.
Use h2o.exportFile() (h2o.export_file() in Python), with the parts argument set to -1. The -1 effectively means that each machine in the cluster will export just its own data. In your case you'd end up with 10 files, and it should be 10 times quicker than otherwise.
To read them back in, use h2o.importFile() and specify all 10 parts when loading:
frame <- h2o.importFile(c(
"s3n://mybucket/my.dat.1",
"s3n://mybucket/my.dat.2",
...
) )
By giving an array of files, they will be loaded and parsed in parallel.
For a local LAN cluster it is recommended to be using HDFS for this. I've had reasonable results by keeping the files on S3 when running a cluster on EC2.
I suggest to export the dataframe from Spark into SVMLight file format (see MLUtils.saveAsLibSVMFile(...). This format can be then natively ingested by H2O.
As Darren pointed out you can export data from H2O in multiple parts which speeds up the export. However H2O currently only supports export to CSV files. This is sub-optimal for your use case of very sparse data. This functionality is accessible via the Java API:
water.fvec.Frame.export(yourFrame, "/target/directory", yourFrame.key.toString, true, -1 /* automatically determine number of part files */)

Why is Ruby CSV file reading very slow?

I have a fairly large CSV file, with 4 Million records with 375 fields, that needs to be processed.
I'm using the RUBY CSV library to read this file and it is very slow. I thought PHP CSV file processing was slow but comparing the two reads PHP is is more then 100 times faster. I'm not sure if I'm doing something dumb or this is just the reality of RUBY not being optimized for this type of batch processing. I set up simple test pgms to get comparative times in both RUBY and PHP. All I do is read, no writing, no building of big arrays, and break out of the CSV read loops after processing 50,000 records. Has anyone else experienced this performance issue?
I'm running locally on a MAC with 4gig of memory, running OS X 10.6.8 and Ruby 1.8.7.
The Ruby process takes 497 seconds to simply read 50,000 records, the PHP process runs in 4 seconds which is not a typo, it's more then 100 times faster. FYI - I had code in the loops to print out data values to make sure that each of the processes was actually reading the files and bringing data back.
This is the Ruby Code:
require('time')
require('csv')
x=0
t1=Time.new
CSV.foreach(pathfile) do |row|
x += 1
if x > 50000 then break end
end
t2 = Time.new
puts " Time to read the file was #{t2-t1} seconds"
Here is the PHP code:
$t1=time();
$fpiData = fopen($pathdile,'r') or die("can not open input file ");
$seqno=0;
while($inrec = fgetcsv($fpiData,0,',','"')) {
if ($seqno > 50000) break;
$seqno++;
}
fclose($fpiData) or die("can not close input data file");
$t2=time();
$t3=$t2-$t1;
echo "Start time is $t1 - end time is $t2 - Time to Process was " . $t3 . "\n";
You'll likely get a massive speed boost by simply updating to a current version of Ruby. in Version 1.9, FasterCSV was integrated as Ruby's standard CSV library.
Check out Chruby to manage your different Ruby versions.
Check out the smarter_csv Gem, which has special options for handling huge files by reading data in chunks.
It also returns the CSV data as hashes, which can make it easier to insert or update the data in a database.
I think that using CSV is little bit overkill for this.
A long time ago I saw this question, and the reason for the slowness of the Ruby is that it loads the entire CSV file into the memory at once. I have seen some people overcome this issue by using the IO class. For example take a look at this gist for its self.perform(url) method.

speedup postgresql to add data from text file using Django python script

I am working with server who's configurations are as:
RAM - 56GB
Processor - 2.6 GHz x 16 cores
How to do parallel processing using shell? How to utilize all the cores of processor?
I have to load data from text file which contains millions of entries for example one file contains half million lines data.
I am using django python script to load data in postgresql database.
But it takes lot of time to add data in database even though i have such a good config. server but i don't know how to utilize server resources in parallel so that it takes less time to process data.
Yesterday i had loaded only 15000 lines of data from text file to postgresql and it took nearly 12 hours to do it.
My django python script is as below:
import re
import collections
def SystemType():
filename = raw_input("Enter file Name:")
in_file = file(filename,"r")
out_file = file("SystemType.txt","w+")
for line in in_file:
line = line.decode("unicode_escape")
line = line.encode("ascii","ignore")
values = line.split("\t")
if values[1]:
for list in values[1].strip("wordnetyagowikicategory"):
out_file.write(re.sub("[^\ a-zA-Z()<>\n""]"," ",list))
# Eliminate Duplicate Entries from extracted data using regular expression
def FSystemType():
lines_seen = set()
outfile = open("Output.txt","w+")
infile = open("SystemType.txt","r+")
for line in infile:
if line not in lines_seen:
l = line.lstrip()
# Below reg exp is used to handle Camel Case.
outfile.write(re.sub(r'((?<=[a-z])[A-Z]|(?<!\A)[A-Z](?=[a-z]))', r' \1', l).lower())
lines_seen.add(line)
infile.close()
outfile.close()
sylist=[]
def create_system_type(stname):
syslist=Systemtype.objects.all()
for i in syslist:
sylist.append(str(i.title))
if not stname in sylist:
slu=slugify(stname)
st=Systemtype()
st.title=stname
st.slug=slu
# st.sites=Site.objects.all()[0]
st.save()
print "one ST added."
if you could express your requirements without the code (not every shell programmer can really read phython), possibly we could help here.
e.g. your report of 12 hours for 15000 lines suggests you have a too-busy "for" loop somewhere, and i'd suggest the nested for
for list in values[1]....
what are you trying to strip? individual characters, whole words? ...
then i'd suggest "awk".
If you are able to work out the precise data structure required by Django, you can load the database tables directly, using the psql "copy" command. You could do this by preparing a csv file to load into the db.
There are any number of reasons why loading is slow using your approach. First of all Django has a lot of transactional overhead. Secondly it is not clear in what way you are running the Django code -- is this via the internal testing server? If so you may have to deal with the slowness of that. Finally what makes a fast database is not normally to do with CPU, but rather fast IO and lots of memory.

What is the fastest way to load data in Matlab

I have a vast quantity of data (>800Mb) that takes an age to load into Matlab mainly because it's split up into tiny files each <20kB. They are all in a proprietary format which I can read and load into Matlab, its just that it takes so long.
I am thinking of reading the data in and writing it out to some sort of binary file which should make it quicker for subsequent reads (of which there may be many, hence me needing a speed-up).
So, my question is, what would be the best format to write them to disk to make reading them back again as quick as possible?
I guess I have the option of writing using fwrite, or just saving the variables from matlab. I think I'd prefer the fwrite option so if needed, I could read them from another package/language...
Look in to the HDF5 data format, used by recent versions of MATLAB as the underlying format for .mat files. You can manually create your own HDF5 files using the hdf5write function, and this file can be accessed from any language that has HDF bindings (most common languages do, or at least offer a way to integrate C code that can call the HDF5 library).
If your data is numeric (and of the same datatype), you might find it hard to beat the performance of plain binary (fwrite).
Binary mat-files are the fastest. Just use
save myfile.mat <var_a> <var_b> ...
I achieved an amazing speed up in loading when I used the '-v6' option to save the .mat files like so:
save(matlabTrainingFile, 'Xtrain', 'ytrain', '-v6');
Here's the size of the matrices that I used in my test ...
Attr Name Size Bytes Class
==== ==== ==== ===== =====
g Xtest 1430x4000 45760000 double
g Xtrain 3411x4000 109152000 double
g Xval 1370x4000 43840000 double
g ytest 1430x1 11440 double
g ytrain 3411x1 27288 double
g yval 1370x1 10960 double
... and the performance improvements that we achieved:
Before the change:
time to load the training data: 78 SECONDS!!!
time to load validation data: 32
time to load the test data: 35
After the change:
time to load the training data: 0 SECONDS!!!
time to load validation data: 0
time to load the test data: 0
Apparently the reason the reason this works so well is that the old version 6 version used less compression the than new versions.
So your file sizes will be bigger, but they will load WAY faster.

Resources