Does read from the file over and over? - tensorflow-datasets

I am struggling with long training times with, and am beginning to wonder if reading the CSV file may be a bottleneck. Does read from the file over and over?
I consider trying to first import the whole dataset and put it in a numpy array, and then create a new TF Dataset from tensors. But such a change will take time, and I don't want to waste time if SO could have told me beforehand that it makes no difference.

I do not know exactly why I got so long training times with CsvDataset, but modifying my code to first import the data to a numpy array and then import it using made the training about 10-100 times faster. One more, prehaps relevant, change that followed by this, was that the dataset was no longer nested throughout the handling. In the old version, each batch was a tuple of column tensors, whereas in the new version, each batch is simply a tensor. (Further speedups may be achieved by removing transforms tailored to the nested structure, which are now applied to one tensor only.)


What is the best way to lag a value in a Dask Dataframe?

I have a Dask Dataframe called data which is extremely large and cannot be fit into main memory, and is importantly not sorted. The dataframe is unique on the following key: [strike, expiration, type, time]. What I need to accomplish in Dask is the equivalent of the following in Pandas:
data1 = data[['strike', 'expiration', 'type', 'time', 'value']].sort_values()
data1['lag_value'] = data1.groupby(['strike', 'expiration', 'type', 'time'])['value'].shift(1)
In other words, I need to lag the variable value within a by group. What is the best way to do this in Dask - I know that sorting is going to be very computationally expensive, but I don't think there is a way around it given what I would like to do?
Thank you in advance!
I'll make a few assumptions, but my guess is that the data is 'somewhat' sorted. So you might have file partitions that are specific to a day or a week or maybe an hour if you are working with high-frequency data. This means that you can do sorting within those partitions, which is often a more manageable task.
If this guess is wrong, then it might be a good idea to incur the fixed cost of sorting (and persisting) the data since it will speed up your downstream analysis.
Since you have only one large file and it's not very big (25GB should be manageable if you have access to a cluster), the best thing might be to load into memory with regular pandas, sort and save the data with partitioning on dates/expirations/tickers (if available) or some other column division that makes sense for your downstream analysis.
It might be possible to reduce memory footprint by using appropriate dtypes, for example strike, type, expiration columns might take less space as categories (vs strings).
If there is no way at all of loading it into memory at once, then it's possible to iterate on chunks of rows with pandas and then saving the relevant bits in smaller chunks, here's rough pseudocode:
df = pd.read_csv('big_file', iterator=True, chunksize=10**4)
for rows in df:
# here we want to split into smaller sets based on some logic
# note the mode is append so some additional check on file
# existence should be added
for group_label, group_df in rows.groupby(['type', 'strike']):
group_df.to_csv(f"{group_label}.csv", mode='a')
Now the above might sound weird, since the question is tagged with dask and I'm focusing on pandas, but the idea is to save time downstream by partitioning the data on the relevant variables. With dask it is probably possible to achieve also, but in my experience in situations like these I would run into memory problems due to data shuffling among workers. Of course, if in your situation there were many files rather than one, then some parallelisation with dask.delayed would be helpful.
Now, after you partition/index your data, then dask will work great when operating on the many smaller chunks. For example, if you partitioned the data based on date and your downstream analysis is primarily using dates, then operations like groupby and shift will be very fast because the workers will not need to check with each other whether they have overlapping dates, so most processing will occur within partitions.

Speed Up Gensim's Word2vec for a Massive Dataset

I'm trying to build a Word2vec (or FastText) model using Gensim on a massive dataset which is composed of 1000 files, each contains ~210,000 sentences, and each sentence contains ~1000 words. The training was made on a 185gb RAM, 36-core machine.
I validated that
gensim.models.word2vec.FAST_VERSION == 1
First, I've tried the following:
files = gensim.models.word2vec.PathLineSentences('path/to/files')
model = gensim.models.word2vec.Word2Vec(files, workers=-1)
But after 13 hours I decided it is running for too long and stopped it.
Then I tried building the vocabulary based on a single file, and train based on all 1000 files as follows:
files = os.listdir['path/to/files']
model = gensim.models.word2vec.Word2Vec(min_count=1, workers=-1)
for file in files:
model.train(corpus_file=file, total_words=model.corpus_total_words, epochs=1)
But I checked a sample of word vectors before and after the training, and there was no change, which means no actual training was done.
I can use some advise on how to run it quickly and successfully. Thanks!
Update #1:
Here is the code to check vector updates:
file = 'path/to/single/gziped/file'
total_words = 197264406 # number of words in 'file'
total_examples = 209718 # number of records in 'file'
model = gensim.models.word2vec.Word2Vec(iter=5, workers=12)
wv_before = model.wv['9995']
model.train(corpus_file=file, total_words=total_words, total_examples=total_examples, epochs=5)
wv_after = model.wv['9995']
so the vectors: wv_before and wv_after are exactly the same
There's no facility in gensim's Word2Vec to accept a negative value for workers. (Where'd you get the idea that would be meaningful?)
So, it's quite possible that's breaking something, perhaps preventing any training from even being attempted.
Was there sensible logging output (at level INFO) suggesting that training was progressing in your trial runs, either against the PathLineSentences or your second attempt? Did utilities like top show busy threads? Did the output suggest a particular rate of progress & let you project-out a likely finishing time?
I'd suggest using a positive workers value and watching INFO-level logging to get a better idea what's happening.
Unfortunately, even with 36 cores, using a corpus iterable sequence (like PathLineSentences) puts gensim Word2Vec in a model were you'll likely get maximum throughput with a workers value in the 8-16 range, using far less than all your threads. But it will do the right thing, on a corpus of any size, even if it's being assembled by the iterable sequence on-the-fly.
Using the corpus_file mode can saturate far more cores, but you should still specify the actual number of worker threads to use – in your case, workers=36 – and it is designed to work on from a single file with all data.
Your code which attempts to train() many times with corpus_file has lots of problems, and I can't think of a way to adapt corpus_file mode to work on your many files. Some of the problems include:
you're only building the vocabulary from the 1st file, which means any words only appearing in other files will be unknown and ignored, and any of the word-frequency-driven parts of the Word2Vec algorithm may be working on unrepresentative
the model builds its estimate of the expected corpus size (eg: model.corpus_total_words) from the build_vocab() step, so every train() will behave as if that size is the total corpus size, in its progress-reporting & management of the internal alpha learning-rate decay. So those logs will be wrong, the alpha will be mismanaged in a fresh decay each train(), resulting in a nonsensical jigsaw up-and-down alpha over all files.
you're only iterating over each file's contents once, which isn't typical. (It might be reasonable in a giant 210-billion word corpus, though, if every file's text is equally and randomly representative of the domain. In that case, the full corpus once might be as good as iterating over a corpus that's 1/5th the size 5 times. But it'd be a problem if some words/patterns-of-usage are all clumped in certain files - the best training interleaves contrasting examples throughout each epoch and all epochs.)
min_count=1 is almost always unwise with this algorithm, and especially so in large corpora of typical natural-language word frequencies. Rare words, and especially those appearing only once or a few times, make the model gigantic but those words won't get good word-vectors, and keeping them in acts like noise interfering with the improvement of other more-common words.
I recommend:
Try the corpus iterable sequence mode, with logging and a sensible workers value, to at least get an accurate read of how long it might take. (The longest step will be the initial vocabulary scan, which is essentially single-threaded and must visit all data. But you can .save() the model after that step, to then later re-.load() it, tinker with settings, and try different train() approaches without repeating the slow vocabulary survey.)
Try aggressively-higher values of min_count (discarding more rare words for a smaller model & faster training). Perhaps also try aggressively-smaller values of sample (like 1e-05, 1e-06, etc) to discard a larger fraction of the most-frequent words, for faster training that also often improves overall word-vector quality (by spending relatively more effort on less-frequent words).
If it's still too slow, consider if you could using a smaller subsample of your corpus might be enough.
Consider the corpus_file method if you can roll much or all of your data into the single file it requires.

Gensim Word2Vec Model trained but not saved

I am using gensim and executed the following code (simplified):
model = gensim.models.Word2Vec(...)
After days my code finished model.train(...). However, during saving, I experienced:
Process finished with exit code 137 (interrupted by signal 9: SIGKILL)
I noticed that there were some npy files generated:
Are those intermediate results I can re-use or do I have to rerun the entire process?
Those are parts of the saved model, but unless the master file_name file (a Python-pickled object) exists and is complete, they may be hard to re-use.
However, if your primary interest is the final word-vectors, those are in the .wv.vectors.npy file. If it appears to be full-length (same size at the syn1neg file), it may be complete. What you're missing is the dict that tells you which word is in which index.
So, the following might work:
Repeat the original process, with the exact same corpus & model parameters, but only through the build_vocab() step. At that point, the new model.wv.vocab dict should be identical as the one from the failed-save-run.
Save that model, without ever train()ing it, to a new filename.
Confirming that newmodel.wv.vectors.npy (with randomly-initialized untrained vectors) is the same size as oldmodel.wv.vectors.npy, copy the oldmodel file to the newmodel's name.
Re-load the new model, and run some sanity checks that the words make sense.
Perhaps, save off just the word-vectors, using something like or newmodel.wvsave_word2vec_format().
Potentially, the resurrected newmodel could also be patched to use the old syn1neg file as well, if it appears complete. It might work to further train the patched model (either with or without having reused the older syn1neg).
Separately: only the very largest corpuses, or an installation missing the gensim cython optimizations, or a machine without enough RAM (and thus swapping during training), would usually require a training session taking days. You might be able to run much faster. Check:
Is any virtual-memory swapping happening during the entire training? If it is, it will be disastrous for training throughput, and you should use a machine with more RAM or be more aggressive about trimming the vocabulary/model size with a higher min_count. (Smaller min_count values mean a larger model, slower training, poor-quality vectors for words with just a few examples, and also counterintuitively worse-quality vectors for more-frequent words too, because of interference from the noisy rare words. It's usually better to ignore lowest-frequency words.)
Is there any warning displayed about a "slow version" (pure Python with no effective multi-threading) being used? If so your training will be ~100X slower than if that problem is resolved. If the optimized code is available, maximum training throughput will likely be achieved with some workers value between 3 and 12 (but never larger than the number of machine CPU cores).
For a very large corpus, the sample parameter can be made more aggressive – such as 1e-04 or 1e-05 instead of the default 1e-03 – and it may both speed training and improve vector quality, by avoiding lots of redundant overtraining of the most-frequent words.
Good luck!

How can I change my QListWidget implementation to improve performance?

Understandably, the code below shows that the QList.addItems call takes 0.393 seconds. I am using similar code in a larger program - the idea is the user is presented with a (very) large list of unique part numbers. They begin entering a regular expression in a separate QEntry widget, which updates the QList in real-time, cutting the list shorter and shorter as it goes.
The problem I have is that with such a large list, it takes a noticeable amount of time to update the QList view as the user inputs their regular expression. I do not imagine that this is a unique issue - what are common solutions to this problem? How can I truncate my list in a reasonable manner such that the user experience is not significantly negatively impacted? Is a QListView a better alternative? Does it make sense to cut the list short to, say, 100 items or so and load more items in real-time as the user scrolls? Or are there pitfalls to this approach?
import time
from itertools import cycle
from PyQt4 import QtCore, QtGui
pns = cycle(['ABCD', 'EFGH', 'IJKL', 'MNOP'])
pn_col = []
for i in range(505000):
pn_col.append("{}_{}".format(next(pns), i))
app = QtGui.QApplication([])
myList = QtGui.QListWidget()
start = time.clock()
print("Time = {}".format(time.clock() - start))
update : turns out my list is actually about 4000 unique values repeating for the grand total of 500k. doh! cutting out the duplicates fixed my performance issue. I am still interested in knowing, though, for science and stuff, more efficient ways of listing this data. and it is a true flat list, I cannot easily group or hierarchy them.
A QListWidget is a convenient widget for small lists. To have a real list you should have a QListView associated to a QAbstractListModel that you reimplement.
In your example, the list is 500k items long! How about using a real database to handle that? With Qt you can the QSqlTableModel associated to SQLite, see the doc here. The sorting and searching will be done in SQL and the results propagated to the model.
Note: if you use python 2.7, replace range() by xrange(), it returns a value at a time instead of constructing the whole list of 500k elements in memory.

Divide a dataset into chunks

I have a function in R that chokes if I apply it to a dataset with more than 1000 rows. Therefore, I want to split my dataset into a list of n chunks, each of not more than 1000 rows.
Here's the function I'm currently using to do the chunking:
chunkData <- function(Data,chunkSize){
Chunks <- floor(0:(nrow(Data)-1)/(chunkSize))
lapply(unique(Chunks),function(x) Data[Chunks==x,])
I would like to make this function more efficient, so that it runs faster on large datasets.
You can do this easily using split from base R. For example, split(iris, 1:3), will split the iris dataset into a list of three data frames by row. You can modify the arguments to specify a chunk size.
Since the output is still a list of data frames, you can easily use lapply on the output to process the data, and combine them as required.
Since speed is the primary issue for using this approach, I would recommend that you take a look at the data.table package, which works great with large data sets. If you specify more information on what you are trying to achieve in your function, people at SO might be able to help.
Replace the lapply() call with a call to split():
split(Data, Chunks)
You should also take a look at ddply fom the plyr package, this package is built around the split-apply-combine principle. This paper about the package explains how this works and what things are available in plyr.
The general strategy I would take here is to add a new data to the dataset called chunkid. This cuts up the data in chunks of 1000 rows, look at the rep function to create this row. You can then do:
result = ddply(dat, .(chunkid), functionToPerform)
I like plyr for its clear syntax and structure, and its support of parallel processing. As already said, please also take a look at data.table, which could be quite a bit faster in some situations.
An additional tip could be to use matrices in stead of data.frames...
