When I train a NER model with Spacy and monitor my CPU I can see that it is only using 1 core.
I have found only one example of multiprocessing in the Spacy documentation but it is not applied to the training: https://github.com/explosion/spaCy/blob/master/examples/pipeline/multi_processing.py
I am just using the training code provided in the examples but with a list of 500000 tuples in the TRAINING_DATA following the same structure: ("rawtext", {"entities": [(entity_start_offset, entity_end_offset, "ENTITY")]})
with nlp.disable_pipes(*other_pipes): # only train NER
for itn in range(n_iter):
random.shuffle(TRAIN_DATA)
losses = {}
batches = spacy.util.minibatch(TRAIN_DATA,
size=spacy.util.compounding(4., 32., 1.001))
for i, batch in enumerate(batches):
print(i)
texts, annotations = zip(*batch)
# Updating the weights
nlp.update(texts, annotations, sgd=optimizer,
drop=0.35, losses=losses)
print('Losses', losses)
I need to speed up the training using multiple cores. Right now, it takes 40 min per epoc using 1 single core.
It seems like since version 2.1, the training in Spacy is only done with one single core. (https://github.com/explosion/spaCy/issues/3507) I find it a little bit weird when parallelization of the matrix multiplications has been proben to be worth in other frameworks.
According to docs: https://spacy.io/usage/processing-pipelines#multiprocessing (spaCy 3)
spaCy includes built-in support for multiprocessing with nlp.pipe
using the n_process option:
# Multiprocessing with 4 processes
docs = nlp.pipe(texts, n_process=4)
# With as many processes as CPUs (use with caution!)
docs = nlp.pipe(texts, n_process=-1)
Related
I have 2 use cases:
Extract, Transform and Load from Oracle / PostgreSQL / Redshift / S3 / CSV to my own Redshift cluster
Schedule the job do it runs daily/weekly (INSERT + TABLE or INSERT + NONE options preferable).
I am currently using:
SQLAlchemy for extracts (works well generally).
PETL for transforms and loads (works well on smaller data sets, but for ~50m+ rows it is slow and the connection to the database(s) time out).
An internal tool for the scheduling component (which stores the transform in XML and then the loads from the XML and seems rather long and complicated).
I have been looking through this link but would welcome additional suggestions. Exporting to Spark or similar is also welcome if there is an "easier" process where I can just do everything through Python (I'm only using Redshift because it seems like the best option).
You can try pyetl an etl framework write by python3
from pyetl import Task, DatabaseReader, DatabaseWriter
reader = DatabaseReader("sqlite:///db.sqlite3", table_name="source")
writer = DatabaseWriter("sqlite:///db.sqlite3", table_name="target")
columns = {"id": "uuid", "name": "full_name"}
functions={"id": str, "name": lambda x: x.strip()}
Task(reader, writer, columns=columns, functions=functions).start()
How about
Python
Pandas
This is what we use for our ETL processing.
I'm using Pandas to access my ETL files, try doing something like this:
Create a class with all your queries there.
Create another class that processes the actual Datawarehouse that includes Pandas and Matplotlib for the graph.
Consider having a look at convtools library, it provides lots of data processing primitives, is pure python and has zero dependencies.
Since it generates ad hoc python code under the hood, sometimes it outperforms pandas/polars, so it can some gaps in your workflows. Especially if those have dynamic nature.
I am trying to run a CNN on the cloud (Google Cloud ML) because my laptop does not have a GPU card.
So I uploaded my data on Google Cloud Storage. A .csv file with 1500 entries, like so:
| label | img_path |
| label_1| /img_1.jpg |
| label_2| /img_2.jpg |
and the corresponding 1500 jpgs.
My input_fn looks like so:
def input_fn(filename,
batch_size,
num_epochs=None,
skip_header_lines=1,
shuffle=False):
filename_queue = tf.train.string_input_producer(filename, num_epochs=num_epochs)
reader = tf.TextLineReader(skip_header_lines=skip_header_lines)
_, row = reader.read(filename_queue)
row = parse_csv(row)
pt = row.pop(-1)
pth = filename.rpartition('/')[0] + pt
img = tf.image.decode_jpeg(tf.read_file(tf.squeeze(pth)), 1)
img = tf.to_float(img) / 255.
img = tf.reshape(img, [IMG_SIZE, IMG_SIZE, 1])
row = tf.concat(row, 0)
if shuffle:
return tf.train.shuffle_batch(
[img, row],
batch_size,
capacity=2000,
min_after_dequeue=2 * batch_size + 1,
num_threads=multiprocessing.cpu_count(),
)
else:
return tf.train.batch([img, row],
batch_size,
allow_smaller_final_batch=True,
num_threads=multiprocessing.cpu_count())
Here is what the full graph looks like (very simple CNN indeed):
Running the training with a batch size of 200, then most of the compute time on my laptop (on my laptop, the data is stored locally) is spent on the gradients node which is what I would expect. The batch node has a compute time of ~12ms.
When I run it on the cloud (scale-tier is BASIC), the batch node takes more than 20s. And the bottleneck seems to be coming from the QueueDequeueUpToV2 subnode according to tensorboard:
Anyone has any clue why this happens? I am pretty sure I am getting something wrong here, so I'd be happy to learn.
Few remarks:
-Changing between batch/shuffle_batch with different min_after_dequeue does not affect.
-When using BASIC_GPU, the batch node is also on the CPU which is normal according to what I read and it takes roughly 13s.
-Adding a time.sleep after queues are started to ensure no starvation also has no effect.
-Compute time is indeed linear in batch_size, so with a batch_size of 50, the compute time would be 4 times smaller than with a batch_size of 200.
Thanks for reading and would be happy to give more details if anyone needs.
Best,
Al
Update:
-Cloud ML instance and Buckets were not in the same region, making them in the same region improved result 4x.
-Creating a .tfrecords file made the batching take 70ms which seems to be acceptable. I used this blog post as a starting point to learn about it, I recommend it.
I hope this will help others to create a fast data input pipeline!
Try converting your images to tfrecord format and read them directly from graph. The way you are doing it, there is no possibility of caching and if your images are small, you are not taking advantage of the high sustained reads from cloud storage. Saving all your jpg images into a tfrecord file or small number of files will help.
Also, make sure your bucket is a single region bucket in a region that had gpus and that you are submitting to cloudml in that region.
I've got the similar problem before. I solved it by changing tf.train.batch() to tf.train.batch_join(). In my experiment, with 64 batch size and 4 GPUs, it took 22 mins by using tf.train.batch() whilst it only took 2 mins by using tf.train.batch_join().
In Tensorflow doc:
If you need more parallelism or shuffling of examples between files, use multiple reader instances using the tf.train.shuffle_batch_join
https://www.tensorflow.org/programmers_guide/reading_data
I am using H2O, on a large dataset, 8 Million rows and 10 col. I trained my randomForest using h2o.randomForest. The model was trained fine and also prediction worked correctly. Now I would like to convert my predictions to a data.frame. I did this :
A2=h2o.predict(m1,Tr15_h2o)
pred2=as.data.frame(A2)
but it is too slow, takes forever. Is there any faster way to do the conversion from H2o to data.frame or data.table?
Here is some code which demonstrates how to use the data.table package on the backend, along with some benchmarks on my macbook:
library(h2o)
h2o.init(nthreads = -1, max_mem_size = "16G")
hf <- h2o.createFrame(rows = 10000000)
options("h2o.use.data.table"=FALSE) #no data.table
system.time(df <- as.data.frame(hf))
# user system elapsed
# 224.387 13.274 272.252
options("datatable.verbose"=TRUE)
options("h2o.use.data.table"=TRUE) # use data.table
system.time(df2 <- as.data.frame(hf))
# user system elapsed
# 50.686 4.020 82.946
You can get more detailed info when using data.table if you turn on this option: options("datatable.verbose"=TRUE).
We have seen this issue with large prediction datasets when exporting to prediction dataframe or converting them to other types takes long time. I have opened the following JIRA to track it now:
https://0xdata.atlassian.net/browse/PUBDEV-4166
Yes there are some new options to turn on using data.table::fread to speed it up. Type h2o:::as.data.frame.H2OFrame to see the small amount of R source code containing the options, or H2O release notes. Please also try latest fread from dev which is now parallel as of yesterday.
Once users have reported success we can turn the default on by default.
I am working on Data Wrangling problem using Python,
which processes a dirty Excel file into a clean Excel file
I would like to process multiple input files by introducing concurrency/parallelism.
I have the following options 1) Using multiThreading 2) Using multiProceesing modules 3) ParallelPython module,
I have a basic idea of the three methods, I would like to know which method is best and why?
In Bref, Processing of a SINGLE dirty Excel file today takes 3 minutes,
Objective : To introduce parallelism/concurrency to process multiple files at once.
Looking for, best method of parallelism to achieve the objective
Since your process is mostly CPU bound multi-threading won't be fast because of the GIL...
I would recommend multiprocessing or concurrent.futures since they are a bit simpler the ParallelPython (only a bit :) )
example:
with concurrent.futures.ProcessPoolExecutor() as executor:
for file_path, clean_file in zip(files, executor.map(data_wrangler, files)):
print('%s is now clean!' % (file_path))
#do something with clean_file if you want
Only if you need to distribute the load between servers then I would recommend ParallelPython .
I have a rather large list: 19 million items in memory that I am trying to save to disk (Windows 10 x64 with plenty of space).
pickle.dump(list, open('list.p'.format(file), 'wb'))
Background:
The original data was read in from a csv (2 columns) with the same number of rows (19mil) and was modified to a list of tuples.
The original csv file was 740mb. The file "list.p" is showing up in my directory at 2.5 gb but the python process does not budge (I was debugging and stepping through each line) and the memory utilization at last check was at 19 gb and increasing.
I am just interested if anyone can shed some light on this pickle process.
PS - I understand that pickle.HIGHEST_PROTOCOL is now at Protocol version 4 which was added in Python 3.4. (It adds support for very large objects)
I love the concept of pickle but find it makes for bad, opaque, and fragile backing store. The data are in CSV and I don't see any obvious reason to not leave it in that form.
Testing under Python 3.4 on Linux has yielded timeit results of:
Create dummy two column CSV 19M lines: 17.6s
Read CSV file back in to a persistent list: 8.62s
Pickle dump of list of lists: 21.0s
Pickle load of dump into list of lists: 7.00s
As the mantra goes: until you measure it, your intuitions are useless. Sure, loading the pickle is slightly faster (7.00 < 8.62) but not dramatically. The pickle file is nearly twice as large as the CSV and can only be unpickled. By contrast, every tool can read the CSV including Python. I just don't see the advantage.
For reference, here is my IPython 3.4 test code:
def create_csv(path):
with open(path, 'w') as outf:
csvw = csv.writer(outf)
for i in range(19000000):
csvw.writerow((i, i*2))
def read_csv(path):
table = []
with open(path) as inf:
csvr = csv.reader(inf)
for row in csvr:
table.append(row)
return table
%timeit create_csv('data.csv')
%timeit read_csv('data.csv')
%timeit pickle.dump(table, open('data.pickle', 'wb'))
%timeit new_table = pickle.load(open('data.pickle', 'rb'))
In case you are unfamiliar, IPython is Python in a nicer shell. I explicitly didn't look at memory utilization because the thrust of this answer (Why use pickle?) renders memory use irrlevant.