Dataloader on top of protobuf file using pytorch's torch.utils.data.IterableDataset - protocol-buffers

I am building a RNN network using pytorch.
The data is stored in various protobuf file.
Each record in protobuf represents one training example with multiple timestamp.
As this is very large dataset, reading the whole data in memory or random read by extending torch.utils.data.Dataset class isn't feasible.
As per the docs using the torch.utils.data.IterableDataset is recommended.
DataLoader on top of IterableDataset would be able to achieve parallelism
However I am not able to find an implementation of this on custom data, docs only talk about a simple range iterator.
import math
import stream
from src import record_pb2
import torch
class MyIterableDataset(torch.utils.data.IterableDataset):
def __init__(self, pb_file):
self.pb_file = pb_file
self.start = 0
self.end = 0
# One time read of the data to get the total count of records in the dataset
with stream.open(self.pb_file, 'rb') as data_stream:
for _ in data_stream:
self.end += 1
def __iter__(self):
worker_info = torch.utils.data.get_worker_info()
if worker_info is None: # Single-process data loading, return the full iterator
iter_start = self.start
iter_end = self.end
else:
# in a worker process, split the workload
per_worker = int(math.ceil((self.end - self.start))/float(worker_info.num_workers))
worker_id = worker_info.id
iter_start = self.start + worker_id * per_worker
iter_end = min(iter_start + per_worker, self.end)
data_stream = stream.open(self.pb_file, 'rb')
# Block to skip the streaming data till the iter start for the current worker process
i = 0
for _ in data_stream:
i += 1
if i >= iter_start:
break
return iter(self.pb_stream)
I am expecting a mechanism by which a parallel data feeder could be designed on top of a large streaming data (protobuf)

The __iter__ method of the IterableDataset would yield your data samples one at a time. In a parallel setup, you have to choose the samples based on worker_id. And with respect to the DataLoader using this dataset, shuffle and sampler options would not work, as an IterableDataset is not going to have any indices. In other words, have your dataset yield one sample at a time and the data loader will take care of loading them. Does this answer?

Related

Pytorch: RAM explodes when using multiprocessing SharedMemory and CUDA

I would like to use multiprocessing to launch multiple training instances on CUDA device. Since the data is common between the processes, I want to avoid data copy for every process. I'm using python 3.8's SharedMemory from multiprocessing module to achieve this following this SO example.
I can allocate a memory block using SharedMemory and create as many processes as I'd like with constant memory (RAM) usage. However, when I try to send tensors to CUDA, the memory scales linearly with the number of processes. It appears as if when c.to(device) is called, the base data is copied for every process.
Does any one know why this is happening? Any ideas to mitigate this issue?
Here is the sample code I'm using:
import numpy as np
from multiprocessing import shared_memory, get_context
import time
import torch
import copy
dim = 10000
batch_size = 10
sleep_time = 2
npe = 1 # number of parallel executions
# cuda
if torch.cuda.is_available():
dev = 'cuda:0'
else:
dev = "cpu"
device = torch.device(dev)
def step(i, shr_name):
existing_shm = shared_memory.SharedMemory(name=shr_name)
np_arr = np.ndarray((dim, dim), dtype=np.float32, buffer=existing_shm.buf)
b = np_arr[i * batch_size: (i + 1) * batch_size, :]
b = torch.Tensor(b)
# This is just to explicitly copy the tensor so that it has nothing to do
# with the shared memory block
c = copy.deepcopy(b)
# If tensor c is sent to the cuda device, then RAM scales linearly
# with the number of parallel executions.
# If c is not sent to cuda device, memory consumption is constant.
c = c.to(device)
time.sleep(sleep_time)
existing_shm.close()
def create_shared_block():
a = np.random.random((dim, dim)).astype(np.float32)
shm = shared_memory.SharedMemory(create=True, size=a.nbytes, name='sha')
np_arr = np.ndarray(a.shape, dtype=np.float32, buffer=shm.buf)
np_arr[:] = a[:]
return shm, np_arr
if __name__ == '__main__':
# create shared memory block
shm, np_arr = create_shared_block()
# create list of inputs to be executed in parallel
inp = [[x, 'sha'] for x in range(npe)]
print(inp)
# sleep added before and after launching multiprocessing to monitor the memory consumption
print('before pool') # to check memory with top or htop
time.sleep(sleep_time)
context = get_context('spawn')
with context.Pool(npe) as pool:
print('after pool') # to check memory with top or htop
time.sleep(sleep_time)
pool.starmap(step, inp)
time.sleep(sleep_time)
shm.close()
shm.unlink()

Is it possible to read pdf/audio/video files(unstructured data) using Apache Spark?

Is it possible to read pdf/audio/video files(unstructured data) using Apache Spark?
For example, I have thousands of pdf invoices and I want to read data from those and perform some analytics on that. What steps must I do to process unstructured data?
Yes, it is. Use sparkContext.binaryFiles to load files in binary format and then use map to map value to some other format - for example, parse binary with Apache Tika or Apache POI.
Pseudocode:
val rawFile = sparkContext.binaryFiles(...
val ready = rawFile.map ( here parsing with other framework
What is important, parsing must be done with other framework like mentioned previously in my answer. Map will get InputStream as an argument
We had a scenario where we needed to use a custom decryption algorithm on the input files. We didn't want to rewrite that code in Scala or Python. Python-Spark code follows:
from pyspark import SparkContext, SparkConf, HiveContext, AccumulatorParam
def decryptUncompressAndParseFile(filePathAndContents):
'''each line of the file becomes an RDD record'''
global acc_errCount, acc_errLog
proc = subprocess.Popen(['custom_decrypt_program','--decrypt'],
stdin=subprocess.PIPE, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
(unzippedData, err) = proc.communicate(input=filePathAndContents[1])
if len(err) > 0: # problem reading the file
acc_errCount.add(1)
acc_errLog.add('Error: '+str(err)+' in file: '+filePathAndContents[0]+
', on host: '+ socket.gethostname()+' return code:'+str(returnCode))
return [] # this is okay with flatMap
records = list()
iterLines = iter(unzippedData.splitlines())
for line in iterLines:
#sys.stderr.write('Line: '+str(line)+'\n')
values = [x.strip() for x in line.split('|')]
...
records.append( (... extract data as appropriate from values into this tuple ...) )
return records
class StringAccumulator(AccumulatorParam):
''' custom accumulator to holds strings '''
def zero(self,initValue=""):
return initValue
def addInPlace(self,str1,str2):
return str1.strip()+'\n'+str2.strip()
def main():
...
global acc_errCount, acc_errLog
acc_errCount = sc.accumulator(0)
acc_errLog = sc.accumulator('',StringAccumulator())
binaryFileTup = sc.binaryFiles(args.inputDir)
# use flatMap instead of map, to handle corrupt files
linesRdd = binaryFileTup.flatMap(decryptUncompressAndParseFile, True)
df = sqlContext.createDataFrame(linesRdd, ourSchema())
df.registerTempTable("dataTable")
...
The custom string accumulator was very useful in identifying corrupt input files.

Optimal way of creating a cache in the PySpark environment

I am using Spark Streaming for creating a system to enrich incoming data from a cloudant database. Example -
Incoming Message: {"id" : 123}
Outgoing Message: {"id" : 123, "data": "xxxxxxxxxxxxxxxxxxx"}
My code for the driver class is as follows:
from Sample.Job import EnrichmentJob
from Sample.Job import FunctionJob
import pyspark
from pyspark.streaming.kafka import KafkaUtils
from pyspark import SparkContext, SparkConf, SQLContext
from pyspark.streaming import StreamingContext
from pyspark.sql import SparkSession
from kafka import KafkaConsumer, KafkaProducer
import json
class SampleFramework():
def __init__(self):
pass
#staticmethod
def messageHandler(m):
return json.loads(m.message)
#staticmethod
def processData(rdd):
if (rdd.isEmpty()):
print("RDD is Empty")
return
# Expand
expanded_rdd = rdd.mapPartitions(EnrichmentJob.enrich)
# Score
scored_rdd = expanded_rdd.map(FunctionJob.function)
# Publish RDD
def run(self, ssc):
self.ssc = ssc
directKafkaStream = KafkaUtils.createDirectStream(self.ssc, QUEUENAME, \
{"metadata.broker.list": META,
"bootstrap.servers": SERVER}, \
messageHandler= SampleFramework.messageHandler)
directKafkaStream.foreachRDD(SampleFramework.processData)
ssc.start()
ssc.awaitTermination()
Code for the the Enrichment Job is as follows:
class EnrichmentJob:
cache = {}
#staticmethod
def enrich(data):
# Assume that Cloudant Connector using the available config
cloudantConnector = CloudantConnector(config, config["cloudant"]["host"]["req_db_name"])
final_data = []
for row in data:
id = row["id"]
if(id not in EnrichmentJob.cache.keys()):
data = cloudantConnector.getOne({"id": id})
row["data"] = data
EnrichmentJob.cache[id]=data
else:
data = EnrichmentJob.cache[id]
row["data"] = data
final_data.append(row)
cloudantConnector.close()
return final_data
My question is - Is there someway to maintain [1]"a global cache on the main memory that is accessible to all workers" or [2]"local caches on each of the workers such that they remain persisted in the foreachRDD setting"?
I have already explored the following -
Broadcast Variables - Here we go the [1] way. As I understand, they are meant to be read-only and immutable. I have checked out this reference but it cites an example of unpersisting/persisting the broadcasted variable. Is this a good practice?
Static Variables - Here we go the [2] way. The class that is being referred to ("Enricher" in this case) maintains a cache in the form of a static variable dictionary. But it turns out that the ForEachRDD function spawns a completely new process for each incoming RDD and this removes the previously initiated static variable. This is the one coded above.
I have two possible solutions right now -
Maintain an offline cache on the file system.
Do the entire computation of this enrichment task on my driver node. This would cause the entire data to end up on driver and be maintained there. The cache object will be sent to the enrichment job as an argument to the mapping function.
Here obviously the first one looks better than the second, but I wish to conclude that these two are the only ways around, before committing to them. Any pointers would be appreciated!
Is there someway to maintain [1]"a global cache on the main memory that is accessible to all workers"
No. There is no "main memory" which can be accessed by all workers. Each worker runs in a separate process and communicates with external world with sockets. Not to mention separation between different physical nodes in non-local mode.
There are some techniques that can be applied to achieve worker scoped cache with memory mapped data (using SQLite being the simplest one) but it takes some additional effort to implement the right way (avoid conflicts and such).
or [2]"local caches on each of the workers such that they remain persisted in the foreachRDD setting"?
You can use standard caching techniques with scope limited to the individual worker processes. Depending on the configuration (static vs. dynamic resource allocation, spark.python.worker.reuse) it may or may not be preserved between multiple tasks and batches.
Consider following, simplified, example:
map_param.py:
from pyspark import AccumulatorParam
from collections import Counter
class CounterParam(AccumulatorParam):
def zero(self, v: Counter) -> Counter:
return Counter()
def addInPlace(self, acc1: Counter, acc2: Counter) -> Counter:
acc1.update(acc2)
return acc1
my_utils.py:
from pyspark import Accumulator
from typing import Hashable
from collections import Counter
# Dummy cache. In production I would use functools.lru_cache
# but it is a bit more painful to show with accumulator
cached = {}
def f_cached(x: Hashable, counter: Accumulator) -> Hashable:
if cached.get(x) is None:
cached[x] = True
counter.add(Counter([x]))
return x
def f_uncached(x: Hashable, counter: Accumulator) -> Hashable:
counter.add(Counter([x]))
return x
main.py:
from pyspark.streaming import StreamingContext
from pyspark import SparkContext
from counter_param import CounterParam
import my_utils
from collections import Counter
def main():
sc = SparkContext("local[1]")
ssc = StreamingContext(sc, 5)
cnt_cached = sc.accumulator(Counter(), CounterParam())
cnt_uncached = sc.accumulator(Counter(), CounterParam())
stream = ssc.queueStream([
# Use single partition to show cache in work
sc.parallelize(data, 1) for data in
[[1, 2, 3], [1, 2, 5], [1, 3, 5]]
])
stream.foreachRDD(lambda rdd: rdd.foreach(
lambda x: my_utils.f_cached(x, cnt_cached)))
stream.foreachRDD(lambda rdd: rdd.foreach(
lambda x: my_utils.f_uncached(x, cnt_uncached)))
ssc.start()
ssc.awaitTerminationOrTimeout(15)
ssc.stop(stopGraceFully=True)
print("Counter cached {0}".format(cnt_cached.value))
print("Counter uncached {0}".format(cnt_uncached.value))
if __name__ == "__main__":
main()
Example run:
bin/spark-submit main.py
Counter cached Counter({1: 1, 2: 1, 3: 1, 5: 1})
Counter uncached Counter({1: 3, 2: 2, 3: 2, 5: 2})
As you can see we get expected results:
For "cached" objects accumulator is updated only once per unique key per worker process (partition).
For not-cached objects accumulator is update each time key occurs.

TensorFlow: Reading images in queue without shuffling

I have a training set of 614 images which have already been shuffled. I want to read the images in order in batches of 5. Because my labels are arranged in the same order, any shuffling of the images when being read into the batch will result in incorrect labelling.
These are my functions to read and add the images to the batch:
# To add files from queue to a batch:
def add_to_batch(image):
print('Adding to batch')
image_batch = tf.train.batch([image],batch_size=5,num_threads=1,capacity=614)
# Add to summary
tf.image_summary('images',image_batch,max_images=30)
return image_batch
# To read files in queue and process:
def get_batch():
# Create filename queue of images to read
filenames = [('/media/jessica/Jessica/TensorFlow/StreetView/training/original/train_%d.png' % i) for i in range(1,614)]
filename_queue = tf.train.string_input_producer(filenames,shuffle=False,capacity=614)
reader = tf.WholeFileReader()
key, value = reader.read(filename_queue)
# Read and process image
# Image is 500 x 275:
my_image = tf.image.decode_png(value)
my_image_float = tf.cast(my_image,tf.float32)
my_image_float = tf.reshape(my_image_float,[275,500,4])
return add_to_batch(my_image_float)
This is my function to perform the prediction:
def inference(x):
< Perform convolution, pooling etc.>
return y_conv
This is my function to calculate loss and perform optimisation:
def train_step(y_label,y_conv):
""" Calculate loss """
# Cross-entropy
loss = -tf.reduce_sum(y_label*tf.log(y_conv + 1e-9))
# Add to summary
tf.scalar_summary('loss',loss)
""" Optimisation """
opt = tf.train.AdamOptimizer().minimize(loss)
return loss
This is my main function:
def main ():
# Training
images = get_batch()
y_conv = inference(images)
loss = train_step(y_label,y_conv)
# To write and merge summaries
writer = tf.train.SummaryWriter('/media/jessica/Jessica/TensorFlow/StreetView/SummaryLogs/log_5', graph_def=sess.graph_def)
merged = tf.merge_all_summaries()
""" Run session """
sess.run(tf.initialize_all_variables())
tf.train.start_queue_runners(sess=sess)
print "Running..."
for step in range(5):
# y_1 = <get the correct labels here>
# Train
loss_value = sess.run(train_step,feed_dict={y_label:y_1})
print "Step %d, Loss %g"%(step,loss_value)
# Save summary
summary_str = sess.run(merged,feed_dict={y_label:y_1})
writer.add_summary(summary_str,step)
print('Finished')
if __name__ == '__main__':
main()
When I check my image_summary the images do not seem to be in sequence. Or rather, what is happening is:
Images 1-5: discarded, Images 6-10: read, Images 11-15: discarded, Images 16-20: read etc.
So it looks like I am getting my batches twice, throwing away the first one and using the second one? I have tried a few remedies but nothing seems to work. I feel like I am understanding something fundamentally wrong about calling images = get_batch() and sess.run().
Your batch operation is a FIFOQueue, so every time you use it's output, it advances the state.
Your first session.run call uses the images 1-5 in the computation of train_step, your second session.run asks for the computation of image_summary which pulls images 5-6 and uses them in the visualization.
If you want to visualize things without affecting the state of input, it helps to cache queue values in variables and define your summaries with variables as inputs rather than depending on live queue.
(image_batch_live,) = tf.train.batch([image],batch_size=5,num_threads=1,capacity=614)
image_batch = tf.Variable(
tf.zeros((batch_size, image_size, image_size, color_channels)),
trainable=False,
name="input_values_cached")
advance_batch = tf.assign(image_batch, image_batch_live)
So now your image_batch is a static value which you can use both for computing loss and visualization. Between steps you would call sess.run(advance_batch) to advance the queue.
Minor wrinkle with this approach -- default saver will save your image_batch variable to checkpoint. If you ever change your batch-size, then your checkpoint restore will fail with dimension mismatch. To work-around you would need to specify the list of variables to restore manually, and run initializers for the rest.

Streaming to HBase with pyspark

There is a fair amount of info online about bulk loading to HBase with Spark streaming using Scala (these two were particularly useful) and some info for Java, but there seems to be a lack of info for doing it with PySpark. So my questions are:
How can data be bulk loaded into HBase using PySpark?
Most examples in any language only show a single column per row being upserted. How can I upsert multiple columns per row?
The code I currently have is as follows:
if __name__ == "__main__":
context = SparkContext(appName="PythonHBaseBulkLoader")
streamingContext = StreamingContext(context, 5)
stream = streamingContext.textFileStream("file:///test/input");
stream.foreachRDD(bulk_load)
streamingContext.start()
streamingContext.awaitTermination()
What I need help with is the bulk load function
def bulk_load(rdd):
#???
I've made some progress previously, with many and various errors (as documented here and here)
So after much trial and error, I present here the best I have come up with. It works well, and successfully bulk loads data (using Puts or HFiles) I am perfectly willing to believe that it is not the best method, so any comments/other answers are welcome. This assume you're using a CSV for your data.
Bulk loading with Puts
By far the easiest way to bulk load, this simply creates a Put request for each cell in the CSV and queues them up to HBase.
def bulk_load(rdd):
#Your configuration will likely be different. Insert your own quorum and parent node and table name
conf = {"hbase.zookeeper.qourum": "localhost:2181",\
"zookeeper.znode.parent": "/hbase-unsecure",\
"hbase.mapred.outputtable": "Test",\
"mapreduce.outputformat.class": "org.apache.hadoop.hbase.mapreduce.TableOutputFormat",\
"mapreduce.job.output.key.class": "org.apache.hadoop.hbase.io.ImmutableBytesWritable",\
"mapreduce.job.output.value.class": "org.apache.hadoop.io.Writable"}
keyConv = "org.apache.spark.examples.pythonconverters.StringToImmutableBytesWritableConverter"
valueConv = "org.apache.spark.examples.pythonconverters.StringListToPutConverter"
load_rdd = rdd.flatMap(lambda line: line.split("\n"))\#Split the input into individual lines
.flatMap(csv_to_key_value)#Convert the CSV line to key value pairs
load_rdd.saveAsNewAPIHadoopDataset(conf=conf,keyConverter=keyConv,valueConverter=valueConv)
The function csv_to_key_value is where the magic happens:
def csv_to_key_value(row):
cols = row.split(",")#Split on commas.
#Each cell is a tuple of (key, [key, column-family, column-descriptor, value])
#Works well for n>=1 columns
result = ((cols[0], [cols[0], "f1", "c1", cols[1]]),
(cols[0], [cols[0], "f2", "c2", cols[2]]),
(cols[0], [cols[0], "f3", "c3", cols[3]]))
return result
The value converter we defined earlier will convert these tuples into HBase Puts
Bulk loading with HFiles
Bulk loading with HFiles is more efficient: rather than a Put request for each cell, an HFile is written directly and the RegionServer is simply told to point to the new HFile. This will use Py4J, so before the Python code we have to write a small Java program:
import py4j.GatewayServer;
import org.apache.hadoop.hbase.*;
public class GatewayApplication {
public static void main(String[] args)
{
GatewayApplication app = new GatewayApplication();
GatewayServer server = new GatewayServer(app);
server.start();
}
}
Compile this, and run it. Leave it running as long as your streaming is happening. Now update bulk_load as follows:
def bulk_load(rdd):
#The output class changes, everything else stays
conf = {"hbase.zookeeper.qourum": "localhost:2181",\
"zookeeper.znode.parent": "/hbase-unsecure",\
"hbase.mapred.outputtable": "Test",\
"mapreduce.outputformat.class": "org.apache.hadoop.hbase.mapreduce.HFileOutputFormat2",\
"mapreduce.job.output.key.class": "org.apache.hadoop.hbase.io.ImmutableBytesWritable",\
"mapreduce.job.output.value.class": "org.apache.hadoop.io.Writable"}#"org.apache.hadoop.hbase.client.Put"}
keyConv = "org.apache.spark.examples.pythonconverters.StringToImmutableBytesWritableConverter"
valueConv = "org.apache.spark.examples.pythonconverters.StringListToPutConverter"
load_rdd = rdd.flatMap(lambda line: line.split("\n"))\
.flatMap(csv_to_key_value)\
.sortByKey(True)
#Don't process empty RDDs
if not load_rdd.isEmpty():
#saveAsNewAPIHadoopDataset changes to saveAsNewAPIHadoopFile
load_rdd.saveAsNewAPIHadoopFile("file:///tmp/hfiles" + startTime,
"org.apache.hadoop.hbase.mapreduce.HFileOutputFormat2",
conf=conf,
keyConverter=keyConv,
valueConverter=valueConv)
#The file has now been written, but HBase doesn't know about it
#Get a link to Py4J
gateway = JavaGateway()
#Convert conf to a fully fledged Configuration type
config = dict_to_conf(conf)
#Set up our HTable
htable = gateway.jvm.org.apache.hadoop.hbase.client.HTable(config, "Test")
#Set up our path
path = gateway.jvm.org.apache.hadoop.fs.Path("/tmp/hfiles" + startTime)
#Get a bulk loader
loader = gateway.jvm.org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles(config)
#Load the HFile
loader.doBulkLoad(path, htable)
else:
print("Nothing to process")
Finally, the fairly straightforward dict_to_conf:
def dict_to_conf(conf):
gateway = JavaGateway()
config = gateway.jvm.org.apache.hadoop.conf.Configuration()
keys = conf.keys()
vals = conf.values()
for i in range(len(keys)):
config.set(keys[i], vals[i])
return config
As you can see, bulk loading with HFiles is more complex than using Puts, but depending on your data load it is probably worth it since once you get it working it's not that difficult.
One last note on something that caught me off guard: HFiles expect the data they receive to be written in lexical order. This is not always guaranteed to be true, especially since "10" < "9". If you have designed your key to be unique, then this can be fixed easily:
load_rdd = rdd.flatMap(lambda line: line.split("\n"))\
.flatMap(csv_to_key_value)\
.sortByKey(True)#Sort in ascending order

Resources