So I have a project idea that requires me to process incoming realtime data and constantly track some metrics about the realtime data. Then every now and then I want to be able to request for the metrics I am calculating and do some stuff with that data.
Currently I have a simple Python script that uses the socket library to get the realtime data. It is basically just...
metric1 = 0
metric2 = ''
while True:
response = socket.recv(512).decode('utf-8')
if response.startswith('PING'):
sock.send("PONG\n".encode('utf-8'))
else:
process(response)
In the above process(response) will update metric1 and metric2 with data from each response. (For example they might be mean len(response) and most common response respectively)
What I want to do is run the above script constantly after starting up the project and occasionally query for metric1 and metric2 in a script I have running locally. I am guessing that I will have to look into running code on a server which I have very little experience with.
What are the most accessible tools to do what I want? I am pretty comfortable with a variety of languages so if there is a library or tool in another language that is better suited for all of this, please tell me about it
Thanks!
I worked on a similar project, not sure if it specifically can be applied to your case, but maybe it can give you a starting point.
Although I am very aware it's not best practice to use Pandas Dataframes for real-time purposes, in my case it's just fast enough (I am actually open to suggestions on how to improve my workflow!), here is my code:
all_prices = pd.Dataframe()
readprice():
global all_prices
msg = mysock.recv(16384)
msg_stringa=str(msg,'utf-8')
new_price = pd.read_csv(StringIO(msg_stringa) , sep=";", error_bad_lines=False,
index_col=None, header=None, engine='c', names=range(33),
decimal = '.')
...
...
all_prices = all_prices.append(new_price, ignore_index=True).copy()
So 'all_prices' Pandas Dataframe is global, new prices get appended to the general 'all_prices' DF . This global DF can be used by other functions in order to read the content ect. Be very careful about the variable sharing between two or more threads, it can lead to errors.
More info here: http://www.laurentluce.com/posts/python-threads-synchronization-locks-rlocks-semaphores-conditions-events-and-queues/
In my case, I don't share the DF to a parallel thread, other threads are launched after the append, not in the meantime.
Related
I have hundreds of thousands of small csv files in hdfs. Before merging them into a single dataframe, I need to add an id to each file individually (or else in the merge it won't be possible to distinguish between data from different files).
Currently I am relying on yarn to distribute the processes that I create that add the id to each file and convert to parquet format. I find that no matter how I tune the cluster (in size/executor/memory) that the bandwidth is limited at 2000-3000 files/h.
for i in range(0,numBatches):
fileSlice = fileList[i*batchSize:((i+1)*batchSize)]
p = ThreadPool(numNodes)
logger.info('\n\n\n --------------- \n\n\n')
logger.info('Starting Batch : ' + str(i))
logger.info('\n\n\n --------------- \n\n\n')
p.map(lambda x: addIdCsv(x), fileSlice)
def addIdCsv(x):
logId=x[logId]
filePath=x[filePath]
fLogRaw = spark.read.option("header", "true").option('inferSchema', 'true').csv(filePath)
fLogRaw = fLogRaw.withColumn('id', F.lit(logId))
fLog.write.mode('overwrite').parquet(filePath + '_added')
You can see that my cluster is underperforming on CPU. But on the YARN manager it is given 100% access to resources.
What is the best was to solve this part of a data pipeline? What is the bottleneck?
Update 1
The jobs are evenly distributed as you can see in the event timeline visualization below.
As per #cricket_007 suggestion, Nifi provides a good easy solution to this problem which is more scalable and integrates better with other frameworks than plain python. The idea is to read the files into Nifi before writing to hdfs (in my case they are in S3). There is still an inherent bottleneck of reading/writing to S3 but has a throughput around 45k files/h.
The flow looks like this.
Most of the work is done in the ReplaceText processor that finds the end of line character '|' and adds the uuid and a newline.
I am working on Data Wrangling problem using Python,
which processes a dirty Excel file into a clean Excel file
I would like to process multiple input files by introducing concurrency/parallelism.
I have the following options 1) Using multiThreading 2) Using multiProceesing modules 3) ParallelPython module,
I have a basic idea of the three methods, I would like to know which method is best and why?
In Bref, Processing of a SINGLE dirty Excel file today takes 3 minutes,
Objective : To introduce parallelism/concurrency to process multiple files at once.
Looking for, best method of parallelism to achieve the objective
Since your process is mostly CPU bound multi-threading won't be fast because of the GIL...
I would recommend multiprocessing or concurrent.futures since they are a bit simpler the ParallelPython (only a bit :) )
example:
with concurrent.futures.ProcessPoolExecutor() as executor:
for file_path, clean_file in zip(files, executor.map(data_wrangler, files)):
print('%s is now clean!' % (file_path))
#do something with clean_file if you want
Only if you need to distribute the load between servers then I would recommend ParallelPython .
Please excuse the broadness of this question. Maybe once I know more perhaps I can ask more specifically.
I have performance sensitive piece of tensorflow code. From the perspective of someone who knows little about gpu programming, I would like to know what guides or strategies would be a "good place to start" to optimizing my code. (single gpu)
Perhaps even a readout of how long was spent on each tensorflow op would be nice...
I have a vague understanding that
Some operations go faster when assigned to a cpu rather than a gpu, but it's not clear which
There is a piece of google software called "EEG" that I read about in a
paper that may one day be open sourced.
There may also be other common factors at play that I am not aware of..
I wanted to give a more complete answer about how to use the Timeline object to get the time of execution for each node in the graph:
you use a classic sess.run() but specifying arguments options and run_metadata
you then create a Timeline object with the run_metadata.step_stats data
Here is in example code:
import tensorflow as tf
from tensorflow.python.client import timeline
x = tf.random_normal([1000, 1000])
y = tf.random_normal([1000, 1000])
res = tf.matmul(x, y)
# Run the graph with full trace option
with tf.Session() as sess:
run_options = tf.RunOptions(trace_level=tf.RunOptions.FULL_TRACE)
run_metadata = tf.RunMetadata()
sess.run(res, options=run_options, run_metadata=run_metadata)
# Create the Timeline object, and write it to a json
tl = timeline.Timeline(run_metadata.step_stats)
ctf = tl.generate_chrome_trace_format()
with open('timeline.json', 'w') as f:
f.write(ctf)
You can then open Google Chrome, go to the page chrome://tracing and load the timeline.json file.
You should something like:
(from a Hadoop newbie)
I want to avoid files where possible in a toy Hadoop proof-of-concept example. I was able to read data from non-file-based input (thanks to http://codedemigod.com/blog/?p=120) - which generates random numbers.
I want to store the result in memory so that I can do some further (non-Map-Reduce) business logic processing on it. Essetially:
conf.setOutputFormat(InMemoryOutputFormat)
JobClient.runJob(conf);
Map result = conf.getJob().getResult(); // ?
The closest thing that seems to do what I want is store the result in a binary file output format and read it back in with the equivalent input format. That seems like unnecessary code and unnecessary computation (am I misunderstanding the premises which Map Reduce depends on?).
The problem with this idea is that Hadoop has no notion of "distributed memory". If you want the result "in memory" the next question has to be "which machine's memory?" If you really want to access it like that, you're going to have to write your own custom output format, and then also either use some existing framework for sharing memory across machines, or again, write your own.
My suggestion would be to simply write to HDFS as normal, and then for the non-MapReduce business logic just start by reading the data from HDFS via the FileSystem API, i.e.:
FileSystem fs = new JobClient(conf).getFs();
Path outputPath = new Path("/foo/bar");
FSDataInputStream in = fs.open(outputPath);
// read data and store in memory
fs.delete(outputPath, true);
Sure, it does some unnecessary disk reads and writes, but if your data is small enough to fit in-memory, why are you worried about it anyway? I'd be surprised if that was a serious bottleneck.
There are two large text files (Millions of lines) that my program uses. These files are parsed and loaded into hashes so that the data can be accessed quickly. The problem I face is that, currently, the parsing and loading is the slowest part of the program. Below is the code where this is done.
database = extractDatabase(#type).chomp("fasta") + "yml"
revDatabase = extractDatabase(#type + "-r").chomp("fasta.reverse") + "yml"
#proteins = Hash.new
#decoyProteins = Hash.new
File.open(database, "r").each_line do |line|
parts = line.split(": ")
#proteins[parts[0]] = parts[1]
end
File.open(revDatabase, "r").each_line do |line|
parts = line.split(": ")
#decoyProteins[parts[0]] = parts[1]
end
And the files look like the example below. It started off as a YAML file, but the format was modified to increase parsing speed.
MTMDK: P31946 Q14624 Q14624-2 B5BU24 B7ZKJ8 B7Z545 Q4VY19 B2RMS9 B7Z544 Q4VY20
MTMDKSELVQK: P31946 B5BU24 Q4VY19 Q4VY20
....
I've messed around with different ways of setting up the file and parsing them, and so far this is the fastest way, but it's still awfully slow.
Is there a way to improve the speed of this, or is there a whole other approach I can take?
List of things that don't work:
YAML.
Standard Ruby threads.
Forking off processes and then retrieving the hash through a pipe.
In my usage, reading all or part the file into memory before parsing usually goes faster. If the database sizes are small enough this could be as simple as
buffer = File.readlines(database)
buffer.each do |line|
...
end
If they're too big to fit into memory, it gets more complicated, you have to setup block reads of data followed by parse, or threaded with separate read and parse threads.
Why not use the solution devised through decades of experience: a database, say SQLlite3?
(To be different, although I'd first recommend looking at (Ruby) BDB and other "NoSQL" backend-engines, if they fit your need.)
If fixed-sized records with a deterministic index are used then you can perform a lazy-load of each item through a proxy object. This would be a suitable candidate for a mmap. However, this will not speed up the total access time, but will merely amortize the loading throughout the life-cycle of the program (at least until first use and if some data is never used then you get the benefit of never loading it). Without fixed-sized records or deterministic index values this problem is more complex and starts to look more like a traditional "index" store (eg. a B-tree in an SQL back-end or whatever BDB uses :-).
The general problems with threading here are:
The IO will likely be your bottleneck around Ruby "green" threads
You still need all the data before use
You may be interested in the Widefinder Project, just in general "trying to get faster IO processing".
I don't know too much about Ruby but I have had to deal with the problem before. I found the best way was to split the file up into chunks or separate files then spawn threads to read each chunk in at a single time. Once the partitioned files are in memory combining the results should be fast. Here is some information on Threads in Ruby:
http://rubylearning.com/satishtalim/ruby_threads.html
Hope that helps.