I work on a project where we run a number of DataStage sequences can be run in parallel, one in particular is poorly performing and takes a lot of resources, impacting the shared environment. Performance tuning initiative is in progress but will take time.
In the meantime I was hopeful that we could throttle DataStage to restrict the resources that could be used by this particular job/sequence - however I'm not personally experienced with DataStage specifically.
Can anyone comment if this facility exists in DataStage (v8.5 I believe), and point me in the direction of some further detail.
Secondly, I know that we can at the throttle based on the user (I think this ties into AIX 'ulimit', but not sure). Is it easy/possbile to run different jobs/sequences as different users?
In this type of situations resources for a particular job can be restricted by specifying number of nodes and resources in a config file. Possible in 8.5 and you may find something at www.datastagetips.com
Revolution_In_Progress is right.
Datastage PX has the notion of a configuration file. That file can be specified for all the jobs you run or it can be overridden on a job by job basis. The configuration file can be used to limit the physical resources that are associated with a job.
In this case, if you have a 4-node config file for most of your jobs, you may want to write a 2-node config file for the job with performance issue. That way, you'll get the minimum amount of parallelism (without going completely sequential) and use the minimum amount of resources.
http://pic.dhe.ibm.com/infocenter/iisinfsv/v8r1/index.jsp?topic=/com.ibm.swg.im.iis.ds.parjob.tut.doc/module5/lesson5.1exploringtheconfigurationfile.html
Sequence is a collection of individual jobs.
In most cases, jobs in a sequence can be rearranged to run serially. Please check the organisation of the sequence and do a critical path analyis to remove the jobs that need not run in parallel to critical jobs.
Related
I'm currently working on performance & memory tuning for a spark process. As part of this I'm performing multiple runs of different versions of the code and trying to compare their results side by side.
I've got a few questions to ask, so I'll post each separately so they can be addressed separately.
Currently, it looks like getOrCreate() is re-using the Spark Context each run. This is causing me two problems:
Caching from one run may be affecting the results of future runs.
All of the tasks are bundled into a single 'job', and I have to guess at which tasks correspond to which test run.
I'd like to ensure that I'm properly resetting all caches in each run to ensure that my results are comparable. I'd also ideally like some way of having each run show up as a separate job in the local job history server so that it's easier for me to compare.
I'm currently relying on spark.catalog.clearCache() but not sure if this is covering all of what I need. I'd also like a way to ensure that the tasks for each job run are clearly grouped in some fashion for comparison so I can see where I'm losing time and hopefully see total memory used by each run as well (this is one thing I'm currently trying to improve on).
I am currently in the process of writing an ElasticSearch Nifi processor. Individual inserts / writes to ES are not optimal, instead batching documents is preferred. What would be considered the optimal approach within a Nifi processor to track (batch) documents (FlowFiles) and when at a certain amount batch them in? The part I am most concerned about is if ES is unavailable, down, network partition, etc. prevents the batch from being successful. The primary point of the question, is given that Nifi has content storage for queuing / back-pressure, etc. is there a preferred method for using that to ensure no FlowFiles get lost if a destination is down? Maybe there is another processor I should look at for an example?
I have looked at the Mongo processor, Merge, etc. to try and get an idea of the preferred approach for batching inside of a processor, but can't seem to find anything specific. Any suggestions would be appreciated.
Good chance I am overlooking some basic functionality baked into Nifi. I am still fairly new to the platform.
Thanks!
Great question and a pretty common pattern. This is why we have the concept of a ProcessSession. It allows you to send zero or more things to an external endpoint and only commit once you know it has been ack'd by the recipient. In this sense it offers at least-once semantics. If the protocol you're using supports two-phase commit style semantics you can get pretty close to the ever elusive exactly-once semantic. Much of the details of what you're asking about here will depend on the destination systems API and behavior.
There are some examples in the apache codebase which reveal ways to do this. One way is if you can produce a merged collection of events prior to pushing to the destination system. Depends on its API. I think PutMongo and PutSolr operate this way (though the experts on that would need to weigh in). An example that might be more like what you're looking for can be found in PutSQL which operates on batches of flowfiles to send in a single transaction (on the destination DB).
https://github.com/apache/nifi/blob/master/nifi-nar-bundles/nifi-standard-bundle/nifi-standard-processors/src/main/java/org/apache/nifi/processors/standard/PutSQL.java
Will keep an eye here but can get the eye of a larger NiFi group at users#nifi.apache.org
Thanks
Joe
I am evaluating various system monitoring tools to use one to monitor my hadoop cluster.
One of the tools I am impressed by is collectl. I have been playing around with it since a couple of days.
I am struggling to find how can we aggregate the metrics captured by collectl when using colmux?
Say, I have 10 nodes in my hadoop cluster each running collectl as a service. Using colmux I can see the
performance metrics of each node in a single view (in single and multi-line formats). Great!
But what if I am considering aggregate of CPU, IO etc on all the nodes in the cluster. That is I want to find
how my cluster as a whole is performing by aggregating the performance metrics from each node into corresponding
numbers, thereby giving me cluster-level metrics instead of node-level.
Any help is greatly appreciated. Thanks!
I had already answered this on the mailing list but for the benefit of those not on it I'll repeat myself here..
That's a cool idea. So if I understand you correctly you might see some sort of total line at the bottom? I can always add to my wish list but no promises. But I think I may also have a solution if you don't mind doing a little extra work on your own ;) btw - can I assume you've installed readkey so you can change sort columns with the arrow keys?
If you run colmux with --noesc, it will take it out of full screen more and simply print everything as scrolling output. If you then also include "--lines 99999" (or some big number) it will print all the output from all the remote systems so you don't miss anything. Finally you can pipe the output through perl, python, bash, or whatever your favorite scripting tool might be and do the totals yourself. Then whenever you see a new header fly by, print the totals and reset the counters to 0. You could even add timestamps and maybe even ultimately make it your own opensource project. I bet others would find it useful too.
-mark
I have a script server that runs arbitrary java script code on our servers. At any given time multiple scripts can be running and I would like to prevent one misbehaving script from eating up all the ram on the machine. I could do this by having each script run in its own process and have an off the shelf monitoring tool monitor the ram usage of each process, killing and restarting the ones that get out of hand. I don't want to do this because I would like to avoid the cost of restart the binary every time one of these scripts goes crazy. Is there a way in v8 to set a per context/isolate memory limit that I can use to sandbox the running scripts?
It should be easy to do now
context.EstimatedSize() to get estimated size of the context
isolate.TerminateExecution() when context goes out of acceptable memory/cpu usage/whatever
in order to get access if there is an infinite loop(or something else blocking, like high cpu calculation) I think you could use isolate.RequestInterrupt()
A single process can run multiple isolates, if you have a 1 isolate to 1 context ratio you can easily
restrict memory usage per isolate
get heap stats
See some examples in this commit:
https://github.com/discourse/mini_racer/commit/f7ec907547e9a6ea888b2587e4edee3766752dd3
In particular you have:
v8::HeapStatistics stats;
isolate->GetHeapStatistics(&stats);
There are also fancy features like memory allocation callbacks you can use.
This is not reliably possible.
All JavaScript contexts by this process share the same object heap.
WebKit/Chromium tries some stuff to disable contexts after context OOMs.
http://code.google.com/searchframe#OAMlx_jo-ck/src/third_party/WebKit/Source/WebCore/bindings/v8/V8Proxy.cpp&exact_package=chromium&q=V8Proxy&type=cs&l=361
Sources:
http://code.google.com/p/v8/source/browse/trunk/src/heap.h?r=11125&spec=svn11125#280
http://code.google.com/p/chromium/issues/detail?id=40521
http://code.google.com/p/chromium/issues/detail?id=81227
I have a design question. I have a 3-4 GB data file, ordered by time stamp. I am trying to figure out what the best way is to deal with this file.
I was thinking of reading this whole file into memory, then transmitting this data to different machines and then running my analysis on those machines.
Would it be wise to upload this into a database before running my analysis?
I plan to run my analysis on different machines, so doing it through database would be easier but if I increase the number machines to run my analysis on the database might get too slow.
Any ideas?
#update :
I want to process the records one by one. Basically trying to run a model on a timestamp data but I have various models so want to distribute it so that this whole process run over night every day. I want to make sure that I can easily increase the number of models and not decrease the system performance. Which is why I am planning to distributing data to all the machines running the model ( each machine will run a single model).
You can even access the file in the hard disk itself and reading a small chunk at a time. Java has something called Random Access file for the same but the same concept is available in other languages also.
Whether you want to load into the the database and do analysis should be purely governed by the requirement. If you can read the file and keep processing it as you go no need to store in database. But for analysis if you require the data from all the different area of file than database would be a good idea.
You do not need the whole file into memory, just the data you need for analysis. You can read every line and store only the needed parts of the line and additionally the index where the line starts in file, so you can find it later if you need more data from this line.
Would it be wise to upload this into a database before running my analysis ?
yes
I plan to run my analysis on different machines, so doing it through database would be easier but if I increase the number machines to run my analysis on the database might get too slow.
don't worry about it, it will be fine. Just introduce a marker so the rows processed by each computer are identified.
I'm not sure I fully understand all of your requirements, but if you need to persist the data (refer to it more than once,) then a db is the way to go. If you just need to process portions of these output files and trust the results, you can do it on the fly without storing any contents.
Only store the data you need, not everything in the files.
Depending on the analysis needed, this sounds like a textbook case for using MapReduce with Hadoop. It will support your requirement of adding more machines in the future. Have a look at the Hadoop wiki: http://wiki.apache.org/hadoop/
Start with the overview, get the standalone setup working on a single machine, and try doing a simple analysis on your file (e.g. start with a "grep" or something). There is some assembly required but once you have things configured I think it could be the right path for you.
I had a similar problem recently, and just as #lalit mentioned, I used the RandomAccess file reader against my file located in the hard disk.
In my case I only needed read access to the file, so I launched a bunch of threads, each thread starting in a different point of the file, and that got me the job done and that really improved my throughput since each thread could spend a good amount of time blocked while doing some processing and meanwhile other threads could be reading the file.
A program like the one I mentioned should be very easy to write, just try it and see if the performance is what you need.
#update :
I want to process the records one by one. Basically trying to run a model on a timestamp data but I have various models so want to distribute it so that this whole process run over night every day. I want to make sure that I can easily increase the number of models and not decrease the system performance. Which is why I am planning to distributing data to all the machines running the model ( each machine will run a single model).