Spark Performance Tuning Question - Resetting all caches for performance testing - performance

I'm currently working on performance & memory tuning for a spark process. As part of this I'm performing multiple runs of different versions of the code and trying to compare their results side by side.
I've got a few questions to ask, so I'll post each separately so they can be addressed separately.
Currently, it looks like getOrCreate() is re-using the Spark Context each run. This is causing me two problems:
Caching from one run may be affecting the results of future runs.
All of the tasks are bundled into a single 'job', and I have to guess at which tasks correspond to which test run.
I'd like to ensure that I'm properly resetting all caches in each run to ensure that my results are comparable. I'd also ideally like some way of having each run show up as a separate job in the local job history server so that it's easier for me to compare.
I'm currently relying on spark.catalog.clearCache() but not sure if this is covering all of what I need. I'd also like a way to ensure that the tasks for each job run are clearly grouped in some fashion for comparison so I can see where I'm losing time and hopefully see total memory used by each run as well (this is one thing I'm currently trying to improve on).

Related

How do I setup esrally to use with elassandra and my own tests?

I'm wondering whether others have attempt benchmarking Elassandra (more specifically, I'm using express-cassandra) using esrally. I'm hoping to not spend to much more time on esrally if that's not a good solution to test Elassandra.
Reading the documentation, it looks like Rally is capable of starting from scratch: Download Elasticsearch, install the source, build it, run it, connect, create a full schema, then start testing with data filling the schema (possibly done with some random data), do queries, ...
I already have everything in place and the only thing I really want to see a few things such as:
Which of 10 different memory setup is faster.
Which type of searches work, whether option 1, 2 and 3 from my existing software create drastic slow downs or not...
Whether insertion while doing searches have a effects on the speed of my searches.
I'm not going to change many parameters other than the memory (-Xmx, -Xms, maybe some others... like cached row in a separate heap.) For sure, I want to run all the tests with the latest Elassandra and not consider rebuilding or anything of the sort.
From reading the documentation, there is no mention of Elassandra. I found total of TWO PAGES in Google about testing with esrally and Elassandra and that did not boost my confidence that it's doable...
I would imagine that I have to use the benchmark-only pipeline. That at least removes all the gathering of the source, building, etc. I guess it also reduces the number of parameters I get in the resulting benchmark, but I don't need all the details...
Have you had any experience with such a setup? (Elassandra + esrally)
Yes, esrally works with Elassandra by using the --benchmark-only option.
To automate the creation of elassandra clusters to benchmark, you could either use ecm or k8s helm chart.
For instance, using ccm :
ecm create bench_cluster -v 6.2.3.10 -n 3 -s -e
esrally --pipeline=benchmark-only --target hosts=127.0.0.1:9200,127.0.0.2:9200,127.0.0.3:9200
ecm remove bench_cluster
For testing specific scenarios, you can write custom tracks.

2 mappings with exactly same session attributes behave completely different with slight change in logic

I am an informatica Developer.
I have a mapping in informatica with below :
Original Mapping :
AS400(DB2SQ)->EXP->RTR->AGG1->MPLT->TGT1(SQL Server) Pipeline 1.
| |->AGG2->TGT2(SQL Server)
| |
| |->TGT3(SQL Server)
->AGG3->EXP->TGT4(FlatFile) Pipeline 2.
Major number of records are passing through pipeline 1. And i was asked to optimize the flow. Below was my suggestions.
In Pipeline 1. remove the AGG1 and AGG2, and push the aggregation logic to the database, this was my suggestion, as the flow is incremental, and incremental records being loaded into a temporary table, so expecting the performance to be better.
Remove the Target data TGT3, as it was not required.
This is how my optimized mapping looks now :
Optimized Mapping(What i thought) :
AS400(DB2SQ)->EXP->RTR->MPLT->TGT1(SQL Server) Pipeline 1.
| |->TGT2(SQL Server)
|
->AGG3->EXP->TGT4(FlatFile) Pipeline 2.
Just to investigate on source performance optimizations, i replaced the sessions properties of all targets to write into a file instead. I wanted to check if i could optimize my source in anyways.
But to my surprise, when, i executed both the session(in separate workflows, and separately one after the other), i see that the SQ throughput for optimized session is much slower than the original session.
Everything in the optimized solution is exactly same, as i made a copy of the original mapping/session, before removing 2 of the Aggregators, and one of the target.
Please Note : the environment where i am developing has version control enabled, has it anything got with that?
I tried to cross check this multiple and unable to find an answer.
You can identify it better if you go through the session log in details.And also you can run the query in source dB and check for the time,it's taking.you can also tune the performance of the source side by using push down optimization (i.e. source side push down optimization). But before that check with your source dB if everything is good and it's not taking much time. Also you can modify optimize the query and see for the performance.
If still that does not work out then you can go for session partitions in sq and check for the performance.

How to deactivate safe mode in the mongo shell?

Short question is on the title: I work with my mongo Shell wich is in safe mode by default, and I want to gain better performance by deactivating this behaviour.
Long Question for those willing to know the context:
I am working on a huge set of data like
{
_id:ObjectId("azertyuiopqsdfghjkl"),
stringdate:"2008-03-08 06:36:00"
}
and some other fields and there are about 250M documents like that (whole database with the indexes weights 36Go). I want to convert the date in a real ISODATE field. I searched a bit how I could make an update query like
db.data.update({},{$set:{date:new Date("$stringdate")}},{multi:true})
but did not find how to make this work and resolved myself to make a script that take the documents one after the other and make an update to set a new field which takes the new Date(stringdate) as its value. The query use the _id so the default index is used.
Problem is that it takes a very long time. I already figured out that if only I had inserted empty dates object when I created the database I would now get better performances since there is the problem of data relocation when a new field is added. I also set an index on a relevant field to process the database chunk by chunk. Finally I ran several concurrent mongo clients on both the server and my workstation to ensure that the limitant factor is the database lock availability and not any other factor like cpu or network costs.
I monitored the whole thing with mongotop, mongostats and the web monitoring interfaces which confirmed that write lock is taken 70% of the time. I am a bit disappointed mongodb does not have a more precise granularity on its write lock, why not allowing concurrent write operations on the same collection as long as there is no risk of interference? Now that I think about it I should have sharded the collection on a dozen shards even while staying on the same server, because there would have been individual locks on each shard.
But since I can't do a thing right now to the current database structure, I searched how to improve performance to at least spend 90% of my time writing in mongo (from 70% currently), and I figured out that since I ran my script in the default mongo shell, every time I make an update, there is also a getLastError() which is called afterwards and I don't want it because there is a 99.99% chance of success and even in case of failure I can still make an aggregation request after the end of the big process to retrieve the single exceptions.
I don't think I would gain so much performance by deactivating the getLastError calls, but I think itis worth trying.
I took a look at the documentation and found confirmation of the default behavior, but not the procedure for changing it. Any suggestion?
I work with my mongo Shell wich is in safe mode by default, and I want to gain better performance by deactivating this behaviour.
You can use db.getLastError({w:0}) ( http://docs.mongodb.org/manual/reference/method/db.getLastError/ ) to do what you want but it won't help.
This is because for one:
make a script that take the documents one after the other and make an update to set a new field which takes the new Date(stringdate) as its value.
When using the shell in a non-interactive mode like within a loop it doesn't actually call getLastError(). As such downing your write concern to 0 will do nothing.
I already figured out that if only I had inserted empty dates object when I created the database I would now get better performances since there is the problem of data relocation when a new field is added.
I did tell people when they asked about this stuff to add those fields incase of movement but instead they listened to the guy who said "leave them out! They use space!".
I shouldn't feel smug but I do. That's an unfortunately side effect of being right when you were told you were wrong.
mongostats and the web monitoring interfaces which confirmed that write lock is taken 70% of the time
That's because of all the movement in your documents, kinda hard to fix that.
I am a bit disappointed mongodb does not have a more precise granularity on its write lock
The write lock doesn't actually denote the concurrency of MongoDB, this is another common misconception that stems from the transactional SQL technologies.
Write locks in MongoDB are mutexs for one.
Not only that but there are numerous rules which dictate that operations will subside to queued operations under certain circumstances, one being how many operations waiting, another being whether the data is in RAM or not, and more.
Unfortunately I believe you have got yourself stuck in between a rock and hard place and there is no easy way out. This does happen.

How to Throttle DataStage

I work on a project where we run a number of DataStage sequences can be run in parallel, one in particular is poorly performing and takes a lot of resources, impacting the shared environment. Performance tuning initiative is in progress but will take time.
In the meantime I was hopeful that we could throttle DataStage to restrict the resources that could be used by this particular job/sequence - however I'm not personally experienced with DataStage specifically.
Can anyone comment if this facility exists in DataStage (v8.5 I believe), and point me in the direction of some further detail.
Secondly, I know that we can at the throttle based on the user (I think this ties into AIX 'ulimit', but not sure). Is it easy/possbile to run different jobs/sequences as different users?
In this type of situations resources for a particular job can be restricted by specifying number of nodes and resources in a config file. Possible in 8.5 and you may find something at www.datastagetips.com
Revolution_In_Progress is right.
Datastage PX has the notion of a configuration file. That file can be specified for all the jobs you run or it can be overridden on a job by job basis. The configuration file can be used to limit the physical resources that are associated with a job.
In this case, if you have a 4-node config file for most of your jobs, you may want to write a 2-node config file for the job with performance issue. That way, you'll get the minimum amount of parallelism (without going completely sequential) and use the minimum amount of resources.
http://pic.dhe.ibm.com/infocenter/iisinfsv/v8r1/index.jsp?topic=/com.ibm.swg.im.iis.ds.parjob.tut.doc/module5/lesson5.1exploringtheconfigurationfile.html
Sequence is a collection of individual jobs.
In most cases, jobs in a sequence can be rearranged to run serially. Please check the organisation of the sequence and do a critical path analyis to remove the jobs that need not run in parallel to critical jobs.

5GB file to read

I have a design question. I have a 3-4 GB data file, ordered by time stamp. I am trying to figure out what the best way is to deal with this file.
I was thinking of reading this whole file into memory, then transmitting this data to different machines and then running my analysis on those machines.
Would it be wise to upload this into a database before running my analysis?
I plan to run my analysis on different machines, so doing it through database would be easier but if I increase the number machines to run my analysis on the database might get too slow.
Any ideas?
#update :
I want to process the records one by one. Basically trying to run a model on a timestamp data but I have various models so want to distribute it so that this whole process run over night every day. I want to make sure that I can easily increase the number of models and not decrease the system performance. Which is why I am planning to distributing data to all the machines running the model ( each machine will run a single model).
You can even access the file in the hard disk itself and reading a small chunk at a time. Java has something called Random Access file for the same but the same concept is available in other languages also.
Whether you want to load into the the database and do analysis should be purely governed by the requirement. If you can read the file and keep processing it as you go no need to store in database. But for analysis if you require the data from all the different area of file than database would be a good idea.
You do not need the whole file into memory, just the data you need for analysis. You can read every line and store only the needed parts of the line and additionally the index where the line starts in file, so you can find it later if you need more data from this line.
Would it be wise to upload this into a database before running my analysis ?
yes
I plan to run my analysis on different machines, so doing it through database would be easier but if I increase the number machines to run my analysis on the database might get too slow.
don't worry about it, it will be fine. Just introduce a marker so the rows processed by each computer are identified.
I'm not sure I fully understand all of your requirements, but if you need to persist the data (refer to it more than once,) then a db is the way to go. If you just need to process portions of these output files and trust the results, you can do it on the fly without storing any contents.
Only store the data you need, not everything in the files.
Depending on the analysis needed, this sounds like a textbook case for using MapReduce with Hadoop. It will support your requirement of adding more machines in the future. Have a look at the Hadoop wiki: http://wiki.apache.org/hadoop/
Start with the overview, get the standalone setup working on a single machine, and try doing a simple analysis on your file (e.g. start with a "grep" or something). There is some assembly required but once you have things configured I think it could be the right path for you.
I had a similar problem recently, and just as #lalit mentioned, I used the RandomAccess file reader against my file located in the hard disk.
In my case I only needed read access to the file, so I launched a bunch of threads, each thread starting in a different point of the file, and that got me the job done and that really improved my throughput since each thread could spend a good amount of time blocked while doing some processing and meanwhile other threads could be reading the file.
A program like the one I mentioned should be very easy to write, just try it and see if the performance is what you need.
#update :
I want to process the records one by one. Basically trying to run a model on a timestamp data but I have various models so want to distribute it so that this whole process run over night every day. I want to make sure that I can easily increase the number of models and not decrease the system performance. Which is why I am planning to distributing data to all the machines running the model ( each machine will run a single model).

Resources