Disable Ray Tune parallel hyperparameter tuning - huggingface-transformers

I have an issue with running hyperparameter optimization on my language model because my setup requires about 20GB of GPU memory to train. Without working in a distributed fashion, I keep getting OutOfMemoryError when running ray tune for any trial worker that is not the first.
I figure this is because my Population Based Training runs trials in parallel and it is erroring out due to these out of memory issues.
As such, I figured I would tell Ray to run serially, and I believe I do this by setting tune.TuneConfig(max_concurrent_trials=1).
Does anyone know how to set this parameter in the HuggingFace transformers flow? On a high level, I execute hyperparameter optimization using trainer.hyperparameter_search(), but I do not see where I can set it to not run trials concurrently.
Thanks!

You should be able to pass it as a kwarg, eg. trainer.hyperparameter_search(max_concurrent_trials=1).

Related

Make Spark run faster when run locally, for training purposes

My team builds and maintains an API built on top of PySpark SQL. It is meant for production use cases, and does a good job at scaling to large data on clusters. We can also run it locally, which is useful for development, testing, and training people via interactive exercise sessions using Jupyter notebooks.
However, running fairly simple computations on Spark takes a little while, frequently a few dozens seconds, even on a dataframe of about 50k rows. Our library is to do differential privacy, which involves some randomization. Thus, training use cases involve running the same analysis multiple times to get average utility metrics. This means that runtimes quickly reach a few minutes, which is annoyingly long when you're trying to run a 1-2h exercise session.
My question is: are there Spark configuration options I could tweak to lower this runtime for small-data, single-machine use cases, and make teaching a little smoother?

Improving Spark performance: is it enough to use a cluster?

I have developed a simple Spark application that analyzes a dataset. The data analyzed comes from a CSV of 2 million records and 25 attributes. The analysis relates to simple transformations/actions of RDDs and I also used the MLlib library algorithms.
Being my first experience I've taken many pieces of code from documentation or examples present online. However, for a complete execution of a simple algorithm ALS of User Recommendation, for example, it takes several minutes.
I use the application on a laptop (i7 2GHz, 12GB RAM).
I would like to know if I only need to use this application in a cluster of computers to increase performance (in terms of speed) and if so, it is normal that running a Recommendation Engine Model in local takes so long time.
If yes, with a good cluster of computer can I get results in real time?
Thanks in advance!

Is my application running efficiently?

The question is generic and can be extended to other frameworks or contexts beyond Spark & Machine Learning algorithms.
Regardless of the details, from a high-level point-of-view, the code is applied on a large dataset of labeled text documents. It passes by 9 iterations of cross-validation to tune some parameters of a Logistic Regression multi-class classifier.
It is expected that this kind of Machine Learning processing will be expensive in term of time and resources.
I am running now the code and everything seems to be OK, except that I have no idea if my application is running efficiently or not.
I couldn't find guidelines saying that for a certain type and amount of data, and for certain type of processing and computing resources the processing time should be in the approximate order of...
Is there any method that help in judging if my application is running slow or fast, or it is purely a matter of experience?
I had the same question and I didn't find a real answer/tool/way to test how good my performances were just looking "only inside" my application.
I mean, as far as I know, there's no tool like a speedtest or something like for the internet connection :-)
The only way I found is to re-write my app (if possible) with another stack in order to see if the difference (in terms of time) is THAT big.
Otherwise, I found very interesting 2 main resources, even if quite old:
1) A sort of 4 point guide to remember when coding:
Understanding the Performance of Spark Applications, SPark Summit 2013
2) A 2-episode article from Cloudera blog to tune at best your jobs:
episode1
episode2
Hoping it could help
FF
Your question is pretty generic, so I would also highlight few generic areas where you can look out for performance optimizations: -
Scheduling Delays - Are there significant scheduling delays in scheduling the tasks? if yes then you can analyze the reasons (may be your cluster needs more resources etc).
Utilization of Cluster - are your jobs utilizing the available cluster resources (like CPU, mem)? In case not then again look out for the reasons. May be creating more partitions helps in faster execution. May be there is significant time taken in serialization, so can you switch to Kyro Serialization.
JVM Tuning - Consider analyzing GC logs and tune if you find anomalies.
Executor Configurations - Analyze the memory/ cores provided to your executors. It should be sufficient to hold the data processed by the task/job.
your DAG and
Driver Configuration - Same as executors, Driver should also have enough memory to hold the results of certain functions like collect().
Shuffling - See how much time is spend in Shuffling and kind of Data Locality used by your task.
All the above are needed for the preliminary investigations and in some cases it can also increase the performance of your jobs to an extent but there could be complex issues for which the solution will depend upon case to case basis.
Please also see Spark Tuning Guide

Best method of having a single process distributed across a cluster

I'm very new to cluster computing, and wanted to know more about the various software used for cluster computing, and which is best for particular tasks. In particular, the problem I am trying to solve involves a Manager/Workers type scenario, where a single Manager is responsible for the creation of 100s to 1000s of jobs. Each job, while relatively large, must execute on a small frame-by-frame basis. I.e. the Manager will tell each job, "advance one frame and report back to me". The execution of a single frame will be very small, so latency between the Manager and the worker machines must be very small, on the order of microseconds.
Thank you! Any information would be appreciated, even stuff that doesn't perfectly fit the scenario I described, just to give me a starting point. Some that I have researched so far are Hadoop, HTCondor, and Akka.
Since communication latency is important to you, you should probably consider using MPI. It's not too difficult to write simple Master/Worker programs using MPI, and it will probably give you the best performance, especially if your cluster has high performance networking, such as infiniband.
If, as it seems, you're using Java, you will have to do some research to determine a good Java/MPI package. You'll find some suggestions here: Java openmpi.

Hive find expected run time of query

I want to find the expected run time of query in Hive. Using EXPLAIN gives the execution plan. Is there a way to find the expected time?
I need Hive equivalent of SQL query EXPLAIN COSTS .
There is no OOTB feature at this moment that facilitates this. One way to achieve this would be to learn from history. Gather patterns based on similar data and queries you have run previously and try to deduce some insights. You might find tools like Starfish helpful in the process.
I would not recommend you to decide anything based on a subset of your data, as running queries on a small dataset and on the actual dataset are very different. This is good to test the functionality but not for any kind of cost approximation. The reason behind this is that a lot of factors are involved in the process, like system resources(disk, CPU slots, N/W etc), system configuration, other running jobs etc. You might find smooth operation on a small dataset, but as the data size increases all these factors start playing much important role. Even a small configuration parameter may play an important role.(You might have noticed sometimes that a Hive query runs fast initially but starts getting slow gradually). Also, execution of a Hive query is much more involved than a simple MR job.
See this JIRA, to get some idea, where they are talking about developing a Cost Based Query optimization for Joins in Hive. You might also find this helpful.
I think that is not possible to because internally map reduce gets executed for any particular Hive query. Moreover map reduce job's execution time depends on the cluster load and its configuration. So it is tough to predict the execution time. May be you can do one thing you can use some timer before running the query and then after that finishes up you can calculate the exact execution time that was needed for execution.
May be you could sample a small % of records from your table using partitions , bucket features etc then run the query against the small dataset. Note the execution time and then multiply with the factor (total_size/sample_size).

Resources