Running Kera Transformer in colab takes too much time - transformer-model

I try this Tutorial (https://colab.research.google.com/drive/1f8bXV1t20jQIy2iEiZ40UTHJ2JNYnmIU#scrollTo=4Pnz5DC92eJh) with another dataset, but the execution time is very long on my colab. I don't know why

Related

SSAS Tabular Model occasionally takes hours to process

I have a fairly large tabular model and most days it only takes 8-10 minutes but every few days it will take 4-5 hours. Where would I even being to start troubleshooting this. I looked at the ETL logs but all it shows me is the step that asks the model to be processed.
Someone mentioned seeing if it was processing in parallel mode or sequential but I can't seem to find any setting for that in VS 2017 which is what I'm using.
I should also mention that I manually process it, it takes the normal amount of time (8-10 minutes). It's only when the ETL to process it executes that I sometimes see this long processing time.

Azure Data Factory difference between duration time

I'm brand new with Azure Data Factory. Previously I've been working with SSIS and Pentaho. Recently I have started using this tool to create some ETL, and I've noticed some differences between the time values provided at the end of the process. So I wonder what they mean (Duration - Processing Time - Time), and especially why the big difference between Duration and Processing Time, is this difference a standard preparation time for the tool or something like that?
Regards.
When you read the "Duration" time from the top of your screenshot, that it is end-to-end for the pipeline activity. That takes into account all factors like marshaling of your data flow script from ADF to the Spark cluster, cluster acquisition time, job execution, and I/O write time.
The bottom section of your screenshot is the amount of time Spark spent in that stage of your transformation logic, which is all in-memory data frames.
The write time is shown in the data flow execution plan in the Sink transformation and the cluster acquisition time is shown at the top.

Improving Spark performance: is it enough to use a cluster?

I have developed a simple Spark application that analyzes a dataset. The data analyzed comes from a CSV of 2 million records and 25 attributes. The analysis relates to simple transformations/actions of RDDs and I also used the MLlib library algorithms.
Being my first experience I've taken many pieces of code from documentation or examples present online. However, for a complete execution of a simple algorithm ALS of User Recommendation, for example, it takes several minutes.
I use the application on a laptop (i7 2GHz, 12GB RAM).
I would like to know if I only need to use this application in a cluster of computers to increase performance (in terms of speed) and if so, it is normal that running a Recommendation Engine Model in local takes so long time.
If yes, with a good cluster of computer can I get results in real time?
Thanks in advance!

What is the best practise to run long runnig calculations with OpenCPU?

Hi I want to use the power and flexibility of OpenCPU to start long runnig calculations (several minutes or so). I am facing issues where OpenCPU terminates processing of given scripts. I have modified /etc/opencpu/server.conf option timelimit.post to 600 (seconds) but it seems that is not taking an effect. Is there any other cofig file that must be modified in order to increase timeouts?
Or more generally - what is the best practise to run long running calculations with OpenCPU?

Hive find expected run time of query

I want to find the expected run time of query in Hive. Using EXPLAIN gives the execution plan. Is there a way to find the expected time?
I need Hive equivalent of SQL query EXPLAIN COSTS .
There is no OOTB feature at this moment that facilitates this. One way to achieve this would be to learn from history. Gather patterns based on similar data and queries you have run previously and try to deduce some insights. You might find tools like Starfish helpful in the process.
I would not recommend you to decide anything based on a subset of your data, as running queries on a small dataset and on the actual dataset are very different. This is good to test the functionality but not for any kind of cost approximation. The reason behind this is that a lot of factors are involved in the process, like system resources(disk, CPU slots, N/W etc), system configuration, other running jobs etc. You might find smooth operation on a small dataset, but as the data size increases all these factors start playing much important role. Even a small configuration parameter may play an important role.(You might have noticed sometimes that a Hive query runs fast initially but starts getting slow gradually). Also, execution of a Hive query is much more involved than a simple MR job.
See this JIRA, to get some idea, where they are talking about developing a Cost Based Query optimization for Joins in Hive. You might also find this helpful.
I think that is not possible to because internally map reduce gets executed for any particular Hive query. Moreover map reduce job's execution time depends on the cluster load and its configuration. So it is tough to predict the execution time. May be you can do one thing you can use some timer before running the query and then after that finishes up you can calculate the exact execution time that was needed for execution.
May be you could sample a small % of records from your table using partitions , bucket features etc then run the query against the small dataset. Note the execution time and then multiply with the factor (total_size/sample_size).

Resources