Hive applications are lazy to start - hadoop

Hive(TEZ) queries are starting in lazy fashion on YARN.
We are running Hive queries on Tez engine. As soon as we submit the queries, we are able to see the status as RUNNING, but the actual job is not starting until 10 to 15 mins.
I am not very sure what code have to present to understand the problem as Hive, Tez and YARN consists lot of configurations. Please let me know in case any configurations required for further investigation on my issue.
The expected scenario is, looking to execute the query as soon as submitted.

Related

How hive manage the Non-Tez and Non-MapReduce based queries

Create table t1(id int)
I was firing above query on Hive 2.3.6 (MapR Hadoop Distribution 6.3.0).
Default hive engine was tez.
So after firing the query I was not able to see any TEZ application is launched on the yarn resource manager web ui
So I've changed the execution engine to MapReduce.
set hive.execution.engine=mr
And tried to run the same query again.
Same I was not able to see any MR application was launched on the yarn resource manager web ui
So my questions are how hive manage such types of queries?
And where the details of this queries are stored like application id, start time so on?
create table - is a metadata operation only, data is not being processed. It creates records in the metastore database, no distributed processing framework like Tez or MR is necessary for this, Yarn is not used.
Compiler translates DDL to the metastore query only if possible.
Also some simple DQL queries can be executed as metastore only if statistics exists and this feature is enabled: https://stackoverflow.com/a/41021682/2700344, without using Tez or MR.
Also small tables can be queried without distributed framework, using fetch-only task, see this: Why is Fetch task in Hive works faster than Map-only task?

Apache NIFI Jon is not terminating automatically

I am new to Apache NIFI tool. I am trying to import data from mongo db and put that data into the HDFS. I have created 2 processors one for MongoDB and second for HDFS and I configured them correctly. The job is running successfully and storing the data into HDFS but the job should terminate automatically on success. But it is not, and creating too many files in HDFS. I want to know how to make On Demand Job in NIFI and how to determine that a job is successfull.
GetMongo will continue to pull data from MongoDB based on the provided properties such as Query, Projection, Limit. It has no way of tracking the execution process, at least for now. What you can do, however, is changing the Run Schedule and/or Scheduling Strategy. You can find them by right clicking on the processor and clicking Configure. By default, Run Schedule will be 0 sec which means running continuously. Changing it to, say, 60 min will make the processor run every one hour. This will still read the same documents from MongoDB again every one hour but since you have mentioned that you just want to run it only once, I'm suggesting this approach.

Hadoop Mapreduce stuck and never completes

I am using Pig to store data into hive. I have a problem that when I execute program, it shows 0% complete and stuck and never completed. I run around 3 hours but never showed any problem. I started searching and found the problem might be in yarn.xml and map_reduce.xml configuration. I changed configuration, but it never effected at all.

Oozie Hive action on AWS - unpredictable ip sources break the job

I've been having a few days of unalloyed torture getting Hive jobs to run via Oozie on an AWS 5 machine cluster. The simplest job that involved the live metastore succeeds or fails unpredictably. The error messages are pretty unhelpful:
Hive failed, error message[Main class [org.apache.oozie.action.hadoop.HiveMain], exit code [1]]
Thanks Oozie!
After a lot of fun changing just about every imaginable setting, I studied hivemetastore.log carefully (we have mySQL as the metastore) and realised that every successful request came from 172.31.40.3. Unsuccessful requests came from 172.31.40.2,172.31.40.4 and 172.31.40.5 . The Hive console app makes requests without problems on 172.31.40.1
This is getting somewhere after nearly week of having no idea whatsover is going on. The question is now, what do I need to change to allow all requests from 172.31.40.1-5 in? Or funnel Oozie requests solely through 172.31.40.1 or 172.31.40.3, either.
Why would only 172.31.40.1 and 172.31.40.3 work?
all ideas and suggestions warmly received.
many thanks
Toby
this was so simple in the end - the Oozie client was only installed on 2 of the 5 machines in the cluster. Corresponding, of course, to the 2 IP addresses that could make successful requests to the hive metastore
Once we installed the Oozie client onto all the machines in the cluster, all the jobs were automatically accepted and ran OK
obvious when you know the answer ...

How to access Hive log information

I am trying to analyze the performance of the Hive queries. Though I was able to make Hive queries with Java but I still need to access the log information getting generated after each query. Instead of using a hack to read the latest log on the disk and using regex to extract the numbers I am looking for a graceful method if already available.
Any pointers will be helpful. Thanks in advance.
-lg
Query execution details like Status,Finished at, Finished in are displayed in Job Tracer, you can access job tracker programmatically . Related info at this link
How could I programmatically get all the job tracker and tasktracker information that is displayed by Hadoop in the web interface?
Once hive starts running a corresponding map-reduce job starts. The logs of this hadoop job can be found on the corresponding tasktracker on which each task runs.
Use jobclient API to retrieve these logs programmatically.

Resources