Athena query submit via shell script - bash

I am executing a shell script In an emr cluster with script-runner to run a Athena query for all data dates in a Month by using aws Athena --query string.
Once I submitted a step, query is being executed for 10 days and the corresponding query Id is displayed in sysout log and the step moves to completed status.
Question:
How to make the job step to execute for all days and then make job status completed?
Thanks in advance

Related

How to speed up this query to retrieve lastUpdateTime of all hive tables?

I have created a bash script (GitHub Link) to query for all hive databases; query each table within them and parse the lastUpdateTime of those tables and extract them to a csv with columns "tablename,lastUpdateTime".
This query is however slow because in each iteration, the call to "hive -e..." starts a new hive cli command which takes noticeably significant amount of time to load.
Is there a way to speed up either loading up the hive cli or speed up the query in some other way to solve the same problem?
I have thought about loading the hive cli just once at the start of the script and try to call the bash commands from within the hive cli using the ! <command> method but not sure how to do loops then within the cli and also if I can process the loops inside a bash script file and execute that, then I am not sure how to pass the results of queries executed within hive cli as arguments to this script.
Without giving specification about the system I am running it on, the script can process about ~10 tables per minute which I think is really slow considering there can be thousands of tables in the database we want to apply it on.

Nifi Job to execute a spark submit command not giving correct results

I have a spark code that appends data from a hive table to parquet files partitioned on dates. The code runs absolutely correct when executed from the spark shell and the parquet files show the exact same number of rows as present in the hive table for the corresponding date.
However, when the same code is executed by putting the code in a jar file, which is called upon by a spark submit command, and the spark submit command is scheduled to execute daily at 9 AM via Nifi, the number of rows in the parquet partition files are coming out to be less. We are on the P_NO_SLA queue, and below are some of the facts and observations we have:
•Data on the source hive table gets updated by 4 AM approx
•Initially our Nifi job was scheduled to start running at 4:45 AM but the number of records did not match. On doing a manual update from the spark shell post 6 AM, the data was an exact match.
•Hence, we scheduled the job to run at 7 AM. On doing this, when the number of records were too less (approx. 20000 on weekends) as compared to weekdays (in the range of 150000 to >200000 records), the data got updated correctly via the Nifi Job. Again a manual run was done to backfill the missing data.
•Again, we postponed the job to 9 AM. Post doing this, there were 2 days when the number of records matched (between 160000 to 200000), however, since Jul-31, the data hasn't matched at all, irrespective of the number of records on any of the days, and we are having to do a manual backfill everyday.
We are unable to figure out any specific reason that maybe causing the code to run correctly from the spark shell at any time, but giving incorrect results from Nifi when Nifi is just schedculed to execute the spark submit command to run the jar file containing the same spark code.
Please help me with understanding why this would be happening and how I can fix this.
P.S.: I have checked the Nifi log files, and could not find any of the scheduled jobs giving an error.

Oozie workflow for hive action

I am using oozie to execute few hive queries one after another and if a query fails it will send error email that a particular hive query is failed.
Now I have to implement another email triggers based on the result of each hive query. So how can we do that ? Its like if a query returns any result then send the results to the email and continue executing remaining hive queries. There should be no stoppings of oozie workflow execution irrespective of query returns value or not.
In short, if it returns value then send email and continue if it didnt return value also it should continue executing.
Thank you in advance.
If you want to make decisions based on previous step its better to use shell actions(hive -e option to execute query) along with capture_output tag in oozie. Or better use java actions with hive jdbc connection to execute hive queries where you can utilize java for doing all logical looping and decision making.
As oozie doesn't support cycles/loops of execution you might need to repeat the email action in workflow based on the decision making and flow.

hive query in Job tracker

Hi we are running hive queries in CDH 4 environment to which we recently upgraded. One thing I notice is that earlier in CDH 3 we were able to track our queries in Job tracker.
The link similar to "hostname:50030/jobconf.jsp?jobid=job_12345" would have a parameter "hive.query.string" or "mapred.jdbc.input.bounding.query" which contains the actual query for which the MR job is executed.
But in CDH4 I do not see where I can get the query. Many queries are run in parallel to keep track of which is the query we are concerned.
You can still view the hive queries in job tracker.
Get the job information based on the job id from below url hostname:50030/jobtracker.jsp
You will find some details as mentioned below at the top of the page.
Hadoop Job 4651 on History Viewer
User: xxxx JobName: test.jar
JobConf:
hdfs://domain:port/user/xxxx/.staging/job_201403111534_4651/job.xml
Job-ACLs: All users are allowed Submitted At: 14-Mar-2014 03:15:19
Launched At: 14-Mar-2014 03:15:19 (0sec) Finished At: 14-Mar-2014
03:18:04 (2mins, 44sec) Status: FAILED Analyse This Job
Now click the URL next to the Job Conf you will find your submitted hive query.
I see that the query parameters for each job can be found in .staging folder in HDFS itself and can be parsed to get the Job_Ids associated query.

Job action string too long

I'm trying to create a job that will sync two databases in the midnight. There are 10 tables that need to be synced. And it's a very long PL SQL script. When I set this script to JOB ACTION and try to create the job I get "string value too long for attribute job action". What do you suggest I do? Should I seperate the scipt into 10? Isn't there a way to make the job run the code as a script. If I do it manualy all 10 anonymous blocks get executed one after another. I need something that will kind of press F5 for me in the midnight.
What you need is a DBMS_Scheduler chain, in which each action is a separate step and they can be executed at the same time.
http://docs.oracle.com/cd/B19306_01/appdev.102/b14258/d_sched.htm

Resources