hiveconf parameter, is it possible to set a value? - hadoop

I have one hiveconf variable set as
set DATEHOUR = from_unixtime(unix_timestamp()-3000);
The idea is that log files are available in Hadoop 50 minutes (3000 seconds) after each hour, and this workflow will process them and store the transformed data into the correct partition. DATEHOUR is initially used to query to correct partition within the raw logs directory.
But after the transformation process (which could take a variable amount of time), I want to store the result in a different directory, but again in the correct partition. But if I use ${hiveconf:DATEHOUR} again, it grabs the current timestamp, not the timestamp from when I first set the variable.
I tried creating a new variable and setting it equal to DATEHOUR but it still returns the same problem. Is there a way I can "paste the value" of DATEHOUR somewhere so it remains constant for later retrieval?

You are getting that result because unix_timestamp() gives the current timestamp.
What you need is to subtract the 3000 secs from the datehour you set. So you need to put that value in your query. Your changed will look like this:
select from_unixtime(unix_timestamp('${hiveconf:datehour}')-3000) from <table>;
When this query runs it will get the value from the last datehour value you set. Also you need to set datehour value with the output for this query.
Hope it helps...!!!

Related

Does sql_id change if bind variable values are changed

I have a sql_id. The corresponding SELECT SQL query has 4 bind variables.
There is a program created by me which lets me know that it ran for 1000 times in the last 1 month.
So basically I want to know that all 1000 times the same bind variable was used or not.
For the latest one, I got the bind variable values from v$sql_bind_capture.
So is it that whatever is the latest value in v$sql_bind_capture is the same used all 1000 times?
Does sql_id generation consider the bind value for generation of sql_id or it is the query without the bind value that is used to generate sql_id?
Thanks
Tarun
No, different bind value passed each time will not cause the SQL_ID to change. A different bind value passed may cause the sql plan hash value to change (PHV) but not the SQL_ID.
About your main question:
so basically I want to know that all 1000 times the same bind variable was used or not.
There are 2 standard ways to do that:
add hint "monitor" into the query and check bind variables values in v$sql_monitor. I have own script for that: https://github.com/xtender/xt_scripts/blob/master/rtsm/binds.sql
enable tracing for your sql_id:
alter system set events 'sql_trace [sql:&sqlid] bind=true, wait=false';
&sqlid is substituion variable which you can set to to your needed sql_id. Then you can periodically check bind variables tracefiles, for example using grep.

How to sort by a derived value that includes a moving date in ElasticSearch?

I have a requirement to sort the results returned by ElasticSearch by a special value i define, let's call it 'X'.
Now - the problem is, 'X' is a value derived based on:
field A in the document (which is a 'term')
field B (which is a 'date')
the current date (UTC)
So, the problem is obviously 3. The date always changes, therefore i'm not sure how to include this in the sort, since it's not part of the document.
From my initial reading it appears i can use a 'script' here, but i'm worried about the performance, since i could be searching + sorting over 1000's of documents.
The only other idea that came to mind is to calculate the value nightly, and store that in each document. But that has a few drawbacks:
i need to have something running in the background to update this value
could be a lot of documents to update (60%+ every night).
i lose precision for the value depending on how long between script runs. (if i run nightly, value is 23 hours 'stale')
Any advice?
Thanks
This can be done by having an ES script run nightly calculating value, and store that in each document

Get the current time in a dataflow

I'm building a dataflow where I want to filter rows based on the current time. I need to filter these based on the hour and minute.
I thought I could use a Date Time block. When I use that, the output value shows "today".
But when I bind the output of the Date Time block to the input on a Date Format block or my symbol, the value of the bound property is null.
I'm looking for a way to get the current date and time, preferably with a way to control how often the value is updated (once per minute would be enough for example).
Using a Script block works. The script to get the current timestamp with the precision of one minute as a string:
dateFormat(new DateTime(), "y-MM-dd HH:mm")
You can connect the output of the Script block to the input on a block that expects a "date", such as a Date Format block.
For the value to be updated, you must invoke the script block. To do this, a Stopwatch block can be used. In my case, I have it set to update every 10 seconds.

How to calculate a value using previous rows' values in Talend

I have a dataset like below.
Dataset:
Now the business logic is to find out the last paid date for each of the loans. I tried using a tmap component, it calls a java routine that has a static variable last_paid_dt which would store the transaction date when the daily deposit > 0. However, when the daily deposit is less than 0 the static var would not get changed. This works fine when the amount paid is 0.
Issue - See the red highlighted values in the table below
When the amount paid is reversed a day or after, the last paid should be from previous non-reversed positive amount. I was not able to get that done.
Also when a new loan id starts processing I need the static variable to get reset which is not currently happening.
If my current methodology is wrong, please help me doing in a better and efficient way. Thanks
Expected output:
First of all you need to use a Map component, with the key being the loanId.
You don't want to overwrite the value. I.e. If the key exists in your map, then do not overwrite it with a new value.
You can use the globalMap if you want, in that case I'd do:
globalMap.get("loan_"+loanId) != null ?
globalMap.put("loand_"+loanId,loanDate) : loanDate
then later:
globalMap.get("loan_"+loanId)
Not elegant, but works. a More elegant would be to define your own map that you put into globalMap and after the process null it out, so you free up the memory. But this all depends on the complexity of your job.

Time value as output

For few columns from the source i.e .csv file, we are having values like 1:52:00, 14:45:00.
I am supposed to load to the Oracle table.
Which data type should I choose in Target as well as source?
Should i be doing any thing in the expression transformation?
Use SQLLDR to load the data into database with the format described as in the link
http://docs.oracle.com/cd/B19306_01/server.102/b14200/sql_elements004.htm
ie.'HH24:MI:SS'
Oracle does not support time-only values, it supports dates (with a time component).
You have a few options:
Store the value as a string, perhaps providing a leading zero for
the hour.
Store the value as the number of seconds (or minutes) past midnight.
Store the value as the time component of some arbitrarily defined date, for
example 0001-JAN-01 01:52:00 and 0001-Jan-01 14:45:00. Tell your report writers to ignore the date portion of the value.
Your source datatype will be string(8). Use LPAD to add leading zeroes.

Resources