Hive get_json_object(): How to check if JSON field exists? - hadoop

I am using Hive and the get_json_object() function to query data stored as JSON. The JSON has a coordinate key and two fields (latitude and longitude) that look like the following:
"coordinate":{
"center":{
"lat":36.123413127558536,
"lng":-115.17381648045654
},
"precision":10
}
I am running my Hive query to retrieve data within some geocoordinate box, like this:
INSERT OVERWRITE LOCAL DIRECTORY '/home/user.name/sample/sample1.txt'
SELECT * FROM mytable
WHERE
get_json_object(mytable.`value`, '$.coordinate.center.lat') > 36.115767
AND get_json_object(mytable.`value`, '$.coordinate.center.lng') > -115.314051
AND get_json_object(mytable.`value`, '$.coordinate.center.lat') < 36.285595
AND get_json_object(mytable.`value`, '$.coordinate.center.lng') < -115.085399
DISTRIBUTE BY rand()
SORT by rand()
LIMIT 10000;
However, the problem is that for some rows, the coordinate field is missing, or the center field is missing, or the lat and/or lng field is missing. How can I modify my Hive SELECT query to only get rows that have a complete valid coordinate entry with existing lat and lng?

I would make a separate VIEW for the table where you do
WHERE get_json_object(...) IS NOT NULL
for each field you're interested in.
Then run the given query over that view
Alternatively, fix your input source to generate some consistent data using Avro instead, for example

Related

How to Select multiple related columns in add calculated fields in Quicksight parameter using ifelse?

I have a parameter 'type' in a table and it can have multiple values as follows -
human
chimpanzee
orangutan
I have 3 columns related to each type in the table -
human_avg_height, human_avg_weight, human_avg_lifespan
chimpanzee_avg_height, chimpanzee_avg_weight, chimpanzee_avg_lifespan
orangutan_avg_height, orangutan_avg_weight, orangutan_avg_lifespan
So if i select the type as human, the quicksight dashboard should only display the three columns -
human_avg_height, human_avg_weight, human_avg_lifespan
and should not display the following columns -
chimpanzee_avg_height, chimpanzee_avg_weight, chimpanzee_avg_lifespan
orangutan_avg_height, orangutan_avg_weight, orangutan_avg_lifespan
I created the parameter type and in the add calculated fields I am trying to use ifelse to select the columns based on the parameter selected as follows -
ifelse(${type}='human',{human_avg_height}, {human_avg_weight}, {human_avg_lifespan},{function})
I also tried -
ifelse(${type}='human',{{human_avg_height}, {human_avg_weight}, {human_avg_lifespan},{function}})
And -
ifelse(${type}='human',{human_avg_height, human_avg_weight, human_avg_lifespan},{function}})
But none of it is working. What am i doing wrong ?
One way to do this would be to use three different calculated fields, one for all the heights, one for weights and one for lifespan. The heights one would look like this:
ifelse(
${type}='human',{human_avg_height}, ifelse(
${type}='chimpanzee', { chimpanzee_avg_height}, ifelse(
${type}='orangutan',{ orangutan_avg_height},
NULL
)))
Make another calculated field for weights and lifespan and then add these calculated fields to your table, and filter by type.
To make it clear to the viewer what data is present, edit the Title of the visual to include the type:
${type} Data
You have to create one calculated field for each measure using the ifelse with the type to choose the correct vale, but is not necessary to create inner ifelse as skabo did, the if else syntax is ifelse(if, then [, if, then ...], else) so you can define the calculated fields as follows:
avg_height = ifelse(${type}='human', {human_avg_height}, ${type}='chimpanzee', {chimpanzee_avg_height},${type}='orangutan', {orangutan_avg_height}, NULL)
avg_weight = ifelse(${type}='human', {human_avg_weight}, ${type}='chimpanzee', {chimpanzee_avg_weight},${type}='orangutan', {orangutan_avg_weight}, NULL)
avg_lifespan = ifelse(${type}='human', {human_avg_lifespan}, ${type}='chimpanzee', {chimpanzee_avg_lifespan},${type}='orangutan', {orangutan_avg_lifespan}, NULL)
Then use those calculated fields in your visuals.

Rails compare same object's 1 field with another + addition of string in Active Record

I've two string fields which contains dates in string like field_1 = "2003.11.14" and I use them in ORM and they are working just fine. Now I want to compare 1 field value with another field's - 18.months. Here is a example
User.where("users.field_1 > '#{Date.today - 18.months}' AND users.field_2 > (users.fields_1 - 18.months)")
something like. Can anyone help me?
Thanks in advance
Most databases support data calculations in SQL. Something like this should work.
query = User.where("users.field_1 > ?", 18.months.ago)
query.where("users.field_2 > users.field_1 - :time", time: 18.months.ago)
edit: Just saw that the values are stored as strings, then you can not use SQL.
can not do that because the table has millions of records
I don't really understand why the size of the table limits to use the correct data type?

Loop over attribute values for executing SQL in Nifi

I would like to know how may I accomplish the following use case in Nifi Flow:
I would like to execute SQL query for date range over a loop. The date range are provided from list of attribute values.
For example: If my list of attributes are : 2013-01-01 2013-02-01 2013-03-01, I would like to execute SQL operation over a loop such that:
select * from where startdate>=2013-01-01 and enddate<2013-02-01
followed by:
select * from where startdate>=2013-02-01 and enddate<2013-03-01
Therefore, for the same, I roughly know the idea but cant implement concretely:
UpdateAttribute (containing list of date values) -> SplitText-> RouteOnAttribute -> ExecuteSQL
Thanks
In NiFi 1.8.0, you can use DuplicateFlowFile for this (via NIFI-5454). You can start with UpdateAttribute to add the count of delineated values in your list (let's assume it is an attribute called datelist), perhaps set list.count to
${allDelineatedValues(${datelist}, " "):count()}
Then in DuplicateFlowFile you can set Number of Copies to ${list.count:minus(1)}. Each flow file downstream will have a copy.index attribute set (the original having index 0), so you can use that in ReplaceText in conjunction with getDelimitedValue(), perhaps setting the content to the following:
select * from myTable where
startdate >= ${datelist:getDelimitedField(${copy.index:plus(1)})} and
enddate < ${datelist:getDelimitedField(${copy.index:plus(2)})}

Pig Latin using two data sources in one FILTER statement

In my pig script, am reading data from more than 5 data sources (Hive tables), where one is the main source data and rest were kind of dimension data tables. I am trying to filter the main data source relation (or alias) w.r.t some value in one of the dimension relation.
E.g.
-- main_data is main data source and dept_data is department data
filtered_data1 = FILTER main_data BY deptID == dept_data.departmentID;
filtered_data2 = FOREACH filtered_data1 GENERATE $0, $1, $3, $7;
In my pig script there are minimum 20 instances where I need to match for some value between multiple data sources and produce a new relation. But am getting some error as
ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1066: Unable to open iterator for alias filtered_data1.
Backend error : Scalar has more than one row in the output. 1st : ( ..... ) 2nd : ( .... )
Details at logfile: /root/pig_1403263965493.log
I tried to use "relation::field" approach also, no use. Alternatively, am joining these two relations (data sources) to get filtered data, but I feel, this will slow down the execution process and unnecessirity huge data will be dumped.
Please guide me how two use two or more data sources in one FILTER statement, something like in SQL, so that I can avoid using JOIN statements and get it done from FILTER statement itself.
Where A.deptID = B.departmentID And A.sectionID = C.sectionID And A.cityID = D.cityID
If you want to match records from different tables by a single ID, you would pretty much have to use a join, as such:
Where A::deptID = B::departmentID And A::sectionID = C::sectionID And A::cityID = D::cityID
If you just want to keep the records that occur in all other tables, you could probably go for an INTERSECT and then a
FILTER BY someID IN someIDList

How do I split birt dataset column into multiple rows

My datasource has a column that contains a comma-separated list of numbers.
I want to create a dataset that takes those numbers and turns them into groupings to use in a bar chart.
requirements
numbers will be between 0-17 inclusive
groupings: 0-2,3-5,6-10,11-17
x-axis labels have to be the groupings
y-axis is the percent of rows that contain that grouping
note that because each row can contribute to multiple columns the percentages can add up to > 100%
any help you can offer would be awesome... i'm very new to BIRT and have been stuck on this for a couple days now
Not sure that I understand the requirements exactly, but your basic question "split dataset column into multiple rows" can be solved either using a scripted dataset or with pure SQL (depending on your DB).
Either way, you will need a second dataset (e.g. your data model is master-detail, and in your layout you will need something like
Table/List "Master bound to master DS
Table/List "Detail" bound to detail DS
The detail DS need the comma-separated result column from the master DS as an input parameter of type "String".
Doing this with a scripted dataset is quite easy IFF you understand Javascript AND you understand how scripted datasets work: Create a report variable "myValues" of type object with a default value of null and a second report variable "myValuesIndex" of type integer with a default value of 0.
(Note: this is all untested!)
Create the dataset "detail" as a scripted DS, with one input parameter "csv" of type String and one output parameter "value" of type String.
In the open event of the scripted DS, code:
vars["myValues"] = this.getInputParameterValue("csv").split(",");
vars["myValuesIndex"] = 0;
In the fetch event, code:
var i = vars["myValuesIndex"];
var len = vars["myValues"].length;
if (i < len) {
row["value"] = vars["myValues"][i];
vars["myValuesIndex"] = i+1;
return true;
} else {
return false;
}
For example, for the master DS result row with csv = "1,2,3-4,foo", the detail DS will result in 4 rows with
value = "1"
value = "2"
value = "3-4"
value = "foo"
Using an Oracle DB, this can be done without Javascript. The detail DS (with the same input parameter as above) would then look like:
select t.value as value from table(split(?)) t
For the definition of the split function, see RedFilter's answer on
Is there a function to split a string in PL/SQL?
If you get ORA-22813, you should change the original definition
create or replace type split_tbl as table of varchar2(32767);
to
create or replace type split_tbl as table of varchar2(4000);
as mentioned on https://community.oracle.com/thread/2288603?tstart=0
It's also possible with pure SQL in 11g using regexp_substr (see the same page).
create parameters in the scripted data set. we have to pass or link actual dataset values to scripted dataset parameters through DataSet parameter Binding after assigning the scripted data set to Table.

Resources