Distinct Column in Hive - hadoop

I am trying to get a query result in HiveQL with one column as distinct. However the results are not matching . There are almost 20 columns in the table.
create table uniq_us row format delimited fields terminated by ',' lines terminated by '\n' as select distinct(a),b,c,d,e,f,g,h,i,j from ctry_us_join;
The resulting number of Rows :513238
select count(distinct a) from ctry_us_join;
The resulting number of rows : 151616
How is this possible and is something wrong in my first or second query

U need to use subselect with group by statement.
select count(a) from (
select a, count(*) from ctry_us_join group by a) b
This is just one solution for this.

Distinct is a keyword, not a function. It applies to all columns you list in your select clause. It is quite reasonable that your table has only 151,616 distinct values in the column a, but multiple rows with the same value in the column a have different values in other columns. That might give you 513,238 distinct rows.

Related

Combining CLOB columns in Query

I have a table with a CLOB column. What I need to do is query the table, and combine the CLOB column of each row into a single CLOB column.
So, say I have something like:
ABC CLOB_VALUE1
ABC CLOB_VALUE2
ABC CLOB_VALUE2
What I need at output is:
ABC Combined Value (CLOB_VALUE1, CLOB_VALUE2, CLOB_VALUE3)
LISTAGG will not work due to the length, and I'm not having any luck with XMLAGG (unless I am doing it wrong).
I tried this, but it is not retrieving all the records:
SELECT id, XMLAGG(XMLELEMENT(E,price_string||',') ORDER BY
price_date).EXTRACT('//text()').getclobval() AS daily_7d_prices
FROM daily_price_coll
WHERE price_date >= TRUNC(SYSDATE) - 7
GROUP BY id;
I'm only getting the most recent row, when there are actually 3 rows in the table.
Any ideas?

How to get distinct single column value while fetching data by joining two tables

$data['establishments2'] = Establishments::Join("establishment_categories",'establishment_categories.establishment_id','=','establishments.id')->where('establishments.city','LIKE',$location)->where('establishments.status',0)->whereIn('establishment_id',array($est_data))->get(array('establishments.*'));
this is controller condition.
I have two tables, in table1 i am matching id with table2 and then fetching data from table1, and in table2 i have multiple values of same id in table 1. i want to get data of table1 values only one time, but as i am hvg multiple data of same id in table2 , data is repeating multiple times, can anyone please tell me how to get data only one time wheater table2 having single value or multiple value of same id... thank you
you can do it by selecting the field name you desired
instead get all field from table establishments
$data['establishments2'] = Establishments::Join("establishment_categories",'establishment_categories.establishment_id','=','establishments.id')->where('establishments.city','LIKE',$location)->where('establishments.status',0)->whereIn('establishment_id',array($est_data))->get(array('establishments.*'));
you can select specific field from table establishments like
$data['establishments2'] = Establishments::Join("establishment_categories",'establishment_categories.establishment_id','=','establishments.id')->where('establishments.city','LIKE',$location)->where('establishments.status',0)->whereIn('establishment_id',array($est_data))->get('establishments.fieldName');
or you can also do
$data['establishments2'] = `Establishments::Join("establishment_categories",'establishment_categories.establishment_id','=','establishments.id')->where('establishments.city','LIKE',$location)->where('establishments.status',0)->whereIn('establishment_id',array($est_data))->select('establishments.fieldName')->get();`

HIVE GROUP_CONCAT with ORDER BY

I have a table like
I expect the output like this (group concat the results in one record, and the group_concat should sort the results by value DESC).
Here is the query I tried,
SELECT id,
CONCAT('{',CONCAT_WS(',',GROUP_CONCAT(CONCAT('"',key, '":"',value, '"'))), '}') AS value
FROM
table_name
GROUP BY id
I want the value in the destination table should be sorted (descending order) by source table value.
To do that, I tried doing GROUP_CONCAT(... ORDER BY value).
Looks like Hive does not support this. Is there any other way to achieve this in hive?
Try out this query.
Hive does not support the GROUP_CONCAT function, but instead you can use the collect_list function to achieve something similar. Also, you will need to use analytic window functions because Hive does not support ORDER BY clause inside the collect_list function
select
id,
-- since we have a duplicate group_concat values against the same key
-- we can pick any one value by using the min() function
-- and grouping by the key 'id'
-- Finally, we can use the concat and concat_ws functions to
-- add the commas and the open/close braces for the json object
concat('{', concat_ws(',', min(g)), '}')
from
(
select
s.id,
-- The window function collect_list is run against each row with
-- the partition key of 'id'. This will create a value which is
-- similar to the value obtained for group_concat, but this
-- same/duplicate value will be appended for each row for the
-- same key 'id'
collect_list(s.c) over (partition by s.id
order by s.v desc
rows between unbounded preceding and unbounded following) g
from
(
-- First, form the key/value pairs from the original table.
-- Also, bring along the value column 'v', so that we can use
-- it further for ordering
select
id,
v,
concat('"', k, '":"', v, '"') as c
from
table_name -- This it th
)
s
)
gs
-- Need to group by 'id' since we have duplicate collect_list values
group by
id

optimize query with minus oracle

Wanted to optimize a query with the minus that it takes too much time ... if they can give thanked help.
I have two tables A and B,
Table A: ID, value
Table B: ID
I want all of Table A records that are not in Table B. Showing the value.
For it was something like:
Select ID, value
FROM A
WHERE value> 70
MINUS
Select ID
FROM B;
Only this query is taking too long ... any tips how best this simple query?
Thank you for attention
Are ID and Value indexed?
The performance of Minus and Not Exists depend:
It really depends on a bunch of factors.
A MINUS will do a full table scan on both tables unless there is some
criteria in the where clause of both queries that allows an index
range scan. A MINUS also requires that both queries have the same
number of columns, and that each column has the same data type as the
corresponding column in the other query (or one convertible to the
same type). A MINUS will return all rows from the first query where
there is not an exact match column for column with the second query. A
MINUS also requires an implicit sort of both queries
NOT EXISTS will read the sub-query once for each row in the outer
query. If the correlation field (you are running a correlated
sub-query?) is an indexed field, then only an index scan is done.
The choice of which construct to use depends on the type of data you
want to return, and also the relative sizes of the two tables/queries.
If the outer table is small relative to the inner one, and the inner
table is indexed (preferrable a unique index but not required) on the
correlation field, then NOT EXISTS will probably be faster since the
index lookup will be pretty fast, and only executed a relatively few
times. If both tables a roughly the same size, then MINUS might be
faster, particularly if you can live with only seeing fields that you
are comparing on.
Minus operator versus 'not exists' for faster SQL query - Oracle Community Forums
You could use NOT EXISTS like so:
SELECT a.ID, a.Value
From a
where a.value > 70
and not exists(
Select b.ID
From B
Where b.ID = a.ID)
EDIT: I've produced some dummy data and two datasets for testing to prove the performance increases of indexing. Note: I did this in MySQL since I don't have Oracle on my Macbook.
Table A has 2600 records with 2 columns: ID, val.
ID is an autoincrement integer
Val varchar(255)
Table b has one column, but more records than Table A. Autoincrement (in gaps of 3)
You can reproduce this if you wish: Pastebin - SQL Dummy Data
Here is the query I will be using:
select a.id, a.val from tablea a
where length(a.val) > 3
and not exists(
select b.id from tableb b where b.id = a.id
);
Without Indexes, the runtime is 986ms with 1685 rows.
Now we add the indexes:
ALTER TABLE `tablea` ADD INDEX `id` (`id`);
ALTER TABLE `tableb` ADD INDEX `id` (`id`);
With Indexes, the runtime is 14ms with 1685 rows. That's 1.42% the time it took without indexes!

Can i use the column in order by clasue

I have specifiec requirement .Actually this is my query. here amount is a column in my table.but i did not mention the amount column in select statement.here can i use this column in oreder by clause.
SELECT stud_name, stud_roll, stud_prg
FROM programcl
ORDER BY 3, amount, 1;
Yes, you can mix both positional and named assignments in your ORDER BY clause.
The positional assignments must appear in your SELECT list. The named assignments do not have to.
can i use this column in oreder by clause.
Yes of course you can use a different column in order by clause that wasn't selected from your select statement.
For example
select col1 from tab1
order by col2;
by this way you get results from col1 which will be displayed on order of col2.
Its Worth trying

Resources