How to show missing values of attributes more than 50% on HIVE

How to show missing values of attributes more than 50% on HIVE - hadoop

I have a dataset of columns
**model, mileage, manufacture, engine_displacement, engine_power, body_type, color_slug, skt_year, transmission, door_count, seat_count, fuel_type, date_created, date_seen, price
**
To see how many missing values you have in each attribute and to show how many missing values in each column we have more than 50% missing values.
How can to achieve this in hive?

you can create a SQL in hive -
to calculate how many nulls in a column. Unfortunately you need to calculate for each column separately.
select
tot_count,
null_model count_null_model,
100*null_model/tot_count percent_null_model,
null_mileage count_null_mileage,
100*null_mileage/tot_count percent_null_mileage,
...
from
(select count(*) tot_count,
sum( if(mileage is null,1,0) as null_mileage,
sum( if(model is null,1,0) as null_model,
...
from my_table)rs
Here
sum( if(mileage is null,1,0) as null_mileage - calculates how many null values exist in the table.
Outer 100*null_mileage/tot_count percent_null_mileage - is calculating % of null. If you awnt you can put a filter here like this 100*null_mileage/tot_count >50

Related

How to calculate longest period between two specific dates in SQL?

I have problem with the task which looks like that I have a table Warehouse containing a list of items that a company has on stock. This
table contains the columns ItemID, ItemTypeID, InTime and OutTime, where InTime (OutTime)
specifies the point in time where a respective item has entered (left) the warehouse. I have to calculate the longest period that the company has gone without an item entering or leaving the warehouse. I am trying to resolve it this way:
select MAX(OutTime-InTime) from Warehouse where OutTime is not null
Is my understanding correct? Because I believe that it is not ;)

You want the greatest gap between any two consecutive actions (item entering or leaving the warehouse). One method is to unpivot the in and out times to rows, then use lag() to get the date of the "previous" action. The final step is aggregation:
select max(x_time - lag_x_time) max_time_diff
from warehouse w
cross apply (
select x_time, lag(x.x_time) over(order by x.x_time) lag_x_time
from (
select w.in_time as x_time from dual
union all select w.out_time from dual
) x
) x

You can directly perform date calculation in oracle.
The result is calculated in days.
If you want to do it in hours, multiply the result by 24.
To calculate the duration in [day], and check all the information in the table:
SELECT round((OutTime - InTime)) as periodDay, Warehouse .*
FROM Warehouse
WHERE OutTime is not null
ORDER BY periodDay DESC
To calculate the duration in [hour]:
SELECT round((OutTime - InTime)*24) AS periodHour, Warehouse .*
FROM Warehouse
WHERE OutTime is not null
ORDER periodHour DESC
round() is used to remove the digits.
Select only the record with maximum period.
SELECT *
FROM Warehouse
WHERE (OutTime - InTime) =
( SELECT MAX(OutTime - InTime) FROM Warehouse)
Select only the record with maximum period, with the period indicated.
SELECT (OutTime - InTime) AS period, Warehouse.*
FROM Warehouse
WHERE (OutTime - InTime) =
( SELECT MAX(OutTime - InTime) FROM Warehouse)
When finding the longest period, the condition where OutTime is null is not needed.

SQL Server has DateDiff, Oracle you can just take one date away from the other.
The code looks ok. Oracle has a live SQL tool where you can test out queries in your browser that should help you.
https://livesql.oracle.com/

I tested in my SQL Developer one case about "Subquery in Order By"

I have question about "Subquery in Order by clause". The below request returns the error. Is it means that Subquery in Order by clause must be scalar?
select *
from employees
order by (select * from employees where first_name ='Steven' and last_name='King');
Error:
ORA-00913: too many values
00913. 00000 - "too many values"

Yes, it means that if you use a subquery in ORDER BY it must be scalar.
With select * your subquery returns multiple columns and the DBMS would not know which of these to use for the sorting. And if you selected one column only, you would still have to make sure you only select one row of course. (The difference is that Oracle sees the too-many-columns problem immediately, but detect too many rows only when fetching the data.)
This would be allowed:
select * from employees
order by (select birthdate from employees where employee_id = 12345);
This is a scalar query, because it returns only one value (one column, one row). But of course this still makes as little sense as your original query, because the subquery result is independent from the main query, i.e. it returns the same value for every row in the table and thus no sorting takes effect.
A last remark: A subquery in ORDER BY makes very seldomly sense, because that would mean you order by something you don't display. The exception is when looking up a sortkey. E.g.:
select *
from products p
where type = 'shirt' and color = 'blue' and size in ('S', 'M', 'L', 'XL')
order by (select sortkey from sizes s where s.size = p.size);

It means that valid options for ORDER BY clause can be
expression,
position or
column alias
A subquery is neither of these.

Trying to display top 3 amount from a table using sql query in oracle 11g..column is of varchar type

Am trying to list top 3 records from atable based on some amount stored in a column FTE_TMUSD which is of varchar datatype
below is the query i tried
SELECT *FROM
(
SELECT * FROM FSE_TM_ENTRY
ORDER BY FTE_TMUSD desc
)
WHERE rownum <= 3
ORDER BY FTE_TMUSD DESC ;
o/p i got
972,9680,963 -->FTE_TMUSD values which are not displayed in desc
I am expecting an o/p which will display the top 3 records of values

That should work; inline view is ordered by FTE_TMUSD in descending order, and you're selecting values from it.
What looks suspicious are values you specified as the result. It appears that FTE_TMUSD's datatype is VARCHAR2 (ah, yes - it is, you said so). It means that values are sorted as strings, not numbers - and it seems that you expect numbers. So, apply TO_NUMBER to that column. Note that it'll fail if column contains anything but numbers (for example, if there's a value 972C).
Also, an alternative to your query might be use of analytic functions, such as row_number:
with temp as
(select f.*,
row_number() over (order by to_number(f.fte_tmusd) desc) rn
from fse_tm_entry f
)
select *
from temp
where rn <= 3;

Delete duplicate rows from a BigQuery table

I have a table with >1M rows of data and 20+ columns.
Within my table (tableX) I have identified duplicate records (~80k) in one particular column (troubleColumn).
If possible I would like to retain the original table name and remove the duplicate records from my problematic column otherwise I could create a new table (tableXfinal) with the same schema but without the duplicates.
I am not proficient in SQL or any other programming language so please excuse my ignorance.
delete from Accidents.CleanedFilledCombined
where Fixed_Accident_Index
in(select Fixed_Accident_Index from Accidents.CleanedFilledCombined
group by Fixed_Accident_Index
having count(Fixed_Accident_Index) >1);

You can remove duplicates by running a query that rewrites your table (you can use the same table as the destination, or you can create a new table, verify that it has what you want, and then copy it over the old table).
A query that should work is here:
SELECT *
FROM (
SELECT
*,
ROW_NUMBER()
OVER (PARTITION BY Fixed_Accident_Index)
row_number
FROM Accidents.CleanedFilledCombined
)
WHERE row_number = 1

UPDATE 2019: To de-duplicate rows on a single partition with a MERGE, see:
https://stackoverflow.com/a/57900778/132438
An alternative to Jordan's answer - this one scales better when having too many duplicates:
#standardSQL
SELECT event.* FROM (
SELECT ARRAY_AGG(
t ORDER BY t.created_at DESC LIMIT 1
)[OFFSET(0)] event
FROM `githubarchive.month.201706` t
# GROUP BY the id you are de-duplicating by
GROUP BY actor.id
)
Or a shorter version (takes any row, instead of the newest one):
SELECT k.*
FROM (
SELECT ARRAY_AGG(x LIMIT 1)[OFFSET(0)] k
FROM `fh-bigquery.reddit_comments.2017_01` x
GROUP BY id
)
To de-duplicate rows on an existing table:
CREATE OR REPLACE TABLE `deleting.deduplicating_table`
AS
# SELECT id FROM UNNEST([1,1,1,2,2]) id
SELECT k.*
FROM (
SELECT ARRAY_AGG(row LIMIT 1)[OFFSET(0)] k
FROM `deleting.deduplicating_table` row
GROUP BY id
)

Not sure why nobody mentioned DISTINCT query.
Here is the way to clean duplicate rows:
CREATE OR REPLACE TABLE project.dataset.table
AS
SELECT DISTINCT * FROM project.dataset.table

If your schema doesn’t have any records - below variation of Jordan’s answer will work well enough with writing over same table or new one, etc.
SELECT <list of original fields>
FROM (
SELECT *, ROW_NUMBER() OVER (PARTITION BY Fixed_Accident_Index) AS pos,
FROM Accidents.CleanedFilledCombined
)
WHERE pos = 1
In more generic case - with complex schema with records/netsed fields, etc. - above approach can be a challenge.
I would propose to try using Tabledata: insertAll API with rows[].insertId set to respective Fixed_Accident_Index for each row.
In this case duplicate rows will be eliminated by BigQuery
Of course, this will involve some client side coding - so might be not relevant for this particular question.
I havent tried this approach by myself either but feel it might be interesting to try :o)

If you have a large-size partitioned table, and only have duplicates in a certain partition range. You don't want to overscan nor process the whole table. use the MERGE SQL below with predicates on partition range:
-- WARNING: back up the table before this operation
-- FOR large size timestamp partitioned table
-- -------------------------------------------
-- -- To de-duplicate rows of a given range of a partition table, using surrage_key as unique id
-- -------------------------------------------
DECLARE dt_start DEFAULT TIMESTAMP("2019-09-17T00:00:00", "America/Los_Angeles") ;
DECLARE dt_end DEFAULT TIMESTAMP("2019-09-22T00:00:00", "America/Los_Angeles");
MERGE INTO `gcp_project`.`data_set`.`the_table` AS INTERNAL_DEST
USING (
SELECT k.*
FROM (
SELECT ARRAY_AGG(original_data LIMIT 1)[OFFSET(0)] k
FROM `gcp_project`.`data_set`.`the_table` AS original_data
WHERE stamp BETWEEN dt_start AND dt_end
GROUP BY surrogate_key
)
) AS INTERNAL_SOURCE
ON FALSE
WHEN NOT MATCHED BY SOURCE
AND INTERNAL_DEST.stamp BETWEEN dt_start AND dt_end -- remove all data in partiion range
THEN DELETE
WHEN NOT MATCHED THEN INSERT ROW
credit: https://gist.github.com/hui-zheng/f7e972bcbe9cde0c6cb6318f7270b67a

Easier answer, without a subselect
SELECT
*,
ROW_NUMBER()
OVER (PARTITION BY Fixed_Accident_Index)
row_number
FROM Accidents.CleanedFilledCombined
WHERE TRUE
QUALIFY row_number = 1
The Where True is neccesary because qualify needs a where, group by or having clause

Felipe's answer is the best approach for most cases. Here is a more elegant way to accomplish the same:
CREATE OR REPLACE TABLE Accidents.CleanedFilledCombined
AS
SELECT
Fixed_Accident_Index,
ARRAY_AGG(x LIMIT 1)[SAFE_OFFSET(0)].* EXCEPT(Fixed_Accident_Index)
FROM Accidents.CleanedFilledCombined AS x
GROUP BY Fixed_Accident_Index;
To be safe, make sure you backup the original table before you run this ^^
I don't recommend to use ROW NUMBER() OVER() approach if possible since you may run into BigQuery memory limits and get unexpected errors.

Update BigQuery schema with new table column as bq_uuid making it NULLABLE and type STRING
 Create duplicate rows by running same command 5 times for example
insert into beginner-290513.917834811114.messages (id, type, flow, updated_at) Values(19999,"hello", "inbound", '2021-06-08T12:09:03.693646')
Check if duplicate entries exist select * from beginner-290513.917834811114.messages where id = 19999
Use generate uuid function to generate uuid corresponding to each message
 UPDATE beginner-290513.917834811114.messages
SET bq_uuid = GENERATE_UUID()
where id>0
Clean duplicate entries DELETE FROM beginner-290513.917834811114.messages
WHERE bq_uuid IN
(SELECT bq_uuid
FROM
(SELECT bq_uuid,
ROW_NUMBER() OVER( PARTITION BY updated_at
ORDER BY bq_uuid ) AS row_num
FROM beginner-290513.917834811114.messages ) t
WHERE t.row_num > 1 );

MDX rather complicated sorting

I can't find out a way, how to sort my query, this is the simple query:
SELECT {[Measures].[IB]}
ON COLUMNS,
{[Dim_Product_Models_new].[PLA].members } *
{[Dim Dates_new].[Date Full].&[2013-02-01]:[Dim Dates_new].[Date Full].&[2014-01-01]}
ON ROWS
FROM [cub_dashboard_spares]
The think is, I would get a result for 6 PLAs combined across 12 months (72 rows in total), however it is sorted alphabetically upon PLA.
What i need, is to sort the PLAs based on a measure in last month (2014-01-01 in this case).
Is there any way to perform this task so that the groupping (PLAs, Dates from 2013-02 to 2013-12) is perserved, but only the order of my PLAs is different. (PLA with highest measure in last month would be first, and so on)
Thank you very much for any kind of help

Just put the sorted set on the rows, using the Order function. The third parameter of this function is DESC if you want to sort within each hierarchy level, but still want to get parents before children (like ALL before the single attribute members), or BDESC if you want to sort across all levels.
SELECT {[Measures].[IB]}
ON COLUMNS,
Order({[Dim_Product_Models_new].[PLA].members },
([Measures].[IB], [Dim Dates_new].[Date Full].&[2014-01-01]),
DESC)
*
{[Dim Dates_new].[Date Full].&[2013-02-01]:[Dim Dates_new].[Date Full].&[2014-01-01]}
ON ROWS
FROM [cub_dashboard_spares]

The order function over a crossjoin should preserve the initial order of the first set so reversing the order of the tuple will do the job:
SELECT
{
[Measures].[IB]
} ON COLUMNS,
order(
{[Dim Dates_new].[Date Full].&[2013-02-01]:[Dim Dates_new].[Date Full].&[2014-01-01]} *
{[Dim_Product_Models_new].[PLA].members } ,
[Measures].[IB],
desc
) ON ROWS
FROM [cub_dashboard_spares]
If you want to preserve the oder of appearance of the column labels, you can use the generate function like in the following example from the AW cube:
SELECT
{[Measures].[Internet Sales Amount]} ON 0
,Generate
(
{[Customer].[Country].&[Australia]:[Customer].[Country].&[United Kingdom]}
,(
Order
(
[Date].[Calendar Year].[Calendar Year].MEMBERS
,(
[Customer].[Country].CurrentMember
,[Measures].[Internet Sales Amount]
)
,DESC
)
,[Customer].[Country].CurrentMember
)
) ON 1
FROM [Adventure Works];
Philip,

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

How to show missing values of attributes more than 50% on HIVE - hadoop

Related

How to calculate longest period between two specific dates in SQL?

I tested in my SQL Developer one case about "Subquery in Order By"

Trying to display top 3 amount from a table using sql query in oracle 11g..column is of varchar type

Delete duplicate rows from a BigQuery table

MDX rather complicated sorting

Categories

Resources