Identify mismatched columns of two hive views - hadoop

I have two Hive views, hiveView1 and hiveView2
Would like to compare two hive views and find which columns are mismatching.
Step1:
Was able to compare data between two hive views with below command
select sum(hash(*)) from hiveView1;
select sum(hash(*)) from hiveView2;
Step2:
If the data from step1 mismatches, have to identify which columns are mismatching. Any script to identify the mismatched columns

Related

Exclude 1 or more column in Impala

can I exclude 1 or more column in Impala without specifying all the columns in table
SELECT * [except columnA] FROM tableA
Why not create a view of the data without ColumnA? Then you can just keep using the data but not have to have the column included? It would require you to list all the Columns once, but after that you would be good to go.

Perform Incremental Sqoop on table that contains joins?

I have some very large tables that I am trying to sqoop from a Source System Data Warehouse into HDFS, but limited bandwidth to do so. I would like to only pull the columns I need, and minimize the run-time for getting the tables stood up.
The sqoop currently pulls something like this:
SELECT
ColumnA,
ColumnB,
....
ColumnN
FROM
TABLE_A
LEFT JOIN
TABLE_B
ON
...
LEFT JOIN
TABLE_N
....
Is It possible to perform an incremental sqoop, given that the data is stored in a star-schema format, and the dimensions could update independently of the facts?
Or, is the only solution to sqoop the entire table, for the columns that I need, incrementally, and perform the joins on the HDFS side?
For incremental imports you need to use --incremental flag. Please refer to below link for more info :-
https://sqoop.apache.org/docs/1.4.2/SqoopUserGuide.html#_incremental_imports
you need to specify —incremental to tell sqoop that you want an incremental load —check-column to specify which column is used for incremental sqooping and —last-value to say from which value you want to start sqooping the next load.
This is just half the picture. There are more ways to do this.for eg. you can use —query option and your query would be like Select * from table where column > 123. This is basically the same thing. You would need to record the last/max value for the selected column and use it for next import.

external table with partitions in hive

I have a bunch of tsv files in HDFS in a directory structure that follows the partition convention where an event_dt is the partition.
some_path/event_dt=2017-04-30
some_path/event_dt=2017-05-01
and so on.
The issue is that event_dt is also one of the columns. The second one in particular. But I cannot specify so since event_dt cannot appear in the table schema and in the PARTITIONED BY statement. That triggers:
Column repeated in partitioning columns
Is there a way around this other than using different names. It is, after all, the same information.
3 options if you dont want to rename the column.
If your event_dt is the last column in your csv, you create the table excluding this column.
During the ingestion process exclude this information of your data, transforming the data from one place to another where the target table is partitioned by even_dt (not the most efficient way)
create a view on top of your table excluding one of the columns, anyway the original table will need the rename .

Load multiple files content to table using SQL loader

How to insert data from multiple files having different columns into a table in Oracle database using SQL Loader with Single control file.
Basically ,
We have 3 CSV files
file 1 having columns a,b,c
file 2 having columns d,e,f
file 3 having columns g,h,i
We need to insert the above attributes to a Table named "TableTest"
having columns a,b,c ,d,e,f,g,h,i
Using single control file
Thanks in advance
You really can't. You can either splice the .csv files together (a lot of nasty work) or create 3 tables to load and then use plsql or sql to join them together into your target table.

Loading multiple concatenated CSV files into Qracle with SQLLDR

I have a dump of several Postgresql Tables in a selfcontained CSV file which I want to import into an Oracle Database with a matching schema. I found several posts on how to distribute data from one CSV "table" to multiple Oracle tables, but my problem is several DIFFERENT CVS "tables" in the same file.
Is it possible to specify table separators or somehow mark new tables in an SQLLDR control file, or do I have to split up the file manually before feeding it to SQLLDR?
That depends on your data. How do you determine which table a row is destined for? If you can determine which table base on data in the row, then it is fairly easy to do with a WHEN.
LOAD DATA
INFILE bunchotables.dat
INTO TABLE foo WHEN somecol = 'pick me, pick me' (
...column defs...
)
INTO TABLE bar WHEN somecol = 'leave me alone' (
... column defs
)
If you've got some sort of header row that determines the target table then you are going to have to split it before hand with another utility.

Resources