How do i get table count for all tables in the same folder in HADOOP hive? if in SAS server? - hadoop

I want to get the table count for all tables under a folder called "planning" in HADOOP hive database but i couldn't figure out a way to do so. Most of these tables are not inter-linkable and hence cant use full join with common key.
Is there a way to do table count and output to 1 table with each row of record represent 1 table name?
Table name that i have:
add_on
sales
ppu
ssu
car
Secondly, I am a SAS developer. Is the above process do-able in SAS? I tried data dictionary but "nobs" is totally blank for this library. All other SAS datasets can display "nobs" properly. I wonder why and how.

Related

Hive table or view? Which should be the right approach?

I am new to HDFS/HIVE. Need some advice. I have a background of RDBMS Data modelling.
I have a requirement of a daily report. The report requires fetching of data from two staging Tables(HIVE).
What if I create a table in HIVE, write a view to fetch records from staging to populate HIVE table. create a HIVE view pointing to HIVE table with where clause of selecting one-day data?
HIVE staging tables ---> 2. View to populate HIVE table --> 3. HIVE table ----> 4. View to fetch data from HIVE table created in 3.
what if I create a view on top of two staging HIVE tables (joining two tables with where clause to fetch one-day data)?
HIVE staging tables ---> 2. View to fetch data from HIVE staging tables
I want to know HIVE best practice and solution strategies.
View or not View but you need ETL process to load tables. ETL process can join, aggregate, etc, so you will be able use finally joined and aggregated data in the form star/snowflake or report table. Why do you need Views here? To reuse some common queries, to reduce complexity of some long complex queries, make interfaces to data, create logical entities, etc. You do not necessarily need View simply to join tables and load data to another table. All depends on your requirements. If reports should query data fast then data should be precalculated by ETL process. View is just wrapper over query, it will be calculated each time you query data.
I think its best if you have zero views, 1 single table, and make your partition the date field (but you can't partition on the date, so you have to store it as a string) ... this make it easier for the end user to have only 1 table... fewer tables.
This gives your users the ability to engage only the latest date they want, or leverage the full table.

How Hive Partition works

I wanna know how hive partitioning works I know the concept but I am trying to understand how its working and store the in exact partition.
Let say I have a table and I have created partition on year its dynamic, ingested data from 2013 so how hive create partition and store the exact data in exact partition.
If the table is not partitioned, all the data is stored in one directory without order. If the table is partitioned(eg. by year) data are stored separately in different directories. Each directory is corresponding to one year.
For a non-partitioned table, when you want to fetch the data of year=2010, hive have to scan the whole table to find out the 2010-records. If the table is partitioned, hive just go to the year=2010 directory. More faster and IO efficient
Hive organizes tables into partitions. It is a way of dividing a table into related parts based on the values of partitioned columns such as date.
Partitions - apart from being storage units - also allow the user to efficiently identify the rows that satisfy a certain criteria.
Using partition, it is easy to query a portion of the data.
Tables or partitions are sub-divided into buckets, to provide extra structure to the data that may be used for more efficient querying. Bucketing works based on the value of hash function of some column of a table.
Suppose you need to retrieve the details of all employees who joined in 2012. A query searches the whole table for the required information. However, if you partition the employee data with the year and store it in a separate file, it reduces the query processing time.

Need some expert help in a Hadoop Hive Pig Scenario

I am still in process of learning Hadoop and have come across a specific situation:
I have two tables, first Table A in mySQL with columns: email and address, while the second Table B inside HDFS with columns: id, email and address. I have to look for email in both tables, and update Table B with the new rows from Table A (the email which are not present in Table B, are the new record entries in Table A and therefore have to be moved in Table B).
Can I solve this problem using Pig or using Hive script? Can someone please help me with this?
Currently loading a MySql table in to HDFS would need some effort using Sqoop or a custom load UDF. Look at this SO Link
Once you have the data in HDFS, it is a matter of doing a left (or right) join and get the difference in rows and creating a new relation as needed and store in HDFS.

How can I know all the column in hbase table?

In hbase shell , I use describe 'table_name' , there is only column_family return. How can I get to know all the column in each columnfamily?
As #zsxwing said you need to scan all the rows since in HBase each row can have a completely different schema (that's part of the power of Hadoop - the ability to store poly-structured data). You can see the HFile file structure and see that HBase doesn't track the columns
Thus the column family(s) and its(their) setting are in fact the schema of the HBase table and that's what you get when you 'describe' it

updating Hive external table with HDFS changes

lets say, I created Hive external table "myTable" from file myFile.csv ( located in HDFS ).
myFile.csv is changed every day, then I'm interested to update "myTable" once a day too.
Is there any HiveQL query that tells to update the table every day?
Thank you.
P.S.
I would like to know if it works the same way with directories: lets say, I create Hive partition from HDFS directory "myDir", when "myDir" contains 10 files. next day "myDIr" contains 20 files (10 files were added). Should I update Hive partition?
There are two types of tables in Hive basically.
One is Managed table managed by hive warehouse whenever you create a table data will be copied to internal warehouse.
You can not have latest data in the query output.
Other is external table in which hive will not copy its data to internal warehouse.
So whenever you fire query on table then it retrieves data from the file.
SO you can even have the latest data in the query output.
That is one of the goals of external table.
You can even drop the table and the data is not lost.
If you add a LOCATION '/path/to/myFile.csv' clause to your table create statement, you shouldn't have to update anything in Hive. It will always use the latest version of the file in queries.

Resources