external table with partitions in hive - hadoop

I have a bunch of tsv files in HDFS in a directory structure that follows the partition convention where an event_dt is the partition.
some_path/event_dt=2017-04-30
some_path/event_dt=2017-05-01
and so on.
The issue is that event_dt is also one of the columns. The second one in particular. But I cannot specify so since event_dt cannot appear in the table schema and in the PARTITIONED BY statement. That triggers:
Column repeated in partitioning columns
Is there a way around this other than using different names. It is, after all, the same information.

3 options if you dont want to rename the column.
If your event_dt is the last column in your csv, you create the table excluding this column.
During the ingestion process exclude this information of your data, transforming the data from one place to another where the target table is partitioned by even_dt (not the most efficient way)
create a view on top of your table excluding one of the columns, anyway the original table will need the rename .

Related

Change schema in an Impala/Hive table with a very large amount of data?

We have a Hive table stored on HDFS with 800+ columns and >65 billion rows (and growing) and need to:
Remove a column with a complex type (small array)
Add a column with a complex type (small array)
Possibly add a handful of other columns (simple type, e.g. string or int)
Modify the contents of 3 columns for every row in the database (effectively read it in, make a simple change, write it back out to the same column and row that it came from). I realise this is probably a separate operation to the other three requirements above.
We could set up a new empty table with the new schema and copy the data over (using CREATE TABLE xxxxx FROM SELECT ... or INSERT INTO xxxx SELECT ...) but tests suggest it would take 1 - 3 weeks running non stop. And it's possible we may need to make further minor similar modifications in future.
Is there an efficient, sensible alternative to copying the whole table? Would ALTER TABLE work (at least for the structural changes, items 1 - 3 above)? What are the pros and cons of either option(s)?
Table is going to be queried using Impala, in a Zeppelin-based interface.
Thanks for any advice.

Hive delete duplicate records

In hive, how can I delete duplicate records ? Below is my case,
First, I load data from product table to products_rcfileformat. There are 25 rows of records on product table
FROM products INSERT OVERWRITE TABLE products_rcfileformat
SELECT *;
Second, I load data from product table to products_rcfileformat. There are 25 rows of records on product table. But this time I'm NOT using OVERWRITE clause
FROM products INSERT INTO TABLE products_rcfileformat
SELECT *;
When I query the data it give me total rows = 50 which are right
Check from hdfs, it seem hdfs make another copy of file xxx_copy_1 instead of append to 000000_0
Now I want to remove those records that read from xxx_copy_1. How can I achieve this in hive command ? If I'm not mistaken, i can remove xxx_copy_1 file by using hdfs dfs -rm command follow by rerun insert overwrite command. But I want to know whether this can it be done by using hive command example like delete statement?
Partition your data such that the rows (use window function row_number) you want to drop are in a partition unto themselves. You can then drop the partition without impacting the rest of your table. This is a fairly sustainable model, even if your dataset grows quite large.
detail about Partition .
www.tutorialspoint.com/hive/hive_partitioning.htm
Check from hdfs, it seem hdfs make another copy of file xxx_copy_1
instead of append to 000000_0
The reason is hdfs is read only, not editable, as hive warehouse files (or whatever may be the location) that is still in hdfs, so it has to create a second file.
Now I want to remove those records that read from xxx_copy_1. How can
I achieve this in hive command ?
Please check this post - Removing DUPLICATE rows in hive based on columns.
Let me know if you are satisfied with the answer there. I have another method, which removes duplicate entries but may not be in the way you want.

How Hive Partition works

I wanna know how hive partitioning works I know the concept but I am trying to understand how its working and store the in exact partition.
Let say I have a table and I have created partition on year its dynamic, ingested data from 2013 so how hive create partition and store the exact data in exact partition.
If the table is not partitioned, all the data is stored in one directory without order. If the table is partitioned(eg. by year) data are stored separately in different directories. Each directory is corresponding to one year.
For a non-partitioned table, when you want to fetch the data of year=2010, hive have to scan the whole table to find out the 2010-records. If the table is partitioned, hive just go to the year=2010 directory. More faster and IO efficient
Hive organizes tables into partitions. It is a way of dividing a table into related parts based on the values of partitioned columns such as date.
Partitions - apart from being storage units - also allow the user to efficiently identify the rows that satisfy a certain criteria.
Using partition, it is easy to query a portion of the data.
Tables or partitions are sub-divided into buckets, to provide extra structure to the data that may be used for more efficient querying. Bucketing works based on the value of hash function of some column of a table.
Suppose you need to retrieve the details of all employees who joined in 2012. A query searches the whole table for the required information. However, if you partition the employee data with the year and store it in a separate file, it reduces the query processing time.

comparing data in two tables taking time

I need to query table1 find all orders and created date ( key is order number an date)).
In table 2 ( key is order number an date) Check if the order exists for a a date.
For this i am scanning table 1 and for each record checking if it exists in table 2. Any better way to do this
In this situation in which your key is identical for both tables, it makes sense to have a single table in which you store both data for Table 1 and Table 2. In that way you can do a single scan on your data and know straight away if the data exists for both criteria.
Even more so, if you want to use this data in MapReduce, you would simply scan that single table. If you only want to get the relevant rows, you could define a filter on the Scan. For example, in the case where you will not be populating rows at all in Table 2, you would simply use a ColumnPrefixFilter
If, however, you do need to keep this data separately in 2 tables, you could pre-split the tables with the same region boundaries for both tables - this will be helpful when you do the query that you are aiming for - load all rows in Table 1 when row exists in Table 2. Essentially this would be a map-side join. You could define multiple inputs in your MapReduce job, and since the region borders are the same, the splits will be such that each mapper will have corresponding rows from both tables. You would probably need to implement your own MultipleInput format for that (the MultiTableInputFormat class recently introduced in 0.96 does not seem to do that map side join)

How do I partition in hive by a specific column?

I have 3 columns: user, datetime, and data
My data is space delimited and each row is delimited by a new line
right now I'm using the regexserde to read in my input, however I want to partition by the user. If I do that user can no longer be a column, correct? If so how do I load my data onto my tables?
In Hive each partition corresponds to a folder in HDFS. You can reload the data from your unpartitioned Hive table into a new partitioned HIve table using a create-table-as-select (CTAS) statement. See https://cwiki.apache.org/Hive/languagemanual-ddl.html#LanguageManualDDL-CreateTable for more details.
You can order the data in HDFS in sub-directories under the current directory, the directory name has to be in the format PART_NAME=PART_VALUE.
If your data is split into files where in each file you have only one type of "user" just create directories corresponding to the usernames (e.g. USERNAME=XYZ) and put all the files that match that username in its directory.
Next you can create an external-table with partitions (see example).
The only problem is that you'll have to define the column "user" that's in your data anyway (but you can just ignore it) and query the other column (USERNAME) which will provide the needed partition pruning.

Resources