Hive partitioned column doesn't appear in rdd via sc.textFile - hadoop

The Hive partitioned column is not the part of the underlying saved data, I need to know how it can be pulled via sc.textFile(filePath) syntax to be loaded in RDD.
I know the other way of creating hive context and all but was wondering is there a way I can directly get it via sc.textFile(filePath) syntax and use it.

By partitioning the data by a column when saving, that columns data will be stored in the file structure and not in the actual files. Since, sc.textFile(filePath) is made for reading single files I do not believe it supports reading partitioned data.
I would recommend reading the data as a dataframe, for example:
val df = hiveContext.read().format("orc").load("path/to/table/")
The wholeTextFiles() method could also be used. Then you would get a tuple of (file path, file data) and from that it should be possible to parse out the partitioned data column and then add it as a new column.
If the storage size is no problem, then an alternative solution would be to store the information of the partitioned column twice. Once in the file structure (done by partitioning on that column), and once more in the data itself. This is achieved by duplicating the column before saving it to file. Say the column in question is named colA,
val df2 = df.withColumn("colADup", $"colA")
df2.write.partitionBy("colADup").orc("path/to/save/")
This can also easily be extended to multiple columns.

Related

How to coalesce large portioned data into single directory in spark/Hive

I have a requirement,
Huge data is partitioned and inserting it into Hive.To bind this data, I am using DF.Coalesce(10). Now i want to bind this portioned data to single directory, if I use DF.Coalesce(1) will the performance decrease? or do I have any other process to do so?
From what I understand is that you are trying to ensure that there are less no of files per partition. So, by using coalesce(10), you will get max 10 files per partition. I would suggest using repartition($"COL"), here COL is the column used to partition the data. This will ensure that your "huge" data is split based on the partition column used in HIVE. df.repartition($"COL")

Can I keep data of different file formats in same hive table?

I am receiving data of formats like csv, xml, json and I want to keep all the files in same hive table.Is it achievable?
Hive expects all the files for one table to use the same delimiter, same compression applied etc. So, you cannot use a Hive table on top of files with multiple formats.
The solution you may want to use is
Create a separate table (json/xml/csv) for each of the file formats
Create a view for the UNION of the 3 tables created above.
This way the consumer of the data has to query only one view/object, if that's what you are looking for.
Yes, you can achieve this through a combination of different external tables.
Because different SerDes with different specifications for how to read columns in the different files will be needed, you will need to create one external table per type of file (and table). The data from each of these external tables can then be combined into a view with UNION, as suggested by Ramesh. The view can could then be used for reading from these, and you could e.g. insert the data into a managed table.

How to deal with primary key check like situations in Hive?

I have to load the incremental load to my base table (say table_stg) everyday once. I get the snapshot of data everyday from various sources in xml format. The id column is supposed to be unique but since data is coming from different sources, there is a chance of duplicate data.
day1:
table_stg
id,col2,col3,ts,col4
1,a,b,2016-06-24 01:12:27.000532,c
2,e,f,2016-06-24 01:12:27.000532,k
3,a,c,2016-06-24 01:12:27.000532,l
day2: (say the xml is parsed and loaded into table_inter as below)
id,col2,col3,ts,col4
4,a,b,2016-06-25 01:12:27.000417,l
2,e,f,2016-06-25 01:12:27.000417,k
5,w,c,2016-06-25 01:12:27.000417,f
5,w,c,2016-06-25 01:12:27.000417,f
when i put this data ino table_stg, my final output should be:
id,col2,col3,ts,col4
1,a,b,2016-06-24 01:12:27.000532,c
2,e,f,2016-06-24 01:12:27.000532,k
3,a,c,2016-06-24 01:12:27.000532,l
4,a,b,2016-06-25 01:12:27.000417,l
5,w,c,2016-06-25 01:12:27.000417,f
What could be the best way to handle these kind of situations(without deleting the table_stg(base table) and reloading the whole data)
Hive does allow duplicates on primary and unique keys.You should have an upstream job doing the data cleaning before loading it into the Hive table.
You can write a python script for that if data is less or use spark if data size is huge.
spark provides dropDuplicates() method to achieve this.

Schema verification/validation before loading data into HDFS/Hive

I am a newbie to Hadoop Ecosystem and I need some suggestion from Bigdata experts on achieving schema verification/validation before loading the huge data into hdfs.
The scenario is:
I have a huge dataset with given schema (having around 200
column-header in it). This dataset is going to be stored in Hive
tables/HDFS. Before loading the data into hive table/hdfs I want to
perform a schema level verification/validation on the data supplied to
avoid any unwanted errors/exception while loading the data into hdfs.
Like in case somebody tries to pass a data file having fewer or more
number of columns in it then at the first level of verification this
load fail.
What could be the best possible approach for achieving the same?
Regards,
Bhupesh
Since you have files, you can add them into HDFS,and run map reduce on top of that. Here you would be having a hold on each row, so you can verify number of columns, their types and any other validations.
When i referred to jason/xml, there is slight overhead to make map reduce identify the records in that format. However with respect to validation there is schema validation which you can enforce and also define only specific values for a field using schema. So once the schema is ready, your parsing(xml to java) and then store them at another final HDFS location for further use(like HBase). When you are sure that data is validated, you can create Hive tables on top of that.
Use below utility to create temp tables every time based on the schema you receive in csv file format in staging directory and then apply some conditions to identify whether you have valid columns or not. Finally load into original table.
https://github.com/enahwe/Csv2Hive

Not able to store the data into hbase using pig when I dont know the number of columns in a file

I have a text file with N number of columns (Not sure, in the future I may have N+1).
Example:
1|A
2|B|C
3|D|E|F
I want to store above data into hbase using pig without writing UDF. How can I store this kind of data without knowing the number of columns in a file?
Put it in a map and then you can use cf1:* where cf1 is your column family

Resources