Hive cannot query the tables save by calling saveAsTable in Spark - hadoop

I was trying to use Hive to query the tables I saved using saveAsTable() provided by Spark DataFrame. Everything works well when I query using hiveContext.sql(). However, when I switch to hive and describe the table, it becomes col, array, something like this and is no longer queryable.
Any ideas how to work it through? Is there a reliable way to make Hive understands the metadata defined in spark instead of explicitly defining the schema?
Sometimes I make use of spark to infer schema from the raw data or read schema from certain file formats like parquet so don't want to create these table that could be inferred automatically.
Thanks a lot for any advice!

Related

HBase Table Data Convert to CSV

How to HBase table data to convert .CSV file, im trying to convert table data to csv format , but i couldn't get any code
hbase001> list
Table
sample_data
Creating an external Hive table mapped on to HBase table using HBaseStorageHandler can solve your problem ,you can now use "select * from table_name" to get data into a csv table (stored as textfile fields terminted by ','). Please refer the below link for reference.
https://cwiki.apache.org/confluence/display/Hive/HBaseIntegration#HBaseIntegration-Usage
There are plenty of ways to solve your task. You can use spark, regular mapreduce or special tools like sqoop. This task is rather trivial and you can implement it by yourself if you learn hadoop. The quickest way for starters to do it is probably sqoop. Please get youself familiar with this power tool and play with it.
Good luck!

Migrate hive table to Google BigQuery

I am trying to design a sort of data pipeline to migrate my Hive tables into BigQuery. Hive is running on an Hadoop on premise cluster. This is my current design, actually, it is very easy, it is just a shell script:
for each table source_hive_table {
INSERT overwrite table target_avro_hive_table SELECT * FROM source_hive_table;
Move the resulting avro files into google cloud storage using distcp
Create first BQ table: bq load --source_format=AVRO your_dataset.something something.avro
Handle any casting issue from BigQuery itself, so selecting from the table just written and handling manually any casting
}
Do you think it makes sense? Is there any better way, perhaps using Spark?
I am not happy about the way I am handling the casting, I would like to avoid creating the BigQuery table twice.
Yes, your migration logic makes sense.
I personally prefer to do the CAST for specific types directly into the initial "Hive query" that generates your Avro (Hive) data. For instance, "decimal" type in Hive maps to the Avro 'type': "type":"bytes","logicalType":"decimal","precision":10,"scale":2
And BQ will just take the primary type (here "bytes") instead of the logicalType.
So that is why I find it easier to cast directly in Hive (here to "double").
Same problem happens to the date-hive type.

Does any one know how to create dataframe in sparkR from hbase table?

I am trying to create a spark dataframe in sparkR using data stored in hbase.
Does any one know how to specify the data source parameters in SQLontext or any other way to get around this?
You might want to take a look at this package : http://spark-packages.org/package/nerdammer/spark-hbase-connector.
However, it seems that you can't use it with SparkR yet and the two others packages providing connection between Spark and HBase don't seem to ba as advanced as the first one.
So I guess you won't be able to create a dataframe directly from HBase to SparkR.

Is there a common place to store data schemas in Hadoop?

I've been doing some investigation lately around using Hadoop, Hive, and Pig to do some data transformation. As part of that I've noticed that the schema of data files doesn't seem to attached to files at all. The data files are just flat files (unless using something like a SequenceFile). Each application that wants to work with those files has its own way of representing the schema of those files.
For example, I load a file into the HDFS and want to transform it with Pig. In order to work effectively with it I need to specify the schema of the file when I load the data:
EMP = LOAD 'myfile' using PigStorage() as { first_name: string, last_name: string, deptno: int};
Now, I know that when storing a file using PigStorage, the schema can optionally be written out along side it, but in order to get a file into Pig in the first place it seems like you need to specify a schema.
If I want to work with the same file in Hive, I need to create a table and specify the schema with that too:
CREATE EXTERNAL TABLE EMP ( first_name string
, last_name string
, empno int)
LOCATION 'myfile';
It seems to me like this is extremely fragile. If the file format changes even slightly then the schema must be manually updated in each application. I'm sure I'm being naive but wouldn't it make sense to store the schema with the data file? That way the data is portable between applications and the barrier to using another tool would be lower since you wouldn't need to re-code the schema for each application.
So the question is: Is there a way to specify the schema of a data file in Hadoop/HDFS or do I need to specify the schema for the data file in each application?
It looks like you are looking for Apache Avro. With Avro your schema is embedded in your data, so you can read it without having to worry about schema issues and it makes schema evolution really easy.
The great thing about Avro is that it is completely integrated in Hadoop and you can use it with a lot of Hadoop sub-projects like Pig and Hive.
For example with Pig you could do:
EMP = LOAD 'myfile.avro' using AvroStorage();
I would advise looking at the documentation for AvroStorage for more details.
You can also work with Avro with Hive as described here but I have not used that personally but it should work the same way.
What you need is HCatalog which is
"Apache HCatalog is a table and storage management service for data
created using Apache Hadoop.
This includes:
Providing a shared schema and data type mechanism.
Providing a table abstraction so that users need not be concerned with where or how
their data is stored.
Providing interoperability across data processing tools such as Pig, Map Reduce, and Hive."
You can take a look at the "data flow example" in the docs to see exactly the scenario you are talking about
Apache Zebra seems to be the tool that could provide a common schema definition across mr, pig and hive. It has its own schema store. MR job can use its built in TableStore to write to HDFS.

Using Apache Hive as a MapReduce Input Format and/or Scraping Hive Metadata

Our environment is heavy into storing data in hive. I find myself currently working on something that it outside the scope though. I have a mapreduce written, but it requires a lot of direct user inputs for information that could easily be scraped from Hive. That said, when I query hive for extended table data, all of the extended information is thrown out in 1 or 2 columns as a giant blob of almost-JSON. Is there either a convenient way to parse this information, or better yet, get it directly in a more direct manor?
Alternatively, if I could get pointed to documentation on manually using the CombinedHiveInputFormat, that would simplify my code a lot more. But it seems like that InputFormat is solely used inside of Hive, using it's custom structs.
Ultimately, what I want is to know table names, columns (not including partitions), and partition locations for the split a mapper is working on. If there is yet another way to accomplish this, I am eager to know.

Resources