storing a Dataframe to a hive partition table in spark - hadoop

I'm trying to store a stream of data comming in from a kafka topic into a hive partition table. I was able to convert the dstream to a dataframe and created a hive context. My code looks like this
val hiveContext = new HiveContext(sc)
hiveContext.setConf("hive.exec.dynamic.partition", "true")
hiveContext.setConf("hive.exec.dynamic.partition.mode", "nonstrict")
newdf.registerTempTable("temp") //newdf is my dataframe
newdf.write.mode(SaveMode.Append).format("osv").partitionBy("date").saveAsTable("mytablename")
But when I deploy the app on cluster, its says
Exception in thread "main" java.lang.IllegalArgumentException: Wrong FS: file:/tmp/spark-3f00838b-c5d9-4a9a-9818-11fbb0007076/scratch_hive_2016-10-18_23-18-33_118_769650074381029645-1, expected: hdfs://
When I try to save it as a normal table and comment out the hiveconfigurations it work. But, with partition table...its giving me this error.
I also tried registering the dataframe as a temp table and then to write that table to the partition table. Doing that also gave me the same error
Can someone please tell how can I solve it.
Thanks.

You need to use hadoop(hdfs) configured if you are deploying the app
on the cluster.
With saveAsTable the default location that Spark saves to is controlled by the HiveMetastore (based on the docs). Another option would be to use saveAsParquetFile and specify the path and then later register that path with your hive metastore OR use the new DataFrameWriter interface and specify the path option write.format(source).mode(mode).options(options).saveAsTable(tableName).

I figured it out.
In the code for spark app, I declared the scratch dir location as below and it worked.
sqlContext.sql("SET hive.exec.scratchdir=<hdfs location>")

sqlContext.sql("SET hive.exec.scratchdir=location")

Related

Create hive table through spark job

I am trying to create hive tables as outputs of my spark (1.5.1 version) job on a hadoop cluster (BigInsight 4.1 distribution) and am facing permission issues. My guess is spark is using a default user (in this case 'yarn' and not the job submitter's username) to create the tables and therefore fails to do so.
I tried to customize the hive-site.xml file to set an authenticated user that has permissions to create hive tables, but that didn't work.
I also tried to set Hadoop user variable to an authenticated user but it didn't work either.
I want to avoid saving txt files and then creating hive tables to optimize performances and reduce the size of the outputs through orc compression.
My questions are :
Is there any way to call write function of the spark dataframe api
with a specified user ?
Is it possible to choose a username using oozie's workflow file ?
Does anyone have an alternative idea or has ever faced this problem ?
Thanks.
Hatak!
Consider df holding your data, you can write
In Java:
df.write().saveAsTable("tableName");
You can use different SaveMode like Overwrite, Append
df.write().mode(SaveMode.Append).saveAsTable("tableName");
In Scala:
df.write.mode(SaveMode.Append).saveAsTable(tableName)
A lot of other options can be specified depending on what type you would like to save. Txt, ORC (with buckets), JSON.

To Replace Name with Another Name in a file

I am very new to hadoop and i have requirement of scrubbing the file in which account no,name and address details and i need to change these name and address details with some other name and address which are existed in another file.
And am good with either Mapreduce or Hive.
Need help on this.
Thank you.
You can write simple Mapper only job (with reducer set to zero), update the information and store them on some other location. Verify the output of the your job, if it is as you expected, then remove the old files. Remember, HDFS does not support in-placing editing and over-write of files.
Hadoop - MapReduce Tutorial.
You can also use Hive to accomplish this task.
Write hive UDF based on your logic of scrubbing
Use above UDF for each column in hive table you want to scrub and store data in new Hive table.
3.You can remove old hive table.

Does any one know how to create dataframe in sparkR from hbase table?

I am trying to create a spark dataframe in sparkR using data stored in hbase.
Does any one know how to specify the data source parameters in SQLontext or any other way to get around this?
You might want to take a look at this package : http://spark-packages.org/package/nerdammer/spark-hbase-connector.
However, it seems that you can't use it with SparkR yet and the two others packages providing connection between Spark and HBase don't seem to ba as advanced as the first one.
So I guess you won't be able to create a dataframe directly from HBase to SparkR.

How to integrate manually copied hbase files into an hbase instance?

I have copied the files associated an hbase table into another cluster and stored the files in the hbase folder. I can see the table when i do a list. When I do scan 'myTable' it can't find the table.
When I go through the HBase-WebUI, I see the table including its cf information, when I click on the table I get:
org.apache.hadoop.hbase.TableNotFoundException: hbTable
at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegionInMeta(HConnectionManager.java:1024)
at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:889)
at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:846)
How do i get the hbaseRegionServers to manage the table?
P.S. For the purposes of this exercise I'm not interested in using the export utility or the copyTable Utility.
Please refer this link : Complete Bulk Load
Assuming here that you copied the Hfiles from one cluster to another.
Did you create the table in the other cluster? I think the error is due to the missing information in the .META

Load Data into Hive from Flat files or existing database

We are setting up Hadoop and Hive in our organization.
Also we will be having the sample data created by data generator tool. The data will be around 1 TB.
My question is - i have to load that data into Hive and Hadoop. What is the process i need to follow for this?
Also we will be having HBase installed with Hadoop.
We need to create the same database design which is right now there in SQL Server..But using Hive. Cz after this data loaded into hive we want to use the Business Objects 4.1 as a front end to create the Reports.
The challage is to load the sample data into the Hive..
Please help me as we want to do all the things asap.
First ingest your data in HDFS
Use Hive external tables, pointing to the location where you ingested the data i.e. your hdfs directory.
You are all set to query the data from the tables you created in Hive.
Good luck.
For the first case you need to put data in hdfs.
Transport your data file(s) to a client node (app node)
put your files en distribute file system (hdfs dfs -put ... )
create an external Table pointing the hdfs directory in which you uploaded those files. Your data have been structure of some way. For instance delimited by semicolon symbol.
Now you can operate over the data with sql queries.
For the second case you can create another hive table (using HBaseStorageHandler , https://cwiki.apache.org/confluence/display/Hive/HBaseIntegration) and load from the first table with Insert statement.
I hope this can help you.

Resources