Writing to multiple HBASE Tables, how do I use context.write(hkey, put)? - hadoop

I am new to Hadoop MapReduce. I would like to perform multiple tables writes from my reducer function. Which will be something like, if anything is getting written to Table1 then I want the same content in table 2 also.
I have gone through the posts like Write to multiple tables in HBASE and checked the "MultiTableOutputFormat". But what I don't understand there is that according to the post in reducer function I should just use
context.write(new ImmutableBytesWritable(Bytes.toBytes("tableName1")),put1);
context.write(new ImmutableBytesWritable(Bytes.toBytes("tableName2")),put2);
I don't understand that if we do this then where are we defining the row where we want to update the value. Like for example I saw some code snippets and saw them writing into the table as context.write(hkey, put). Where I think hkey is not the table name instead it represents some particular row in the table.
How should I deal with this?

Related

How to deal with primary key check like situations in Hive?

I have to load the incremental load to my base table (say table_stg) everyday once. I get the snapshot of data everyday from various sources in xml format. The id column is supposed to be unique but since data is coming from different sources, there is a chance of duplicate data.
day1:
table_stg
id,col2,col3,ts,col4
1,a,b,2016-06-24 01:12:27.000532,c
2,e,f,2016-06-24 01:12:27.000532,k
3,a,c,2016-06-24 01:12:27.000532,l
day2: (say the xml is parsed and loaded into table_inter as below)
id,col2,col3,ts,col4
4,a,b,2016-06-25 01:12:27.000417,l
2,e,f,2016-06-25 01:12:27.000417,k
5,w,c,2016-06-25 01:12:27.000417,f
5,w,c,2016-06-25 01:12:27.000417,f
when i put this data ino table_stg, my final output should be:
id,col2,col3,ts,col4
1,a,b,2016-06-24 01:12:27.000532,c
2,e,f,2016-06-24 01:12:27.000532,k
3,a,c,2016-06-24 01:12:27.000532,l
4,a,b,2016-06-25 01:12:27.000417,l
5,w,c,2016-06-25 01:12:27.000417,f
What could be the best way to handle these kind of situations(without deleting the table_stg(base table) and reloading the whole data)
Hive does allow duplicates on primary and unique keys.You should have an upstream job doing the data cleaning before loading it into the Hive table.
You can write a python script for that if data is less or use spark if data size is huge.
spark provides dropDuplicates() method to achieve this.

Schema verification/validation before loading data into HDFS/Hive

I am a newbie to Hadoop Ecosystem and I need some suggestion from Bigdata experts on achieving schema verification/validation before loading the huge data into hdfs.
The scenario is:
I have a huge dataset with given schema (having around 200
column-header in it). This dataset is going to be stored in Hive
tables/HDFS. Before loading the data into hive table/hdfs I want to
perform a schema level verification/validation on the data supplied to
avoid any unwanted errors/exception while loading the data into hdfs.
Like in case somebody tries to pass a data file having fewer or more
number of columns in it then at the first level of verification this
load fail.
What could be the best possible approach for achieving the same?
Regards,
Bhupesh
Since you have files, you can add them into HDFS,and run map reduce on top of that. Here you would be having a hold on each row, so you can verify number of columns, their types and any other validations.
When i referred to jason/xml, there is slight overhead to make map reduce identify the records in that format. However with respect to validation there is schema validation which you can enforce and also define only specific values for a field using schema. So once the schema is ready, your parsing(xml to java) and then store them at another final HDFS location for further use(like HBase). When you are sure that data is validated, you can create Hive tables on top of that.
Use below utility to create temp tables every time based on the schema you receive in csv file format in staging directory and then apply some conditions to identify whether you have valid columns or not. Finally load into original table.
https://github.com/enahwe/Csv2Hive

Write Hive Table using Spark SQL and JDBC

I am new to Hadoop and I am using a single node cluster (for development) to pull some data from a relational database.
Specifically, I am using Spark (version 1.4.1), Java API, to pull data for a query and write to Hive. I have run into various problems (and have read the manuals and tried searching online) but I think I might be misunderstanding some fundamental part of this because I am having problems.
First, I thought I'd be able to read data into Spark, optionally run some Spark methods to manipulate the data and then write it to Hive through a HiveContext object. But, there doesn't seem to be any way to write straight from Spark to Hive. Is that true?
So I need an intermediate step. I have tried a few different methods of storing the data first before writing to Hive and settled on writing an HDFS text file since it seemed to work best for me. However, writing the HDFS file, I get square brackets in the files, like this: [A,B,C]
So, when I load the data into Hive using the "LOAD DATA INPATH..." HiveQL statement, I get the square brackets in the Hive table!!
What am I missing? Or more appropriately, can someone please help me understand the steps I need to do to:
Run a SQL on SQL Server or Oracle DB
Write the data out to a Hive table that can be accessed by a dashboard tool.
My code right now, looks something like this:
DataFrame df= sqlContext.read().format("jdbc").options(getSqlContextOptions(driver, dburl, query)).load(); // This step seem to work fine.
JavaRDD<Row> rdd = df.javaRDD();
rdd.saveAsTextFile(getHdfsUri() + pathToFile); // This works, but writes the rows in square brackets, like: [1, AAA].
hiveContext.sql("CREATE TABLE BLAH (MY_ID INT, MY_DESC STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n' STORED AS TEXTFILE");
hiveContext.sql("LOAD DATA INPATH '" + getHdfsUri() + hdfsFile + "' OVERWRITE INTO TABLE `BLAH`"); // Get's written like:
MY_INT MY_DESC
------ -------
AAA]
The INT column doesn't get written at all because the leading [ makes it no longer a numeric value and the last column shows the "]" at the end of the row in the HDFS file.
Please help me understand why this isn't working or what a better way would be. Thanks!
I am not locked into any specific approach, so all options would be appreciated.
Ok, I figured out what I was doing wrong. I needed to use the write function on the HiveContext and needed to use the com.databricks.spark.csv to write a sequencefile in Hive. This does not require an intermediate step of saving a file in HDFS, which is great, and writes to Hive successfully.
DataFrame df = hiveContext.createDataFrame(rdd, struct);
df.select(cols).write().format("com.databricks.spark.csv").mode(SaveMode.Append).saveAsTable("TABLENAME");
I did need to create a StructType object, though, to pass into the createDataFrame method for the proper mapping of the data types (Something like is shown in the middle of this page: Support for User Defined Types for java in Spark). And the cols variable is an array of Column objects which is really just an array of column names (i.e. something like Column[] cols = {new Column("COL1"), new Column("COL2")};
I think "Insert" is not yet supported.
http://spark.apache.org/docs/latest/sql-programming-guide.html#compatibility-with-apache-hive
To get rid of brackets in text file, you should avoid saveAsTextFile. Instead try writing the contents using HDFS API i.e FSDataInputStream

Bulk Update of Particular fields In Hbase

I have a scenario while was working on Hbase. Initially I have to bulkupload a csv file to Hbase table.Which I could do successfully by using Hbase bulkloading.
Now I want to update a particular field in hbase table by comparing to an new csv provided and if the value is updated have to maintain a flag which says the rowkey was updated. Any hint how I can do it easily.
Any help is really appreciated.
Thanks
HBase maintains versions for each cell. As long as you have the row key with you, you get a handle of the row, and you can just use put to add the updated column. Internally it maintains the versions, and you can have access to history of the updated values too.
However, you need comparing too, as I can see. So after bulk loading the fastest you can do it, use a map reduce as have HBase as source and sink. Look here at 7.2.2 section.
The idea is have mapreduce perform the scan, do comparision in map, and write the new updated put in output. Its like a basic fetch, modify and update sequence. But we are using map reduce parallel feature as we are dealing with large amount of data

MultipleInputs with DBInputFormat in Hadoop

In my database I have multiple tables where each table is a different entity type. I have an Avro schema that I use in hadoop which is a union of all the fields of these different entity types plus it has a entity type field.
What I would like to do is something along the lines of setting up a DBInputFormat with a DBWritable for each entity type that maps the entity type to the combined Avro type. Then give each DBInputFormat to something like MultipleInputs so that I can create a composite input format. The composite input format could then be given to my map reduce job so that all of the data from all the tables could be processed at once by the same mapper class.
Data is constantly added to these database tables so I need to be able to configure the DBInputFormat for each entity type/dbtable to only grab the new data and to do the splits properly.
Basically I need the functionality of DBInputFormat or DataDrivenDBInputFormat but also the ability to make a composite of them similar to what you can do with paths and MultipleInputs.
Create a view from the N input tables and set the view in the DBInputFormat#setInput. According to the Cloudera article. So, I guess data should not be updated in the table for the time the job completes.
Hadoop may need to execute the same query multiple times. It will need to return the same results each time. So any concurrent updates to your database, etc, should not affect the query being run by your MapReduce job. This can be accomplished by disallowing writes to the table while your MapReduce job runs, restricting your MapReduce’s query via a clause such as “insert_date < yesterday,” or dumping the data to a temporary table in the database before launching your MapReduce process.
Evaluate frameworks which support real time processing like Storm, HStreaming, S4 and Strembases. Some of these sit on top of Hadoop and some don't, some are FOSS and some are commercial.

Resources