Bulk Update of Particular fields In Hbase - hadoop

I have a scenario while was working on Hbase. Initially I have to bulkupload a csv file to Hbase table.Which I could do successfully by using Hbase bulkloading.
Now I want to update a particular field in hbase table by comparing to an new csv provided and if the value is updated have to maintain a flag which says the rowkey was updated. Any hint how I can do it easily.
Any help is really appreciated.
Thanks

HBase maintains versions for each cell. As long as you have the row key with you, you get a handle of the row, and you can just use put to add the updated column. Internally it maintains the versions, and you can have access to history of the updated values too.
However, you need comparing too, as I can see. So after bulk loading the fastest you can do it, use a map reduce as have HBase as source and sink. Look here at 7.2.2 section.
The idea is have mapreduce perform the scan, do comparision in map, and write the new updated put in output. Its like a basic fetch, modify and update sequence. But we are using map reduce parallel feature as we are dealing with large amount of data

Related

How to deal with primary key check like situations in Hive?

I have to load the incremental load to my base table (say table_stg) everyday once. I get the snapshot of data everyday from various sources in xml format. The id column is supposed to be unique but since data is coming from different sources, there is a chance of duplicate data.
day1:
table_stg
id,col2,col3,ts,col4
1,a,b,2016-06-24 01:12:27.000532,c
2,e,f,2016-06-24 01:12:27.000532,k
3,a,c,2016-06-24 01:12:27.000532,l
day2: (say the xml is parsed and loaded into table_inter as below)
id,col2,col3,ts,col4
4,a,b,2016-06-25 01:12:27.000417,l
2,e,f,2016-06-25 01:12:27.000417,k
5,w,c,2016-06-25 01:12:27.000417,f
5,w,c,2016-06-25 01:12:27.000417,f
when i put this data ino table_stg, my final output should be:
id,col2,col3,ts,col4
1,a,b,2016-06-24 01:12:27.000532,c
2,e,f,2016-06-24 01:12:27.000532,k
3,a,c,2016-06-24 01:12:27.000532,l
4,a,b,2016-06-25 01:12:27.000417,l
5,w,c,2016-06-25 01:12:27.000417,f
What could be the best way to handle these kind of situations(without deleting the table_stg(base table) and reloading the whole data)
Hive does allow duplicates on primary and unique keys.You should have an upstream job doing the data cleaning before loading it into the Hive table.
You can write a python script for that if data is less or use spark if data size is huge.
spark provides dropDuplicates() method to achieve this.

Schema verification/validation before loading data into HDFS/Hive

I am a newbie to Hadoop Ecosystem and I need some suggestion from Bigdata experts on achieving schema verification/validation before loading the huge data into hdfs.
The scenario is:
I have a huge dataset with given schema (having around 200
column-header in it). This dataset is going to be stored in Hive
tables/HDFS. Before loading the data into hive table/hdfs I want to
perform a schema level verification/validation on the data supplied to
avoid any unwanted errors/exception while loading the data into hdfs.
Like in case somebody tries to pass a data file having fewer or more
number of columns in it then at the first level of verification this
load fail.
What could be the best possible approach for achieving the same?
Regards,
Bhupesh
Since you have files, you can add them into HDFS,and run map reduce on top of that. Here you would be having a hold on each row, so you can verify number of columns, their types and any other validations.
When i referred to jason/xml, there is slight overhead to make map reduce identify the records in that format. However with respect to validation there is schema validation which you can enforce and also define only specific values for a field using schema. So once the schema is ready, your parsing(xml to java) and then store them at another final HDFS location for further use(like HBase). When you are sure that data is validated, you can create Hive tables on top of that.
Use below utility to create temp tables every time based on the schema you receive in csv file format in staging directory and then apply some conditions to identify whether you have valid columns or not. Finally load into original table.
https://github.com/enahwe/Csv2Hive

Handling incremental data - Hadoop

We had 5 years of data in cluster and we are loading the data everyday. The data that gets added everyday might contain duplicate data , partially modified data etc ..
1 . How to handle duplicate data - should that be handled as part of highlevel programming interfaces pig, hive etc .. or any other alternatives.
Say if there is a usecase to find out what is changed between two records given the key to find out the row.
What is the best way to model the data, using which hadoop eco system components.
How to handle duplicate data
It's very hard to remove duplicates from HDFS raw data,
so I guess your approach is right: remove using pig or hive while loading those data.
Say if there is a usecase to find out what is changed between two records given the key to find out the row.
For this case, do you meaning that two records has the same key?
Then what kind of changes you want to capture?
When you say that, you need to remove duplicates and also the delta between two records when you know the key, you should have some criteria of which data to be removed in case of partial changed data.
In both scenarios, you can have a handle of the key and write logic to remove duplicates. Map reduce seems to be a good choice, given the parallelism, performance and ability to manage based on keys. Mostly your requirements could be handled in reducer
See if Sqoop-merge fits your use case.
From the doc:
The merge tool allows you to combine two datasets where entries in one dataset should overwrite entries of an older dataset. For example, an incremental import run in last-modified mode will generate multiple datasets in HDFS where successively newer data appears in each dataset. The merge tool will "flatten" two datasets into one, taking the newest available records for each primary key.

Filter data on row key in Random Partitioner

I'm working on Cassandra Hadoop integration (MapReduce). We have used RandomPartitioner to insert data to gain faster write speed. Now we have to read that data from Cassandra in MapReduce and perform some calculations on it.
From the lots of data we have in cassandra we want to fetch data only for particular row keys but we are unable to do it due to RandomPartitioner - there is an assertion in the code.
Can anyone please guide me how should I filter data based on row key on the Cassandra level itself (I know data is distributed across regions using hash of the row key)?
Would using secondary indexes (still trying to understand how they works) solve my problem or is there some other way around it?
I want to use cassandra MR to calculate some KPI's on the data which is stored in cassandra continuously. So here fetching whole data from cassandra every time seems an overhead to me? The rowkey I'm using is like "(timestamp/60000)_otherid"; this CF contains reference of rowkeys of actual data stored in other CF. so to calculate KPI I will work for a particular minute and fetch data from other CF, and process it.
When using RandomPartitioner, keys are not sorted, so you cannot do a range query on your keys to limit the data. Secondary indexes work on columns not keys, so they won't help you either. You have two options for filtering the data:
Choose a data model that allows you to specify a thrift SlicePredicate, which will give you a range of columns regardless of key, like this:
SlicePredicate predicate = new SlicePredicate().setSlice_range(new SliceRange(ByteBufferUtil.bytes(start), ByteBufferUtil.bytes(end), false, Integer.MAX_VALUE));
ConfigHelper.setInputSlicePredicate(conf, predicate);
Or use your map stage to do this by simply ignoring input keys that are outside your desired range.
I am unfamiliar with the Cassandra Hadoop integration but trying to understand how to use the hash system to query the data yourself is likely the wrong way to go.
I would look at the Cassandra client you are using (Hector, Astynax, etc.) and ask how to query by row keys from that.
Querying by the row key is a very common operation in Cassandra.
Essentially if you want to still use a RandomPartitioner and want the ability to do range slices you will need to create a reverse index (a.k.a. inverted index). I have answered a similar question here that involved timestamps.
Having the ability to generate your rowkeys programmatically allows you to emulate a range slice on rowkeys. To do this you must write your own InputFormat class and generate your splits manually.

How to keep a flat file on HDFS in sync with a large database table?

What's the best way of keeping a flat file on HDFS in sync with a large database table which may have row updates?
Tools such as sqoop seem like they'd be useful as they allow incremental extracts of new rows from tables, however I can't see an easy way of handling row updates.
What techniques can we use to handle row updates in an efficent manner? Dumping entire tables nightly is something we'd rather avoid.
Here are a couple suggestions:
Use DBInputFormat to make the database the input to your jobs, instead of having an intermediate file that you have to worry about synchronizing. If MySQL becomes a bottleneck, you can use some distributed/NoSQL database.
If you still want to use flat files, each night you can only dump the rows that changed in MySQL, along with a timestamp. Write a Hadoop job that outputs only the most recent version of each unique row.
I prefer having an updated_at field in mysql table to only get the changed data every night. After that I do a simple map reduce to apply changes on (merge with) old state.

Resources