Handling incremental data - Hadoop - hadoop

We had 5 years of data in cluster and we are loading the data everyday. The data that gets added everyday might contain duplicate data , partially modified data etc ..
1 . How to handle duplicate data - should that be handled as part of highlevel programming interfaces pig, hive etc .. or any other alternatives.
Say if there is a usecase to find out what is changed between two records given the key to find out the row.
What is the best way to model the data, using which hadoop eco system components.

How to handle duplicate data
It's very hard to remove duplicates from HDFS raw data,
so I guess your approach is right: remove using pig or hive while loading those data.
Say if there is a usecase to find out what is changed between two records given the key to find out the row.
For this case, do you meaning that two records has the same key?
Then what kind of changes you want to capture?

When you say that, you need to remove duplicates and also the delta between two records when you know the key, you should have some criteria of which data to be removed in case of partial changed data.
In both scenarios, you can have a handle of the key and write logic to remove duplicates. Map reduce seems to be a good choice, given the parallelism, performance and ability to manage based on keys. Mostly your requirements could be handled in reducer

See if Sqoop-merge fits your use case.
From the doc:
The merge tool allows you to combine two datasets where entries in one dataset should overwrite entries of an older dataset. For example, an incremental import run in last-modified mode will generate multiple datasets in HDFS where successively newer data appears in each dataset. The merge tool will "flatten" two datasets into one, taking the newest available records for each primary key.

Related

Schema verification/validation before loading data into HDFS/Hive

I am a newbie to Hadoop Ecosystem and I need some suggestion from Bigdata experts on achieving schema verification/validation before loading the huge data into hdfs.
The scenario is:
I have a huge dataset with given schema (having around 200
column-header in it). This dataset is going to be stored in Hive
tables/HDFS. Before loading the data into hive table/hdfs I want to
perform a schema level verification/validation on the data supplied to
avoid any unwanted errors/exception while loading the data into hdfs.
Like in case somebody tries to pass a data file having fewer or more
number of columns in it then at the first level of verification this
load fail.
What could be the best possible approach for achieving the same?
Regards,
Bhupesh
Since you have files, you can add them into HDFS,and run map reduce on top of that. Here you would be having a hold on each row, so you can verify number of columns, their types and any other validations.
When i referred to jason/xml, there is slight overhead to make map reduce identify the records in that format. However with respect to validation there is schema validation which you can enforce and also define only specific values for a field using schema. So once the schema is ready, your parsing(xml to java) and then store them at another final HDFS location for further use(like HBase). When you are sure that data is validated, you can create Hive tables on top of that.
Use below utility to create temp tables every time based on the schema you receive in csv file format in staging directory and then apply some conditions to identify whether you have valid columns or not. Finally load into original table.
https://github.com/enahwe/Csv2Hive

Bulk Update of Particular fields In Hbase

I have a scenario while was working on Hbase. Initially I have to bulkupload a csv file to Hbase table.Which I could do successfully by using Hbase bulkloading.
Now I want to update a particular field in hbase table by comparing to an new csv provided and if the value is updated have to maintain a flag which says the rowkey was updated. Any hint how I can do it easily.
Any help is really appreciated.
Thanks
HBase maintains versions for each cell. As long as you have the row key with you, you get a handle of the row, and you can just use put to add the updated column. Internally it maintains the versions, and you can have access to history of the updated values too.
However, you need comparing too, as I can see. So after bulk loading the fastest you can do it, use a map reduce as have HBase as source and sink. Look here at 7.2.2 section.
The idea is have mapreduce perform the scan, do comparision in map, and write the new updated put in output. Its like a basic fetch, modify and update sequence. But we are using map reduce parallel feature as we are dealing with large amount of data

Modeling Data in Hadoop

Currently I am bringing into Hadoop around 10 tables from an EDW (Enterprise Data Warehouse), these tables are closely related to a Star Schema model. I'm usig Sqoop to bring all these tables across, resulting in 10 directories containing csv files.
I'm looking at what are some better ways to store these files before striking off MR jobs. Should I follow some kind of model or build an aggregate before working on MR jobs? I'm basically looking at how might be some ways of storing related data together.
Most things I have found by searching are storing trivial csv files and reading them with opencsv. I'm looking for something a bit more involved and not just for csv files. If moving towards another format works better than csv, then that is no problem.
Boils down to: How best to store a bunch of related data in HDFS to have a good experience with MR.
I suggest spending some time with Apache Avro.
With Sqoop v1.3 and beyond you can import data from your relational data sources as Avro files using a schema of your own design. What's nice about Avro is that it provides a lot of features in addition to being a serialization format...
It gives you data+schema in the same file but is compact and efficient for fast serialization. It gives you versioning facilities which are useful when bringing in updated data with a different schema. Hive supports it in both reading and writing and Map Reduce can use it seamlessly.
It can be used as a generic interchange format between applications (not just for Hadoop) making it an interesting option for a standard, cross-platform format for data exchange in your broader architecture.
Storing these files in csv is fine. Since you will be able to process these files using text output format and could also read it through hive using specific delimiter. You could change the delimiter if you do not like comma to pipe("|") that's what I do most of the time. Also you generally need to have large files in hadoop but if its large enough that you can partition these files and each file partition is in the size of few 100 gigs then it would be a good to partition these files into separate directory based on your partition column.
Also it would be better idea to have most of the columns in single table than having many normalized small tables. But that varies depending on your data size. Also make sure whenever you copy , move or create data you do all the constraint check on your applications as it will be difficult to make small changes in the table later on, you will need to modify the complete file for even small change.
Hive Partitioning and Bucketing concepts can be used to effectively used to put similar data together (not in nodes, but in files and folders) based on a particular column. Here are some nice tutorials for Partitioning and Bucketing.

Filter data on row key in Random Partitioner

I'm working on Cassandra Hadoop integration (MapReduce). We have used RandomPartitioner to insert data to gain faster write speed. Now we have to read that data from Cassandra in MapReduce and perform some calculations on it.
From the lots of data we have in cassandra we want to fetch data only for particular row keys but we are unable to do it due to RandomPartitioner - there is an assertion in the code.
Can anyone please guide me how should I filter data based on row key on the Cassandra level itself (I know data is distributed across regions using hash of the row key)?
Would using secondary indexes (still trying to understand how they works) solve my problem or is there some other way around it?
I want to use cassandra MR to calculate some KPI's on the data which is stored in cassandra continuously. So here fetching whole data from cassandra every time seems an overhead to me? The rowkey I'm using is like "(timestamp/60000)_otherid"; this CF contains reference of rowkeys of actual data stored in other CF. so to calculate KPI I will work for a particular minute and fetch data from other CF, and process it.
When using RandomPartitioner, keys are not sorted, so you cannot do a range query on your keys to limit the data. Secondary indexes work on columns not keys, so they won't help you either. You have two options for filtering the data:
Choose a data model that allows you to specify a thrift SlicePredicate, which will give you a range of columns regardless of key, like this:
SlicePredicate predicate = new SlicePredicate().setSlice_range(new SliceRange(ByteBufferUtil.bytes(start), ByteBufferUtil.bytes(end), false, Integer.MAX_VALUE));
ConfigHelper.setInputSlicePredicate(conf, predicate);
Or use your map stage to do this by simply ignoring input keys that are outside your desired range.
I am unfamiliar with the Cassandra Hadoop integration but trying to understand how to use the hash system to query the data yourself is likely the wrong way to go.
I would look at the Cassandra client you are using (Hector, Astynax, etc.) and ask how to query by row keys from that.
Querying by the row key is a very common operation in Cassandra.
Essentially if you want to still use a RandomPartitioner and want the ability to do range slices you will need to create a reverse index (a.k.a. inverted index). I have answered a similar question here that involved timestamps.
Having the ability to generate your rowkeys programmatically allows you to emulate a range slice on rowkeys. To do this you must write your own InputFormat class and generate your splits manually.

Using Hadoop to process data from multiple datasources

Does mapreduce and any of the other hadoop technologies (HBase, Hive, pig etc) lend themselves well to situations where you have multiple input files and where data needs to be compared between the different datasources.
In the past I've written a few mapreduce jobs using Hadoop and Pig. However these tasks were quite simple since they involved manipulating only a single dataset. The requirements we have now, dictates that we read data from multiple sources and perform comparisons on various data elements on another datasource. We then report on the differences. The datasets we are working with are in the region of 10million - 60million records and so far we haven't manage to make these jobs fast enough.
Is there a case for using mapreduce in order to solve such issues or am I going down the wrong route.
Any suggestions are much appreciated.
I guess I'd preprocess the different datasets into a common format (being sure to include a "data source" id column with a single unique value for each row coming from the same dataset). Then move the files into the same directory, load the whole dir and treat it as a single data source in which you compare the properties of rows based on their dataset id.
Yes, you can join multiple datasets in a mapreduce job. I would recommend getting a copy of the book/ebook Hadoop In Action which addresses joining data from multiple sources.
When you have multiple input files you can use MapReduce API FileInputFormat.addInputPaths() in which can take a comma separated list of multiple files, as below:
FileInputFormat.addInputPaths("dir1/file1,dir2/file2,dir3/file3");
You can also pass multiple inputs into a Mapper in hadoop using Distributed Cache, more info is described here: multiple input into a Mapper in hadoop
If i am not misunderstanding you are trying to normalize the structured data in records, coming in from several inputs and then process it. Based on this, i think you really need to look at this article which helped me in past. It included How To Normalize Data Using Hadoop/MapReduce as below:
Step 1: Extract the column value pairs from the original data.
Step 2: Extract column-value Pairs Not In Master ID File
Step 3: Calculate the Maximum ID for Each Column in the Master File
Step 4: Calculate a New ID for the Unmatched Values
Step 5: Merge the New Ids with the Existing Master IDs
Step 6: Replace the Values in the Original Data with IDs
Using MultipleInputs we can do this.
MutlipleInputs.addInputPath(job, Mapper1.class, TextInputFormat.class,path1);
MutlipleInputs.addInputPath(job, Mapper2.class, TextInputFormat.class,path2);
job.setReducerClass(Reducer1.class);
//FileOutputFormat.setOutputPath(); set output path here
If both classes have a common key, then they can be joined in reducer and do the necessary logics

Resources