Approach to upload multiple interconnected csv files to HBase - hadoop

I am new to HBase and still not sure which component of Hadoop ecosystem I will use in my case and how to analyse my data later so just exploring options.
I have an Excel sheet with a summary about all the customers like this but with ≈ 400 columns:
CustomerID Country Age E-mail
251648 Russia 27 boo#yahoo.com
487985 USA 30 foo#yahoo.com
478945 England 15 lala#yahoo.com
789456 USA 25 nana#yahoo.com
Also, I have .xls files created separately for each customer with an information about him (one customer = one .xls file), the number of columns and names of columns are the same in each file. Each of these files are named with a CustomerID. A one looks like this:
'customerID_251648.xls':
feature1 feature2 feature3 feature4
0 33,878 yes 789,598
1 48,457 yes 879,594
1 78,495 yes 487,457
0 94,589 no 787,475
I have converted all these files into .csv format and now feeling stuck which component of Hadoop ecosystem should I use for storing and querying such a data.
My eventual goal is to query some customerID and to get all the information about a customer from all the files.
I think that HBase fits perfectly for that because I can create such a schema:
row key timestamp Column Family 1 Column Family 2
251648 Country Age E-Mail Feature1 Feature2 Feature3 Feature4
What is the best approach to upload and query such a data in HBase? Should I first combine an information about a customer from different sources and then upload it to HBase? Or I can keep different .csv files for each customer and when uploading to HBase choose somehow which .csv to use for forming column-families?
For querying data stored in HBase I am going to write MapReduce tasks via Python API.
Any help would be very approciated!

You are correct with schema design, also remember that hbase loads the whole column family during scans, so if you need all the data at one time maybe its better to place everything in one column family.
A simple way to load the data will be to scan first file with customers and fetch the data from the second file on fly. Bulk CSV load could be faster in execution time, but you'll spend more time writing code.
Maybe you also need to think about the row key because HBase stores data in alphabetical order. If you have a lot of data, you'd better create table with given split-keys rather than let HBase do the splits because it can end up with unbalanced regions.

Related

Best way ho to validate ingested data

I am ingesting data daily from various external sources like GA, scrapers, Google BQ, etc.
I store created CSV file into HDFS, create stage table from it and then append it to historical table in Hadoop.
Can you share some best practices how to valide new data with historical one? Like for example compare row count of actual data with average of last 10 days or someting like that. Is there any prepared solution in spark or something?
Thanks for advices.

How to deal with primary key check like situations in Hive?

I have to load the incremental load to my base table (say table_stg) everyday once. I get the snapshot of data everyday from various sources in xml format. The id column is supposed to be unique but since data is coming from different sources, there is a chance of duplicate data.
day1:
table_stg
id,col2,col3,ts,col4
1,a,b,2016-06-24 01:12:27.000532,c
2,e,f,2016-06-24 01:12:27.000532,k
3,a,c,2016-06-24 01:12:27.000532,l
day2: (say the xml is parsed and loaded into table_inter as below)
id,col2,col3,ts,col4
4,a,b,2016-06-25 01:12:27.000417,l
2,e,f,2016-06-25 01:12:27.000417,k
5,w,c,2016-06-25 01:12:27.000417,f
5,w,c,2016-06-25 01:12:27.000417,f
when i put this data ino table_stg, my final output should be:
id,col2,col3,ts,col4
1,a,b,2016-06-24 01:12:27.000532,c
2,e,f,2016-06-24 01:12:27.000532,k
3,a,c,2016-06-24 01:12:27.000532,l
4,a,b,2016-06-25 01:12:27.000417,l
5,w,c,2016-06-25 01:12:27.000417,f
What could be the best way to handle these kind of situations(without deleting the table_stg(base table) and reloading the whole data)
Hive does allow duplicates on primary and unique keys.You should have an upstream job doing the data cleaning before loading it into the Hive table.
You can write a python script for that if data is less or use spark if data size is huge.
spark provides dropDuplicates() method to achieve this.

Schema verification/validation before loading data into HDFS/Hive

I am a newbie to Hadoop Ecosystem and I need some suggestion from Bigdata experts on achieving schema verification/validation before loading the huge data into hdfs.
The scenario is:
I have a huge dataset with given schema (having around 200
column-header in it). This dataset is going to be stored in Hive
tables/HDFS. Before loading the data into hive table/hdfs I want to
perform a schema level verification/validation on the data supplied to
avoid any unwanted errors/exception while loading the data into hdfs.
Like in case somebody tries to pass a data file having fewer or more
number of columns in it then at the first level of verification this
load fail.
What could be the best possible approach for achieving the same?
Regards,
Bhupesh
Since you have files, you can add them into HDFS,and run map reduce on top of that. Here you would be having a hold on each row, so you can verify number of columns, their types and any other validations.
When i referred to jason/xml, there is slight overhead to make map reduce identify the records in that format. However with respect to validation there is schema validation which you can enforce and also define only specific values for a field using schema. So once the schema is ready, your parsing(xml to java) and then store them at another final HDFS location for further use(like HBase). When you are sure that data is validated, you can create Hive tables on top of that.
Use below utility to create temp tables every time based on the schema you receive in csv file format in staging directory and then apply some conditions to identify whether you have valid columns or not. Finally load into original table.
https://github.com/enahwe/Csv2Hive

How to do complex query on big data?

every one.
I have some data about 6G in hdfs that has been exported from mysql.And I have write mapreduces prehandling data to fill some key field that data can be easily queried.
As the business demands are different aggregation data group by day ,hour,hospital,area etc,
so I have to write many hive sqls exporting data to local disk,and then I write python script to parse files on local disk ,then get datas in demand.
Is there some good technique on hadoop to resolve my demand.I am considering.
Can you help me ,please.

How do I store data in multiple, partitioned files on HDFS using Pig

I've got a pig job that analyzes a large number of log files and generates a relationship between a group of attributes and a bag of IDs that have those attributes. I'd like to store that relationship on HDFS, but I'd like to do so in a way that is friendly for other Hive/Pig/MapReduce jobs to operate on the data, or subsets of the data without having to ingest the full output of my pig job, as that is a significant amount of data.
For example, if the schema of my relationship is something like:
relation: {group: (attr1: long,attr2: chararray,attr3: chararray),ids: {(id: chararray)}}
I'd really like to be able to partition this data, storing it in a file structure that looks like:
/results/attr1/attr2/attr3/file(s)
where the attrX values in the path are the values from the group, and the file(s) contain only ids. This would allow me to easily subset my data for subsequent analysis without duplicating data.
Is such a thing possible, even with a custom StoreFunc? Is there a different approach that I should be taking to accomplish this goal?
I'm pretty new to Pig, so any help or general suggestions about my approach would be greatly appreciated.
Thanks in advance.
Multistore wasn't a perfect fit for what I was trying to do, but it proved a good example of how to write a custom StoreFunc that writes multiple, partitioned output files. I downloaded the Pig source code and created my own storage function that parsed the group tuple, using each of the items to build up the HDFS path, and then parsed the bag of ids, writing one ID per line into the result file.

Resources