Best way ho to validate ingested data - validation

I am ingesting data daily from various external sources like GA, scrapers, Google BQ, etc.
I store created CSV file into HDFS, create stage table from it and then append it to historical table in Hadoop.
Can you share some best practices how to valide new data with historical one? Like for example compare row count of actual data with average of last 10 days or someting like that. Is there any prepared solution in spark or something?
Thanks for advices.

Related

Apache Spark: Which data storage and data format to choose

I'm going to write a sales analytics application with Spark. Therefore I get a delta-dataset every night with new sales data (the sellings of the day before). Later I want to realize some analytics like Association-Rules or popularity of products.
The sales data contains information about:
store-id
article-group
timestamp of cash-point
article GTIN
amount
price
So far I used a simple .textFile method and RDDs in my Applications. I heard something about DataFrame and Parquet, which is a table-like data format for text files, right? And what about storing the data once in a database (I have HBase installed in a Hadoop cluster) and later read this?
Can someone give a short overview of the different types of save-/load-possibilities in Spark? And give a recommendation what to use for this data?
The data-amount are actually about 6 GB, which represent data data for 3 stores for about 1 year. Later I will work with data of ~500 stores and time-period of ~5 years.
You can use spark to process that data without any problem. You can read from a csv file as well(there's a library from databricks that supports csv). You can manipulate it, from an rdd your one step closer to turning it into a dataframe. And you can throw the final dataframe dirrectly into HBASE.
All needed documentation you can find here:
http://spark.apache.org/docs/latest/sql-programming-guide.html
https://www.mapr.com/blog/spark-streaming-hbase
Cheers,
Alex

While creating table how to identify the data types in hive

I am learning to use Hadoop for performing Big Data related operations.
I need to perform some queries on a collection of data sets split across 8 csv files. Each csv file has multiple sheets and the query concerns only one of the sheets(Sheet Name: Table4)
The dataset can be downloaded here : http://www.census.gov/hhes/www/hlthins/data/utilization/tables.html
Sample Data snap shot attached for quick reference
I have already converted the above xls file to csv.
Am not sure how to group the data while creating table in Hive.
It will be really helpful if you can guide me here.
Note: I am a novice with Hadoop and Big Data, so if anyone could guide me with how to proceed further I'd be very grateful.
If you need information on the queries or anything else let me know.
Thanks!

Questions about migration, data model and performance of CDH/Impala

I have some questions about migration, data model and performance of Hadoop/Impala.
How to migrate Oracle application to cloudera hadoop/Impala
1.1 How to replace oracle stored procedure in impala or M/R or java/python app.
For example, the original SP include several parameters and sqls.
1.2 How to replace unsupported or complex SQL like over by partition from Oracle to impala.
Are there any existing examples or Impala UDF?
1.3 How to handle update operation since part of data has to be updated.
For example, use data timestamp? use the store model which can support update like HBase? or use delete all data/partition/dir and insert it again(insert overwrite).
Data store model , partition design and query performance
2.1 How to chose impala internal table or external table like csv, parquet, habase?
For example, if there are several kind of data like importing exsited large data in Oracle into hadoop, new business data into hadoop, computed data in hadoop and frequently updated data in hadoop, how to choose the data model? Do you need special attention if the different kind of data need to join?
We have XX TB's data from Oracle, do you have any suggestion about the file format like csv or parquet? Do we need to import the data results into impala internal table or hdfs fs after calculation. If those kind of data can be updated, how to we considered that?
2.2 How to partition the table /external table when joining
For example, there are huge number of sensor data and each one includes measuring data, acquisition timestamp and region information.
We need:
calculate measuring data by different region
Query a series of measuring data during a certain time interval for specific sensor or region.
Query the specific sensor data from huge number of data cross all time.
Query data for all sensors on specific date.
Would you please provide us some suggestion about how to setup up the partition for internal and directories structure for external table(csv) .
In addition, for the structure of the directories, which is better when using date=20090101/area=BEIJING or year=2009/month=01/day=01/area=BEIJING? Is there any guide about that?

Map Reduce - How to plan the data files

I would like to use AWS EMR to query large log files that I will write to S3. I can design the files any way I like. The data is created in a rate of 10K entries/minute.
The logs consist of dozens of data points and I'd like to collect data for very long period of time (years) to compare trends etc.
What are the best practices for creating such files that will be stored on S3 and queried by AWS EMR cluster?
Whats the optimal file sizes ?Should I create separate files for example on hourly basis?
What is the best way to name the files?
Should I place them in daily/hourly buckets or all in the same bucket?
Whats the best way to handle things like adding some data after a while or change in data structure that I use?
Should I compress things for example by leaving out domain names out of urls or keep as much data as possible?
Is there any concept like partitioning (the data is based on 100s of websites so I can use site ids for example). I must be able to query all the data together, or by partitions.
Thanks!
in my opinion you should use a hourly basis bucket to store data in s3 and then use a pipeline to schedule your mr job to clean the data.
once u have clean the data you can keep it to a location in s3 and then you can run a data pipeline on hourly basis on the lag of 1hour with respect to your MR pipeline to put this process data into redshift.
Hence at 3am on a day you will have 3 hour of processed data in s3 and 2 hour processed into redshift dB.
To do this you can have 1 machine dedicated for running pipelines and on that machine you can define you shell script/perl/python or so script to load data to your dB.
You can use AWS bucketing formatter for year,month,date,hour and so on. for e.g.
{format(minusHours(#scheduledStartTime,2),'YYYY')}/mm=#{format(minusHours(#scheduledStartTime,2),'MM')}/dd=#{format(minusHours(#scheduledStartTime,2),'dd')}/hh=#{format(minusHours(#scheduledStartTime,2),'HH')}/*

How to do complex query on big data?

every one.
I have some data about 6G in hdfs that has been exported from mysql.And I have write mapreduces prehandling data to fill some key field that data can be easily queried.
As the business demands are different aggregation data group by day ,hour,hospital,area etc,
so I have to write many hive sqls exporting data to local disk,and then I write python script to parse files on local disk ,then get datas in demand.
Is there some good technique on hadoop to resolve my demand.I am considering.
Can you help me ,please.

Resources