Import sas data into hadoop - hadoop

We are buying third party survey data. They are providing us data in SAS format.
Source data format - SAS
Frequency - Daily
Data - Full one year data set (no delta)
We would like to bring this data into our Hadoop environment on daily basis. What are our options.
We asked them to send the data in text file. But their text file had 8650 columns (for ex. Country .. so they had 250 columns - one with each country). Our ETL tool failed to process that many columns. According to them it is mush easier to read data in SAS format.
Any suggestion..
Thx

The problem here is not a technology problem... It sounds like they are just being unhelpful. I do most of my work in SAS and would never provide someone with a table with that many columns and expect them to import it.
Even if they sent it in SAS format, the SAS dataset is still going to have the same number of columns and the ETL tool (even if it could read in SAS datasets - which is unlikely) is still likely to fail.
Tell them to transpose the data in SAS so that there are fewer columns and then to re-send it as a text file.

Thanks Everyone..
I think, this would solve my issue:
http://www.ats.ucla.edu/stat/sas/modules/tolong.htm

Related

Best way ho to validate ingested data

I am ingesting data daily from various external sources like GA, scrapers, Google BQ, etc.
I store created CSV file into HDFS, create stage table from it and then append it to historical table in Hadoop.
Can you share some best practices how to valide new data with historical one? Like for example compare row count of actual data with average of last 10 days or someting like that. Is there any prepared solution in spark or something?
Thanks for advices.

Read data from multiple tables and evaluate results and generate report

I have some question regarding the effective way of reading values in DB and generating report.
I use hadoop to see data from multiple tables and do data analysis based on the results.
I want to know if there is effective tool or way which can read data from multiple tables and evaluate if the values of certain columns are same across tables and send report if they are not same... I have 2 options, either I can read data from hadoop or I can connect to DB in DB2 and do it. Without creating a new java program, is there a tool which helps for the same? Like Talend tool which reads XML and writes output in DB ?
You can use Talend for this. Using Talend, you can read data from Hadoop as well as from database. In between you can perform your operation after fetching data and generate report.
if your using alot of data, and do this sort of function alot elasticsearch is also a great help in this area. use ELK stack. although you would not need the 'L' logstash part of this necessarily

Apache Spark: Which data storage and data format to choose

I'm going to write a sales analytics application with Spark. Therefore I get a delta-dataset every night with new sales data (the sellings of the day before). Later I want to realize some analytics like Association-Rules or popularity of products.
The sales data contains information about:
store-id
article-group
timestamp of cash-point
article GTIN
amount
price
So far I used a simple .textFile method and RDDs in my Applications. I heard something about DataFrame and Parquet, which is a table-like data format for text files, right? And what about storing the data once in a database (I have HBase installed in a Hadoop cluster) and later read this?
Can someone give a short overview of the different types of save-/load-possibilities in Spark? And give a recommendation what to use for this data?
The data-amount are actually about 6 GB, which represent data data for 3 stores for about 1 year. Later I will work with data of ~500 stores and time-period of ~5 years.
You can use spark to process that data without any problem. You can read from a csv file as well(there's a library from databricks that supports csv). You can manipulate it, from an rdd your one step closer to turning it into a dataframe. And you can throw the final dataframe dirrectly into HBASE.
All needed documentation you can find here:
http://spark.apache.org/docs/latest/sql-programming-guide.html
https://www.mapr.com/blog/spark-streaming-hbase
Cheers,
Alex

While creating table how to identify the data types in hive

I am learning to use Hadoop for performing Big Data related operations.
I need to perform some queries on a collection of data sets split across 8 csv files. Each csv file has multiple sheets and the query concerns only one of the sheets(Sheet Name: Table4)
The dataset can be downloaded here : http://www.census.gov/hhes/www/hlthins/data/utilization/tables.html
Sample Data snap shot attached for quick reference
I have already converted the above xls file to csv.
Am not sure how to group the data while creating table in Hive.
It will be really helpful if you can guide me here.
Note: I am a novice with Hadoop and Big Data, so if anyone could guide me with how to proceed further I'd be very grateful.
If you need information on the queries or anything else let me know.
Thanks!

Modeling Data in Hadoop

Currently I am bringing into Hadoop around 10 tables from an EDW (Enterprise Data Warehouse), these tables are closely related to a Star Schema model. I'm usig Sqoop to bring all these tables across, resulting in 10 directories containing csv files.
I'm looking at what are some better ways to store these files before striking off MR jobs. Should I follow some kind of model or build an aggregate before working on MR jobs? I'm basically looking at how might be some ways of storing related data together.
Most things I have found by searching are storing trivial csv files and reading them with opencsv. I'm looking for something a bit more involved and not just for csv files. If moving towards another format works better than csv, then that is no problem.
Boils down to: How best to store a bunch of related data in HDFS to have a good experience with MR.
I suggest spending some time with Apache Avro.
With Sqoop v1.3 and beyond you can import data from your relational data sources as Avro files using a schema of your own design. What's nice about Avro is that it provides a lot of features in addition to being a serialization format...
It gives you data+schema in the same file but is compact and efficient for fast serialization. It gives you versioning facilities which are useful when bringing in updated data with a different schema. Hive supports it in both reading and writing and Map Reduce can use it seamlessly.
It can be used as a generic interchange format between applications (not just for Hadoop) making it an interesting option for a standard, cross-platform format for data exchange in your broader architecture.
Storing these files in csv is fine. Since you will be able to process these files using text output format and could also read it through hive using specific delimiter. You could change the delimiter if you do not like comma to pipe("|") that's what I do most of the time. Also you generally need to have large files in hadoop but if its large enough that you can partition these files and each file partition is in the size of few 100 gigs then it would be a good to partition these files into separate directory based on your partition column.
Also it would be better idea to have most of the columns in single table than having many normalized small tables. But that varies depending on your data size. Also make sure whenever you copy , move or create data you do all the constraint check on your applications as it will be difficult to make small changes in the table later on, you will need to modify the complete file for even small change.
Hive Partitioning and Bucketing concepts can be used to effectively used to put similar data together (not in nodes, but in files and folders) based on a particular column. Here are some nice tutorials for Partitioning and Bucketing.

Resources