Questions about migration, data model and performance of CDH/Impala - oracle

I have some questions about migration, data model and performance of Hadoop/Impala.
How to migrate Oracle application to cloudera hadoop/Impala
1.1 How to replace oracle stored procedure in impala or M/R or java/python app.
For example, the original SP include several parameters and sqls.
1.2 How to replace unsupported or complex SQL like over by partition from Oracle to impala.
Are there any existing examples or Impala UDF?
1.3 How to handle update operation since part of data has to be updated.
For example, use data timestamp? use the store model which can support update like HBase? or use delete all data/partition/dir and insert it again(insert overwrite).
Data store model , partition design and query performance
2.1 How to chose impala internal table or external table like csv, parquet, habase?
For example, if there are several kind of data like importing exsited large data in Oracle into hadoop, new business data into hadoop, computed data in hadoop and frequently updated data in hadoop, how to choose the data model? Do you need special attention if the different kind of data need to join?
We have XX TB's data from Oracle, do you have any suggestion about the file format like csv or parquet? Do we need to import the data results into impala internal table or hdfs fs after calculation. If those kind of data can be updated, how to we considered that?
2.2 How to partition the table /external table when joining
For example, there are huge number of sensor data and each one includes measuring data, acquisition timestamp and region information.
We need:
calculate measuring data by different region
Query a series of measuring data during a certain time interval for specific sensor or region.
Query the specific sensor data from huge number of data cross all time.
Query data for all sensors on specific date.
Would you please provide us some suggestion about how to setup up the partition for internal and directories structure for external table(csv) .
In addition, for the structure of the directories, which is better when using date=20090101/area=BEIJING or year=2009/month=01/day=01/area=BEIJING? Is there any guide about that?

Related

Cassandra for datawarehouse

Is Cassandra a good alternative for Hadoop as a data warehouse where data is append only and all updates in source databases should not overwrite the existing rows in the data warehouse but get appended. Is Cassandra really ment to act as a data warehouse or just as a database to store the results of batch / stream queries?
Cassandra can be used both as a data warehouse(raw data storage) and as a database (for final data storage). It depends more on the cases you want to do with the data.
You even may need to have both Hadoop and Cassandra for different purposes.
Assume, you need to gather and process data from multiple mobile devices and provide some complex aggregation report to the user.
So at first, you need to save data as fast as possible (as new portions appear very often) so you use Cassandra here. As Cassandra is limited in aggregation features, you load data into HDFS and do some processing via HQL scripts (assume, you're not very good at coding but great in complicated SQLs). And then you move the report results from HDFS to Cassandra in a dedicated reports table partitioned by user id.
So when the user wants to have some aggregation report about his activity in the last month, the application takes the id of active user and returns the aggregated result from Cassandra (as it is simple key-value search).
So for your question, yes, it could be an alternative, but the selection strategy depends on the data types and your application business cases.
You can read more information about usage of Cassandra
here

What is the best way to ingest data from Terdata into Hadoop with Informatica?

What is the best ways to parallel ingest data from Teradata database into Hadoop with parallel data moving?
If we create a job which is simple opens one session to Teradata database it will take a lot of time to load huge table.
if we create a set of sessions to load data in parallel, and also make Select in each of the sessions, than it will make a set of Full table scans Teradata to produce a data
What is the recommended best practice to load data in parallelised streams and make unnecessary workload to Teradata?
If Tera data supports table partitioning like oracle, you could try reading the table based on partitioning points which will enable parallelism in read...
Other option you have is, split the table into multiple partitions like adding a where clause on indexed column. This will ensure index scan and you can avoid full table scan.
The most scalable way to ingest data into Hadoop form teradata, which i found is to use Teradata connector for hadoop. It is included in Cloudera & Hortonworks distributions. I will show example base on Cloudera documentation, but the same works with Hortonworks as well:
Informatica big Data edition is using standard Scoop invocation via command line and submitting set of parameters to it. So the main question is - which driver to use to make parallel connections between two MPP systems.
Here is the link to the Cloudera documentation:
Using the Cloudera Connector Powered by Teradata
And here is the digest from this documentation (You could find that this connector support different kinds of load balancing between connections):
Cloudera Connector Powered by Teradata supports the following methods for importing data from Teradata to Hadoop:
split.by.amp
split.by.value
split.by.partition
split.by.hash
split.by.amp Method
This optimal method retrieves data from Teradata. The connector creates one mapper per available Teradata AMP, and each mapper subsequently retrieves data from each AMP. As a result, no staging table is required. This method requires Teradata 14.10 or higher.
If you use partition names in the select clause, Power Center will select only the rows within that partition so there won't be duplicate read (don't forget to choose Database partitioning in Informatica session level). However if you use key range partition you have to choose the range as you mentioned in settings. Usually we use NTILE oracle analytical function to split the table into multiple portions so that the read will be unique across the selects. Please let me know if you have any question. If you have range/auto generated/surrogate key column in the table use it in where clause - write a sub-query to divide the table into multiple portions.

How to do complex query on big data?

every one.
I have some data about 6G in hdfs that has been exported from mysql.And I have write mapreduces prehandling data to fill some key field that data can be easily queried.
As the business demands are different aggregation data group by day ,hour,hospital,area etc,
so I have to write many hive sqls exporting data to local disk,and then I write python script to parse files on local disk ,then get datas in demand.
Is there some good technique on hadoop to resolve my demand.I am considering.
Can you help me ,please.

How to optimize Hive queires with external table and serde

Part 1: my enviroment
I have following files uploaded to Hadoop:
The are plain text
Each line contains JSON like:
{code:[int], customerId:[string], data:{[something more here]}}
code are numbers from 1 to 3000,
customerId are total up to 4 millions, daily up to 0.5 millon
All files are gzip
In hive I created external table with custom JSON serde (let's call it CUSTOMER_DATA)
All files from each date is stored in separate directory - and I use it as partitions in Hive tables
Most queries which I do are filtering by date, code and customerId. I have also a second file with format (let's call it CUSTOMER_ATTRIBUTES]:
[customerId] [attribute_1] [attribute_2] ... [attribute_n]
which contains data for all my customers, so rows are up to 4 millions.
I query and filter my data in following way:
Filtering by date - partitions do the job here using WHERE partitionDate IN (20141020,20141020)
Filtering by code using statement like for example `WHERE code IN (1,4,5,33,6784)
Joining table CUSTOMER_ATTRIBUTES with CUSTOMER_DATA with condition query like
SELECT customerId
FROM CUSTOMER_DATA
JOIN CUSTOMER_ATTRIBUTES ON (CUSTOMER_ATTRIBUTES.customerId=CUSTOMER_DATA.customerId)
WHERE CUSTOMER_ATTRIBUTES.attribute_1=[something]
Part 2: question
Is there any efficient way how can I optimize my queries. I read about indexes and buckets by I don't know if I can use them with external tables and if they will optimize my queries.
Performance on search:
Internal or External table does not make a difference as far as performance is considered. You can build indexes on both. Either ways building indexes on large data sets is counter intuitive.
Bucketing the data on your searching columns would give a lot of performance gains. But whether you can bucket you data or not depends on your use case.
You can consider more partitioning (if possible) to get more gains if you can on code/customer id. Hopefully you don't have to many unique code or customer id.
Rather than trying these things out on your Textual Json formatted data, I would strongly suggest you to move away from JSON test data. Parsing JSON(Text) is a big performance killer.
These days there are a lot of file format which work pretty good. If cant change the component which produces the data, you use a series of queries and tables to convert to other file formats. This will be one time job for each partition data. After that your search queries will run faster on newer file formats.
for eg. RCFile format is support by hive. If you pull out code, customerid as separate columns in RCFILE then the query engine can completely skip data col for not matching code in (1,4,5,33,6784) , reducing IO heavily.
Also storing data in RCFile ie columnar storage will help your joins. With RCFile when you run a query with join the hive execution engine will only read in required columns, again significantly reducing IO. On top of this if you bucketted your columns which are a part of JOIN keys it will lead to more performance gains.
If you need to have JSON due to nesting nature of data then I would suggesting you look at Parquet
It will give you performance gains of RCFile + binary (avro, thrift etc)
At my work we had 2 columns of heavily nested JSON data. We tried storing this as compressed text and sequence file format. We then broke up the complex nested JSON columns to lesser nested multiple columns and pulled out some frequently searched keys into other columns. We stored this as RCfile and performance gains we observed on searching were huge.
Rightnow with more burst in data we need to improve more. After trying a few more things and talking to Cloudera guys there is only one big area to improve. Move away from JSON parsing. Parquet seems to be ideal candidate for this.
Yes you can use Indexes with External Tables. Index do optimize the search Queries.
CREATE INDEX your_index_name ON TABLE your_table_name(field_you_want_to_index) AS 'COMPACT' WITH DEFERRED REBUILD;
indexing takes a lot of time for a huge dataset, so we can do a deferred rebuild, i.e after production hours :)
ALTER INDEX your_index_name ON your_table_name REBUILD;
you can even rebuild a specific partition.
ALTER INDEX your_index_name ON your_table_name PARTITION(your_field = 'any_thing') REBUILD;
when you JOIN two tables BUCKETING is the best option to go with, does alot of optimization.

What is a Data warehouse in this use case

I'm trying to figure out the difference (between tools/services/programs) between Data Warehouse, Clustered Data Processing and the tools/infrastructure for querying a Data Warehouse
So Let's say I have the following setup to perform some data processing for a certain use case
Hadoop Cluster for Distributed Data processing
Hive for providing infrastructure and Functions for querying data from a data warehouse
My data sitting in an RDBMS or a NoSQL database
In the above example, what exactly is the Data Warehouse? My naive brain thinks that it is the RDBMS or the NoSQL database in the above context is the Data warehouse. But by definition, isn't a Data warehouse a database used for reporting and data analysis? (Definition shamelessly stolen from Wikipedia). So can I call a traditional RDBMS/NoSQL database a Data Warehouse? Thanks.
You cannot call every relational database system a data warehouse, since one of data warehouses main feature is to aggregate data from multiple databases (with different schemas). It is usually done with a "star schema" allowing to combine multiple dimensions and multiple granularities.
Because NoSQL database systems (graph-based or map-reduce-based) are schema-less they can indeed store data from different schemas. Moreover Map-Reduce can be used to aggregate data with different granularities (e.g. aggregate daily data to compare them with monthly data).

Resources