I was trying to migrate a data from SQL db to Hadoop. I have successfully done this by configuring Hive, HBase & Hadoop.
My problem is that I was using Birt & Tableau with my SQL db and was able to load 10 million data in 5-10 mins, but my newly configured Hadoop, Hive & HBase System takes around 50 mins to fetch 10 million entries.
How can I improve this performance?
As Hadoop is specially developed for processing tons of data, why I am not able to do so?
Is there any special configuration for performance?
After lot of research and for the answer of this question I went through HDP as well. Then I come across a scenrio that we cannot compare the performance of SQL Db with Hadoop as both are used for different purposes.
Also Hadoop will show its performance only after the data crosses a limit of Several TB's i.e. the case in which SQL Database fails. So it will be better if one should check first whether for an Application. If there is a requirement of Performance, choosing Hadoop is not a good option; go for the SQL Databases. But if the Application is such that it will have huge amount of Data & one has to do an analysis of such huge data where SQL DB fails; in such case Hadoop is prevalent.
Related
I am new to Cognos and trying to create reports on top of Hadoop using Hive JDBC Driver. I'm able to connect to Hive through JDBC and can able to generate reports, but here the report runs very slow. I did the same job while connecting with DB2 and the data is same as in Hadoop. Reports ran very quickly when compared to reports on top of Hive. I'm using same data-sets in both Hadoop and DB2, but can't figure out why reports on top of Hadoop are very slow. I installed Hadoop in pseudo distributed mode and connected through JDBC.
I installed following versions of software's which I used,
IBM Cognos 10.2.1 with fix pack 11,
Apache Hadoop 2.7.2,
Apache Hive 0.12.
Both are installed in different systems, Cognos on top of Windows 7 and Hadoop on top of Red-Hat.
Can any one tell me where I might be wrong in setting up of Cognos or Hadoop. Is there any way to speed up the report running time in Cognos on top of Hadoop.
When you say you installed Hadoop in pseudo distributed mode are you saying you are only running it on a single server? If so, it's never going to be as fast as DB2. Hadoop and Hive are designed to run on a cluster and scale. Get 3 or 4 servers running in a cluster and you should find that you can start to see some impressive query speeds over large datasets.
Check that you have allowed the Cognos Query Service to access more than the default amount of memory for it's Java Heap (http://www-01.ibm.com/support/docview.wss?uid=swg21587457) I currently run an initial size of 8Gb and max of 12Gb, but still manage to blow this occasionally.
Next issue you will run into is that Cognos doesn't know Hive SQL specifics (or Impala which is what I am using). This means that any non-basic query is going to be converted to a select from and maybe a group by. The big missing piece will be a where clause, which will mean that Cognos is going to try to suck in all the data from the Hive table and then do the filtering in Cognos rather than pass that off to Hive where it belongs. Cognos knows how to write DB2 SQL and all the specifics so it can pass that workload through.
The more complex the query, or any platform specific functions etc will generally not get passed to Hive (date functions, analytic functions etc), so try to structure your data and queries so they are required in filters.
Use the Hive query logs to monitor the queries that Cognos is running. Also try things like add fields to the query and then drag that field to the filter rather than direct from the model into the filter. I have found this can help in getting Cognos to include the filter in a where clause.
The other option is to use passthrough SQL queries in Report Studio and just write it all in Hive's SQL. I have just done this for a set of dashboards which required a stack of top 5's from a fact table with 5 million rows. For 5 rows Cognos was extracting all 5 million rows and then ranking them within Cognos. Do this a number of times and all of a sudden Cognos is going to struggle. With a passthrough query I could use the Impala Rank() function and only get 5 rows, much much faster, and faster than what DB2 would do seeing I am running on a proper (but small) cluster.
Another consideration with Hive is whether you are using Hive on Map Reduce or Hive on TEZ. From what a colleague has found, Hive on TEZ is much faster at the type of queries Cognos runs than Hive on Map Reduce.
We have a datawarehousing application which we are planning to convert to Hadoop.
Currently, there are 20 feeds that we receive on daily basis and load this data into MySQL database.
As the data is getting large, we are planning to move to Hadoop for faster query processing.
As the first step we are planning to load the data into HIVE on a daily basis instead of MySQL.
Question:-
1.Can I convert Hadoop similar to a DWH application to process files on daily basis?
2.When I load the data in Master Node, will it be sync'd automatically?
It really depends on the size of your data. The Question is a bit complex but in general you will have to design your own pipeline.
If you are analyzing raw logs HDFS will be a good choice to start from. You can use Java, Python or Scala to schedule the Hive jobs on daily basis and use Sqoop if you still need some MySQL data.
In Hive you will have to create partitioned table to be synced and available upon query execution. Partition creation can be also scheduled.
I would suggest to go with Impala instead of Hive as it is more tunable, fault tolerant and easier to use.
I am working on Migrating a Data from SQL Database to Hadoop, in which I have used HBase & Hadoop as well. I have successfully imported my data from SQL db to Hadoop, HBase and Hive. But the problem is the Performance of the System. I was getting the results of millions of entries within 5-10 minutes in SQL Db, but it takes around 1 hr to fetch 10 million of data from HBase & Hive. Can anyone help me on this to improve the Performance of my Hadoop System.
Data in HBase is only 'indexed' by rowkey. If you're querying in Hive on anything other than rowkey prefixes, you will generally be performing a full table scan.
There are some optimizations that can be made with HBase filters e.g., when using a FamilyFilter, you may be able to skip entire regions, but I doubt Hive is doing that.
How to improve performance depends on how your data is shaped and what analysis you need to perform on it. When performing frequent ad-hoc analysis, you may be better served by exporting data from HBase into something like Parquet files on HDFS and running your analysis against those with Hive (or Drill or Spark, Imapala, etc).
We presently load CDRs to an oracle warehouse using a combination of bash shell scripts and SQL loader with multiple threads. We are hoping to offload this process to hadoop because we envisage that the increase in data due to increase in subscriber base will soon max out the current system. And we also want to gradually introduce hadoop into our data warehouse environment.
Will loading from hadoop be faster?
If so what's is the best set of hadoop tool for this?
Further info:
We usually will get contunoius stream of pipe delimited text files through ftp to a folder, add two more fields to each record, load to temp tables in oracle and run a procedure to load to final table. How would u advice the process flow to be in terms of tools to use. For example;
files are ftp to the Linux file system (or is possible to ftp straight to hadoop?) and flume loads to hadoop.
fields are added (what will be best to do this? Pig, hive, spark or any other recommendations)
files are then loaded to oracle using sqoop
the final procedure is called(can sqoop make an oracle procedure call? If not what tool will be best to execute procedure and help control the whole process ?)
Also how can one control the level of paralleism ? Does it equate the number of mappers running the job?
Had a similar task of exporting data from a < 6 node Hadoop cluster to an Oracle Datewarehouse.
I've tested the following:
Sqoop
OraOop
Oracle Loader for Hadoop from the "Oracle BigData Connectors" suite
Hadoop streaming job which uses sqloader as mapper, in its configuration you can read from stdin using: load data infile "-"
Considering just speed, the Hadoop streaming job with sqloader as a mapper was the fastest way to transfer the data, but you have to install sqloader on each machine of your cluster. It was more of a personal curiosity, I would not recommend using this way to export data, the logging capabilities are limited, and should have a bigger impact on your datawarehouse performance.
The winner was Sqoop, it is pretty reliable, it's the import/export tool of the Hadoop ecosystem and was second fastest solution, according to my tests.(1.5x slower than first place)
Sqoop with OraOop (last updated 2012) was slower than the latest version of Sqoop, and requires extra configuration on the cluster.
Finally the worst time was obtained using Oracle's BigData Connectors, if you have a big cluster(>100 machines) than it should not be as bad as the time I obtained. The export was done in two steps. First step involves reprocessing the output and converting it to an Oracle Format that plays nice with the Datawarehouse. The second step was transfering the result to the Datawarehouse. This approach is better if you have allot of processing power, and you would not impact the Datawarehouse's performace as much as the other solutions.
I'm trying to understand how to architect a big data solution. I have historic data of 400TB of data and every hour 1GB of data is getting inserted.
Since data is confidential, I'm describing sample scenario, Data contains information of all activities in a bank branch. With every hour, when new data is inserted(no updation) into hdfs, I need to find how many loans closed, loans created,accounts expired, etc ( around 1000 analytics to be performed). Analytics involve processing entire 400TB of data.
I was plan was to use hadoop + spark. But I'm being suggested to use HBase. Reading through all the documents, I'm not able to find a clear advantage.
What is the best way to go for data which will grow to 600TB
1. MR for analytics and impala/hive for query
2. Spark for analytics and query
3. HBase + MR for analytics and query
Thanks in advance
About HBase:
HBase is a database that is build over HDFS. HBase uses HDFS to store data.
Basically, HBase will allow you to update records, have versioning and deletion of single records. HDFS does not support file updates, so HBase is introducing something you can consider "virtual" operations, and merge data from multiple sources (original files, delete markers) when you are asking it for data. Also, HBase as key-value store is creating indices to support selecting by key.
Your problem:
Choosing the technology in such situations you should look into what you are going to do with the data: Single query on Impala (with Avro schema) can be much faster than MapReduce (not to mention Spark). Spark will be faster in batch jobs, when there is caching involved.
You are probably familiar with Lambda architecture, if not, take a look into it. For what I can tell you now, the third option you mentioned (HBase and MR only) won't be good. I did not try Impala + HBase, so I can't say anything about performance, but HDFS (plain files) + Spark + Impala (with Avro), worked for me: Spark was doing reports for pre-defined queries (after that, data was stored in objectFiles - not human-readable, but very fast), Impala for custom queries.
Hope it helps at least a little.