how hadoop handle ram when execute query? - hadoop

in Relation-database model like mysql , when user send query to database like "SELECT message.message_id FROM message" whole table 'message' loading in RAM . when tables are very large and server hasn't enough memory, mysql crashed.
sorry about my question . i have no idea how describe my question . my database course in university ask search about how hadoop handle tables and query when query send to database and hadoop tries to execute query

Since this is homework I won't fully answer your question, but I will point you in the right direction. In a traditional relational database (MySQL, PostgreSQL, SQLite) all the processing for a single query is done on a single machine. Even with replication, one query runs on one machine.
Hadoop uses a distributed filesystem to spread the work across multiple machines. Using MapReduce, one query can be broken up into smaller pieces and performed in parallel on multiple machines.
This can be faster, it depends on your data and your queries. What it really buys you is the ability to scale to handle more and more data and more and more queries. Rather than having to buy more powerful and more expensive database servers (even with replication, your database hardware has to be beefy), you can add inexpensive machines to your Hadoop cluster.
As for this...
when user send query to database like "SELECT message.message_id FROM message" whole table 'message' loading in RAM. when tables are very large and server hasn't enough memory, mysql crashed
This assumption is wrong. The whole table is not loaded into MySQL memory (unless MySQL is even dumber than I give it credit for). The database will read the table row-by-row. Just like if you open a huge file, it's still read line-by-line. Even with an ORDER BY, the sorting will be done on disk.
I suspect your teacher is trying to emphasize the advantages of a distributed database, and being able to deal with enormous datasets is one of them, but MySQL will not crash just because you query a large table.

Unlike sql queries, in hadoop you need to write map reduce job to extract the data. Now a days, there are many wrappers are available on top of map reduce job such as hive, pig, phoenix, etc.
In these wrapper you can run sql like queries, but at the end, it will convert queries into map reduce job, and it will return output looking like sql query result. It is calld SQL on NoSQL.
If FileSystem and MapReduce are installed on a node, MapR allocates 20% of physical memory to FileSystem, about 5-8% of memory for OS and other applications and the rest would be given to MapReduce services
On an average about 75% of physical memory is assigned to MapReduce in this kind of setting. Note that for the mfs process MapR pre-allocates 20% of memory, which means mfs grabs 20% of memory immediately. On the other hand, MapReduce services starts off low and eventually grows up to 75% of physical memory because memory is not pre-allocated when you configure and start TaskTracker service.
For more detail, check below link:
https://www.mapr.com/developercentral/code/memory-management-basics#.VTEoVq2qqko

Related

Hive query having 15 tables join is expected to generate 1 Billion records, on 3 datanodes, 16GB RAM each Is this the right way to do?

My name is Vitthal.
The Hortonworks HDP 2.4 Cluster on Amazon is 3 Datanodes, Masters on different Instances.
7 Instances 16GB RAM each.
Total 1TB HDD Space
3 Data Nodes
Hadoop version 2.7
I have pulled data from Postgres into Hadoop Distributed Environment.
The Data is 15 Tables, Among them 4 tables are having 15 Million Records, rest are Masters.
I've pulled them in HDFS, compressed as ORC, and SnappyCodec. Created Hive External Tables with schema.
Now I'm firing a query which joins all the 15 tables and selects the columns which I need in a final flat table. The records expected are more than 1.5 Billion.
I have optimized Hive, Yarn, MapReduce Engine viz. Parallel Execution, Vectorization, Optimized Joins, Small Table Condition, Heap Size etc.
The query is running on Cluster / Hive / Tez since 20 hours & it's reached 90% where the last reducer is running. The 90% is reached long back like since 18 hours it's stuck at 90%.
Am I doing it the right way ?
If I understand, you have effectively copied tables in their raw form from your RDBMs into Hadoop in order to create a flattened view into one or more new tables. You're using Hive to do this. All of this sounds fine.
There are many possibilities why this is taking so long, but several come to mind.
First, YARN will allocate containers (one per CPU core, typically) that mappers and reducers will use to run the parallelized parts of the query. This should allow you to utilize all of the resources you have available.
I use Cloudera, but I assume Hortonworks has similar tools that let you see how many containers are in use, how many mappers and reducers are created by Hive, and so on. You should see that most or all of your available CPUs are in use constantly. Jobs should be finishing at some reasonable rate (perhaps every minute, or every 15 minutes). Depending on the query, Hive is often able to break it into distinct "stages" that are executed distinctly from others, then reassembled at the end.
If this is the case, everything may be fine, but your cluster may be under-resourced. But before you throw more AWS instances at the problem, consider the query itself.
First, Hive has several tools that are essential for optimizing performance, most importantly, partitioning. When you create tables, you should find some means of partitioning the resulting datasets into roughly equal subsets. A common method is to use dates, for example year+month+day (perhaps 20160417), or if you expect to have lots of historical data, maybe just year+month. This will also allow you to dramatically optimize queries that can be constrained by date. I seem to recall that Hive (or maybe it's YARN) will allocate partitions to different containers, so if you don't see all your workers working, then this would be a possible cause. Use the PARTITIONED BY clause in your CREATE TABLE statement.
The reason to choose something like date is that presumably your data is relatively evenly distributed over time (dates). We had chosen a customer_id as a partition key in an early implementation but as we grew, so did our customers. Hundreds of smaller customers would finish in a few minutes, then hundreds of mid-sized customers would finish in an hour, then a couple of our largest customers would take 10 or more hours to complete. We would see complete utilization of the cluster for that first hour, then only a couple containers in use for the last couple of customers. Not good.
This phenomenon is known as "data skew", so you want to carefully choose partitions to avoid skew. There are some options involving SKEW BY and CLUSTER BY that can help deal with getting evenly sized or smaller data files that you could consider.
Note that the raw import data should also be partitioned, as partitions act like indexes in a RDBMS, so are important for performance. In this case, choose partitions that use the keys that your larger query joins on. It is possible and common to have multiple partitions, so a date-based top-level partition, with a sub-partition on the join key could be helpful ... maybe ... depends on your data.
We have also found that it's very important to optimize the query itself. Hive has some hinting mechanisms that can direct it to run the query differently. While quite rudimentary compared to RDBMS, EXPLAIN is very helpful for understanding how Hive will break up the query and when it needs to scan a full dataset. It's hard to read the explain output, so get comfortable with the Hive documentation :-).
Lastly, if you can't make Hive do things in a sensible manner (if its optimizer still results in imbalanced stages) you can create intermediate tables with an additional Hive query that runs to create a partially transformed dataset before building the final one. This seems expensive since you're adding an additional write, and read of new tables, but in the case you describe it may be much faster overall. Also, it's sometimes useful to have intermediate tables just to test or sample data.
Writing Hive is a lot less like writing regular software -- you can get the Hive query done pretty quickly in most cases. Getting it to run fast has taken us 10 or 15 tries in a few cases. Good luck, and I hope this is helpful.

Hadoop comparison to RDBMS

I really do not understand the actual reason behind hadoop scaling better than RDBMS . Can anyone please explain at a granular level ? Has this got something to do with underlying datastructures & algorithms
RDBMS have challenges in handling huge data volumes of Terabytes & Peta bytes. Even if you have Redundant Array of Independent/Inexpensive Disks (RAID) & data shredding, it does not scale well for huge volume of data. You require very expensive hardware.
EDIT:
To answer, why RDBMS cannot scale, have a look at Overheads of RBDMS.
Logging. Assembling log records and tracking down all changes
in database structures slows performance. Logging may not be
necessary if recoverability is not a requirement or if recoverability
is provided through other means (e.g., other sites on the network).
Locking. Traditional two-phase locking poses a sizeable overhead
since all accesses to database structures are governed by a
separate entity, the Lock Manager.
Latching. In a multi-threaded database, many data structures
have to be latched before they can be accessed. Removing this
feature and going to a single-threaded approach has a noticeable
performance impact.
Buffer management. A main memory database system does not
need to access pages through a buffer pool, eliminating a level of
indirection on every record access.
How Hadoop handles?:
Hadoop is a free, Java-based programming framework that supports the processing of large data sets in a distributed computing environment, which can run on commodity hardware. It is useful for storing & retrieval of huge volumes of data.
This scalability & efficiency are possible with Hadoop implementation of storage mechanism (HDFS) & processing jobs (YARN Map reduce jobs). Apart from scalability, Hadoop provides high availability of stored data.
Scalability, High Availability, Processing of huge volumes of data (Strucutred data, Unstructured data, Semi structured data) with flexibility are key to success of Hadoop.
Data is stored on thousands of nodes & processing is done on the node where data is stored (most of the times) through Map Reduce jobs. Data Locality on processing front is one key area of success of Hadoop.
This has been achieved with Name Node, Data Node & Resource Manager.
To understand how Hadoop achieve this, you should must visit these links : HDFS Architecture , YARN Architecture and HDFS Federation
Still RDBMS is good for multiple write/read/updates and consistent ACID transactions on Giga bytes of data. But not good for processing of Tera bytes & Peta bytes of data. NoSQL with two of Consistency ,Availability Partitioning attributes of CAP theory is good in some of use cases.
But Hadoop is not meant for real time transaction support with ACID properties. It is good for Business intelligence reporting with batch processing - "Write once, multiple read" paradigm.
From slideshare.net
Have a look at one more related SE question :
NoSql vs Relational database
First, hadoop IS NOT a DB replacement.
RDBMS scale vertical and hadoop scale horizontal.
This means that to scale twice a RDBMS you need to have hardware with the double memory, double storage and double cpu. That is very expensive and has limits. There isn't a server with 10TB of ram for example. With hadoop is different, you don't need expensive edge technology, instead of that you can use several commodity servers working together to simulate a bigger server (with some limitations). You can have a cluster with 10 Tb of ram distributed in several nodes.
Other advantage is that instead to have to buy a new more powerful server and drop the old one, to scale distributed systems only require to add new nodes into the cluster.
The one issue if have with the description above is that paralleled RDBMS required expensive hardware. Teridata and Netezza need special hardware. Greenplum and Vertica can be put on commodity hardware. (Now I will admit I am biased, like everyone else.) I have seen Greenplum scan petabytes of information daily. (Walmart was up to 2.5 petabytes last I hard.) I dealt with both Hawq and Impala. They both require about 30% more hardware to do the same job on structured data. Hbase is less efficient.
There is no magic silver spoon. It has been my experience that both structured and unstructured have their place. Hadoop is great for ingesting large amounts of data and scanning through it a small amount of times. We use it as part of our load procedures. RDBMS is grate at scanning the same data over and over with highly complex queries.
You always have to structure the data to make use of it. That structuring takes time somewhere. You ether structure before you put it in to an RDBMS or at query time .
In RDBMS , data is structured , rather it is indexed.
Retrieval of data of any particular 'nth' column is loading the entire database and then selecting the 'nth' column.
where as in Hadoop, say Hive, we load the only the particular column from the entire data set.
More so over the data loading is also done by Map reduce programs which is done in a distributed structure which reduce the overall time.
Hence, two advantages of using Hadoop and its tools.

The best way to filter large data sets

I have a query about how to filter relevant records from a large data set of financial transactions. We use Oracle 11g database and one of the requirements is to produce various end-of-day reports with all sorts of criteria.
The relevant tables look roughly like this:
trade_metadata 18m rows, 10 GB
trade_economics 18m rows, 15 GB
business_event 18m rows, 11 GB
trade_business_event_link 18m rows, 3 GB
One of our reports is now taking ages to run ( > 5 hours). The underlying proc has been optimized time and again but new criteria keep getting added so we start struggling again. The proc is pretty standard - join all the tables and apply a host of where clauses (20 at the last count).
I was wondering if I have a problem large enough to consider big data solutions to get rid of this optimize-the-query game every few months. In any case, the volumes are only going up. I have read up a bit about Hadoop + HBase, Cassandra, Apache Pig etc. but being very new to this space, am a little confused about the best way to proceed.
I imagine this is not a map-reduce problem. HBase does seem to offer Filters but I am not sure about their performance. Could the enlightened folks here please answer a few questions for me:
Is the data set large enough for big data solutions (Do I need entry into the billion club first?)
If it is, would HBase be a good choice to implement this?
We are not moving away from Oracle anytime soon even though the volumes are growing steadily. Am I looking at populating the HDFS every day with a dump from the relevant tables? Or is delta write possible everyday?
Thanks very much!
Welcome to the incredibly varied big data eco-system. If your dataset size is big enough that it is taxing your ability to analyze it using traditional tools, then it is big enough for big data technologies. As you have probably seen, there are a huge number of big data tools available with many of them having overlapping capabilities.
First of all, you did not mention if you have a cluster set-up. If not, then I would suggest looking into the products by Cloudera and Hortonworks. These companies provide Hadoop distributions that include many of the most popular big data tools(hbase, spark, sqoop, etc), and make it easier to configure and manage the nodes that will make up your cluster. Both companies provide their distributions free of charge, but you will have to pay for support.
Next you will need to get your data out of Oracle and into some format in the hadoop cluster to analyze it. The tool often used to get data from a relational database and into the cluster is Sqoop. Sqoop has the ability to load your tables into HBase, Hive, and files on the Hadoop Distributed Filesystem (HDFS). Sqoop also has the ability to do incremental imports for updates instead of whole table loads. Which of these destinations you choose affects which tools you can use in the next step. HDFS is the most flexible in that you can access it from PIG, MapReduce code you write, Hive, Cloudera Impala, and others. I have found HBase to be very easy to use, but others highly recommend Hive.
An aside: There is a project called Apache Spark that is expected to be the replacement for Hadoop MapReduce. Spark claims 100x speedup compared to traditional hadoop mapreduce jobs. Many projects including Hive will run on Spark giving you the ability to do SQL-like queries on big data and get results very quickly (Blog post)
Now that your data is loaded you need to run those end of day reports. If you choose Hive, then you can reuse a lot of your sql knowledge instead of having to program Java or learn Pig Latin (not that it’s very hard). Pig Translates Pig Latin to MapReduce jobs (as does Hive’s Query Language for now), but, like Hive, Pig can target Spark as well. Regardless of which tool you choose for this step, I recommend looking into Oozie to automate the ingestion, analaytics, and movement of results back out of the cluster (sqoop export for this). Oozie allows you to schedule recurring workflows like yours so you can focus on the results not the process. The full capabilities of Oozie are documented here.
There are a crazy number of tools at your disposal, and the speed of change in this eco-system can give you whip-lash. Both cloudera and Hortonworks provide Virtual Machines you can use to try their distributions. I strongly recommend spending less time deeply researching each tool and just trying some of the them (like Hive, Pig, Oozie,...) to see what works best for your application).

MySQL Cluster vs. Hadoop for handling big data

I want to know the advantages/disadvantages of using a MySQL Cluster and using the Hadoop framework.
What is the better solution. I would like to read your opinion.
I think the advantages of using a MySQL Cluster are:
high availability
good scalability
high performance / real time data access
you can use commodity hardware
And I don't see a disadvantage! Are there any disadvantages that Hadoop do not has?
The advantages of Hadoop with Hive on top of it are:
also good scalability
you can also use commodity hardware
the ability to run in heterogenous environments
parallel computing with the MapReduce framework
Hive with HiveQL
and the disadvantage is:
no real time data access. It may takes minutes or hours to analyze the data.
So in my opinion for handling big data a MySQL cluster is the better solution. Why Hadoop is the holy grail of handling big data? What is your opinion?
Both of the above answers miss a huge differentiation between mySQL and Hadoop. mySQL requires you to store data in a certain format. It likes heavily structured data - you declare the data type of each column in a table etc. Hadoop doesn't care about this at all.
Example - if you have a billion text log files, to make analysis even possible for mySQL you'd need to parse and load the data first into a mySQL table, typeing each column along the way. With hadoop and mapreduce, you define the function that is to scan/analyze/return the data from its raw source - you don't need pre-processing ETL to get it pre-structured.
If the data is already structured and in mySQL - then (hopefully) its well structured - why export it for hadoop to analyze? If it isn't, why spend the time to ETL the data?
Hadoop is not a replacement of MySQL, so I think they have their own scenario。
Every one know hadoop is better for batch job or offline compute, but there also have many related real time product, such as hbase.
If you wanna choose a offline compute & storage arch.
I suggest hadoop not MySQL cluster for offline compute & storage, because of :
Cost : obviously, hadoop cluster is more cheap than MySQL cluster
Scalability : hadoop support more than ten thousands machine in a cluster
Ecosystem : mapreduce, hive, pig, sqoop and etc.
So you can choose hadoop as offline compute & storage and MySQL as online compute & storage, you also can learn more from lambda architecture.
The other answer is good, but doesn't really explain why hadoop is more scalable for offline data crunching than MySQL Clusters. Hadoop is more efficient for large data sets that must be distributed across many machines because it gives you full control over the sharding of data.
MySQL clusters use auto-sharding, and it's designed to randomly distribute the data so no one machine gets hit with more of the load. On the other hand, Hadoop allows you to explicitly define the data partition so that multiple data points that require simultaneous access will be on the same machine, minimizing the amount of communication among the machines necessary to get the job done. This makes Hadoop better for processing massive data sets in many cases.
The answer to this question has a good explanation of this distinction.

How does Hive compare to HBase?

I'm interested in finding out how the recently-released (http://mirror.facebook.com/facebook/hive/hadoop-0.17/) Hive compares to HBase in terms of performance. The SQL-like interface used by Hive is very much preferable to the HBase API we have implemented.
It's hard to find much about Hive, but I found this snippet on the Hive site that leans heavily in favor of HBase (bold added):
Hive is based on Hadoop which is a batch processing system. Accordingly, this system does not and cannot promise low latencies on queries. The paradigm here is strictly of submitting jobs and being notified when the jobs are completed as opposed to real time queries. As a result it should not be compared with systems like Oracle where analysis is done on a significantly smaller amount of data but the analysis proceeds much more iteratively with the response times between iterations being less than a few minutes. For Hive queries response times for even the smallest jobs can be of the order of 5-10 minutes and for larger jobs this may even run into hours.
Since HBase and HyperTable are all about performance (being modeled on Google's BigTable), they sound like they would certainly be much faster than Hive, at the cost of functionality and a higher learning curve (e.g., they don't have joins or the SQL-like syntax).
From one perspective, Hive consists of five main components: a SQL-like grammar and parser, a query planner, a query execution engine, a metadata repository, and a columnar storage layout. Its primary focus is data warehouse-style analytical workloads, so low latency retrieval of values by key is not necessary.
HBase has its own metadata repository and columnar storage layout. It is possible to author HiveQL queries over HBase tables, allowing HBase to take advantage of Hive's grammar and parser, query planner, and query execution engine. See http://wiki.apache.org/hadoop/Hive/HBaseIntegration for more details.
Hive is an analytics tool. Just like pig, it was designed for ad hoc batch processing of potentially enourmous amounts of data by leveraging map reduce. Think terrabytes. Imagine trying to do that in a relational database...
HBase is a column based key value store based on BigTable. You can't do queries per se, though you can run map reduce jobs over HBase. It's primary use case is fetching rows by key, or scanning ranges of rows. A major feature is being able to have data locality when scanning across ranges of row keys for a 'family' of columns.
To my humble knowledge, Hive is more comparable to Pig. Hive is SQL-like and Pig is script based.
Hive seems to be more complicated with query optimization and execution engines as well as requires end user needs to specify schema parameters(partition etc).
Both are intend to process text files, or sequenceFiles.
HBase is for key value data store and retrieve...you can scan or filter on those key value pairs(rows). You can not do queries on (key,value) rows.
Hive and HBase are used for different purpose.
Hive:
Pros:
Apache Hive is a data warehouse infrastructure built on top of Hadoop.
It allows for querying data stored on HDFS for analysis via HQL, an SQL-like language, which will be converted into series of Map Reduce Jobs
It only runs batch processes on Hadoop.
it’s JDBC compliant, it also integrates with existing SQL based tools
Hive supports partitions
It supports analytical querying of data collected over a period of time
Cons:
It does not currently support update statements
It should be provided with a predefined schema to map files and directories into columns
HBase:
Pros:
A scalable, distributed database that supports structured data storage for large tables
It provides random, real time read/write access to your Big Data. HBase operations run in real-time on its database rather than MapReduce jobs
it supports partitions to tables, and tables are further split into column families
Scales horizontally with huge amount of data by using Hadoop
Provides key based access to data when storing or retrieving. It supports add or update rows.
Supports versoning of data.
Cons:
HBase queries are written in a custom language that needs to be learned
HBase isn’t fully ACID compliant
It can't be used with complicated access patterns (such as joins)
It is also not a complete substitute for HDFS when doing large batch MapReduce
Summary:
Hive can be used for analytical queries while HBase for real-time querying. Data can even be read and written from Hive to HBase and back again.
As of the most recent Hive releases, a lot has changed that requires a small update as Hive and HBase are now integrated. What this means is that Hive can be used as a query layer to an HBase datastore. Now if people are looking for alternative HBase interfaces, Pig also offers a really nice way of loading and storing HBase data. Additionally, it looks like Cloudera Impala may offer substantial performance Hive based queries on top of HBase. They are claim up to 45x faster queries over traditional Hive setups.
To compare Hive with Hbase, I'd like to recall the definition below:
A database designed to handle transactions isn’t designed to handle
analytics. It isn’t structured to do analytics well. A data warehouse,
on the other hand, is structured to make analytics fast and easy.
Hive is a data warehouse infrastructure built on top of Hadoop which is suitable for long running ETL jobs.
Hbase is a database designed to handle real time transactions

Resources