Flink Cluster Performance is much worse than standalone - hadoop

I use flink to process HDFS files or local files.
When I use standalone setup, the server can process the data at 500k/s.
But when I use cluster,the server can only process the data at 100k/s.
It is so weird,I can not figure out what is going on.
I found that when I use cluster(2 servers), there is always one server which has low speeds to read/write data. The flink cluster is based on hadoop.
Can anyone help me?

Related

Make spark environment for cluster

I made a spark application that analyze file data. Since input file data size could be big, It's not enough to run my application as standalone. With one more physical machine, how should I make architecture for it?
I'm considering using mesos for cluster manager but pretty noobie at hdfs. Is there any way to make it without hdfs (for sharing file data)?
Spark maintain couple cluster modes. Yarn, Mesos and Standalone. You may start with the Standalone mode which means you work on your cluster file-system.
If you are running on Amazon EC2, you may refer to the following article in order to use Spark built-in scripts that loads Spark cluster automatically.
If you are running on an on-prem environment, the way to run in Standalone mode is as follows:
-Start a standalone master
./sbin/start-master.sh
-The master will print out a spark://HOST:PORT URL for itself. For each worker (machine) on your cluster use the URL in the following command:
./sbin/start-slave.sh <master-spark-URL>
-In order to validate that the worker was added to the cluster, you may refer to the following URL: http://localhost:8080 on your master machine and get Spark UI that shows more info about the cluster and its workers.
There are many more parameters to play with. For more info, please refer to this documentation
Hope I have managed to help! :)

HBase standalone performance vs. running on an HDFS cluster

My Application is connected to an HBase and does a lot of communication (hundreds or thousands of reads/writes per second). This strongly affects performance, probably due to I/O operations HBase does on every request.
Doo.dle are calls to my code - the difference between blue and red is time consumed by HBase.
Currently, I've only tested in standalone mode, where HBase stores data using the local file system. I was wondering, whether using one in distributed mode with an actual HDFS could significantly improve performance, or just yield the same results. I'm trying to get a clue before losing too much time into getting a cluster up and running.
A second question I've asked myself is whether a standalone HBase could be configured to just persist data to memory (RAM) instead of writing it to the file system for performance measures.
In the standalone mode,HBase does not use HDFS and it runs all HBase daemons and a local ZooKeeper all up in the same JVM
In a Pseudo-distributed mode, Hbase can run against the local filesystem or it can run against an instance of the Hadoop Distributed File System. So there is no difference between standalone and pseudo-distributed considering the performance.
The Fully-distributed mode requires the use of HDFS which means that the tasks will run over jobs and that's take time according to my experience.
So using Hbase in fully-distributed mode with an actual HDFS could significantly improve performance.

Should the HBase region server and Hadoop data node on the same machine?

Sorry that I don't have the resource to set up a cluster to test it, I'm just wondering to know:
Can I deploy hbase region server on a separated machine other than the hadoop data node machine? I guess the answer is yes, but I'm not sure.
Is it good or bad to deploy hbase region server and hadoop data node on different machines?
When putting some data into hbase, where is this data eventually stored in, data node or region server? I guess it's data node, but what is the StoreFile and HFile in region server, isn't it the physical file to store our data?
Thank you!
RegionServers should always run alongside DataNodes in distributed clusters if you want decent performance.
Very bad, that will work against the data locality principle (If you want to know a little more about data locality check this: http://www.larsgeorge.com/2010/05/hbase-file-locality-in-hdfs.html)
Actual data will be stored in the HDFS (DataNode), RegionServers are responsible of serving and managing regions.
For more information about HBase architecture please check this excelent post from Lars' blog: http://www.larsgeorge.com/2009/10/hbase-architecture-101-storage.html
BTW, as long as you have a PC with decent RAM you can set up a demo cluster with virtual machines. Do not ever try to set up a production environment without properly test the platform first in a development environment.
To go in more detail about this answer:
RegionServers should always run alongside? DataNodes in distributed clusters if you want decent performance."
I'm not sure how anyone would interpet the term alongside, so let's try to be even more precise:
What makes any physical server an "XYZ" server is that it's running a program called a daemon (think "eternally-running background event-handling" program);
What makes a "file" server is that it's running a file-serving daemon;
What makes a "web" server is that it's running a web-serving daemon;
AND
What makes a "data node" server is that it's running the HDFS data-serving daemon;
What makes a "region" server then is that it's running the HBase region-serving daemon (program);
So, in all Hadoop Distributions (eg Cloudera, MAPR, Hortonworks, others), the general best practice is that for HBase, the "RegionServers" are "co-located" with the "DataNodeServers".
This means that the actual slave (datanode) servers which form the HDFS cluster are each running the HDFS data-serving daemon (program)
and they're also running the HBase region-serving daemon (program) as well!
This way we ensure locality - the concurrent processing and storing of data on all the individual nodes in an HDFS cluster, with no "movement" of gigantic loads of big data from "storage" locations to "processing" locations. Locality is vital to the success of a Hadoop cluster, such that HBase region servers (data nodes running the HBase daemon as well) must also do all their processing (putting/getting/scanning) on each data node containing the HFiles which make up HRegions which make up HTables which make up HBases (Hadoop-dataBases) ... .
So, servers (VMs or physical on Windows, Linux, ..) can run multiple daemons concurrently, often, they run dozens of them regularly.

Hadoop Disaster Recovery

Anyone could help me to know about Hadoop Disaster Recovery ?
should i replicate data from cluster to another cluster as backup use distcp?
Or i can use copyToLocal to copy my data to my localmachine ?
Anyone idea about it ?
DRP plan goes beyond just the technology and the requirements can greatly affect the solution.
for instance if you can't afford to lose any data you'd want an active/active setup and send data to two hadoop clusters simultaneously. on the other side of the spectrum hadoop's replication (default is 3 copies but you can change that) and rack awareness can give you a copy on a secondary rack. In between you can use things like distcp that you mention to copy data from cluster to cluster.
Additionally you might want to follow project falcon which is a new initiative for hadoop data life-cycle management

read data from amazon hbase

Can anyone suggest me that whether I can read data from amazon hbase using the org.apache.hadoop.conf.Configuration and org.apache.hadoop.hbase.client.HTablePool.
We are migrating to Amazon's EMR framework having hbase running on top of it.
The present implementation is based on pure Apache hadoop and hbase distributions. I'm trying to verify that no code changes needed even we migrate to amazon's EMR.
Please share your thoughts.
While it should not happen, I would expect the problems and changes related to the nature of EC2 and its networking.
HBase relay on Regions able to renew their leases in timely manner. If Region servers are two busy - because of some massive operations over them, they can not do so and get kicked off the cluster.
In amazon performance of the EC2 instances are much less predictable then in dedicated cluster (unless you use cluster instances), so adjusting timeout parameters and/or nature of your loads might be needed to get cluster to work properly

Resources