How to run hadoop multithread way in single JVM? - hadoop

I have 4 core desktop and want to use all my cores for local data processing with hadoop.
(i.e. sometimes I have enough power to process data locally sometimes I submit same jobs to cluster).
By default hadoop local mode runs only one mapper and one reducer so my local jobs are really slow.
I do not want to setup cluster on single machine first because of "painful" configuration and second I have to create jar each time. So perfect solution is to how run embedded Hadoop on a single machine
PS pseudo-distributed mode is bad option since it will create cluster with Single node, so I will get only one mapper and I have to spend some time on additional configuration.

You need to use MultithreadedMapRunner - just set up it in JobConf's setMapRunnerClass method and don't forget to set mapred.map.multithreadedrunner.threads to desirable concurrency level.
Also there is an another way, you should:
set MultithreadedMapper as your mapper class in Job-typed object
call MultithreadedMapper.setMapperClass with you actual mapper class
call MultithreadedMapper.setNumberOfThreads with desirable concurrency level
But be careful, your mapper class should be thread safe and it's setup and cleanup methods would be called several times, so it isn't a smart idea to mix MultithreadedMapper with MultipulOutput, unless you implement you own MultithreadedMapper inspired class.

Hadoop purposely does not run more than one task at the same time in one JVM for isolation purposes. And in stand-alone (local) mode, only one JVM is ever used. If you want to make use of your four cores, you should run in pseudo-distributed mode, and increase the max number of concurrent tasks to four. You can do this with the mapred.tasktracker.map.tasks.maximum and mapred.tasktracker.reduce.tasks.maximum properties.

Configuration conf = new Configuration();
Job job = new Job(conf, "SolerRandomHit");
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.setMapperClass(MultithreadedMapper.class);

Related

Execute a method only once at start of an Apache Storm topology

If I have a simple Apache Storm topology with a spout (set to a parallelism of 2) running on two separate nodes. How can I write a method that will be run once, and only once, at the start of the topology before any processing of tuples has begun?
Any implementation of a singleton/static class, or synchronized method alone will not work, as the two instances are running on separate nodes.
Perhaps there are some Storm methods that I can use to decide if I'm the first Spout to be instantiated, and run only then? I tried playing around with the getThisTaskId() & getThisWorkerTasks() methods, but was unsuccessful.
NOTE: The parallelism of 2 is to keep things simple. A solution should work for any number of nodes/workers.
Edit: Thought of an easier solution. I'll leave the original answer below in case it is helpful.
You can use TopologyContext.getThisTaskIndex to do this. If you make your spout open method run the code only if TopologyContext.getThisTaskIndex == 0, then your code will run only once, before any tuples are emitted.
If the worker that ran this code crashes, the code will be run again when the spout instance with task index 0 is restarted. In order to fix this, you can use Zookeeper to store state that should carry over across restarts, e.g. put a flag in Zookeeper once the only-once code has run, and have the spout open check that the flag is not set before running the code.
You can use TopologyContext.getStormId to get a constant unique string to identify the topology, so you can tell whether the flag was set by this topology or a previous deployment.
Original answer:
The easiest way to run some code only once on deployment of a topology, is to call the code when you submit the topology. You can call the only-once code at the same time as you wire your topology with TopologyBuilder. This will only get run once. The downside is it will run on the machine you're calling storm jar from.
If you for some reason can't do it this way or need to run the code from one of the worker nodes, there isn't anything built in to Storm to allow you to do this. The reason there isn't such a mechanism is that it requires extra coordination between the worker JVMs, and I don't think anyone has needed something like this.
The best option for you would probably be to look at Zookeeper/Curator to do this coordination (see https://curator.apache.org/curator-recipes/index.html). This should allow you to make only one worker in the cluster run your code. You'll have to consider what should happen if the worker chosen to run your code crashes/stalls.
Storm already uses Zookeeper for coordination, so you can just connect to that cluster.

How to write rows asynchronously in Spark Streaming application to speed up batch execution?

I have a spark job where I need to write the output of the SQL query every micro-batch. Write is a expensive operation perf wise and is causing the batch execution time to exceed the batch interval.
I am looking for ways to improve the performance of write.
Is doing the write action in a separate thread asynchronously like shown below a good option?
Would this cause any side effects because Spark itself executes in a distributed manner?
Are there other/better ways of speeding up the write?
// Create a fixed thread pool to execute asynchronous tasks
val executorService = Executors.newFixedThreadPool(2)
dstream.foreachRDD { rdd =>
import org.apache.spark.sql._
val spark = SparkSession.builder.config(rdd.sparkContext.getConf).getOrCreate
import spark.implicits._
import spark.sql
val records = rdd.toDF("record")
records.createOrReplaceTempView("records")
val result = spark.sql("select * from records")
// Submit a asynchronous task to write
executorService.submit {
new Runnable {
override def run(): Unit = {
result.write.parquet(output)
}
}
}
}
1 - Is doing the write action in a separate thread asynchronously like shown below a good option?
No. The key to understand the issue here is to ask 'who is doing the write'. The write is done by the resources allocated for your job on the executors in a cluster. Placing the write command on an async threadpool is like adding a new office manager to an office with a fixed staff. Will two managers be able to do more work than one alone given that they have to share the same staff? Well, one reasonable answer is "only if the first manager was not giving them enough work, so there's some free capacity".
Going back to our cluster, we are dealing with a write operation that is heavy on IO. Parallelizing write jobs will lead to contention for IO resources, making each independent job longer. Initially, our job might look better than the 'single manager version', but trouble will eventually hit us.
I've made a chart that attempts to illustrate how that works. Note that the parallel jobs will take longer proportionally to the amount of time that they are concurrent in the timeline.
Once we reach that point where jobs start getting delayed, we have an unstable job that will eventually fail.
2- Would this cause any side effects because Spark itself executes in a distributed manner?
Some effects I can think of:
Probably higher cluster load and IO contention.
Jobs are queuing on the Threadpool queue instead of on the Spark Streaming Queue. We loose the ability to monitor our job through the Spark UI and monitoring API, as the delays are 'hidden' and all is fine from the Spark Streaming point of view.
3- Are there other/better ways of speeding up the write?
(ordered from cheap to expensive)
If you are appending to a parquet file, create a new file often. Appending gets expensive with time.
Increase your batch interval or use Window operations to write larger chunks of Parquet. Parquet likes large files
Tune the partition and distribution of your data => make sure that Spark can do the write in parallel
Increase cluster resources, add more nodes if necessary
Use faster storage
Is doing the write action in a separate thread asynchronously like shown below a good option?
Yes. It's certainly something to consider when optimizing expensive queries and saving their results to external data stores.
Would this cause any side effects because Spark itself executes in a distributed manner?
Don't think so. SparkContext is thread-safe and promotes this kind of query execution.
Are there other/better ways of speeding up the write?
YES! That's the key to understand when to use the other (above) options. By default, Spark applications run in FIFO scheduling mode.
Quoting Scheduling Within an Application:
By default, Spark’s scheduler runs jobs in FIFO fashion. Each job is divided into “stages” (e.g. map and reduce phases), and the first job gets priority on all available resources while its stages have tasks to launch, then the second job gets priority, etc. If the jobs at the head of the queue don’t need to use the whole cluster, later jobs can start to run right away, but if the jobs at the head of the queue are large, then later jobs may be delayed significantly.
Starting in Spark 0.8, it is also possible to configure fair sharing between jobs. Under fair sharing, Spark assigns tasks between jobs in a “round robin” fashion, so that all jobs get a roughly equal share of cluster resources. This means that short jobs submitted while a long job is running can start receiving resources right away and still get good response times, without waiting for the long job to finish. This mode is best for multi-user settings.
That means that to make a room for executing multiple writes asynchronously and in parallel you should configure your Spark application to use FAIR scheduling mode (using spark.scheduler.mode property).
You will have to configure so-called Fair Scheduler Pools to "partition" executor resources (CPU and memory) into pools that you can assign to jobs using spark.scheduler.pool property.
Quoting Fair Scheduler Pools:
Without any intervention, newly submitted jobs go into a default pool, but jobs’ pools can be set by adding the spark.scheduler.pool "local property" to the SparkContext in the thread that’s submitting them.

How to find the right portion between hadoop instance types

I am trying to find out how many MASTER, CORE, TASK instances are optimal to my jobs. I couldn't find any tutorial that explains how do I figure it out.
How do I know if I need more than 1 core instance? What are the "symptoms" I would see in EMR's console in the metrics that would hint I need more than one core? So far when I tried the same job with 1*core+7*task instances it ran pretty much like on 8*core, but it doesn't make much sense to me. Or is it possible that my job is so much CPU bound that the IO is such minor? (I have a map-only job that parses apache log files into csv file)
Is there such a thing to have more than 1 master instance? If yes, when is it needed? I wonder, because my master node pretty much is just waiting for the other nodes to do the job (0%CPU) for 95% of the time.
Can the master and the core node be identical? I can have a master only cluster, when the 1 and only node does everything. It looks like it would be logical to be able to have a cluster with 1 node that is the master and the core , and the rest are task nodes, but it seems to be impossible to set it up that way with EMR. Why is that?
The master instance acts as a manager and coordinates everything that goes in the whole cluster. As such, it has to exist in every job flow you run but just one instance is all you need. Unless you are deploying a single-node cluster (in which case the master instance is the only node running), it does not do any heavy lifting as far as actual MapReducing is concerned, so the instance does not have to be a powerful machine.
The number of core instances that you need really depends on the job and how fast you want to process it, so there is no single correct answer. A good thing is that you can resize the core/task instance group, so if you think your job is running slow, then you can add more instances to a running process.
One important difference between core and task instance groups is that the core instances store actual data on HDFS whereas task instances do not. In turn, you can only increase the core instance group (because removing running instances would lose the data on those instances). On the other hand, you can both increase and decrease the task instance group by adding or removing task instances.
So these two types of instances can be used to adjust the processing power of your job. Typically, you use ondemand instances for core instances because they must be running all the time and cannot be lost, and you use spot instances for task instances because losing task instances do not kill the entire job (e.g., the tasks not finished by task instances will be rerun on core instances). This is one way to run a large cluster cost-effectively by using spot instances.
The general description of each instance type is available here:
http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/InstanceGroups.html
Also, this video may be useful for using EMR effectively:
https://www.youtube.com/watch?v=a5D_bs7E3uc

Hadoop: force 1 mapper task per node from jobconf

I want to run one task (mapper) per node on my Hadoop cluster, but I cannot modify the configuration with which the tasktracker runs (i'm just a user).
For this reason, I need to be able to push the option through the job configuration. I tried to set the mapred.tasktracker.map.tasks.maximum=1 at hadoop jar command, but the tasktracker ignores it as it has a different setting in its configuration file.
By the way, the cluster uses the Capacity Scheduler.
Is there any way I can force 1 task per node?
Edited:
Why? I have a memory-bound task, so I want each task to use all the memory available to the node.
when you set the no of mappers, either through the configuration files or by some other means, it's just a hint to the framework. it doesn't guarantee that you'll get only the specified no of mappers. the creation of mappers is actually governed by the no of Splits. and the split creation is carried out by the logic which your InputFormat holds. if you really want to have just one mapper to process the entire file, set "issplittable" to true in the InputFormat class you are using. but why would you do that?the power of hadoop actually lies in distributed parallel processing.

Why all the reduce tasks are ending up in a single machine?

I wrote a relatively simple map-reduce program in Hadoop platform (cloudera distribution). Each Map & Reduce write some diagnostic information to standard ouput besides the regular map-reduce tasks.
However when I'm looking at these log files, I found that Map tasks are relatively evenly distributed among the nodes (I have 8 nodes). But the reduce task standard output log can only be found in one single machine.
I guess, that means all the reduce tasks ended up executing in a single machine and that's problematic and confusing.
Does anybody have any idea what's happening here ? Is it configuration problem ?
How can I make the reduce jobs also distribute evenly ?
If the output from your mappers all have the same key they will be put into a single reducer.
If your job has multiple reducers, but they all queue up on a single machine, then you have a configuration issue.
Use the web interface (http://MACHINE_NAME:50030) to monitor the job and see the reducers it has as well as what machines are running them. There is other information that can be drilled into that will provide information that should be helpful in figuring out the issue.
Couple questions about your configuration:
How many reducers are running for the job?
How many reducers are available on each node?
Is the node running the reducer better
hardware than the other nodes?
Hadoop decides which Reducer will process which output keys by the use of a Partitioner
If you are only outputting a few keys and want an even distribution across your reducers, you may be better off implementing a custom Partitioner for your output data. eg
public class MyCustomPartitioner extends Partitioner<KEY, VALUE>
{
public int getPartition(KEY key, VALUE value, int numPartitions) {
// do something based on the key or value to determine which
// partition we want it to go to.
}
}
You can then set this custom partitioner in the job configuration with
Job job = new Job(conf, "My Job Name");
job.setPartitionerClass(MyCustomPartitioner.class);
You can also implement the Configurable interface in your custom Partitioner if you want to do any further configuration based on job settings.
Also, check that you haven't set the number of reduce tasks to 1 anywhere in the configuration (look for "mapred.reduce.tasks"), or in code, eg
job.setNumReduceTasks(1);

Resources