Why are locality levels all ANY in a Spark wordcount application running on HDFS? - hadoop

I ran a Spark cluster of 12 nodes (8G memory and 8 cores for each) for some tests.
I'm trying to figure out why data localities of a simple wordcount app in "map" stage are all "Any". The 14GB dataset is stored in HDFS.

I have run into the same problem and in my case it was a problem with the configuration. I was running on the EC2 and I had a name mismatch. Maybe the same thing happened to you.
When you check how HDFS sees you cluster it should be something along this lines:
hdfs dfsadmin -printTopology
Rack: /default-rack
172.31.xx.xx:50010 (ip-172-31-xx-xxx.eu-central-1.compute.internal)
172.31.xx.xx:50010 (ip-172-31-xx-xxx.eu-central-1.compute.internal)
And the same should be seen in executors' address in the UI (by default it's http://your-cluster-public-dns:8080/).
In my case I was using public hostname for spark slaves. I have changed my SPARK_LOCAL_IP in $SPARK/conf/spark-env.sh to use the private name as well, and after that change I get NODE_LOCAL most of the times.

I encounter the same problem today. This is my situation:
My cluster have 9 workers(each setup one executor by default) ,when i set --total-executor-cores 9, the Locality lever is NODE_LOCAL, but when i set the total-executor-cores below 9 such as --total-executor-cores 7, then Locality lever become ANY, and the total time cost is 10X than NODE_LOCAL lever. You can have a try.

I'm running my cluster on EC2s, and I fixed my problem by adding the following to spark-env.sh on the name node
SPARK_MASTER_HOST=<name node hostname>
and then adding the following to spark-env.sh on the data nodes
SPARK_LOCAL_HOSTNAME=<data node hostname>

Don't start slaves like this start-all.sh. u should start every slave alonely
$SPARK_HOME/sbin/start-slave.sh -h <hostname> <masterURI>

Related

Datastax - Cassandra Amazon EC2 Multiregion Setup - Cluster with 3 node

I have launched 3 Amazon EC2 instance and setup datastax cassandra as follows
1.Region - US EAST:
cassandra.yaml - configuration
a.listen_address as private IP of this instance
b.broadcast_address as public IP of this instance
c.seeds as 50.XX.XX.X1, 50.XX.XX.X2 (public-ip of node1,public-ip of node2)
cassandra-rackdc.properties - configuration
dc=DC1
rack=RAC1
dc_suffix=US_EAST_1
2.Region - US WEST:
I did same procedure as I did above.
3.Region - EU IRELAND:
The result of above configuration is
All the node working good individually. But when I do
$nodetool status on all the three node
It only listing the local node only.
I tried to achieve the following things.
1. Launch 3 cassandra node in three different region. For say, US-EAST,US-WEST,EU-IRELAND.
With Following configuration or methodology
a.Ec2MultiRegionSnitch
b.Replication staragey as SimpleStrategy
c.Replication Factor as 3
d. Read & write level as QUORUM.
I wish to attain only one thing i.e. if any two of the region is down or any two of the node down, I can survive with renaming one node.
My Questions here are
Where I did the mistake? and How to attain my requirements?
Any help or inputs are much appreciated.
Thanks.
This is what worked for me with cassandra 3.0
endpoint_snitch: Ec2MultiRegionSnitch
listen_address: <leave_blank>
broadcast_address: <public_ip_of_server>
rpc_address: 0.0.0.0
broadcast_rpc_address: <public_ip_of_server>
-seed: "one_ip_from_other_DC"
Finally, I found the resolution of my issue. I am using replication strategy as SimpleStrategy, hence I do not require to configure cassandra-rackdc.properties.
Once, I removed the file cassandra-rackdc.properties from all node, Everything working as expected.
Thanks

Mahout - ParallelALSFactorizationJob running too long?

I am trying to run Mahout ALS recommendation on AWS EMR cluster, however, it takes much longer than I expected.
The following is the command I run:
aws add-steps --cluster-id <cluster_id> \
--steps Type=CUSTOM_JAR,\
Name="Mahout ALS Factorization Job",\
Jar=s3://<my_bucket>/recproto/mahout-mr-0.10.0-job.jar,\
MainClass=org.apache.mahout.cf.taste.hadoop.als.ParallelALSFactorizationJob,\
Args=["--input","s3://<my_bucket>/recproto/trainingdata/userClicks.csv.gz",\
"--output","s3://<my_bucket>/recproto/als-output/",\
"--implicitFeedback","true",\
"--lambda","150",\
"--alpha","0.05",\
"--numFeatures","100",\
"--numIterations","3",\
"--numThreadsPerSolver","4",\
"--usesLongIDs","true"]
In the userClicks.csv file, there are 1,567,808 ratings from 335,636 users and 23,934 items.
The job is run on a 10-c3.xlarge nodes EMR cluster, and the job has been running for more than 2 hours. I would like to know is this normal? In the case of my rating file, which scale of EMR cluster and parameters should I use so I can get a more acceptable running time?
I solved this problem by simply use Spark ALS. The training process spends LESS THAN 2 MINUTES ON MY LAPTOP on the same dataset with the same parameters.
I can now understand why some machine learning algorithms are deprecated due to performance issues...(e.g., the Minhash algorithm)

Hadoop performance modeling

I am working on Hadoop performance modeling. Hadoop has 200+ parameters so setting them manually is not possible. So often we run our hadoop jobs with default parameter value(like using default value io.sort.mb, io.sort.record.percent, mapred.output.compress etc). But using default parameter value gives us sub optimal performance. There is some work done in this area by Herodotos Herodotou (http://www.cs.duke.edu/starfish/files/vldb11-job-optimization.pdf) to improve performance. But i have following doubt in their work --
They are fixing the value of parameters at the job start time( according to proportionality assumption of data) for all the phases( read, map, collect etc.) of MapReduce job. Can we set different value of these parameters for each phase at run time according to run time environment( like cluster configuration, underling file system etc.), by changing Hadoop configuration log files of a particular node to get optimal performance from a node ?
They are using white box model for Hadoop core are they still applicable for
current Hadoop ( http://arxiv.org/pdf/1106.0940.pdf) ?
No, you couldn't dynamically change MapReduce parameters per job per node.
Configuring set of nodes
Rather what you could do is change the configuration parameters per node statically in the configuration files (generally located in /etc/hadoop/conf), so that you could take the most out of your cluster with different h/w configurations.
Example: Assume you have 20 worker nodes with different hardware configurations like:
10 with configuration of 128GB RAM, 24 Cores
10 with configuration of 64GB RAM, 12 Cores
In that case you would want to configure each of identical servers to take most out of the hardware for example, you would want to run more child tasks (mappers & reducers) on worker nodes with more RAM and Cores, for example:
Nodes with 128GB RAM, 24 Cores => 36 worker tasks (mappers + reducers), JVM heap for each worker task would be around 3GB.
Nodes with 64GB RAM, 12 Cores => 18 worker tasks (mappers + reducers), JVM heap for each worker task would be around 3GB.
So, you would want to configure the set of nodes respectively with appropriate parameters.
Using ToolRunner to pass configuration parameters dynamically to a Job:
Also, you could dynamically change the MapReduce job parameters per job but these parameters would be applied to the entire cluster not just to a set of nodes. Provided your MapReduce job driver extends ToolRunner.
ToolRunner allows you to parse generic hadoop command line arguments. You'll be able to pass MapReduce configuration parameters using -D property.name=property.value.
You can pretty much pass almost all hadoop parameters dynamically to a job. But most commonly passed MapReduce configuration parameters dynamically to a job are:
mapreduce.task.io.sort.mb
mapreduce.map.speculative
mapreduce.job.reduces
mapreduce.task.io.sort.factor
mapreduce.map.output.compress
mapreduce.map.outout.compress.codec
mapreduce.reduce.memory.mb
mapreduce.map.memory.mb
Here is an example terasort job passing lots of parameters dynamically per job:
hadoop jar hadoop-mapreduce-examples.jar tearsort \
-Ddfs.replication=1 -Dmapreduce.task.io.sort.mb=500 \
-Dmapreduce.map.sort.splill.percent=0.9 \
-Dmapreduce.reduce.shuffle.parallelcopies=10 \
-Dmapreduce.reduce.shuffle.memory.limit.percent=0.1 \
-Dmapreduce.reduce.shuffle.input.buffer.percent=0.95 \
-Dmapreduce.reduce.input.buffer.percent=0.95 \
-Dmapreduce.reduce.shuffle.merge.percent=0.95 \
-Dmapreduce.reduce.merge.inmem.threshold=0 \
-Dmapreduce.job.speculative.speculativecap=0.05 \
-Dmapreduce.map.speculative=false \
-Dmapreduce.map.reduce.speculative=false \

 -Dmapreduce.job.jvm.numtasks=-1 \
-Dmapreduce.job.reduces=84 \

 -Dmapreduce.task.io.sort.factor=100 \
-Dmapreduce.map.output.compress=true \

 -Dmapreduce.map.outout.compress.codec=\
org.apache.hadoop.io.compress.SnappyCodec \
-Dmapreduce.job.reduce.slowstart.completedmaps=0.4 \
-Dmapreduce.reduce.merge.memtomem.enabled=fasle \
-Dmapreduce.reduce.memory.totalbytes=12348030976 \
-Dmapreduce.reduce.memory.mb=12288 \

 -Dmapreduce.reduce.java.opts=“-Xms11776m -Xmx11776m \
-XX:+UseConcMarkSweepGC -XX:+CMSIncrementalMode \
-XX:+CMSIncrementalPacing -XX:ParallelGCThreads=4” \

 -Dmapreduce.map.memory.mb=4096 \

 -Dmapreduce.map.java.opts=“-Xmx1356m” \
/terasort-input /terasort-output

How does communication between datanodes work in a Hadoop cluster?

I am new to Hadoop and help with this questions is appreciated.
The replication of blocks in a cluster is handled by individual data nodes having a copy of the block, but how does this transfer take place without considering namenode.
I found that ssh is setup from slaves to master and master to slaves unlike slave to slave.
Could someone explain?
Is it through hadoop data transfer protocol like Client to DN communication ?
http://blog.cloudera.com/blog/2013/03/how-to-set-up-a-hadoop-cluster-with-network-encryption/
After digging into hadoop source code,I find datanodes use BlockSender class to transfer block data.Actually Socket is under the hood.
Below is my hack way to find this.(hadoop version 1.1.2 used here)
DataNode Line 946 is offerService method, which is a main loop
for service.
codes above is datanode send heartbeat to namenode mainly to tell it is alive.the return value are some commands which datanode will process.this is where block copy happens.
digging into processCommand we come at Line 1160
here is a comment which we can be undoubtedly sure transferBlocks is what we want.
digging into transferBlocks, we come at Line 1257, a private method.At the end of the method,
new Daemon(new DataTransfer(xferTargets, block, this)).start();
so,we know datanode start a new thread to do block copy.
Look at DataTransfer in Line 1424,check at run method.
at the nearly end of run method,we find following snippets:
// send data & checksum
blockSender.sendBlock(out, baseStream, null);
from code above, we can know BlockSender is the actual worker.
I have done my work,It is up to you to find more,such as BlockReader
Whenever a block has to be written in HDFS, the NameNode will allocate space for this block on any datanode. It will also allocate space on other datanodes for the replicas of this block. Then it will instruct the first datanode to write the block and also to replicate the block on the other datanodes where space was allocated for the replicas.

Hector is unable to read Cassandra data when nodes reboot or terminate

We are trying to run a cassandra cluster on AWS/EC2 within a standard VPC footprint (cassandra nodes on private subnets). Because this is AWS there is always a chance that an EC2 instance will terminate or reboot with no warning. I have been simulating this case on a test cluster and I am seeing things with the cluster that I thought a cluster was suppose to prevent. Specifically if a node reboots some data will go temporarily missing until the node completes its reboot. If a node terminates it appears that some data is lost forever.
For my test I just did a bunch of writes (using QUORUM consistency) to some keyspaces then interrogate the contents of those keyspaces as I bring down nodes (either through reboot or terminate). I'm just using cqlsh SELECT to do the keyspace/column family interrogation of the cluster using ONE consistency level.
Note, even though I am performing no writes to the cluster while I am doing the SELECTs rows temporarily disappear when rebooting and can permanently go missing during termination.
I thought Netflix Priam might be able to help, but sadly it doesn't work in a VPC the last time I checked.
Also, because we are using ephemeral storage instances there is no equivalent of 'shutdown' so I cannot run any scripts during reboot/terminate of an instance to perform a nodetool decommission or nodetool removenode before an instance goes away. Terminate is the equivalent of kicking the plug out of the wall.
Since I am using a replication factor of 3 and quorum/write that should mean that all data is written to at least 2 nodes. So, unless I am totally misunderstanding things (which is possible), losing one node should not mean that I lose any data for any period of time when I am using consistency level ONE for the read.
Questions
Why wouldn't a 6 node cluster with a replication factor of 3 work?
Do I need to run something like a 12 node cluster with a replication factor of 7? Don't bother telling me that will fix the problem, because it doesn't.
Do I need to use consistency level of ALL on the writes then use ONE or QUORUM on the reads?
Is there something not quite right with virtual nodes? unlikely
Are there nodetool commands besides removenode that I need to run when a node terminates to recover missing data? As mentioned earlier, when a reboot occurs, eventually the missing data reappears.
Is there some cassandra savant who can look at my cassandra.yaml file below and send me on the path to salvation?
More Info added 7/19
I don't think this is a QUORUM vs ONE vs ALL is the issue. The test I set up performs no writes to the keyspaces after the initial population of the column families. So the data has had plenty of time (hours) to make it to all the nodes as required by the replication factor. Plus the test dataset is REALLY small (2 column families with about 300-1000 values each). So in other words, the data is completely static.
The behavior I am seeing seems to be tied to the fact that the ec2 instance is no longer on the network. The reason I say this is because if I log on to a node and just do a cassandra stop I see no loss of data. But if I do the reboot or terminate I start getting the following in a stack trace.
CassandraHostRetryService - Downed Host Retry service started with queue size -1 and retry delay 10s
CassandraHostRetryService - Downed Host retry shutdown complete
CassandraHostRetryService - Downed Host retry shutdown hook called
Caused by: TimedOutException()
Caused by: TimedOutException()
So it seems to be more of a networking communication issue in that the cluster is expecting, for example 10.0.12.74, to be on the network after it has joined the cluster. If that ip is suddenly unreachable either due to reboot or termination the timeouts start happening.
When I do a nodetool status under all three scenarios (cassandra stop, reboot or terminate) the status of the node shows up as DN. Which is what you would expect. Eventually nodetool status will return to UN with cassandra start or reboot, but obviously termination always stays DN.
Details of my Configuration
Here are some details of my configuration (cassandra.yaml is at the bottom of this posting):
Nodes are running in private subnets of a VPC.
Cassandra 1.2.5 with num_tokens: 256 (virtual nodes). initial_token: (blank). I am really hoping this works because all of our nodes run in autoscaling groups so the thought that redistribution could be handle dynamically is appealing.
EC2 m1.large one seed and one non-seed node in each availability zone. (so 6 total nodes in the cluster).
Ephemeral storage, not EBS.
Ec2Snitch with NetworkTopologyStrategy and all keyspaces have replication factor of 3.
Non-seed nodes are auto_bootstraped, seed nodes are not.
sample cassandra.yaml file
cluster_name: 'TestCluster'
num_tokens: 256
initial_token:
hinted_handoff_enabled: true
max_hint_window_in_ms: 10800000
hinted_handoff_throttle_in_kb: 1024
max_hints_delivery_threads: 2
authenticator: org.apache.cassandra.auth.AllowAllAuthenticator
authorizer: org.apache.cassandra.auth.AllowAllAuthorizer
partitioner: org.apache.cassandra.dht.Murmur3Partitioner
disk_failure_policy: stop
key_cache_size_in_mb:
key_cache_save_period: 14400
row_cache_size_in_mb: 0
row_cache_save_period: 0
row_cache_provider: SerializingCacheProvider
saved_caches_directory: /opt/company/dbserver/caches
commitlog_sync: periodic
commitlog_sync_period_in_ms: 10000
commitlog_segment_size_in_mb: 32
seed_provider:
- class_name: org.apache.cassandra.locator.SimpleSeedProvider
parameters:
- seeds: "SEED_IP_LIST"
flush_largest_memtables_at: 0.75
reduce_cache_sizes_at: 0.85
reduce_cache_capacity_to: 0.6
concurrent_reads: 32
concurrent_writes: 8
memtable_flush_queue_size: 4
trickle_fsync: false
trickle_fsync_interval_in_kb: 10240
storage_port: 7000
ssl_storage_port: 7001
listen_address: LISTEN_ADDRESS
start_native_transport: false
native_transport_port: 9042
start_rpc: true
rpc_address: 0.0.0.0
rpc_port: 9160
rpc_keepalive: true
rpc_server_type: sync
thrift_framed_transport_size_in_mb: 15
thrift_max_message_length_in_mb: 16
incremental_backups: true
snapshot_before_compaction: false
auto_bootstrap: AUTO_BOOTSTRAP
column_index_size_in_kb: 64
in_memory_compaction_limit_in_mb: 64
multithreaded_compaction: false
compaction_throughput_mb_per_sec: 16
compaction_preheat_key_cache: true
read_request_timeout_in_ms: 10000
range_request_timeout_in_ms: 10000
write_request_timeout_in_ms: 10000
truncate_request_timeout_in_ms: 60000
request_timeout_in_ms: 10000
cross_node_timeout: false
endpoint_snitch: Ec2Snitch
dynamic_snitch_update_interval_in_ms: 100
dynamic_snitch_reset_interval_in_ms: 600000
dynamic_snitch_badness_threshold: 0.1
request_scheduler: org.apache.cassandra.scheduler.NoScheduler
index_interval: 128
server_encryption_options:
internode_encryption: none
keystore: conf/.keystore
keystore_password: cassandra
truststore: conf/.truststore
truststore_password: cassandra
client_encryption_options:
enabled: false
keystore: conf/.keystore
keystore_password: cassandra
internode_compression: all
I think http://www.datastax.com/documentation/cassandra/1.2/cassandra/dml/dml_config_consistency_c.html will clear up a lot of this. In particular, QUORUM/ONE is not guaranteed to return the most recent data. QUORUM/QUORUM is. So is ALL/ONE, but that will be intolerant to failure on write.
Edit to go with the new information:
CassandraHostRetryService is part of Hector. I assumed you were testing with cqlsh like a sane person would. Lessons:
Use cqlsh for testing
Use the DataStax Java Driver for building your application, which is faster, easier to use, and has more insight into the cluster state than Hector thanks to the native protocol it's built on.

Resources