Failed to place enough replicas: expected size is 1 but only 0 storage types can be selected - hadoop

Failed to place enough replicas: expected size is 1 but only 0 storage types can be selected (replication=1, selected=[], unavailable=[DISK], removed=[DISK], policy=BlockStoragePolicy
{HOT:7, storageTypes=[DISK], creationFallbacks=[], replicationFallbacks=[ARCHIVE]}
We have a scenario where there are multiple hdfs files being written (order of 500-1000 files - at most 10-40 such files written concurrently) -- we don't call close immediately on each file for every write -- but keep writing till end and then call close.
It seems that sometimes we get the above error - and the write fails. We have set hdfs retries to 10 - but that does not seem to help.
We also increased dfs.datanode.handler.count to 200 - that did sometime helped but not always.
a) Would increasing dfs.datanode.handler.count help here? even if 10 are written concurrently..
b) What should be done so that we don't get error at application level -- as such hadoop monitoring page indicates that disks are healthy - but from the warning message, it did seemed that sometimes disks were not available -- org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy: Failed to place enough replicas, still in need of 1 to reach 1 (unavailableStorages=[DISK], storagePolicy=BlockStoragePolicy
{HOT:7, storageTypes=[DISK], creationFallbacks=[], replicationFallbacks=[ARCHIVE]}
, newBlock=true) All required storage types are unavailable: unavailableStorages=[DISK], storagePolicy=BlockStoragePolicy
Assuming that above happens only when we find failures to disks -- we also tried setting dfs.client.block.write.replace-datanode-on-failure.enable to false, so that for temporary failures, we don't get errors. But it does not seem to help either.
Any further suggestions here?

In my case this was fixed by opening the firewall port 50010 for the datanodes (on Docker)

Related

CoGroupByKey always failed on big data (PythonSDK)

I have about 4000 files (avg ~7MB each) input.
My pipeline always failed on the step CoGroupByKey when the data size reach about 4GB.
I tried to limit only use 300 file then it run just fine.
In case of fail, the logs on GCP dataflow only show:
Workflow failed. Causes: S24:CoGroup Geo data/GroupByKey/Read+CoGroup Geo data/GroupByKey/GroupByWindow+CoGroup Geo data/Map(_merge_tagged_vals_under_key) failed., The job failed because a work item has failed 4 times. Look in previous log entries for the cause of each one of the 4 failures. For more information, see https://cloud.google.com/dataflow/docs/guides/common-errors. The work item was attempted on these workers:
store-migration-10212040-aoi4-harness-m7j7
Root cause: The worker lost contact with the service.,
store-migration-xxxxx
Root cause: The worker lost contact with the service.,
store-migration-xxxxx
Root cause: The worker lost contact with the service.,
store-migration-xxxxx
Root cause: The worker lost contact with the service.
I digging through all logs in Logs Explorer. Nothing else indicate error other than the above, even my logging.info and try...except code.
Think this relate to the memory of the instances but I didn't digging into that direction. Because it kindna what I don't want to worry about when I am using GCP services.
Thanks.

Must hacking protobuf jar?

1.My namenode log always prints error log java.io.IOException: Requested data length 113675682 is longer than maximum configured RPC length 67108864. RPC came from 172.16.xxx.xxx
And datanode prints Unsuccessfully sent block report 0x706cd6d00df0effe, containing 1 storage report(s), of which we sent 0. The reports had 9016550 total blocks and used 0 RPC(s). This took 1734 msec to generate and 252 msecs for RPC and NN processing. Got back no commands
2.I set ipc.maximum.data.length to 134217728 and solved the problem,But unfortunately,i find after set length,my hdfs client often can't to write data,but just take a few minutes every time.Then i find the namenode throw a new exception,when client can't write,DatanodeProtocol.blockReport from 172.16.xxx.xxx:43410 Call#30074227 Retry#0
java.lang.IllegalStateException: com.google.protobuf.InvalidProtocolBufferException: Protocol message was too large. May be malicious. Use CodedInputStream.setSizeLimit() to increase the size limit.
like Referring HDFS-5153,it says The NameSystem write lock is held during this time.`
I must hacking protobuf jar and set the limit?
EDIT:
I find a Same question,but no solution

Nodes loading, but Elasticsearch has 0 shards

I was testing out a 20 node cluster with default replicates, default sharding, and recently wanted to rename the cluster from the default "elasticsearch." So, I updated the config cluster name, and additionally renamed the data from
mylocation/data/OldName
to
mylocation/data/NewName
Which of course contain:
nodes/0
nodes/1
etc...
About a month later, I'm loading up my cluster again, and I see that while all 20 nodes come back online, it says 0 active shards, 0 primary shards, etc. where this should be several thousand. Status is green, nothing is initializing, nothing looks amiss except I have no data. I look in nodes/0 and I see nodes/0/indices/ are well populated with my index names: the data is actually on the disk. But it seems there's nothing I can do to get it to actually load the shards. The config is using the correct Des.path.data=mylocation/data/.
What could be wrong and how can I debug it? I'm fairly confident I ran this for a week after loading it, but it was some time ago and perhaps other things have changed. It just oddly seems to not be recognize any of the data it's pointing at, and it isn't giving me any kind of "I don't see your data" or "cannot read or write that data" error message.
Update
After it gets started it says:
Recovered [0] indices into cluster_state.
I googled this and it sounded like version compatibility. Checked my binaries and this did not appear to be an issue. I'm using 1.3.2 on all.
Update 2
One of 20 nodes repeatly fails with
ElasticsearchillegalStateException[failed to obtain node lock, is the following location writable?]
It lists the correct data dir, which is writable. Should I delete the node lock? Some node.locks are 664 and some are 640 when the cluster is off. Is this normal or possibly the relic of an unclean shutdown?
Are some of these replicates? I have 40 nodes, 20 are 640 and 20 are 664.
Update 3
There are write locks in place at the lucene level. So
data/NewName/nodes/1/indices/indexname/4/index/write.lock
exists. Is this why moving shards fails? Can i safely delete each of these write locks or is there shared state in the _state file that would lead to inconsistency?

Hadoop MapReduce job I/O Exception due to premature EOF from inputStream

I ran a MapReduce program using the command hadoop jar <jar> [mainClass] path/to/input path/to/output. However, my job was hanging at: INFO mapreduce.Job: map 100% reduce 29%.
Much later, I terminated and checked the datanode log (I am running in pseudo-distributed mode). It contained the following exception:
java.io.IOException: Premature EOF from inputStream
at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:201)
at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:213)
at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:134)
at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:109)
at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:472)
at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:849)
at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:804)
at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:137)
at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:74)
at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:251)
at java.lang.Thread.run(Thread.java:745)
5 seconds later in the log was ERROR DataXceiver error processing WRITE_BLOCK operation.
What problem might be causing this exception and error?
My NodeHealthReport said:
1/1 local-dirs are bad: /home/$USER/hadoop/nm-local-dir;
1/1 log-dirs are bad: /home/$USER/hadoop-2.7.1/logs/userlogs
I found this which indicates that dfs.datanode.max.xcievers may need to be increased. However, it is deprecated and the new property is called dfs.datanode.max.transfer.threads with default value 4096. If changing this would fix my problem, what new value should I set it to?
This indicates that the ulimit for the datanode may need to be increased. My ulimit -n (open files) is 1024. If increasing this would fix my problem, what should I set it to?
Premature EOF can occur due to multiple reasons, one of which is spawning of huge number of threads to write to disk on one reducer node using FileOutputCommitter. MultipleOutputs class allows you to write to files with custom names and to accomplish that, it spawns one thread per file and binds a port to it to write to the disk. Now this puts a limitation on the number of files that could be written to at one reducer node. I encountered this error when the number of files crossed 12000 roughly on one reducer node, as the threads got killed and the _temporary folder got deleted leading to plethora of these exception messages. My guess is - this is not a memory overshoot issue, nor it could be solved by allowing hadoop engine to spawn more threads. Reducing the number of files being written at one time at one node solved my problem - either by reducing the actual number of files being written, or by increasing reducer nodes.

Hector is unable to read Cassandra data when nodes reboot or terminate

We are trying to run a cassandra cluster on AWS/EC2 within a standard VPC footprint (cassandra nodes on private subnets). Because this is AWS there is always a chance that an EC2 instance will terminate or reboot with no warning. I have been simulating this case on a test cluster and I am seeing things with the cluster that I thought a cluster was suppose to prevent. Specifically if a node reboots some data will go temporarily missing until the node completes its reboot. If a node terminates it appears that some data is lost forever.
For my test I just did a bunch of writes (using QUORUM consistency) to some keyspaces then interrogate the contents of those keyspaces as I bring down nodes (either through reboot or terminate). I'm just using cqlsh SELECT to do the keyspace/column family interrogation of the cluster using ONE consistency level.
Note, even though I am performing no writes to the cluster while I am doing the SELECTs rows temporarily disappear when rebooting and can permanently go missing during termination.
I thought Netflix Priam might be able to help, but sadly it doesn't work in a VPC the last time I checked.
Also, because we are using ephemeral storage instances there is no equivalent of 'shutdown' so I cannot run any scripts during reboot/terminate of an instance to perform a nodetool decommission or nodetool removenode before an instance goes away. Terminate is the equivalent of kicking the plug out of the wall.
Since I am using a replication factor of 3 and quorum/write that should mean that all data is written to at least 2 nodes. So, unless I am totally misunderstanding things (which is possible), losing one node should not mean that I lose any data for any period of time when I am using consistency level ONE for the read.
Questions
Why wouldn't a 6 node cluster with a replication factor of 3 work?
Do I need to run something like a 12 node cluster with a replication factor of 7? Don't bother telling me that will fix the problem, because it doesn't.
Do I need to use consistency level of ALL on the writes then use ONE or QUORUM on the reads?
Is there something not quite right with virtual nodes? unlikely
Are there nodetool commands besides removenode that I need to run when a node terminates to recover missing data? As mentioned earlier, when a reboot occurs, eventually the missing data reappears.
Is there some cassandra savant who can look at my cassandra.yaml file below and send me on the path to salvation?
More Info added 7/19
I don't think this is a QUORUM vs ONE vs ALL is the issue. The test I set up performs no writes to the keyspaces after the initial population of the column families. So the data has had plenty of time (hours) to make it to all the nodes as required by the replication factor. Plus the test dataset is REALLY small (2 column families with about 300-1000 values each). So in other words, the data is completely static.
The behavior I am seeing seems to be tied to the fact that the ec2 instance is no longer on the network. The reason I say this is because if I log on to a node and just do a cassandra stop I see no loss of data. But if I do the reboot or terminate I start getting the following in a stack trace.
CassandraHostRetryService - Downed Host Retry service started with queue size -1 and retry delay 10s
CassandraHostRetryService - Downed Host retry shutdown complete
CassandraHostRetryService - Downed Host retry shutdown hook called
Caused by: TimedOutException()
Caused by: TimedOutException()
So it seems to be more of a networking communication issue in that the cluster is expecting, for example 10.0.12.74, to be on the network after it has joined the cluster. If that ip is suddenly unreachable either due to reboot or termination the timeouts start happening.
When I do a nodetool status under all three scenarios (cassandra stop, reboot or terminate) the status of the node shows up as DN. Which is what you would expect. Eventually nodetool status will return to UN with cassandra start or reboot, but obviously termination always stays DN.
Details of my Configuration
Here are some details of my configuration (cassandra.yaml is at the bottom of this posting):
Nodes are running in private subnets of a VPC.
Cassandra 1.2.5 with num_tokens: 256 (virtual nodes). initial_token: (blank). I am really hoping this works because all of our nodes run in autoscaling groups so the thought that redistribution could be handle dynamically is appealing.
EC2 m1.large one seed and one non-seed node in each availability zone. (so 6 total nodes in the cluster).
Ephemeral storage, not EBS.
Ec2Snitch with NetworkTopologyStrategy and all keyspaces have replication factor of 3.
Non-seed nodes are auto_bootstraped, seed nodes are not.
sample cassandra.yaml file
cluster_name: 'TestCluster'
num_tokens: 256
initial_token:
hinted_handoff_enabled: true
max_hint_window_in_ms: 10800000
hinted_handoff_throttle_in_kb: 1024
max_hints_delivery_threads: 2
authenticator: org.apache.cassandra.auth.AllowAllAuthenticator
authorizer: org.apache.cassandra.auth.AllowAllAuthorizer
partitioner: org.apache.cassandra.dht.Murmur3Partitioner
disk_failure_policy: stop
key_cache_size_in_mb:
key_cache_save_period: 14400
row_cache_size_in_mb: 0
row_cache_save_period: 0
row_cache_provider: SerializingCacheProvider
saved_caches_directory: /opt/company/dbserver/caches
commitlog_sync: periodic
commitlog_sync_period_in_ms: 10000
commitlog_segment_size_in_mb: 32
seed_provider:
- class_name: org.apache.cassandra.locator.SimpleSeedProvider
parameters:
- seeds: "SEED_IP_LIST"
flush_largest_memtables_at: 0.75
reduce_cache_sizes_at: 0.85
reduce_cache_capacity_to: 0.6
concurrent_reads: 32
concurrent_writes: 8
memtable_flush_queue_size: 4
trickle_fsync: false
trickle_fsync_interval_in_kb: 10240
storage_port: 7000
ssl_storage_port: 7001
listen_address: LISTEN_ADDRESS
start_native_transport: false
native_transport_port: 9042
start_rpc: true
rpc_address: 0.0.0.0
rpc_port: 9160
rpc_keepalive: true
rpc_server_type: sync
thrift_framed_transport_size_in_mb: 15
thrift_max_message_length_in_mb: 16
incremental_backups: true
snapshot_before_compaction: false
auto_bootstrap: AUTO_BOOTSTRAP
column_index_size_in_kb: 64
in_memory_compaction_limit_in_mb: 64
multithreaded_compaction: false
compaction_throughput_mb_per_sec: 16
compaction_preheat_key_cache: true
read_request_timeout_in_ms: 10000
range_request_timeout_in_ms: 10000
write_request_timeout_in_ms: 10000
truncate_request_timeout_in_ms: 60000
request_timeout_in_ms: 10000
cross_node_timeout: false
endpoint_snitch: Ec2Snitch
dynamic_snitch_update_interval_in_ms: 100
dynamic_snitch_reset_interval_in_ms: 600000
dynamic_snitch_badness_threshold: 0.1
request_scheduler: org.apache.cassandra.scheduler.NoScheduler
index_interval: 128
server_encryption_options:
internode_encryption: none
keystore: conf/.keystore
keystore_password: cassandra
truststore: conf/.truststore
truststore_password: cassandra
client_encryption_options:
enabled: false
keystore: conf/.keystore
keystore_password: cassandra
internode_compression: all
I think http://www.datastax.com/documentation/cassandra/1.2/cassandra/dml/dml_config_consistency_c.html will clear up a lot of this. In particular, QUORUM/ONE is not guaranteed to return the most recent data. QUORUM/QUORUM is. So is ALL/ONE, but that will be intolerant to failure on write.
Edit to go with the new information:
CassandraHostRetryService is part of Hector. I assumed you were testing with cqlsh like a sane person would. Lessons:
Use cqlsh for testing
Use the DataStax Java Driver for building your application, which is faster, easier to use, and has more insight into the cluster state than Hector thanks to the native protocol it's built on.

Resources