Where moved to trash files go? - hadoop

I have the following questions regarding the "move to trash" functionality in the hue GUI:
Where do these files go?
How long are they stored?
Can I restore them?

1) /user/hduser/.Trash
Where hduser is unix(operating system user, it can be windows user also if you are using java client from windows + eclipse ) user.
2) This will depend on the below configuration in core-site.xml
<property>
<name>fs.trash.interval</name>
<value>30</value>
<description>Number of minutes after which the checkpoint
gets deleted. If zero, the trash feature is disabled.
</description>
</property>
3) For doing this recovery method trash should be enabled in hdfs. Trash can be enabled by setting the property fs.trash.interval (as above mentioned xml) greater than 0.
By default the value is zero. Its value is number of minutes after which the checkpoint gets deleted. If zero, the trash feature is disabled. We have to set this property in core-site.xml.
There is one more property which is having relation with the above property called fs.trash.checkpoint.interval. It is the number of minutes between trash checkpoints. This should be smaller or equal to fs.trash.interval.
Everytime the checkpointer runs, it creates a new checkpoint out of current and removes checkpoints created more than fs.trash.interval minutes ago.
The default value of this property is zero.
<property>
<name>fs.trash.checkpoint.interval</name>
<value>15</value>
<description>Number of minutes between trash checkpoints.
Should be smaller or equal to fs.trash.interval.
Every time the checkpointer runs it creates a new checkpoint
out of current and removes checkpoints created more than
fs.trash.interval minutes ago.
</description>
</property>
If the above properties are enabled in your cluster. Then the deleted files will be present in .Trash directory of hdfs. You have time to recover the files until the next checkpoint occurs. After the new checkpoint the deleted files will not be present in the .Trash. So recover before the new checkpoint. If this property is not enabled in your cluster, you can enable this for future recovery.. :)

Related

SNMPv3 user entries got deleted after reboot on SNMPv3 machine

After reboot on our SNMPv3 server SNMPv3 user entries get deleted from file /var/net-snmp/snmpd.conf.
It appears on doing reboot engineBoots is set to 1 in /var/net-snmp/snmpd.conf randomly and whenever engineBoot is set to 1 it erases snmpv3 user entries from snmpd.conf file.
Firstly, I want to understand why engineBoots vaule is randomly set to 1, As per standard snmp document , this EngineBoots should be incremented every time we do reboot or EngineTime exceeded the max value.
Secondly, we want to figure out the correlation of engineBoots vaule setting to 1 and the deletion of usmUser entries in /var/net-snmp/snmpd.conf.
Thanks-Ravi
It is normal behavior. Once snmpd is restarted, the create user command line will be deleted from /var/net-snmp/snmpd.conf for security reason and the user will be created in usmUsertable.

Nodes loading, but Elasticsearch has 0 shards

I was testing out a 20 node cluster with default replicates, default sharding, and recently wanted to rename the cluster from the default "elasticsearch." So, I updated the config cluster name, and additionally renamed the data from
mylocation/data/OldName
to
mylocation/data/NewName
Which of course contain:
nodes/0
nodes/1
etc...
About a month later, I'm loading up my cluster again, and I see that while all 20 nodes come back online, it says 0 active shards, 0 primary shards, etc. where this should be several thousand. Status is green, nothing is initializing, nothing looks amiss except I have no data. I look in nodes/0 and I see nodes/0/indices/ are well populated with my index names: the data is actually on the disk. But it seems there's nothing I can do to get it to actually load the shards. The config is using the correct Des.path.data=mylocation/data/.
What could be wrong and how can I debug it? I'm fairly confident I ran this for a week after loading it, but it was some time ago and perhaps other things have changed. It just oddly seems to not be recognize any of the data it's pointing at, and it isn't giving me any kind of "I don't see your data" or "cannot read or write that data" error message.
Update
After it gets started it says:
Recovered [0] indices into cluster_state.
I googled this and it sounded like version compatibility. Checked my binaries and this did not appear to be an issue. I'm using 1.3.2 on all.
Update 2
One of 20 nodes repeatly fails with
ElasticsearchillegalStateException[failed to obtain node lock, is the following location writable?]
It lists the correct data dir, which is writable. Should I delete the node lock? Some node.locks are 664 and some are 640 when the cluster is off. Is this normal or possibly the relic of an unclean shutdown?
Are some of these replicates? I have 40 nodes, 20 are 640 and 20 are 664.
Update 3
There are write locks in place at the lucene level. So
data/NewName/nodes/1/indices/indexname/4/index/write.lock
exists. Is this why moving shards fails? Can i safely delete each of these write locks or is there shared state in the _state file that would lead to inconsistency?

Apache flume rolling out the HDFS files on hourly basis

I'm new to Flume and I was exploring options to roll over my HDFS files on hourly basis using Flume.
In my project Apache Flume will read the messages from Rabbit MQ and it will write it to HDFS.
hdfs.rollInterval - It closes the file based on the time interval when it got opened.
New file will be created only when Flume reads a message after the file got closed. This option is not solving our problem.
hdfs.path = /%y/%m/%d/%H - This option is working fine and it creates folder on hourly basis. But the problem is new folder will be created only when new message comes.
For example: Messages are coming till 11.59, the file will be in open state. Then the messages stop coming till 12.30. But, the file will still be in open state. After 12.30 new message comes. Then because of hdfs.path configuration, previous file will be closed and new file will be created in new folder.
Previous file cannot be used for computation till it is closed.
We need an option of closing the opened files on hourly basis perfectly. I'm wondering if there any options in flume for doing that.
hdfs.rollInterval is described as
Number of seconds to wait before rolling current file
So this line should cause the files to allocate for an hour at a time
hdfs.rollInterval = 3600
And I would additionally ignore file size and event count, so add these as well
hdfs.rollSize = 0
hdfs.rollCount = 0
hdfs.idleTimeout
Timeout after which inactive files get closed (0 = disable automatic closing of idle files)
For example, you can set this property to 180. The file will be opened

How to tune Spark application with hadoop custom input format

My spark application process the files (average size is 20 MB) with custom hadoop input format and stores the result in HDFS.
Following is the code snippet.
Configuration conf = new Configuration();
JavaPairRDD<Text, Text> baseRDD = ctx
.newAPIHadoopFile(input, CustomInputFormat.class,Text.class, Text.class, conf);
JavaRDD<myClass> mapPartitionsRDD = baseRDD
.mapPartitions(new FlatMapFunction<Iterator<Tuple2<Text, Text>>, myClass>() {
//my logic goes here
}
//few more translformations
result.saveAsTextFile(path);
This application creates 1 task/ partition per file and processes and stores the corresponding part file in HDFS.
i.e, For 10,000 input files 10,000 tasks are created and 10,000 part files are stored in HDFS.
Both mapPartitions and map operations on baseRDD are creating 1 task per file.
SO question
How to set the number of partitions for newAPIHadoopFile?
suggests to set
conf.setInt("mapred.max.split.size", 4); for configuring no of partitions.
But when this parameter is set CPU is utilized at maximum and none of the stage is not started even after long time.
If I don't set this parameter then application will be completed successfully as mentioned above.
How to set number of partitions with newAPIHadoopFile and increase the efficiency?
What happens with mapred.max.split.size option?
============
update:
What happens with mapred.max.split.size option?
In my use case file size is small and changing the split size options are irrelevant here.
more info on this SO: Behavior of the parameter "mapred.min.split.size" in HDFS
Just use baseRDD.repartition(<a sane amount>).mapPartitions(...). That will move the resulting operation to fewer partitions, especially if your files are small.

hadoop: how to increase the limit of failed tasks

I want to run a job so that all task failures are just logged and are otherwise ignored (basically to test my input). Right now, when a task fails I get "# of failed Map Tasks exceeded allowed limit". How do I increase the limit?
I use Hadoop 1.2.1
Specify the mapred.max.map.failures.percent and mapred.max.reduce.failures.percent in the mapred-site.xml to specify the failure threshold. Both are set to 0. Check the code for JobConf.java for more details.
In order to set increase the limit of the MapTasks try to add following in the mapred-site.xml file.
<property>
<name>mapred.tasktracker.map.tasks.maximum</name>
<value>{cores}</value>
</property>
This will make the number of MapTasks set to maximum value. In place of {cores} you should substitute the value of cores you have. Setting this value to exact value of core available is not considered good. Let me know if you have any questions.
Hope this helps.
Happy Hadooping!!!

Resources