Life of distributed cache in Hadoop - hadoop

When files are transferred to nodes using the distributed cache mechanism in a Hadoop streaming job, does the system delete these files after a job is completed? If they are deleted, which i presume they are, is there a way to make the cache remain for multiple jobs? Does this work the same way on Amazon's Elastic Mapreduce?

I was digging around in the source code, and it looks like files are deleted by TrackerDistributedCacheManager about once a minute when their reference count drops to zero. The TaskRunner explicitly releases all its files at the end of a task. Maybe you should edit TaskRunner to not do this, and control the cache through more explicit means yourself?

I cross posted this question at the AWS forum and got a good recommendation to use hadoop fs -get to transfer files in a way that persists across jobs.

Related

How to set HDFS as statebackend for flink

I want to store flink store in HDFS so that after crash I can recover the flink state from HDFS. I am planning to write state to HDFS every 60 second. How Can I achieve this ?
Is this the config I need to follow ?
https://ci.apache.org/projects/flink/flink-docs-release-1.4/ops/state/state_backends.html#setting-default-state-backend
And where do I specify the check point interval ? Any link or sample code would be helpful
Choosing where checkpoints are stored (e.g., HDFS) is separate from deciding which state backend to use for managing your working state (which can be on-heap, or in local files managed by the RocksDB library).
These two concepts were cleanly separated in Flink 1.12. In early versions of Flink, the two appeared to be more strongly related than they actually are because the filesystem and rocksdb state backend constructors took a file URI as a parameter, specifying where the checkpoints should be stored.
The best way to manage all of this is to leave this out of your code, and to specify the configuration you want in flink-conf.yaml, e.g.,
state.backend: filesystem
state.checkpoints.dir: hdfs://namenode-host:port/flink-checkpoints
execution.checkpointing.interval: 10s
Information about checkpointing and savepointing can be found at https://nightlies.apache.org/flink/flink-docs-master/docs/dev/datastream/fault-tolerance/checkpointing/
On how to configure HDFS as a filesystem, you should check https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/filesystems/overview/

How to get the scheduler of an already finished job in Yarn Hadoop?

So I'm in this situation, where I'm modifying the mapred-site.xml and specific configuration files of different schedulers for Hadoop, and I just want to make sure that the modifications I have made to the default scheduler(FIFO), has actually taken place.
How can I check the scheduler applied to a job or a queue of jobs already submitted to hadoop using job id ?
Sorry if this doesn't make that much sense, but I've looked around quite extensively to wrap my head around it, and read many documentations, and yet I still cannot seem to find this fundamental piece of information.
I'm simply trying the word count as a job, changing scheduler settings in mapped-site.xml and yarn-site.xml.
For instance I'm changing property "yarn.resourcemanager.scheduler.class" to "org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler" based on this link : see this
I'm also moving appropriate jar files specific to the schedulers to the correct directory.
For your reference, I'm using the "yarn" runtime mode, and Cloudera and Hadoop 2.
Thanks a ton for your help

Hadoop: How to know HDFS file access details?

I am planning to do a project for which I need to monitor HDFS file creation, deletion and append operations in real time. Hadoop metrics tell the number of such operations performed but I need to know the files on which these operations are being done. Logs don't seem to be of much help for this. Is there any framework/technology that would easily allow me to monitor HDFS file operations?
If you use Hadoop 2.6+, the native way to do this is by using inotify feature that was implemented in https://issues.apache.org/jira/browse/HDFS-6634
Take a look at this nice presentation from Cloudera: https://www.slideshare.net/Hadoop_Summit/keep-me-in-the-loop-inotify-in-hdfs

Spark Architecture for processing small binary files saved in HDFS

I don't know how to build architecture for following use case:
I have an Web application where users can upload files(pdf&pptx) and directories to be processed. After upload is complete web application put this files and directories in HDFS, then send a messages on kafka with path to this files.
Spark Application read messages from kafka streaming, collect them on master(driver), and after that process them. I collect messages first because i need to move the code to data, and not move data where the message is received. I understood that spark assign job to executor which already have file locally.
I have issues with kafka because i was forced to collect them first for the above reason, and when want to create checkpoint app crash "because you are attempting to reference SparkContext from a broadcast variable" even if the code run before adding checkpointing( I use sparkContext there because i need to save data to ElasticSearch and PostgreSQL. I don't know how exactly i can do code upgrading in this conditions.
I read about hadoop small files problems, and I understand what problems are in this case. I read that HBase is a better solution to save small files than just save in hdfs. Other problem in hadoop small files problems is big number of mappers and reducers created for computation, but i don't understand if this problem there in spark.
What is the best architecture for this use case?
How to do Job Scheduling? It's kafka good for that? or I need to use other service like rabbitMQ or something else?
Exist some method to add jobs to an running Spark application through some REST API?
How is the best way to save files? Is better to use Hbase because i have small files(<100MB)? Or I need to use SequenceFile? I think SequenceFile isn't for my use case because i need to reprocess some files randomly.
What is the best architecture do you think for this use case?
Thanks!
There is no one single "the best" way to build architecture. You need to make decisions and stick to them. Make the architecture flexible and decoupled so that you can easily replace components if needed.
Consider following stages/layers in your architecture:
Retrieval/Acquisition/Transport of source data (files)
Data processing/transformation
Data archival
As a retrieval component, I would use Flume. It is flexible, supports a lot of sources, channels (including Kafka) and sinks. In your case you can configure source that monitors the directory and extracts the newly received files.
For data processing/transformation - it depends what task you are solving. You probably decided on Spark Streaming. Spark streaming can be integrated with Flume sink (http://spark.apache.org/docs/latest/streaming-flume-integration.html) There are other options available, e.g. Apache Storm. Flume combines very well with Storm. Some transformations can also be applied in Flume.
For data archival - do not store/archive the files directly in Hadoop, unless they are bigger than few hundredths of megabytes. One solution would be to put them in HBase.
Make your architecture more flexible. I would place processed files in a temporary HDFS location and have some job regualarly archive them into zip, HBase, Hadoop Archive (there is such an animal) or any other solution.
Consider using Apache NiFi (aka HDF - Hortonworks Data Flow). It uses internally queues, provides a lot of processors. It can make your life easier and get the workflow developed in minutes. Give it a try. There is nice Hortonworks tutorial which , combined with HDP Sandbox running on a virtual machine/Docker, can bring you up to speed in very short time (1-2 hours?).

data backup and recovery in hadoop 2.2.0

I am new to Hadoop and much interested in Hadoop Administration,so i tried to install Hadoop 2.2.0 in Ubuntu 12.04 as pseudo distributed mode and installed successfully and run some example jar files also ,now i am trying learn further ,trying to learn data back up and recovery part now,can anyone tell ways to take data back back up and recovery it in hadoop 2.2.0 ,and also please suggest any good books for Hadoop Adminstration and steps to learn Hadoop Adminstration.
Thanks in Advance.
There is no classic backup and recovery functionality in Hadoop. There are several reasons for this:
HDFS uses block level replication for data protection via redundancy.
HDFS scales out massively in size, and it is becoming more economic to backup to disk, rather than tape.
The size of "Big Data" doesn't lend itself to being easily backed up.
Instead of backups, Hadoop uses data replication. Internally, it creates multiple copies of each block of data (by default, 3 copies). It also has a function called 'distcp', which allows you to replicate copies of data between clusters. This is what's typically done for "backups" by most Hadoop operators.
Some companies, like Cloudera, are incorporating the distcp tool into creating a 'backup' or 'replication' service for their distribution of Hadoop. It operates against a specific directory in HDFS, and replicates it to another cluster.
If you really wanted to create a backup service for Hadoop, you can create one manually yourself. You would need some mechanism of accessing the data (NFS gateway, webFS, etc), and could then use tape libraries, VTLs, etc. to create backups.

Resources