Hadoop - data block caching techniques - caching

I am running some experiments to benchmark the time it takes (by map-reduce) to read and process data stored on HDFS with varying parameters. I use pig script to launch map-reduce jobs. Since I am working with the same set of files frequently, my results may get affected because of file/block caching.
I want to understand the various caching techniques employed in a map-reduce environment.
Lets say that a file foo (contains some data to be procesed) stored on HDFS occupies 1 HDFS block and it gets stored in machine STORE. During a map-reduce task, machine COMPUTE reads that block over network and processes it. Caching can happen at two levels:
Cached in memory of machine STORE (in-memory file cache)
Cached in memory/disk of machine COMPUTE.
I am pretty sure that #1 caching happens. I want to ensure whether something like #2 happens? From the post here, it looks like there is no client level caching going on since it is very unlikely that the block cached by COMPUTE will be needed again in the same machine before the cache is flushed.
Also, is the hadoop distributed cache used only to distribute any application specific files (not task specific input data files) to all task tracker nodes? Or is the task specific input file data (like the foo file block) cached in the distributed cache? I assume local.cache.size and related parameters only control the distributed cache.
Please clarify.

The only caching that is ever applied within HDFS is the OS caching to minimize disk accesses.
So if you access a block from a datanode, it is likely to be cached if nothing else is going on there.
On your client side, this depends on what you do with the block. If you directly write it to disk, it is also very likely that your client OS caches it.
The distributed cache is just for jars and files that need to be distributed across the cluster where your job launches tasks. The name is thus a bit misleading, as it "caches" nothing.

Related

NVMeOF/RDMA sync file modifications

I just set up the NVMeOF/RDMA environment to play around. I have a target node which NVMe SSD is accessed by some client nodes. However, when I delete a file say test on one client node, the rest nodes cannot see this operation and can still read the content of test as normal. I know that RDMA bypasses the kernel, so I guess this is because of the cache? I then have tried to clean up the cache using these commands:
sudo sync; echo 3 | sudo tee /proc/sys/vm/drop_caches
sudo sync; echo 1 | sudo tee /proc/sys/vm/drop_caches
sudo sync; echo 2 | sudo tee /proc/sys/vm/drop_caches
Unfortunately, other nodes still keep this file.
So actually I have two questions:
Does it exactly due to the cache? How does it work?
What is the correct way to clean up the cache so that other nodes can see the deletion without re-mount?
Any help will be greatly appreciated!
Relatively short answer
Like Boris said, you don't want to do that (distributed consistency on storage is a hard problem), and you need something else to do what you want. Flushing caches may not work because you've got multiple distinct views of the system + caching behaviors
Longer answer:
As Boris mentioned, NVMeoF is a block protocol. This means that at a broad level (with some hand-waving) all it can do is read and write blocks at a particular address. In practice, we usually have layers above the NVMe/NVMeoF communication layer like file systems that handle this abstraction.
I can't tell right now if you're using a file system or if you reading/writing the device directly, but in either case you are at least partially correct that the page cache may be getting in the way, even with RDMA.
Now, if you are using local file systems on your client nodes, you quickly get inconsistent views. The filesystem (and consequently overall operating system and its view of the state of the page cache and block storage) has no idea anyone else wrote anything. So even if you write and sync on one client, you may have to bypass the page cache on another (e.g. use O_DIRECT reads, which have their own sets of complexities) and make sure you target something that eventually refers to the same block addresses that were written on the NVMe target from your other client.
In theory, this will let you read data written by another if everything lines up correctly, in practice though this can cause confusion, especially if a file system or application on one client writes something at a location, and the other client attempts to read or write that location unknowingly. Now you have a consistency problem.
NVMeoF (with RDMA or any other transport) is a block level storage protocol and not a file level storage protocol. Thus, there is no guarantee to the atomicity of file operations across nodes in NVMeoF systems. Even if one node deletes a file, there is no guarantee that:
The delete operation was actually translated to block erase operations and sent to the storage server;
Even if the storage server deleted the blocks, there is no guarantee that other clients that have cached this data will not continue to read it. Moreover, another client can overwrite the deleted file.
Overall, I think that to have any guarantees at the file level, you should consider a distribute file systems, rather than NVMeoF.
What is the correct way to clean up the cache so that other nodes can see the deletion without re-mount?
There is no good way to do it. Flushing the cache on all nodes and only then reading may work, but it depends on the file system.

How can Spark take input after it is submitted

I am designing an application, which requires response very fast and need to retrieve and process a large volume of data (>40G) from hadoop file system, given one input (command).
I am thinking, if it is possible to catch such high amount of data in the distributed memory using spark, and let the application running all the time. If I give the application an command, it could start to process data based on the input.
I think catching such big data is not a problem. However, how can I let the application running, and take input?
As far as I know, there is nothing can be done after "spark-submit" command...
You can try spark job server and Named Objects to cache dataset in distributed memory and use it in various input commands.
The requirement is not clear!!!, but based on my understanding,
1) In spark-submit after the application.jar, you can provide application specific command line arguments. But if you want to send commands after the job was started, then you can write a spark streaming job which processes kafka messages.
2) HDFS is already optimised for processing large volume of data. You can cache intermediate reusable data so that they do not get re-computed. But for better performance you might consider using something like elasticsearch/cassandra, so that they can be fetched/stored even faster.

Understanding file handling in hadoop

I am new to Hadoop ecosystem with some basic idea. Please assist on following queries to start with:
If the file size (file that am trying to copy into HDFS) is very big and unable to accommodate with the available commodity hardware in my Hadoop ecosystem system, what can be done? Will the file wait until it gets an empty space or the there is an error?
How to find well in advance or predict the above scenario will occur in a Hadoop production environment where we continue to receive files from outside sources?
How to add a new node to a live HDFS ecosystem? There are many methods but I wanted to know which files I need to alter?
How many blocks does a node have? If I assume that a node is a CPU with storage(HDD-500 MB), RAM(1GB) and a processor(Dual Core). In this scenario is it like 500GB/64? assuming that each block is configured to hold 64 GB RAM
If I copyFromLocal a 1TB file into HDFS, which portion of the file will be placed in which block in which node? How can I know this?
How can I find which record/row of the input file is available in which file of the multiple files split by Hadoop?
What are the purpose of each xmls configured? (core-site.xml,hdfs-site.xml & mapred-site.xml). In a distributed environment, which of these files should be placed in all the slave Data Nodes?
How to know how many map and reduce jobs will run for any read/write activity? Will the write operation always have 0 reducer?
Apologize for asking some of the basic questions. Kindly suggest methods to find answers for all of the above queries.

Confusion about distributed cache in Hadoop

What does the distribute cache actually mean? Having a file in distributed cache means that is it available in every datanode and hence there will be no internode communication for that data, or does it mean that the file is in memory in every node?
If not, by what means can I have a file in memory for the entire job? Can this be done both for map-reduce, as well as for a UDF..
(In particular there is some configuration data, comparatively small that I would like to keep in memory as a UDF applies on hive query...? )
Thanks and regards,
Dhruv Kapur.
DistributedCache is a facility provided by the Map-Reduce framework to cache files needed by applications. Once you cache a file for your job, hadoop framework will make it available on each and every data nodes (in file system, not in memory) where you map/reduce tasks are running. Then you can access the cache file as local file in your Mapper Or Reducer job. Now you can easily read the cache file and populate some collection (e.g Array, Hashmap etc.) in your code.
Refer https://hadoop.apache.org/docs/r2.6.1/api/org/apache/hadoop/filecache/DistributedCache.html
Let me know if still you have some questions.
You can read the cache file as local file in your UDF code. After reading the file using JAVA APIs just populate any collection (In memory).
Refere URL http://www.lichun.cc/blog/2013/06/use-a-lookup-hashmap-in-hive-script/
-Ashish

Hadoop Distributed Cache - modify file

I have a file in the distributed cache. The driver class, based on the output of a job, updates this file and starts a new job. The new job need these updates.
The way I currently do it is to replace the old Distributed Cache file with a new one (the updated one).
Is there a way of broadcasting the diffs (between the old file and the new one) to all the tasks trackers which need the file ?
Or is it the case that, after a job (the first one, in my case) is finished, all the directories/files specific to that job are deleted and consequently it doesn't even make sense to think in this direction ?
I think that distributed cache is not build with such scenario in mind. It simply put files locally.
In Your case I would suggest to put file in HDFS and make all interested parties to take it from there
As an optimization you can give this file high replication factor and it will be local to most of the tasks.

Resources