So while writing to HBase from a MapReduce job which is using TableOutputFormat how often does it write to HBase. I dont imagine it doing a put command for every row.
How do we control AutoFlush and Write Ahead Log (WAL) while using in MapReduce?
TableOutputFormat disables AutoFlush and uses the write buffer specified at hbase.client.write.buffer (defaults to 2MB), once the buffer is full its automatically flushed to HBase. You can change it by adding the property to your job configuration:
config.set("hbase.client.write.buffer", 16777216); // 16MB Buffer
WAL is enabled by default, it can be disabled per put but it's generally discouraged:
myPut.setWriteToWal(false);
It does in fact see the code. If you want to bypass the WAL write using HFileOutputFormat see example on gitub
Related
I am using hadoop 1.0.3. Can the input split/block be changed (increase/decrease) during run time based on some constraints. Is there a class to override to accomplish this mechanism like FileSplit/InputTextFormat? Can we have variant size blocks in HDFS depending on logical constraint in one job?
You're not limited to TextInputFormat... Thats entirely configurable based on the data source you are reading. Most examples are line delimited plaintext, but that obviously doesn't work for XML, for example.
No, block boundaries can't change during runtime as your data should already be on disk, and ready to read.
But the InputSplit is dependent upon the InputFormat for the given job, which should remain consistent throughout a particular job, but the Configuration object in the code is basically a Hashmap which can be changed while running, sure
If you want to change block size only for a particular run or application you can do by overriding "-D dfs.block.size=134217728" .It helps you to change block size for your application instead of changing overall block size in hdfs-site.xml.
-D dfs.block.size=134217728
Right now we are using a mapreduce job to convert data and store the result in the Parquet format.
There is a summary file (_metadata) generated as well. But the problem is that it is too big (over 5G). Is there any way to reduce the size?
Credits to Alex Levenson and Ryan Blue:
Alex Levenson:
You can push the reading of the summary file to the mappers instead of reading it on the submitter node:
ParquetInputFormat.setTaskSideMetaData(conf, true);
(Ryan Blue: This is the default from 1.6.0 forward)
or setting "parquet.task.side.metadata" to true in your configuration. We
had a similar issue, by default the client reads the summary file on the
submitter node which takes a lot of time and memory. This flag fixes the
issue for us by instead reading each individual file's metadata from the
file footer in the mappers (each mapper reads only the metadata it needs).
Another option, which is something we've been talking about in the past, is
to disable creating this metadata file at all, as we've seen creating it
can be expensive too, and if you use the task side metadata approach, it's
never used.
(Ryan Blue: There's an option to suppress the files, which I recommend. Now that file metadata is handled on the tasks, there's not much need for the summary files.)
I have a file that get aggregated and written into HDFS. This file will be opened for an hour before it is closed. Is it possible to compute this file using MapReduce framework, while it is open? I tried it but it's not picking up all appended data. I could query the data in HDFS and it available but not when done by MapReduce. Is there anyway I could force MapReduce to read an open file? Perhaps customize the FileInputFormat class?
You can read what was physically flushed. Since close() makes the final flush of the data, your reads may miss some of the most recent data regardless how you access it (mapreduce or command line).
As a solution I would recommend periodically close the current file, and then open a new one (with some incremented index suffix). You can run you map reduce on multiple files. You would still end up with some data missing in the most recent file, but at least you can control it by frequency of of your file "rotation".
What does the distribute cache actually mean? Having a file in distributed cache means that is it available in every datanode and hence there will be no internode communication for that data, or does it mean that the file is in memory in every node?
If not, by what means can I have a file in memory for the entire job? Can this be done both for map-reduce, as well as for a UDF..
(In particular there is some configuration data, comparatively small that I would like to keep in memory as a UDF applies on hive query...? )
Thanks and regards,
Dhruv Kapur.
DistributedCache is a facility provided by the Map-Reduce framework to cache files needed by applications. Once you cache a file for your job, hadoop framework will make it available on each and every data nodes (in file system, not in memory) where you map/reduce tasks are running. Then you can access the cache file as local file in your Mapper Or Reducer job. Now you can easily read the cache file and populate some collection (e.g Array, Hashmap etc.) in your code.
Refer https://hadoop.apache.org/docs/r2.6.1/api/org/apache/hadoop/filecache/DistributedCache.html
Let me know if still you have some questions.
You can read the cache file as local file in your UDF code. After reading the file using JAVA APIs just populate any collection (In memory).
Refere URL http://www.lichun.cc/blog/2013/06/use-a-lookup-hashmap-in-hive-script/
-Ashish
In HBase before writing into memstore data will be first written into the WAL, but when i checked on my system
WAL files are not getting updated immediately after each Put operation, it's taking lot of time to update. Is there any parameter i need to set?
(WAL has been enabled)
Do you know how long it takes to update the WAL files? Are you sure the time is taken in write or by the time you check WAL, it is already moved to old logs. If WAL is enabled all the entries must come to WAL first and then written to particular region as cluster configured.
I know that WAL files are moved to .oldlogs fairly quickly i.e. 60 seconds as defined in hbase-site.xml through hbase.master.logcleaner.ttl setting.
In standalone mode writing into WAL is taking a lot of time, where as on pseudo distributed mode it's working fine