Hadoop per-file block size - hadoop

In Hadoop book it is said that we can specify per-file block size at the time of creation of file.
"The most natural way to increase the split size is to have larger blocks in HDFS, by setting dfs.block.size, or on a per-file basis at file construction time."
Any idea how to do this at file construction time. I hope by setting this to value = file-size, the file will not be split

you can use CLI:
hadoop fs -D dfs.block.size=file-size -put local_name remote_location
or you can use Java API to specify the dfs.block.size when you want to create or copy files.
Configuration conf = new Configuration();
conf.setInt("dfs.block.size",file-size);

Related

Having multiple reduce tasks assemble a single HDFS file as output

Is there any low level API in Hadoop allowing multiple reduce tasks running on different machines to assemble a single HDFS as output of their computation?
Something like, a stub HDFS file is created at the beginning of the job then each reducer creates, as output, a variable number of data blocks and assigns them to this file according to a certain order
The answer is no, that would be an unnecessary complication for a rare use case.
What you should do
option 1 - add some code at the end of your hadoop command
int result = job.waitForCompletion(true) ? 0 : 1;
if (result == 0) { // status code OK
// ls job output directory, collect part-r-XXXXX file names
// create HDFS readers for files
// merge them in a single file in whatever way you want
}
All of the required methods are present in hadoop FileSystem api.
option 2 - add job to merge files
You can create a generic hadoop job that would accept directory name as input and pass everything as-is to the single reducer, that would merge results into one output file. Call this job in a pipeline with your main job.
This would work faster for big inputs.
If you want merged output file on local, you can use hadoop command getmerge to combine multiple reduce task files into one single local output file, below is command for same.
hadoop fs -getmerge /output/dir/on/hdfs/ /desired/local/output/file.txt

how to insert header file as first line into data file in HDFS without using getmerge(performance issue while copying to local)?

I am trying to insert header.txt as first line into data.txt without using getmerge. Getmerge copies to local and inserts into third file. But I want in HDFS only
Header.txt
Head1,Head2,Head3
Data.txt
100,John,28
101,Gill,25
102,James,29
I want output in Data.txt file only like below :
Data.txt
Head1,Head2,Head3
100,John,28
101,Gill,25
102,James,29
Please suggest me whether can we implement in HDFS only ?
HDFS supports a concat (short for concatenate) operation in which two files are merged together into one without any data transfer. It will do exactly what you are looking for. Judging by the file system shell guide documentation, it is not currently supported from the command line, so you will need to implement this in Java:
FileSystem fs = ...
Path data = new Path("Data.txt");
Path header = new Path("Header.txt");
Path dataWithHeader = new Path("DataWithHeader.txt");
fs.concat(dataWithHeader, header, data);
After this, Data.txt and Header.txt both cease to exist, replaced by DataWithHeader.txt.
Thanks for your reply.
I got other way like :
Hadoop fs cat hdfs_path/header.txt hdfs_path/data.txt | Hadoop fs -put - hdfs_path/Merged.txt
This is having drawback as cat command reads complete data which impacts performance.

How to merge CSV files in Hadoop?

I am new to the Hadoop framework and I would like to merge 4 CSV files into a single file.
All the 4 CSV files have same headers and order is also the same.
I don't think Pig STORE offers such a feature.
You could use Spark's coalesce(1) function, however, there is little reason to do this as almost all Hadoop processing tools prefer to read directories, not files.
You should ideally not be storing raw CSV in Hadoop for very long, anyway, and rather you convert it to ORC or Parquet as columnar data. Especially if you are reading CSV to begin with already -- do not output CSV again.
If the idea is to produce one CSV to later download, then I would suggest using Hive + Beeline to do that
This will store the result into a file in the local file system.
beeline -u 'jdbc:hive2://[databaseaddress]' --outputformat=csv2 -f yourSQlFile.sql > theFileWhereToStoreTheData.csv
try using getmerge utility to merge the csv files
for example you have a couple of EMP_FILE1.csv EMP_FILE2.csv EMP_FILE3.csv are placed at some location on hdfs. you can merge all these files and can placed merge file at some new location.
hadoop fs -getmerge /hdfsfilelocation/EMP_FILE* /newhdfsfilelocation/MERGED_EMP_FILE.csv

Hadoop job just ends

I'm having a rather strange problem with Hadoop.
I wrote a MR job that ends just like that, without executing map or reduce code. It produces the output folder, but that folder is empty. I see no reason for such a behavior.
I'm even trying out this with default Mapper and Reducer, just to find the problem, but I get no exception, no error, the job just finishes and produces an empty folder. Here's the simplest driver:
Configuration conf = new Configuration();
//DistributedCache.addCacheFile(new URI(firstPivotsInput), conf);
Job pivotSelection = new Job(conf);
pivotSelection.setJarByClass(Driver.class);
pivotSelection.setJobName("Silhoutte");
pivotSelection.setMapperClass(Mapper.class);
pivotSelection.setReducerClass(Reducer.class);
pivotSelection.setMapOutputKeyClass(IntWritable.class);
pivotSelection.setMapOutputValueClass(Text.class);
pivotSelection.setOutputKeyClass(IntWritable.class);
pivotSelection.setOutputValueClass(Text.class);
FileInputFormat.addInputPath(pivotSelection, new Path("/home/pera/WORK/DECOMPRESSION_RESULT.csv"));
FileOutputFormat.setOutputPath(pivotSelection, new Path("/home/pera/WORK/output"));
pivotSelection.setNumReduceTasks(1);
pivotSelection.waitForCompletion(true);
What could be the problem in such a simple example?
The simplest explanation is that the input Path ("/home/pera/WORK/DECOMPRESSION_RESULT.csv") does not contain anything on HDFS. You can verify that by the value of the MAP_INPUT_RECORDS counter. You can also check the size of this file on HDFS with hadoop dfs -ls /home/pera/WORK, or you can even see the first few lines of this file by hadoop dfs -cat /home/pera/WORK/DECOMPRESSION_RESULT.csv | head. (or -text instead of -cat if it is compressed).
Another problem could be that the reducer has a special (if) condition that fails for every mapper's output, but this should not hold in the case of identity mapper and reducer, so I believe the case is the former one.

passing URI as a runtime variable to distributed cache in mapreduce hadoop

I am using distributed cache in my mapreduce program and I am passing three variables to this mapreduce program input file, output dir and config file.
I want to add the third argument i.e config file to the Distributed Cache.
I am setting the parameter as follows in run() method of the MapReduce Driver:-
conf.set("CONF_XML", args[2]);
How to add this file into distributed cache in the same method. how do I do that ?
Usually we add using URI(new (file path));
DistributedCache.addCacheFile(new URI(file_path), conf); << here how to pass the argument parameter?
Pass the file path argument to the DistributedCache API as URI
DistributedCache.addCacheFile(new Path(args[2]).toUri(),job.getConfiguration());

Resources