Background--we are trying to read different file types (csv or parquet) into pyspark, and I have the task of writing a program that will determine file type.
It appears that parquet files are always directories, parquet file appears in HDFS as a directory.
We have some csv files that are also directories, where the file name is the directory name and the directory contains several part files. What processes do this?
Why are some files --'files' and some files 'directories'?
It will depend on what process produced those files. For example, when MapReduce produces output, it always produces a directory and then creates one output file per reducer within that directory. This is done so that each reducer can create its output independently.
Judging from Spark's CSV package, it expects to output to a single file. So perhaps the single-file CSVs are being generated by Spark and the directories by MapReduce.
To be as generic as possible, it may be a good idea to do the following: check if the file in question is a directory. If not, check the extension. If yes, look at the extension of the files inside of the directory. This should work for each of your situations.
Note that some input formats (e.g. MapReduce input formats) will only accept directories as inputs, and some (e.g. Spark's textFile) will only accept files/globs of files. You need to be aware of what is expected from the libraries you are interacting with.
All the data on your hard drive consists of files and folders. The
basic difference between the two is that files store data, while
folders store files and other folders.
Hadoop execution engines generally creates a directory and write multiple part files as output based on the number of reducers or executors used.
When you many an output file abc.csv it doesn't mean that its a single file with the data. Its just the output location which MapReduce (generally) interprets as the new directory to be created within which it creates the output files(part files).
In case of Spark when you are writing a file(maybe using .saveAsTextFile) it may creates only a single file.
Related
rdd.saveAsTextFile("s3n://bucket-name/path) is creating an empty file with folder name as - [folder-name]_$folder$
Seems like this empty file in used by hadoop-aws jar (of org.apache.hadoop) to mimick S3 filesystem as hadoop filesystem.
But, my application writes thousands of files to S3. As saveAsTextFile creates folder (from the given path) to write the data (from rdd) my application ends up creating thousands of these empty files - [directory-name]_$folder$.
Is there a way to make rdd.saveAsTextFile not to write these empty files?
Stop using s3n, switch to s3a. It's faster and actually supported. that will make this issue go away, along with the atrocious performance problems reading large Parquet/ORC files.
Also, if your app is creating thousands of small files in S3, you are creating future performance problems: listing and opening files on S3 is slow. Try to combine source data into larger columnar-formatted files & use whatever SELECT mechanism your framework has to only read the bits you want
For Hadoop Map Reduce program when we run it by executing this command $hadoop jar my.jar DriverClass input1.txt hdfsDirectory. How to make Map Reduce process multiple files( input1.txt & input2.txt ) in a single run ?
Like that:
hadoop jar my.jar DriverClass hdfsInputDir hdfsOutputDir
where
hdfsInputDir is the path on HDFS where your input files are stored (i.e., the parent directory of input1.txt and input2.txt)
hdfsOutputDir is the path on HDFS where the output will be stored (it should not exist before running this command).
Note that your input should be copied on HDFS before running this command.
To copy it to HDFS, you can run:
hadoop dfs -copyFromLocal localPath hdfsInputDir
This is your small files problem. for every file mapper will run.
A small file is one which is significantly smaller than the HDFS block size (default 64MB). If you’re storing small files, then you probably have lots of them (otherwise you wouldn’t turn to Hadoop), and the problem is that HDFS can’t handle lots of files.
Every file, directory and block in HDFS is represented as an object in the namenode’s memory, each of which occupies 150 bytes, as a rule of thumb. So 10 million files, each using a block, would use about 3 gigabytes of memory. Scaling up much beyond this level is a problem with current hardware. Certainly a billion files is not feasible.
solution
HAR files
Hadoop Archives (HAR files) were introduced to HDFS in 0.18.0 to alleviate the problem of lots of files putting pressure on the namenode’s memory. HAR files work by building a layered filesystem on top of HDFS. A HAR file is created using the hadoop archive command, which runs a MapReduce job to pack the files being archived into a small number of HDFS files. To a client using the HAR filesystem nothing has changed: all of the original files are visible and accessible (albeit using a har:// URL). However, the number of files in HDFS has been reduced.
Sequence Files
The usual response to questions about “the small files problem” is: use a SequenceFile. The idea here is that you use the filename as the key and the file contents as the value. This works very well in practice. Going back to the 10,000 100KB files, you can write a program to put them into a single SequenceFile, and then you can process them in a streaming fashion (directly or using MapReduce) operating on the SequenceFile. There are a couple of bonuses too. SequenceFiles are splittable, so MapReduce can break them into chunks and operate on each chunk independently. They support compression as well, unlike HARs. Block compression is the best option in most cases, since it compresses blocks of several records (rather than per record).
I have a reducer that needs to output results to different directories so that we can later use the output as input to Hive as a partitioned table. (Hive creates partitions based on folder name). In order to write out to these locations, we are currently not using any Hadoop framework to accomplish this, we are just writing out to separate locations "behind Hadoop's back", so to speak. In other words we are not using hadoop's API to output these files.
We had issues with mapred.reduce.tasks.speculative.execution set to true. I understand this to be the case because multiple task attempts for the same task are writing to the same location.
Is there a way to correctly use Hadoop's API to output to several different folders from the same reducer such that I can also use mapred.reduce.tasks.speculative.execution=true ? (I know about MultipleOutputs, which I'm not sure supports speculative execution.)
If so, is there a way to do that and output to S3?
The way Hadoop typically deals with speculative execution is to create an output folder for each task attempt (in a _temporary subfolder of the actual HDFS output directory).
The OutputCommitter for the OutputFormat then simply moves the contents of the temp task folder to the actual output folder when a task succeeds, and deletes the other temp task folders for those failed / aborted (this is the default behavior for most FileOutputFormats)
So for your case, if you are writing to a folder outside of the job output folder, then you'll need to extend / implement your own output committer. I'd follow the same principals when creating the files - include the full task id (including the attempt id) to avoid name collisions when speculatively executing. How you track the files created in your job and manage the deletion in the abort / fail scenarios is up to you (maybe some file globing for the task ids?)
You might be interested in this : http://hadoop.apache.org/docs/r2.3.0/api/org/apache/hadoop/mapreduce/lib/output/MultipleOutputs.html
I have a lot of zip files that need to be processed by a C++ library. So I use C++ to write my hadoop streaming program. The program will read a zip file, unzip it, and process the extracted data.
My problem is that:
my mapper can't get the content of exactly one file. It usually gets something like 2.4 files or 3.2 files. Hadoop will send several files to my mapper but at least one of the file is partial. You know zip files can't be processed like this.
Can I get exactly one file per map? I don't want to use file list as input and read it from my program because I want to have the advantage of data locality.
I can accept the contents of multiple zip file per map if Hadoop don't split the zip files. I mean exactly 1, 2, 3 files, not something like 2.3 files. Actually it will be even better because my program need to load about 800MB data file for processing the unziped data. Can we do this?
You can find the solution here:
http://wiki.apache.org/hadoop/FAQ#How_do_I_get_each_of_a_job.27s_maps_to_work_on_one_complete_input-file_and_not_allow_the_framework_to_split-up_the_files.3F
The easiest way I would suggest is to set mapred.min.split.size to a large value so that your files do not get split.
If this does not work then you would need to implement an InputFormat which is not very difficult to do and you can find the steps at: http://developer.yahoo.com/hadoop/tutorial/module5.html#fileformat
Rather then depending on the min split size I would suggest an easier way is to Gzip your files.
There is a way to compress files using gzip
http://www.gzip.org/
If you are on Linux you compress the extracted data with
gzip -r /path/to/data
Now that you have this pass this data as your input in your hadoop streaming job.
I have a MR streaming job. My code is in C++. Its a mapper only job, with no reducer. Input to the the job is a directory containing three files. Job creates 3 mappers. Each mapper processes one input file and produces one output file in different format.
Input files are like:
MyDir/file1
MyDir/file2
MyDir/file3
Output file are like:
MyDir/Output/part-00000
MyDir/Output/part-00001
MyDir/Output/part-00002
I want to correlate input files to output files. For example, input file MyDir/file1 may correspond to output file MyDir/Output/part-00002, i.e. mapper that processed input file MyDir/file1 may have produced output file MyDir/Output/part-00002.
I want to know this relationship, i.e., which input file corresponds to which output file. Is there a simple way to know this?
One way I can think of is it to have the i/p and the o/p file names of the Job the same. Get the input file name (map.input.file environment property) which the mapper is processing and then us it in the MultipleOutputFormat#generateFileNameForKeyValue method.
With how Hadoop is designed, the only relationship that you can rely on, without you expressly naming the output files as per the other answer, is that the number of output files corresponds to the number of final tasks being run, usually reducers (mappers in your case, since you're not running any reducers).
If Hadoop later decides to run more mappers/reducers instead of just 3 (larger input files, more nodes available), you'll get a different number of output files.