Mapreduce processing multiple files in same directory - hadoop

If I have two files in my input folder, hadoop mapreduce will process both these files as . Is there a way to specify different processing for these two files? Suppose for example that instead of firing 1 for each word I encounter, I want to fire a 1 if this word was in file 1 and a 2 in the case it was seen in file 2 present in the same directory. How would you do that?

You should be able to get the file name as described in this post How to get the input file name in the mapper in a Hadoop program?
Once you have the file name you can have a condition to check for the file name based on that you should be able to fire 1 or 2.

Related

Need to use 1 Processor instead of 5 FetchHDFS in NiFi

I have 5 XML files in HDFS which I am fetching using Apache this is the flow nifi. First, I am using Generate Flow file processor and then I have to use 5 different FetchHdfs processors. I can't use GetHdfs because it deletes all the file from directory and I don't have permission to ingest the files back. Hence, I am searching for a way that instead of using 5 FetchHdfs, what else can I do?. All the files are in the same directory and I want to keep them so that I can test multiple times.
I am ingesting those files in TransformXML processor and converting them to JSON
Instead of the GetHDFS Processor, try the ListHDFS Processor as it lists the entire directory and doesn't delete the files ListHDFS It says in the description, "Unlike GetHDFS, this Processor does not delete any data from HDFS."
Thanks everyone for answering. I am unable to vote anyone's answer and hence I am writing what I did.
First I used the ListHDFS processor and it will list out all the filenames.
Then I used FetchHDFS and in HDFS filename, I put '${path}/${filename}'.
change the ${path} to your path of the directory and leave the ${filename} as is as this is a property of ListHDFS and that's where it is picking the filenames from.
This way, there is no need of loops or anything and as soon as the new file is uploaded in the directory, it will be picked by the ListHDFS processors.
So, leave the entire processes working.

Download multiple files with multiple threads

I am trying to download few files using 3 threads. my requirement is i want to achieve file download on 3 threads so that all files download 3 times in 3 different folders so that the files dont overwrite. I am using __counter to append 1,2,3 to the folders. Problem is if i give Thread count as 1 or 2 or 3 , it is behaving same in all the scenarios i.e. it always create two folders Folder1 and Folder2 and in all in folder1 it download all the files and in folder2 only last file gets downloaded with size as 0 KB.
Number of threads = 1
Attaching what i have tried so far-
Please try without counter function and with prefix, and two threads. I am guessing it based on the below information.
https://jmeter.apache.org/usermanual/component_reference.html#Save_Responses_to_a_file
Please note that Filename Prefix must not contain Thread related data,
so don't use any Variable (${varName}) or functions like
${__threadNum} in this field
Or try to keep some delay/pacing between two threads.
Hope this helps.
Update:-
Just give the folder path and file name without extension. It will save the with extension. I tried with image and it is save as Myfile1.jpeg

HDFS "files" that are directories

Background--we are trying to read different file types (csv or parquet) into pyspark, and I have the task of writing a program that will determine file type.
It appears that parquet files are always directories, parquet file appears in HDFS as a directory.
We have some csv files that are also directories, where the file name is the directory name and the directory contains several part files. What processes do this?
Why are some files --'files' and some files 'directories'?
It will depend on what process produced those files. For example, when MapReduce produces output, it always produces a directory and then creates one output file per reducer within that directory. This is done so that each reducer can create its output independently.
Judging from Spark's CSV package, it expects to output to a single file. So perhaps the single-file CSVs are being generated by Spark and the directories by MapReduce.
To be as generic as possible, it may be a good idea to do the following: check if the file in question is a directory. If not, check the extension. If yes, look at the extension of the files inside of the directory. This should work for each of your situations.
Note that some input formats (e.g. MapReduce input formats) will only accept directories as inputs, and some (e.g. Spark's textFile) will only accept files/globs of files. You need to be aware of what is expected from the libraries you are interacting with.
All the data on your hard drive consists of files and folders. The
basic difference between the two is that files store data, while
folders store files and other folders.
Hadoop execution engines generally creates a directory and write multiple part files as output based on the number of reducers or executors used.
When you many an output file abc.csv it doesn't mean that its a single file with the data. Its just the output location which MapReduce (generally) interprets as the new directory to be created within which it creates the output files(part files).
In case of Spark when you are writing a file(maybe using .saveAsTextFile) it may creates only a single file.

Writing MapReduce job to concurrently download files?

Not sure if this is a suitable use case for MapReduce: Part of the OOZIE workflow I'm trying to implement is to download a series of files named with sequential numbers (e.g. 1 through 20). I wanted those files to be downloaded simultaneously (5 files at a time), so I created a python script that creates 5 text files as follows:
1.txt: 1,2,3,4
2.txt: 5,6,7,8
3.txt: 9,10,11,12
4.txt: 13,14,15,16
5.txt: 17,18,19,20
Then for the next step of the workflow, I created a download.sh shell script that consumes a comma-separated list of numbers and download the requested files. In the workflow, I setup a streaming action in Oozie and used the directory that contains files generated above as input (mapred.input.dir) and used download.sh as the mapper command and "cat" as the reducer command. I assumed that Hadoop will spawn a different mapper for each of the input files above.
This seems to work sometimes, it would download the files correctly, but sometimes it just get stuck trying to execute and I don't know why. I noticed that this happen when I increase the number of simultaneous downloads (e.g. instead of files per txt file, I would do 20 and so forth).
So my question is: Is this a correct way to implement parallel retrieval of files using MapReduce and OOZIE? If not, how is this normally done using OOZIE? I'm trying to get my CSV files into the HDFS prior to running the Hive script and I'm not sure what the best way would be to achieve that.
After looking deeper into this, it seems that creating an Oozie "Fork" node would be the best approach. So I created a fork node, under which I created 6 shell actions that executes download.sh and take the list of file numbers as an argument. So I ended up modifying the python script so it outputs the file numbers that need to be downloaded to STDOUT (instead of saving them on HDFS). I had oozie capture that output and then pass them as arguments to the download.sh forks.
Cloudera Hue interface does not provide a way to create fork nodes (at least not that I was able to find) so I downloaded the workflow.xml file and added the fork nodes myself and then re-imported it as a new workflow.

Correlating input files to output files

I have a MR streaming job. My code is in C++. Its a mapper only job, with no reducer. Input to the the job is a directory containing three files. Job creates 3 mappers. Each mapper processes one input file and produces one output file in different format.
Input files are like:
MyDir/file1
MyDir/file2
MyDir/file3
Output file are like:
MyDir/Output/part-00000
MyDir/Output/part-00001
MyDir/Output/part-00002
I want to correlate input files to output files. For example, input file MyDir/file1 may correspond to output file MyDir/Output/part-00002, i.e. mapper that processed input file MyDir/file1 may have produced output file MyDir/Output/part-00002.
I want to know this relationship, i.e., which input file corresponds to which output file. Is there a simple way to know this?
One way I can think of is it to have the i/p and the o/p file names of the Job the same. Get the input file name (map.input.file environment property) which the mapper is processing and then us it in the MultipleOutputFormat#generateFileNameForKeyValue method.
With how Hadoop is designed, the only relationship that you can rely on, without you expressly naming the output files as per the other answer, is that the number of output files corresponds to the number of final tasks being run, usually reducers (mappers in your case, since you're not running any reducers).
If Hadoop later decides to run more mappers/reducers instead of just 3 (larger input files, more nodes available), you'll get a different number of output files.

Resources