Using multiple local folders as source in hadoop mapreduce job

Using multiple local folders as source in hadoop mapreduce job - hadoop

I have data in multiple local folders i.e. /usr/bigboss/data1, /usr/bigboss/data2 and many more folders. I want to use all of these folders as input source for my MapReduce command and store the result at HDFS. I can not find a working command to use Hadoop Grep example to do it.

The data will need to reside in HDFS for you to process it with the grep example. You can upload the folders to HDFS using the -put FsShell command:
hadoop fs -mkdir bigboss
hadoop fs -put /usr/bigboss/data* bigboss
Which will create a folder in the current user HDFS directory, and upload each of the data directories to it
Now you should be able to run the grep example over the data

Related

Hadoop error when outputting the grep results to a new file in a different directory

I'm trying to read the contents of a few files and using grep find the lines with the my search query and then output the results into a folder in another directory. I get an error "No such file or directory exists". I have created the folder structure and the text file.
hadoop fs -cat /Final_Dataset/c*.txt | grep 2015-01-* > /energydata/2015/01/01.txt
ERROR:
-bash: /energydata/2015/01/01.txt: No such file or directory

> /energydata/2015/01/01.txt means that the output is being redirected to a local file. hdfs fs -cat sends output to your local machine and at that point you're no longer operating within Hadoop. grep simply acts on a stream of data, it doesn't care (or know) where it came from.
You need to make sure that /energydata/2015/01/ exists locally before you run this command. You can create it with mkdir -p /energydata/2015/01/.
If you're looking to pull certain records from a file on HDFS and then re-write the new file to HDFS then I'd suggest not cat-ing the file and instead keeping the processing entirely on the cluster, by using something like Spark or Hive to transform data efficiently. Or failing that just do a hadoop dfs -put <local_path> /energydata/2015/01/01.txt.

The following CLI command worked
hadoop fs -cat /FinalDataset/c*.txt | grep 2015-01-* | hadoop fs -put - /energydata/2015/01/output.txt

How files or directories are getting stored in hadoop hdfs

I have created a file in hdfs using below command
hdfs dfs -touchz /hadoop/dir1/file1.txt
I could see the created file by using below command
hdfs dfs -ls /hadoop/dir1/
But, I could not find the location itself by using linux commands (using find or locate). I searched on internet and found following link.
How to access files in Hadoop HDFS? . It says, hdfs is virtual storage. In that case, How its taking partition which one or how much it needs to be used, where the meta data being stored
Is it taking datanode location for virtual storage which I have mentioned in hdfs-site.xml to store all the data?
I looked into datanode location and there are files available. But I could not find out anything related to my file or folder which I have created.
(I am using hadoop 2.6.0)

HDFS file system is a distributed storage system wherein the storage location is virtual and created using the disk space from all the DataNodes. While installing hadoop, you must have specified paths for dfs.namenode.name.dir and dfs.datanode.data.dir. These are the locations at which all the HDFS related files are stored on individual nodes.
While storing the data onto HDFS, it is stored as blocks of a specified size (default 128MB in Hadoop 2.X). When you use hdfs dfs commands you will see the complete files but internally HDFS stores these files as blocks. If you check the above mentioned paths on your local file system, you will see a bunch of files which correcpond to files on your HDFS. But again, you will not see them as actual files as they are split into blocks.
Check below mentioned command's output to get more details on how much space from each DataNode is used to create the virtual HDFS storage.
hdfs dfsadmin -report #Or
sudo -u hdfs hdfs dfsadmin -report
HTH

As we creating a file in local file system i.e on creating a directory in it
for ex:$/mkdir MITHUN94** it is a directory entering into that(LFS) cd MITHUN90
in that **create a new file as **$nano file1.log .
And now create a directory in** hdfs for ex: hdfs dfs -mkdir /mike90 .Here "mike90"
refers to directory name . After that creating a directory send files from LFS to hdfs. By using this command $hdfs dfs -copyFromLocal /home/gopalkrishna/file1.log
/mike90
Here '/home/gopalkrishna/file1.log' refers to pwd (present working directory)
and '/mike90' refers to directory in hdfs. By clickig $hdfs dfs -ls /mike90
the list of files .

Shell Script to copy directories from hdfs to local

i'm looking for a shell script which should copy directory (with files under) from HDFS to local system.

I think it is pointless to write a whole script, when you only need to write one command into terminal.
With
hadoop fs -ls /myDir/path
you can verify name and path to directory, which you want to copy and write
hadoop fs -get /myDir/path
to get file into local. You also can specify destination directory by
hadoop fs -get /myDir/path /myLocal/destDir
It copies while directory (with subdirectories) to your working directory or to specified directory. You also can get file by file (dir by dir) with
hadoop fs -get /myDir/path/*
or specific dirs or files in one command
hadoop fs -get /myDir/path/dir1 /myDir/path/dir2 .
to your directory. I tried it on my Hadoop VM and it works fine.

Reading files from hdfs vs local directory

I am a beginner in hadoop. I have two doubts
1) how to access files stored in the hdfs? Is it same as using a FileReader in java.io and giving the local path or is it something else?
2) i have created a folder where i have copied the file to be stored in hdfs and the jar file of the mapreduce program. When I run the command in any directory
${HADOOP_HOME}/bin/hadoop dfs -ls
it just shows me all the files in the current dir. So does that mean all the files got added without me explicitly adding it?

Yes, it's pretty much the same. Read this post to read files from HDFS.
You should keep in mind that HDFS is different than your local file system. With hadoop dfs you access the HDFS, not the local file system. So, hadoop dfs -ls /path/in/HDFS shows you the contents of the /path/in/HDFS directory, not the local one. That's why it's the same, no matter where you run it from.
If you want to "upload" / "download" files to/from HDFS you should use the commads:
hadoop dfs -copyFromLocal /local/path /path/in/HDFS and
hadoop dfs -copyToLocal /path/in/HDFS /local/path, respectively.

Is it possible to run hadoop fs -getmerge in S3?

I have an Elastic Map Reduce job which is writing some files in S3 and I want to concatenate all the files to produce a unique text file.
Currently I'm manually copying the folder with all the files to our HDFS (hadoop fs copyFromLocal), then I'm running hadoop fs -getmerge and hadoop fs copyToLocal to obtain the file.
is there anyway to use hadoop fs directly on S3?

Actually, this response about getmerge is incorrect. getmerge expects a local destination and will not work with S3. It throws an IOException if you try and responds with -getmerge: Wrong FS:.
Usage:
hadoop fs [generic options] -getmerge [-nl] <src> <localdst>

An easy way (if you are generating a small file that fits on the master machine) is to do the following:
Merge the file parts into a single file onto the local machine (Documentation)
hadoop fs -getmerge hdfs://[FILE] [LOCAL FILE]
Copy the result file to S3, and then delete the local file (Documentation)
hadoop dfs -moveFromLocal [LOCAL FILE] s3n://bucket/key/of/file

I haven't personally tried the getmerge command myself but hadoop fs commands on EMR cluster nodes support S3 paths just like HDFS paths. For example, you can SSH into the master node of your cluster and run:
hadoop fs -ls s3://<my_bucket>/<my_dir>/
The above command will list of out all the S3 objects under the specified directory path.
I would expect hadoop fs -getmerge to work the same way. So, just use full S3 paths (starting with s3://) instead of HDFS paths.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Using multiple local folders as source in hadoop mapreduce job - hadoop

I have data in multiple local folders i.e. /usr/bigboss/data1, /usr/bigboss/data2 and many more folders. I want to use all of these folders as input source for my MapReduce command and store the result at HDFS. I can not find a working command to use Hadoop Grep example to do it.

Related

Hadoop error when outputting the grep results to a new file in a different directory

How files or directories are getting stored in hadoop hdfs

Shell Script to copy directories from hdfs to local

Reading files from hdfs vs local directory

Is it possible to run hadoop fs -getmerge in S3?

Categories

Resources