I am trying to copy some files from the hadoop HDFS to local. I used the following command
hadoop fs -copyToLocal <hdfs path> <local path>
The size of the file is just 80M. I had run a job before where I had no issue in copying files of size 70MB to local. However, this time I am having Input/Output error
copyToLocal: Input/output error
can anyone tell me what could have gone wrong?
It might be a space constraint on your machine. I had the same issue because the file was too big for it to be moved to my local machine. Once I made space, I was able to perform the copyToLocal operation.
Related
I'm trying to copy a file that is hosted both within Hive and on an HDFS onto my local computer, but I can't seem to figure out the write call/terminology to use to invoke my local. All online explanations describe it solely as "path to local". I'm trying to copy this into a folder at C/Users/PC/Desktop/data and am using the following attempts:
In HDFS
hdfs dfs -copyToLocal /user/w205/staging /C/Users/PC/Desktop/data
In Hive
INSERT OVERWRITE LOCAL DIRECTORY 'c/users/pc/desktop/data/' SELECT * FROM lacountyvoters;
How should I be invoking the local repository in this case?
I have installed both hadoop and spark locally on a windows machine.
I can access HDFS files in hadoop, e.g.,
hdfs dfs -tail hdfs:/out/part-r-00000
works as expected. However, if I try to access the same file from the spark shell, e.g.,
val f = sc.textFile("hdfs:/out/part-r-00000")
I get an error that the file does not exist. Spark can access files in the windows file system using the file:/... syntax, though.
I have set the HADOOP_HOME environment variable to c:\hadoop which is the folder containing the hadoop install (in particular winutils.exe, which seems to be necessary for spark, is in c:\hadoop\bin).
Because it seems that HDFS data is stored in the c:\tmp folder, I was wondering whether there is would be a way to let spark know about this location.
Any help would be greatly appreciated. Thank you.
If you are getting file doesn't exist, that means your spark application(code snippet) is able to connect to HDFS.
The HDFS file path that you are using seems wrong.
This should solve your issue
val f = sc.textFile("hdfs://localhost:8020/out/part-r-00000")
I want to copy/upload some files from a local system (a system not in Hadoop cluster) onto Hadoop HDFS. The local system can be Windows system too.
I tried with Flume spool directory. It works fine with Text files. For other docs, the mime type is getting corrupted.
Please let me know different approaches to load a file(s) to HDFS.
hadoop fs -copyFromLocal <localsrc> URI
Check Hadoop documentation: copyFromLocal
Keep in mind, Apache Flume wasn't created to copy some files.
You can also use hadoop fs -put <localsrcpath> <hdfspath>
This is one of the alternative to copyFromLocal
In hadoop 2.0 (YARN) you can do as follows to transfer local files to HDFS:
hdfs dfs -put "localsrcpath" "hdfspath"
where hdfs is the command located in the bin directory.
Java code can do that easily. You don't require any tools for this. Check below, the piece of code that worked:
Configuration conf = new Configuration();
try {
conf.set("fs.defaultFS",<<namenode>>); //something like hdfs://server:9000 or copy from core-site.xml
FileSystem fileSystem= FileSystem.get(conf);
System.out.println("Uploading please wait...");
fileSystem.copyFromLocalFile(false, new Path(args[0]), new Path(args[1].trim()));//args[0]=C://file or dir args[1]=/imported
Prepare jar out of this and run on any OS. Keep in mind you no need to
have Hadoop running in the machine, where you are going to run this
code. If you need any help, add comments.
Don't forget to add dnsresolver line where you run this code. Open /drivers/etc/hosts (for Windows)
hadoopnamenode ip-address
slavenode ip-address
First you need to load docs from your Windows machine to linux machine using filezilla or other tool.
And then you need to use:
hadoop fs -put localsrcpath hdfspath
Following command will also work.
hadoop fs -copyFromLocal localsrcpath hdfspath
I don't know what's going on here but I am trying to copy a simple file from a directory in my local filesystem to the directory specified for hdfs.
In my hdfs-site.xml I have specified that the directory for hdfs will be /home/vaibhav/Hadoop/dataNodeHadoopData using the following properties -
<name>dfs.data.dir</name>
<value>/home/vaibhav/Hadoop/dataNodeHadoopData/</value>
and
<name>dfs.name.dir</name>
<value>/home/vaibhav/Hadoop/dataNodeHadoopData/</value>
I am using the following command -
bin/hadoop dfs -copyFromLocal /home/vaibhav/ml-100k/u.data /home/vaibhav/Hadoop/dataNodeHadoopData
to copy the file u.data from it's local filesystem location to the directory that I specified as Hdfs directory. But when I do this, nothing happens - no error, nothing. And no file gets copied to the hdsf. Am I doing something wrong? Any permissions issue could be there?
Suggestions needed.
I am using pseudo distributed single node mode.
Also, on a related note, I want to ask that in my map reduce program I have set the configuration to point to the inputFilePath as /home/vaibhav/ml-100k/u.data. So would it not automatically copy the file from given location to hdfs ?
I believe dfs.data.dir and dfs.name.dir have to point to two different and existing directories. Furthermore make sure you have formatted the namenode FS after changing the directories in the configuration.
While copying to HDFS you're incorrectly specifying the target. The correct syntax for copying a local file to HDFS is:
bin/hadoop dfs -copyFromLocal <local_FS_filename> <target_on_HDFS>
Example:
bin/hadoop dfs -copyFromLocal /home/vaibhav/ml-100k/u.data my.data
This would create a file my.data in your user's home directory in HDFS.
Before copying files to HDFS make sure, you master listing directory contents and directory creation first.
I gave an input to the mapper from a local filesystem.It is running successfully from eclipse,But not running from the cluster as it is unable to find the local input path saying:input path does not exist.Please can anybody help me how to give a local file path to a mapper so that it can run in the cluster and i can get the output in hdfs
This is a very old question. Recently faced the same issue.
I am not aware of how correct this solution is it worked for me though. Please bring to notice if there are any drawbacks of this.Here's what I did.
Reading a solution from the mail-archives, I realised if i modify fs.default.name from hdfs://localhost:8020/ to file:/// it can access the local file system. However, I didnt want this for all my mapreduce jobs. So I made a copy of core-site.xml in a local system folder (same as the one from where I would submit my MR jar to hadoop jar).
and in my Driver class for MR I added,
Configuration conf = new Configuration();
conf.addResource(new Path("/my/local/system/path/to/core-site.xml"));
conf.addResource(new Path("/usr/lib/hadoop-0.20-mapreduce/conf/hdfs-site.xml"));
The MR takes input from local system and writes the output to hdfs:
Running in a cluster requires the data to be loaded into distributed storage (HDFS). Copy the data to HDFS first using hadoop fs -copyFromLocal and then try to trun your job again, giving it the path of the data in HDFS
The question is an interesting one. One can have data on S3 and access this data without an explicit copy to HDFS prior to running the job. In the wordcount example, one would specify this as follows:
hadoop jar example.jar wordcount s3n://bucket/input s3n://bucket/output
What occurs in this is that the mappers read records directly from S3.
If this can be done with S3, why wouldn't hadoop similarly, using this syntax instead of s3n
file:///input file:///output
?
But empirically, this seems to fail in an interesting way -- I see that Hadoop gives a file not found exception for a file that is indeed in the input directory. That is, it seems to be able to list the files in the put directory on my local disk but when it comes time to open them to read the records, the file is not found (or accessible).
The data must be on HDFS for any MapReduce job to process it. So even if you have a source such as local File System or a network path or a web based store (such as Azure Blob Storage or Amazon Block stoage), you would need to copy the data at HDFS first and then run the Job.
The bottom line is that you would need to push the data first to to HDFS and there are several ways depend on data source, you would perform the data transfer from your source to HDFS such as from local file system you would use the following command:
$hadoop -f CopyFromLocal SourceFileOrStoragePath _HDFS__Or_directPathatHDFS_
Try setting the input path like this
FileInputFormat.addInputPath(conf, new Path(file:///the directory on your local filesystem));
if you give the file extension, it can access files from the localsystem
I have tried the following code and got the solution...
Please try it and let me know..
You need to get FileSystem object for local file system and then use makequalified method to return path.. As we need to pass path of local filesystem(no other way to pass this to inputformat), i ve used make qualified, which in deed returns only local file system path..
The code is shown below..
Configuration conf = new Configuration();
FileSystem fs = FileSystem.getLocal(conf);
Path inputPath = fs.makeQualified(new Path("/usr/local/srini/")); // local path
FileInputFormat.setInputPaths(job, inputPath);
I hope this works for your requirement, though it's posted very late.. It worked fine for me.. It does not need any configuration changes i believe..
U might wanna try this by setting the configuration as
Configuration conf=new Configuration();
conf.set("job.mapreduce.tracker","local");
conf.set("fs.default.name","file:///");
After this u can set the fileinputformat with the local path and u r good to go