Load a folder from LocalSystem to HDFS - hadoop

I have a folder in my LocalSystem. It contains 1000 files, and I would move or copy him from my LocalSystem to HDFS
I tried by these two commands:
hadoop fs copyFromLocal C:/Users/user/Downloads/ProjectSpark/ling-spam /tmp
And I also tried this command:
hdfs dfs -put /C:/Users/user/Downloads/ProjectSpark/ling-spam
/tmp/ling-spam
It displays an error message which says that my directory not found and yet I'm sure that correct.
I found a function getmerge() to move a folder from HDFS to LocalSystem, but I did not find the inverse.
Please, can you help me?

my VirtualBox on Windows, and i work by HDP2.3.2 with the console secure shell
You can't copy files from your Windows machine to HDFS. You have to first SCP the files into the VM (I recommend WinSCP or Filezilla) and only then can you use hadoop fs to put files onto HDFS.
The error was correct in that C:/Users/user/Downloads does not exist on the HDP sandbox because it's a Linux machine.
As noted, you can also try and use the Ambari HDFS file viewer, but I still standby by note that SCP is the official way because not all Hadoop systems have Ambari (or at least the HDFS file view for Ambari)

I would take the Mutual Information for classification of the word spam or ham. I have this operation: MI(Word)= ∑ Probabi(Occ,Class) * Log2 * (Probabi(Occuren,Class)/Probabi(Occurren) * Probabi(Class)).
I understand the function, I must compute 4 operation (true,ham), (false,ham), (true,spam) and (false,spam).
I do not understand who i do write exactly, in fact, I computed the number of the file in which in occur.
But I do not who exactly I must write in my function.
Thank you very much!
This isthe corps of my function:
def computeMutualInformationFactor(
probaWC:RDD[(String, Double)],// probability of occurrence of the word in a given class.
probaW:RDD[(String, Double)],// probability of occurrence of the word in whether class
probaC: Double, //probability an email appears in class (spam or ham)
probaDefault: Double // default value when a probability is missing
):RDD[(String, Double)] = {

Related

Input / Output error when using HDFS NFS Gateway

Getting "Input / output error" when trying work with files in mounted HDFS NFS Gateway. This is despite having set dfs.namenode.accesstime.precision=3600000 in Ambari. For example, doing something like...
$ hdfs dfs -cat /hdfs/path/to/some/tsv/file | sed -e "s/$NULL_WITH_TAB/$TAB/g" | hadoop fs -put -f - /hdfs/path/to/some/tsv/file
$ echo -e "Lines containing null (expect zero): $(grep -c "\tnull\t" /nfs/hdfs/path/to/some/tsv/file)"
when trying to remove nulls from a tsv then inspect for nulls in that tsv based on the NFS location throws the error, but I am seeing it in many other places (again, already have dfs.namenode.accesstime.precision=3600000). Anyone have any ideas why this may be happening or debugging suggestions? Can anyone explain what exactly "access time" is in this context?
From discussion on the apache hadoop mailing list:
I think access time refers to the POSIX atime attribute for files, the “time of last access” as described here for instance (https://www.unixtutorial.org/atime-ctime-mtime-in-unix-filesystems). While HDFS keeps a correct modification time (mtime), which is important, easy and cheap, it only keeps a very low-resolution sense of last access time, which is less important, and expensive to monitor and record, as described here (https://issues.apache.org/jira/browse/HADOOP-1869) and here (https://superuser.com/questions/464290/why-is-cat-not-changing-the-access-time).
However, to have a conforming NFS api, you must present atime, and so the HDFS NFS implementation does. But first you have to configure it on. [...] many sites have been advised to turn it off entirely by setting it to zero, to improve HDFS overall performance. See for example here ( https://community.hortonworks.com/articles/43861/scaling-the-hdfs-namenode-part-4-avoiding-performa.html, section "Don’t let Reads become Writes”). So if your site has turned off atime in HDFS, you will need to turn it back on to fully enable NFS. Alternatively, you can maintain optimum efficiency by mounting NFS with the “noatime” option, as described in the document you reference.
[...] check under /var/log, eg with find /var/log -name ‘*nfs3*’ -print

Merging small files into single file in hdfs

In a cluster of hdfs, i receive multiple files on a daily basis which can be of 3 types :
1) product_info_timestamp
2) user_info_timestamp
3) user_activity_timestamp
The number of files received can be of any number but they will belong to one of these 3 categories only.
I want to merge all the files(after checking whether they are less than 100mb) belonging to one category into a single file.
for eg: 3 files named product_info_* should be merged into one file named product_info.
How do i achieve this?
You can use getmerge toachieve this, but the result will be stored in your local node (edge node), so you need to be sure you have enough space there.
hadoop fs -getmerge /hdfs_path/product_info_* /local_path/product_inf
You can move them back to hdfs with put
hadoop fs -put /local_path/product_inf /hdfs_path
You can use hadoop archive (.har file) or sequence file. It is very simple to use - just google "hadoop archive" or "sequence file".
Another set of commands along the similar lines as suggested by #SCouto
hdfs dfs -cat /hdfs_path/product_info_* > /local_path/product_info_combined.txt
hdfs dfs -put /local_path/product_info_combined.txt /hdfs_path/

MapReduceIndexerTool output dir error "Cannot write parent of file"

I want to use Cloudera's MapReduceIndexerTool to understand how morphlines work. I created a basic morphline that just reads lines from the input file and I tried to run that tool using that command:
hadoop jar /opt/cloudera/parcels/CDH/lib/solr/contrib/mr/search-mr-*-job.jar org.apache.solr.hadoop.MapReduceIndexerTool \
--morphline-file morphline.conf \
--output-dir hdfs:///hostname/dir/ \
--dry-run true
Hadoop is installed on the same machine where I run this command.
The error I'm getting is the following:
net.sourceforge.argparse4j.inf.ArgumentParserException: Cannot write parent of file: hdfs:/hostname/dir
at org.apache.solr.hadoop.PathArgumentType.verifyCanWriteParent(PathArgumentType.java:200)
The /dir directory has 777 permissions on it, so it is definitely allowed to write into it. I don't know what I should do to allow it to write into that output directory.
I'm new to HDFS and I don't know how I should approach this problem. Logs don't offer me any info about that.
What I tried until now (with no result):
created a hierarchy of 2 directories (/dir/dir2) and put 777 permissions on both of them
changed the output-dir schema from hdfs:///... to hdfs://... because all the examples in the --help menu are built that way, but this leads to an invalid schema error
Thank you.
It states 'cannot write parent of file'. And the parent in your case is /. Take a look into the source:
private void verifyCanWriteParent(ArgumentParser parser, Path file) throws ArgumentParserException, IOException {
Path parent = file.getParent();
if (parent == null || !fs.exists(parent) || !fs.getFileStatus(parent).getPermission().getUserAction().implies(FsAction.WRITE)) {
throw new ArgumentParserException("Cannot write parent of file: " + file, parser);
}
}
In the message printed is file, in your case hdfs:/hostname/dir, so file.getParent() will be /.
Additionally you can try the permissions with hadoop fs command, for example you can try to create a zero length file in the path:
hadoop fs -touchz /test-file
I solved that problem after days of working on it.
The problem is with that line --output-dir hdfs:///hostname/dir/.
First of all, there are not 3 slashes at the beginning as I put in my continuous trying to make this work, there are only 2 (as in any valid HDFS URI). Actually I put 3 slashes because otherwise, the tool throws an invalid schema exception! You can easily see in this code that the schema check is done before the verifyCanWriteParent check.
I tried to get the hostname by simply running the hostname command on the Cent OS machine that I was running the tool on. This was the main issue. I analyzed the /etc/hosts file and I saw that there are 2 hostnames for the same local IP. I took the second one and it worked. (I also attached the port to the hostname, so the final format is the following: --output-dir hdfs://correct_hostname:8020/path/to/file/from/hdfs
This error is very confusing because everywhere you look for the namenode hostname, you will see the same thing that the hostname command returns. Moreover, the errors are not structured in a way that you can diagnose the problem and take a logical path to solve it.
Additional information regarding this tool and debugging it
If you want to see the actual code that runs behind it, check the cloudera version that you are running and select the same branch on the official repository. The master is not up to date.
If you want to just run this tool to play with the morphline (by using the --dry-run option) without connecting to Solr and playing with it, you can't. You have to specify a Zookeeper endpoint and a Solr collection or a solr config directory, which involves additional work to research on. This is something that can be improved to this tool.
You don't need to run the tool with -u hdfs, it works with a regular user.

Concatenating multiple text files into one very large file in HDFS

I have the multiple text files.
The total size of them exceeds the largest disk size available to me (~1.5TB)
A spark program reads a single input text file from HDFS. So I need to combine those files into one. (I cannot re-write the program code. I am given only the *.jar file for execution)
Does HDFS have such a capability? How can I achieve this?
What I understood from your question is you want to Concatenate multiple files into one. Here is a solution which might not be the most efficient way of doing it but it works. suppose you have two files: file1 and file2 and you want to get a combined file as ConcatenatedFile
.Here is the script for that.
hadoop fs -cat /hadoop/path/to/file/file1.txt /hadoop/path/to/file/file2.txt | hadoop fs -put - /hadoop/path/to/file/Concatenate_file_Folder/ConcatenateFile.txt
Hope this helps.
HDFS by itself does not provide such capabilities. All out-of-the-box features (like hdfs dfs -text * with pipes or FileUtil's copy methods) use your client server to transfer all data.
In my experience we always used our own written MapReduce jobs to merge many small files in HDFS in distributed way.
So you have two solutions:
Write your own simple MapReduce/Spark job to combine text files with
your format.
Find already implemented solution for such kind of
purposes.
About solution #2: there is the simple project FileCrush for combining text or sequence files in HDFS. It might be suitable for you, check it.
Example of usage:
hadoop jar filecrush-2.0-SNAPSHOT.jar crush.Crush -Ddfs.block.size=134217728 \
--input-format=text \
--output-format=text \
--compress=none \
/input/dir /output/dir 20161228161647
I had a problem to run it without these options (especially -Ddfs.block.size and output file date prefix 20161228161647) so make sure you run it properly.
You can do a pig job:
A = LOAD '/path/to/inputFiles' as (SCHEMA);
STORE A into '/path/to/outputFile';
Doing a hdfs cat and then putting it back to hdfs means, all this data is processed in the client node and will degradate your network

hadoop copy preserving the ownership/permissions

Is there any way to retain the ownership/permissions while copying files in hadoop?
Tried hadoop fs -cp -p <src> <dest> . Didn't work.
Yes of course you can. But I recommend you to use distcp, is an advanced tool to copy data between clusters or on the same cluster, you have a lot of option to optimize the execution. This command will run a mapreduce, so for a long copies it will take less time and you will can preserve all attributes.
Example:
hadoop distcp /source_dir/data \
/target_dir/data
hadoop distcp /source_dir/dataA \
/source_dir/dataB \
/target_dir/
For all attributes:
r: replication number
b: block size
u: user
g: group
p: permission
c: checksum-type
a: ACL
x: XAttr
t: timestamp
Another example, but preserving all attributes:
hadoop distcp -p rbugpcaxt \
/source_dir/data \
/target_dir/data
You can read more about this command on hadoop-distcp
The most important is not the owner and group or permissions, you can change it easy after copy command, the most important attributes are ACL, block size, replication number, and some times timestamp, this are extra properties that you can not change so easy after a simply copy (hdfs dfs -cp).
There is not, but you can (assuming you have the appropriate permissions) change the ownership after you copy the files.
It is currently not possible to create two copies of the file while copying permissions -- Depending on your use case, however, an option may be to move the files instead. For instance, I have had to change the location of a file and its permissions, and also wanted to keep a backup (permissions didn't matter) so I moved with permissions to the new location and copied back to the original without. I know that's not very helpful, but that's the best we have in Hadoop at the moment.

Resources