Why is dataproc outputting an unexpected value? - hadoop

I have created a jar file that uses hadoop to counts the number of bigrams found in a set of text files.
When I run a hadoop job on my local setup I receive an output fie containing a count of bigrams in the text file.
Correct output
however, when I use the exact same jar file but using dataproc on Google cloud platform. It outputs the following
dataproc, incorrect output
Any ideas why this may be happening? Cheers

Related

"No such file or directory" in hadoop while executing WordCount program using jar command

I am new to Hadoop and am trying to execute the WordCount Problem.
Things I did so far -
Setting up the Hadoop Single Node cluster referring the below link.
http://www.bogotobogo.com/Hadoop/BigData_hadoop_Install_on_ubuntu_single_node_cluster.php
Write the word count problem referring the below link
https://kishorer.in/2014/10/22/running-a-wordcount-mapreduce-example-in-hadoop-2-4-1-single-node-cluster-in-ubuntu-14-04-64-bit/
Problem is when I execute the last line to run the program -
hadoop jar wordcount.jar /usr/local/hadoop/input /usr/local/hadoop/output
Following is the error I get -
The directory seems to be present
The file is also present in the directory with contents
Finally, on a side note I also tried the following directory sturcture in the jar command.
No avail! :/
I would really appreciate if someone could guide me here!
Regards,
Paul Alwin
Your first image is using input from the local Hadoop installation directory, /usr
If you want to use that data on your local filesystem, you can specify file:///usr/...
Otherwise, if you're running pseudo distributed mode, HDFS has been setup, and /usr does not exist in HDFS unless you explicitly created it there.
Based on the stacktrace, I believe the error comes from the /app/hadoop/ staging directory path not existing, or the permissions for it are not allowing your current user to run commands against that path
Suggestion: Hortonworks and Cloudera offer pre-built VirtualBox images and lots of tutorial resources. Most companies will have Hadoop from one of those vendors, so it's better to get familiar with that rather than mess around with having to install Hadoop yourself from scratch, in my opinion

No output files from mahout

I am running a mahout recommenderJob on hadoop in syncfusion. I get the following. But no output... it seems to run indefinitely
Does anyone have an idea why I am not getting an output.txt from this? Why does this seem to run indefinitely?
I suspect this could be due to the insufficient disk space in your machine and in this case, I'd suggest you to clean up your disk space and try this again from your end.
In alternate, I'd also suggest you to use the Syncfusion Cluster Manager - using which you can form a cluster with multiple nodes/machines, so that there will be suffifient memory available to execute your job.
-Ramkumar
I've tested the same map reduce job which you're trying to execute using Syncfusion BigData Studio and it worked for me.
Please find the input details which I've used from the following,
Command:
hadoop jar E:\mahout-examples-0.12.2-job.jar org.apache.mahout.cf.taste.hadoop.item.RecommenderJob -s SIMILARITY_COOCCURRENCE --input=/Input.txt --output=output
Sample input (Input.txt):
For input data, I've used the data available in Apache - Mahout site (refer below link) and saved the same in a text file.
http://mahout.apache.org/users/recommender/userbased-5-minutes.html
I've also seen a misspelled word "COOCCURRENCE" used in your command. Please correct this, or else you could face "Class Not Found Exception".
Output:
Please find the generated output from below.
-Ramkumar :)

InvalidJobConfException: Output directory not set

I am using Cloudera VM for mapreduce pratice.
I just created the jar from the default wordcount classes given by cloudera.
I am getting this error when I run the mapreduce program. Can I know what I am missing?
InvalidJobConfException: Output directory not set.
Exception in thread "main" org.apache.hadoop.mapred.InvalidJobConfException: Output directory not set.
To process data using MapReduce program you need-
Mapper class
Reducer class
Driver class(Main class to run MapReduce program)
Input data(path of input data to analysis)
Output directory(path of output directory,where output of the program will store, this
directory should not already exist in HDFS)
From the error, It seems you have not set the output directory path. If output directory is not already set in your code, than you have to pass it at runtime if your code is accepting the argument for the same. Here is a very good step-by-step guide to run first WordCount program in MapReduce.

Spark yarn-cluster mode - read file passed with --files

I'm running my spark application using yarn-cluster master.
What does the app do?
External service generates a jsonFile based on HTTP request to a RESTService
Spark needs to read this file and do some work after parsing the json
Simplest solution that came to mind was to use --files to load that file.
In yarn-cluster mode reading a file means it must be available on hdfs (if I'm right?) and my file is being copied to path like this:
/hadoop_user_path/.sparkStaging/spark_applicationId/myFile.json
Where I can of course read it, but I cannot find a way to get this path from any configuration / SparkEnv object. And hardcoding .sparkStaging in spark code seamed like a bad idea.
Why simple:
val jsonStringData = spark.textFile(myFileName)
sqlContext.read.json(jsonStringData)
cannot read file passed with --files and throws FileNotFoundException? Why is spark looking for files in hadoop_user_folder only?
My solution which works for now:
Just before running spark, I copy file to proper hdfs folder, pass the filename as Spark argument, process the file from a known path and after the job is done I delete the file form hdfs.
I thought passing the file as --files would let me forget about saving and deleting this file. Something like pass-process-andforget.
How do you read a file passed with --files then? The only solution is with creating path by hand, hardcoding ".sparkStaging" folder path?
The question is written very ambiguously. However, from what I seem to get is that you want to read a file from any location of your Local OS File System, and not just from HDFS.
Spark uses URI's to identify paths, and in the availability of a valid Hadoop/HDFS Environment, it will default to HDFS. In that case, to point to your Local OS FileSystem, in the case of for example UNIX/LINUX, you can use something like:
file:///home/user/my_file.txt
If you are using an RDD to read from this file, you run in yarn-cluster mode, or the file is accessed within a task, you will need to take care of copying and distributing that file manually to all nodes in your cluster, using the same path. That is what it makes it easy of first putting it on hfs, or that is what the --files option is supposed to do for you.
See more info on Spark, External Datasets.
For any files that were added through the --files option, or were added through SparkContext.addFile, you can get information about their location using the SparkFiles helper class.
Answer from #hartar worked for me. Here is the complete solution.
add required files during spark-submit using --files
spark-submit --name "my_job" --master yarn --deploy-mode cluster --files /home/xyz/file1.properties,/home/xyz/file2.properties --class test.main /home/xyz/my_test_jar.jar
get spark session inside main method
SparkSession ss = new SparkSession.Builder().getOrCreate();
Since i am interested only in .properties files, i am filtering it, instead if you know the file name which you wish to read then it can be directly used in FileInputStream.
spark.yarn.dist.files would have stored it as file:/home/xyz/file1.properties,file:/home/xyz/file2.properties hence splitting the string by (,) and (/) so that i can eliminate the rest of the content except the file name.
String[] files = Pattern.compile("/|,").splitAsStream(ss.conf().get("spark.yarn.dist.files")).filter(s -> s.contains(".properties")).toArray(String[]::new);
//load all files to Property
for (String f : files) {
props.load(new FileInputStream(f));
}
I had the same problem as you, in fact, you must know that when you send an executable and files, these are at the same level, so in your executable, it is enough that you just put the file name to Access it since your executable is based on its own folder.
You do not need to use sparkFiles or any other class. Just the method like readFile("myFile.json");
I have come across an easy way to do it.
We are using Spark 2.3.0 on Yarn in pseudo distributed mode. We need to query a postgres table from spark whose configurations are defined in a properties file.
I passed the property file using --files attribute of spark submit. To read the file in my code I simply used java.util.Properties.PropertiesReader class.
I just need to ensure that the path I specify when loading file is same as that passed in --files argument
e.g. if the spark submit command looked like:
spark-submit --class --master yarn --deploy-mode client--files test/metadata.properties myjar.jar
Then my code to read the file will look like:
Properties props = new Properties();
props.load(new FileInputStream(new File("test/metadata.properties")));
Hope you find this helpful.

Output Folders for Amazon EMR

I want to jun a custom jar, whose main class a chain of map reduce jobs, with the output of the first job going as the input of the second jar, and so on.
What do I set in FileOutputFormat.setOutputPath("what path should be here?");
If I specify -outputdir in the argument, I get the error FileAlraedy exists. If I don't specify, then I do not know where the ouput will land. I want to be able to see the ouput from every job of the chained mapreduce jobs.
Thanks in adv. Pls help!
You are likely getting the "FileAlraedy exists" error because that output directory exists prior to the job you are running. Make sure to delete the directories that you specify as output for your Hadoop jobs; otherwise you will not be able to run those jobs.
Good practice is to take output from command line as it will increase flexibility of your code And you will compile your jar only once provided the changes are related to your path.
for EMR if you launch your cluster and compile your jar
For eg.
dfs_ip_folder=HDFS_IP_DIR
dfs_op_folder=HDFS_OP_DIR
hadoop jar hadoop-examples-*.jar wordcount ${dfs_ip_folder} ${dfs_op_folder}
Note : you have to create dfs_ip_folder and store input data inside it.
dfs_op_folder will be created automatically on HDFS not on local file system
To access the HDFS op folder either you can copy it to local file system or you can do cat.
eg.
hadoop fs -cat ${dfs_op_folder}/<file_name>
hadoop fs -copyToLocal ${dfs_op_folder} ${your_local_input_dir_path}

Resources