Pyspark on windows : Input path does not exist - windows

As I am new to pyspark, I did some research about my issue but none of the solutions worked for me.
I want to read a text file, I first put it in the same folder as my .py file in jupyter notebook. For that I run the following command:
rdd = sc.textFile("Parcours client.txt")
print(rdd.collect())
I get this error:
Input path does not exist: file:/C:/Spark/spark-2.3.0-bin-hadoop2.7/Data Analysis/Parcours client.txt
Although this is exactly where I put the file.txt, and I launch my pyspark from
C:/Spark/spark-2.3.0-bin-hadoop2.7
I tried also to indicate the local direction where my txt file exist:
rdd = sc.textFile("C:\\Users\\Jiji\\Desktop\\Data Analysis\\L'Output\\Parcours client.txt")
print(rdd.collect())
I get the same error:
Input path does not exist: file:/Users/Jiji/Desktop/Data Analysis/L'Output/Parcours client.txt

Try rdd = sc.textFile("Parcours\ client.txt") or rdd = sc.textFile(r"Parcours client.txt")
See also:
whitespaces in the path of windows filepath

Thank you everybody for your help.
I have tried to put my txt file in a folder in the desktop wich the name doesn't have any spaces and that solve my issue. So I run the following command:
rdd = sc.textFile('C:\\Users\\Jiji\\Desktop\\Output\\Parcours client.txt')
I think the issue was because of the spaces in the path.

Related

Issues with os.listdir when script is an executable

I have created a script that takes a file from one folder and produces another file in another folder. This is a project to convert one format into another to be used by people who dont have strong background in informatics so I have created a folder with the script plus the input folder and the output folder. The user just need to put the input file in the input folder and take the results from the output folder.
The script works fine if I run this python script when running with visual code as well as If I run the script using the terminal ( python CSVtoVCFv3.py )
but when I convert my script in an executable with pyinstaller I found the next error.
File "CSVtoVCFv3.py", line 99, in <module>
FileNotFoundError: [Errno 2] No such file or directory: '/Users/manoldominguez/input/'
[99327] Failed to execute script CSVtoVCFv3
The code used in line 99 is:
97 actual_path = os.getcwd()
98 folder_input = '/input/'
99 input_file_name = os.listdir(actual_path+folder_input)
100 input_file_name= ''.join(input_file_name)
101 CSV_input = actual_path+folder_input+input_file_name
I have also tried this:
actual_path = (os.path.dirname(os.path.realpath('CSVtoVCFv3.py')))
So as conclusion as far as I can understand the issue is:
In these lines If I run my script I get this
'/Users/manoldominguez/Desktop/CSVtoVCF/input/'
If my script is ran with my executable I get this
'/Users/manoldominguez/input/'
os.getcwd() gives Current Working Directory - it means folder in which script was executed, but it doesn't have to be folder in which script is saved. This way you can run code in different folder and it works with files in different folder - and it can be usefull.
But if you need with files in folder where you have script then you can get this folder using
SCRIPT_PATH = os.path.dirname(os.path.realpath(__file__))
or
import sys
SCRIPT_PATH = os.path.dirname(os.path.realpath(sys.argv[0]))
not with 'CSVtoVCFv3.py'
And then you can join it
SCRIPT_PATH = os.path.dirname(os.path.realpath(sys.argv[0]))
folder_input = '/input/'
full_folder_input = os.path.join(SCRIPT_PATH, folder_input)
all_filenames = os.listdir(full_folder_input)
for input_file_name in all_filenames:
#CSV_input = os.path.join(full_folder_input, input_file_name)
CSV_input = os.path.join(SCRIPT_PATH, folder_input, input_file_name)
I only don't like your
input_file_name = os.listdir(actual_path+folder_input)
input_file_name= ''.join(input_file_name)
because listdir() may gives more files and then your join may create incorrect path. Better get input_file_name[0] for single file or use for-loop to work with all files in folder.
BTW: Maybe you should use sys.argv to get path as parameter and then everyone may decide where to put file.

hadoop MultipleOutputs to absolute path , but file is already being created by other attempt

I use MultipleOutputs to output data to some absolute paths, instead of a path relative to OutputPath.
Then, i get the error:
Error: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException): Failed to create file [/test/convert.bak/326/201505110030/326-m-00035] for [DFSClient_attempt_1425611626220_29142_m_000035_1001_-370311306_1] on client [192.168.7.146], because this file is already being created by [DFSClient_attempt_1425611626220_29142_m_000035_1000_-53988495_1] on [192.168.7.149] at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.recoverLeaseInternal(FSNamesystem.java:2320) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInternal(FSNamesystem.java:2083) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInt(FSNamesystem.java:2012) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFile(FSNamesystem.java:1963) at
https://issues.apache.org/jira/browse/MAPREDUCE-6357
Output files must in ${mapred.output.dir} 。
The design and implementation dosn't support outputing data to files out of ${mapred.output.dir}.
By looking into stack trace error, it seems that output file is already created.
If you want to write your data into multiple files, then try to generate those file name dynamically and use those files name as shown in code taken from Hadoop Definitive Guide
String basePath = String.format("%s/%s/part", parser.getStationId(), parser.getYear());
multipleOutputs.write(NullWritable.get(), value, basePath);
I hope this will help.
As it clearly suggests that the path you are trying to create,already exists. So try to do a check before creating that path whether that path exists or not.If exists, then delete that path.
FileSystem hdfs;
Path path = new Path (YourHadoopPath);
if (hdfs.exists(path)) {
hdfs.delete(path);
}

Hadoop read files with following name patterns

This may sound very basic but I have a folder in HDFS with 3 kinds of files.
eg:
access-02171990
s3.Log
catalina.out
I want my map/reduce to read only files which begin with access- only. How do I do that via program? or specifying via the input directory path?
Please help.
You can set the input path as a glob:
FileInputFormat.addInputPath(jobConf, new Path("/your/path/access*"))

Import an *.xls file in R?

I am struggeling to read an *.xls file into R:
I did the following:
I set my working directory to the *.xls file and then:
> library(gdata) # load the gdata package
> mydata = read.xls("comprice.xls", sheet=1, verbose=FALSE)
Mistake in findPerl(verbose = verbose) : perl executable not found. Use perl= argument to specify the correct path. mistake in file.exists(tfn) : unknown 'file' argument
However, my path is correct and there is the file! Whats wrong?
UPDATE
I have installed it already, however now I get: Exception: cannot find function "read.xls"...
This error message means that perl is not installed on your computer or it is not set on your path.
If the perl is installed then you can put argument perl= inside read.xls() function.
read.xls(xlsfile, perl="C:/perl/bin/perl.exe")
As an alternative, you could try xlsxpackage:
read.xlsx("comprice.xls", 1) reads your file and makes the data.frame column classes nearly useful, but is very slow for large data sets.
read.xlsx2("comprice.xls", 1) is faster, but you'll have to define column classes manually. If you run the command twice, you will not need to count columns so much:
data <- read.xlsx2("comprice.xls", 1)
data <- read.xlsx2("comprice.xls", 1, colClasses= rep("numeric", ncol(data)))
Perl is either not installed or cannot be found. You can either install it, or specify the path where it is installed using
perl='path of perl installation'
in the call.

Mahout - Naive Bayes

I tried deploying 20- news group example with mahout, it seems working fine. Out of curiosity I would like to dig deep into the model statistics,
for example: bayes-model directory contains the following sub directories,
trainer-tfIdf trainer-thetaNormalizer trainer-weights
which contains part-0000 files. I would like to read the contents of the file for better understanding, cat command doesnt seems to work, it prints some garbage.
Any help is appreciated.
Thanks
The 'part-00000' files are created by Hadoop, and are in Hadoop's SequenceFile format, containing values specific to Mahout. You can't open them as text files, no. You can find the utility class SequenceFileDumper in Mahout that will try to output the content as text to stdout.
As to what those values are to begin with, they're intermediate results of the multi-stage Hadoop-based computation performed by Mahout. You can read the code to get a better sense of what these are. The "tfidf" directory for example contains intermediate calculations related to term frequency.
You can read part-0000 files using hadoop's filesystem -text option. Just get into the hadoop directory and type the following
`bin/hadoop dfs -text /Path-to-part-file/part-m-00000`
part-m-00000 will be printed to STDOUT.
If it gives you an error, you might need to add the HADOOP_CLASSPATH variable to your path. For example, if after running it gives you
text: java.io.IOException: WritableName can't load class: org.apache.mahout.math.VectorWritable
then add the corresponding class to the HADOOP_CLASSPATH variable
export HADOOP_CLASSPATH=/src/mahout/trunk/math/target/mahout-math-0.6-SNAPSHOT.jar
That worked for me ;)
In order to read part-00000 (sequence files) you need to use the "seqdumper" utility. Here's an example I used for my experiments:
MAHOUT_HOME$: bin/mahout seqdumper -s
~/clustering/experiments-v1/t14/tfidf-vectors/part-r-00000
-o ~/vectors-v2-1010
-s is the sequence file you want to convert to plain text
-o is the output file

Resources