Pyspark: Reading properties files on HDFS using configParser - hadoop

I am using ConfigParser to read through key values which are passed to my pyspark program. The code works fine when I execute from edge node of a hadoop cluster,with the config file in local directory of edge node. This doesn't if the config file is uploaded to a hdfs path and I try accessing the same using the parser.
The config file para.conf has below contents
[tracker]
port=9801
On local client mode, with para.conf in local directory, to access the values i am using the below.
from ConfigParser import SafeConfigParser
parser = SafeConfigParser()
parser.read("para.conf")
myport = parser.get('tracker', 'port')
The above works fine...
On Hadoop Cluster :
Uploaded para.conf file to hdfs directory path bdc/para.conf
parser.read("hdfs://clusternamenode:8020/bdc/para.conf")
this doesn't return anythin, neither does the below by escaping..
parser.read("hdfs:///clusternamenode:8020//bdc//para.conf")
Although using sqlCOntext i can read this file which returns a valid rdd.
sc.textFile("hdfs://clusternamenode:8020/bdc/para.conf")
though am not sure if using configParser can extract the key values from this..
Can anyone advise if configParser can be used to read files from hdfs ? Or is there any alternative ?

I have copied most of the code you have provided in the comments. You were really close to the solution. Your problem was that sc.textFile produces a row in the rdd for every newline character. When you call .collect() you get a list of strings for every line of your document. The StringIO is not expecting a list, it is expecting a string and therefore you have to restore the previous document structure from your list. See working example below:
import ConfigParser
import StringIO
credstr = sc.textFile("hdfs://clusternamenode:8020/bdc/cre.conf").collect()
buf = StringIO.StringIO("\n".join(credstr))
parse_str = ConfigParser.ConfigParser()
parse_str.readfp(buf)
parse_str.get('tracker','port')
Output:
'9801'

Related

Using gitpython to get current hash does not work when using qsub for job submission on a cluster

I use python to do my data analysis and lately I came up with the idea to save the current git hash in a log file so I can later check which code version created my results (in case I find inconsistencies or whatever).
It works fine as long as I do it locally.
import git
import os
rep = git.Repo(os.getcwd(), search_parent_directories=True)
git_hash = rep.head.object.hexsha
with open ('logfile.txt', 'w+') as writer:
writer.write('Code version: {}'.format(git_hash))
However, I have a lot of heavy calculations that I run on a cluster to speed things up (run analyses of subjects parallel), using qsub, which looks more or less like this:
qsub -l nodes=1:ppn=12 analysis.py -q shared
This always results in a git.exc.InvalidGitRepositoryError.
EDIT
Printing os.getcwd() showed me, that on the cluster the current working dir is always my $HOME directory no matter from where I submit the job.
My next solution was to get the directory where the file is located using some of the solutions suggested here.
However, these solutions result in the same error because (that's how I understand it) my file is somehow copied to a directory deep in the root structure of the cluster's headnode (/var/spool/torque/mom_priv/jobs).
I could of course write down the location of my file as a hardcoded variable, but I would like a general solution for all my scripts.
So after I explained my problem to IT in detail, they could help me solve the problem.
Apparently the $PBS_O_WORKDIR variable stores the directory from which the job was committed.
So I adjusted my access to the githash as follows:
try:
script_file_directory = os.environ["PBS_O_WORKDIR"]
except KeyError:
script_file_directory = os.getcwd()
try:
rep = git.Repo(script_file_directory, search_parent_directories=True)
git_hash = rep.head.object.hexsha
except git.InvalidGitRepositoryError:
git_hash = 'not-found'
# create a log file, that saves some information about the run script
with open('logfile.txt'), 'w+') as writer:
writer.write('Codeversion: {} \n'.format(git_hash))
I first check if the PBS_O_WORKDIR variable exists (hence if I run the script as a job on the cluster). If it does get the githash from this directory if it doesn't use the current working directory.
Very specific, but maybe one day someone has the same problem...

Pyspark on windows : Input path does not exist

As I am new to pyspark, I did some research about my issue but none of the solutions worked for me.
I want to read a text file, I first put it in the same folder as my .py file in jupyter notebook. For that I run the following command:
rdd = sc.textFile("Parcours client.txt")
print(rdd.collect())
I get this error:
Input path does not exist: file:/C:/Spark/spark-2.3.0-bin-hadoop2.7/Data Analysis/Parcours client.txt
Although this is exactly where I put the file.txt, and I launch my pyspark from
C:/Spark/spark-2.3.0-bin-hadoop2.7
I tried also to indicate the local direction where my txt file exist:
rdd = sc.textFile("C:\\Users\\Jiji\\Desktop\\Data Analysis\\L'Output\\Parcours client.txt")
print(rdd.collect())
I get the same error:
Input path does not exist: file:/Users/Jiji/Desktop/Data Analysis/L'Output/Parcours client.txt
Try rdd = sc.textFile("Parcours\ client.txt") or rdd = sc.textFile(r"Parcours client.txt")
See also:
whitespaces in the path of windows filepath
Thank you everybody for your help.
I have tried to put my txt file in a folder in the desktop wich the name doesn't have any spaces and that solve my issue. So I run the following command:
rdd = sc.textFile('C:\\Users\\Jiji\\Desktop\\Output\\Parcours client.txt')
I think the issue was because of the spaces in the path.

import csv file on Neo4J with a Mac: which path to use for my file?

I am a new user of Neo4J. I would like to import a simple csv file into neo4J with my Mac, but it seems I am doing something wrong with the path to my file. I have tried many different ways but it is not working. the only workaround I found is to upload it on dropbox....
please see below the code I am using/
LOAD CSV WITH HEADERS FROM "file://Users/Cam/Documents/Neo4j/default.graphdb/import/node_attributes.csv" as line
RETURN count(*)
the error message is:
Cannot load from URL
'file://Users/Cam/Documents/Neo4j/default.graphdb/import/node_attributes.csv':
file URL may not contain an authority section (i.e. it should be
'file:///')
I already try to add some /// in the path but it is not working.
If the CSV file is in your default.graphdb/import folder, then you don't need to provide the absolute path, just give the path relative to the import folder:
LOAD CSV WITH HEADERS FROM "file:///node_attributes.csv" as line
RETURN count(*)
I'd use neo4j-import from the terminal:
Example from https://neo4j.com/developer/guide-import-csv/
neo4j-import --into retail.db --id-type string \
--nodes:Customer customers.csv --nodes products.csv \
--nodes orders_header.csv,orders1.csv,orders2.csv \
--relationships:CONTAINS order_details.csv \
--relationships:ORDERED customer_orders_header.csv,orders1.csv,orders2.csv
What is not working when you try
LOAD CSV WITH HEADERS FROM "file:///Users/Cam/Documents/Neo4j/default.graphdb/import/node_attributes.csv" as line RETURN count(*)
?

Running Simple Hadoop Command using Java code

I would like to list files using hadoop command. "hadoop fs -ls filepath". I want to write a Java code to achieve this. Can I write a small piece of java code, make a jar of it and supply it to Map reduce job(Amazon EMR) to achieve this ? Can you please point me to the code and steps using which I can achieve this ?
You can list files in HDFS using JAVA code as below
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileStatus;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.FileUtil;
import org.apache.hadoop.fs.Path;
...
Configuration configuration = new Configuration();
FileSystem hdfs = FileSystem.get(new URI("hdfs://localhost:54310"), configuration);
FileStatus[] fileStatus = hdfs.listStatus(new Path("hdfs://localhost:54310/user/path"));
Path[] paths = FileUtil.stat2Paths(fileStatus);
for (Path path : paths) {
System.out.println(path);
}
Use this in your map reduce trigger code ( main or run method) for get the list and pass it args for your map reduce class
Option 2
create shell script to read list of files using hadoop fs -ls command
provide this script as part of EMR bootstrap script to get list of files
in same script you can write code to save the paths in text files under path /mnt/
read this path from your map reduce code and provide to arg list for your mapper and reducers
Here is My Github Repository
Simple Commands like:
making folder,
putting files to hdfs,
reading,
listing and
writing data are present in JAVA API folder.
And you can explore other folders to get map-reduce codes in java.

How to read gz files in Spark using wholeTextFiles

I have a folder which contains many small .gz files (compressed csv text files). I need to read them in my Spark job, but the thing is I need to do some processing based on info which is in the file name. Therefore, I did not use:
JavaRDD<<String>String> input = sc.textFile(...)
since to my understanding I do not have access to the file name this way. Instead, I used:
JavaPairRDD<<String>String,String> files_and_content = sc.wholeTextFiles(...);
because this way I get a pair of file name and the content.
However, it seems that this way, the input reader fails to read the text from the gz file, but rather reads the binary Gibberish.
So, I would like to know if I can set it to somehow read the text, or alternatively access the file name using sc.textFile(...)
You cannot read gzipped files with wholeTextFiles because it uses CombineFileInputFormat which cannot read gzipped files because they are not splittable (source proving it):
override def createRecordReader(
split: InputSplit,
context: TaskAttemptContext): RecordReader[String, String] = {
new CombineFileRecordReader[String, String](
split.asInstanceOf[CombineFileSplit],
context,
classOf[WholeTextFileRecordReader])
}
You may be able to use newAPIHadoopFile with wholefileinputformat (not built into hadoop but all over the internet) to get this to work correctly.
UPDATE 1: I don't think WholeFileInputFormat will work since it just gets the bytes of the file, meaning you may have to write your own class possibly extending WholeFileInputFormat to make sure it decompresses the bytes.
Another option would be to decompress the bytes yourself using GZipInputStream
UPDATE 2: If you have access to the directory name like in the OP's comment below you can get all the files like this.
Path path = new Path("");
FileSystem fileSystem = path.getFileSystem(new Configuration()); //just uses the default one
FileStatus [] fileStatuses = fileSystem.listStatus(path);
ArrayList<Path> paths = new ArrayList<>();
for (FileStatus fileStatus : fileStatuses) paths.add(fileStatus.getPath());
I faced the same issue while using spark to connect to S3.
My File was a gzip csv with no extension .
JavaPairRDD<String, String> fileNameContentsRDD = javaSparkContext.wholeTextFiles(logFile);
This approach returned currupted values
I solved it by using the the below code :
JavaPairRDD<String, String> fileNameContentsRDD = javaSparkContext.wholeTextFiles(logFile+".gz");
By adding .gz to the S3 URL , spark automatically picked the file and read it like gz file .(Seems a wrong approach but solved my problem .

Resources