How to use sqlContext to load multiple parquet files? - hadoop

I'm trying to load a directory of parquet files in spark but can't seem to get it to work...this seems to work:
val df = sqlContext.load("hdfs://nameservice1/data/rtl/events/stream/loaddate=20151102")
but this doesn't work:
val df = sqlContext.load("hdfs://nameservice1/data/rtl/events/stream/loaddate=201511*")
it gives me back this error:
java.io.FileNotFoundException: File does not exist: hdfs://nameservice1/data/rtl/events/stream/loaddate=201511*
how do i get it to work with a wild card?

you can read in the list of files or folders using the filesystem list status.
Then go over the files/folders you want to read.
Use a reduce with union to reduce all files into one single rdd.
Get the files/folders:
val fs = FileSystem.get(new Configuration())
val status = fs.listStatus(new Path(YOUR_HDFS_PATH))
Read in the data:
val parquetFiles= status .map(folder => {
sqlContext.read.parquet(folder.getPath.toString)
})
Merge the data into single rdd:
val mergedFile= parquetFiles.reduce((x, y) => x.unionAll(y))
You can also have a look at my past posts around the same topic.
Spark Scala list folders in directory
Spark/Scala flatten and flatMap is not working on DataFrame

If provided paths are partition directories, please set "basePath" in the options of the data source to specify the root directory of the table. If there are multiple root directories, please load them separately and then union them.
like:
basePath="hdfs://nameservice1/data/rtl/events/stream"
sparkSession.read.option("basePath", basePath).parquet(basePath + "loaddate=201511*")

Related

Pyspark: Reading properties files on HDFS using configParser

I am using ConfigParser to read through key values which are passed to my pyspark program. The code works fine when I execute from edge node of a hadoop cluster,with the config file in local directory of edge node. This doesn't if the config file is uploaded to a hdfs path and I try accessing the same using the parser.
The config file para.conf has below contents
[tracker]
port=9801
On local client mode, with para.conf in local directory, to access the values i am using the below.
from ConfigParser import SafeConfigParser
parser = SafeConfigParser()
parser.read("para.conf")
myport = parser.get('tracker', 'port')
The above works fine...
On Hadoop Cluster :
Uploaded para.conf file to hdfs directory path bdc/para.conf
parser.read("hdfs://clusternamenode:8020/bdc/para.conf")
this doesn't return anythin, neither does the below by escaping..
parser.read("hdfs:///clusternamenode:8020//bdc//para.conf")
Although using sqlCOntext i can read this file which returns a valid rdd.
sc.textFile("hdfs://clusternamenode:8020/bdc/para.conf")
though am not sure if using configParser can extract the key values from this..
Can anyone advise if configParser can be used to read files from hdfs ? Or is there any alternative ?
I have copied most of the code you have provided in the comments. You were really close to the solution. Your problem was that sc.textFile produces a row in the rdd for every newline character. When you call .collect() you get a list of strings for every line of your document. The StringIO is not expecting a list, it is expecting a string and therefore you have to restore the previous document structure from your list. See working example below:
import ConfigParser
import StringIO
credstr = sc.textFile("hdfs://clusternamenode:8020/bdc/cre.conf").collect()
buf = StringIO.StringIO("\n".join(credstr))
parse_str = ConfigParser.ConfigParser()
parse_str.readfp(buf)
parse_str.get('tracker','port')
Output:
'9801'

How to count similar files in hdfs by groovy code?

I want to count files with similar filenames in hdfs directory, but i don't want to use File System API can you recommend me something which can help me get the result which i can get from this groovy code below?
Is there any equivalent of file System Api which can i use for this task?
String filename="2017-02-03";
def count = new File("file path ").listFiles()
.findAll { it.name.substring(0,10) ==filename}
.size();

Specify Multiple Folders for Input in Pail Tap Hadoop Jobs

I am running a hadoop mapreduce job using Cascalog API.I want to take multiple input folders to process the map reduce job.
I have two folders in HDFS rooPath/Folder_1 & rootPath/Folder_2 which contains files that are to be processed in a job.
I am giving the job the input folders throught Pail Tap function :
new PailTap(rootPath + "Folder_1",
JcascalogUtils.getPailTapOptions());
Can I give multiple folders to the same job.
and Can I give a regex fodler Path like rootPath+*/ so that it will process all the folders in the rootPath folder.
Thank you for any help :)
You can use MultiSourceTap like this:
dataSource = new MultiSourceTap(
new PailTap(rootPath + "Folder_1",JcascalogUtils.getPailTapOptions()),
new PailTap(rootPath + "Folder_2",JcascalogUtils.getPailTapOptions())
);
or use GlobHfs
dataSource = new GlobHfs(new PailTap(rootPath,JcascalogUtils.getPailTapOptions()).getScheme() , rootPath + "*");

How to read gz files in Spark using wholeTextFiles

I have a folder which contains many small .gz files (compressed csv text files). I need to read them in my Spark job, but the thing is I need to do some processing based on info which is in the file name. Therefore, I did not use:
JavaRDD<<String>String> input = sc.textFile(...)
since to my understanding I do not have access to the file name this way. Instead, I used:
JavaPairRDD<<String>String,String> files_and_content = sc.wholeTextFiles(...);
because this way I get a pair of file name and the content.
However, it seems that this way, the input reader fails to read the text from the gz file, but rather reads the binary Gibberish.
So, I would like to know if I can set it to somehow read the text, or alternatively access the file name using sc.textFile(...)
You cannot read gzipped files with wholeTextFiles because it uses CombineFileInputFormat which cannot read gzipped files because they are not splittable (source proving it):
override def createRecordReader(
split: InputSplit,
context: TaskAttemptContext): RecordReader[String, String] = {
new CombineFileRecordReader[String, String](
split.asInstanceOf[CombineFileSplit],
context,
classOf[WholeTextFileRecordReader])
}
You may be able to use newAPIHadoopFile with wholefileinputformat (not built into hadoop but all over the internet) to get this to work correctly.
UPDATE 1: I don't think WholeFileInputFormat will work since it just gets the bytes of the file, meaning you may have to write your own class possibly extending WholeFileInputFormat to make sure it decompresses the bytes.
Another option would be to decompress the bytes yourself using GZipInputStream
UPDATE 2: If you have access to the directory name like in the OP's comment below you can get all the files like this.
Path path = new Path("");
FileSystem fileSystem = path.getFileSystem(new Configuration()); //just uses the default one
FileStatus [] fileStatuses = fileSystem.listStatus(path);
ArrayList<Path> paths = new ArrayList<>();
for (FileStatus fileStatus : fileStatuses) paths.add(fileStatus.getPath());
I faced the same issue while using spark to connect to S3.
My File was a gzip csv with no extension .
JavaPairRDD<String, String> fileNameContentsRDD = javaSparkContext.wholeTextFiles(logFile);
This approach returned currupted values
I solved it by using the the below code :
JavaPairRDD<String, String> fileNameContentsRDD = javaSparkContext.wholeTextFiles(logFile+".gz");
By adding .gz to the S3 URL , spark automatically picked the file and read it like gz file .(Seems a wrong approach but solved my problem .

How to transfer files between machines in Hadoop and search for a string using Pig

I have 2 questions:
I have a big file of records, a few million ones. I need to transfer this file from one machine to a hadoop cluster machine. I guess there is no scp command in hadoop (or is there?) How to transfer files to the hadoop machine?
Also, once the file is on my hadoop cluster, I want to search for records which contain a specific string, say 'XYZTechnologies'. How to do this is Pig? Some sample code would be great to give me a head-start.
This is the very first time I am working on Hadoop/Pig. So please pardon me if it is a "too basic" question.
EDIT 1
I tried what Jagaran suggested and I got the following error:
2012-03-18 04:12:55,655 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1000: Error during parsing. Encountered " "(" "( "" at line 3, column 26.
Was expecting:
<QUOTEDSTRING> ...
Also, please note that, I want to search for the string anywhere in the record, so I am reading the tab separated record as one single column:
A = load '/user/abc/part-00000' using PigStorage('\n') AS (Y:chararray);
for your first question, i think that Guy has already answered it.
as for the second question, it looks like if you just want to search for records which contain a specific string, a bash script is better, but if you insist on Pig, this is what i suggest:
A = load '/user/abc/' using PigStorage(',') AS (Y:chararray);
B = filter A by CONTAINS(A, 'XYZTechnologies');
store B into 'output' using PigStorage()
keep in mind that PigStorage default delimeter is tab so put a delimeter that does not appear in your file.
then you should write a UDF that returns a boolean for CONTAINS, something like:
public class Contains extends EvalFunc<Boolean> {
#Override
public Boolean exec(Tuple input) throws IOException
{
return input.get(0).toString().contains(input.get(1).toString());
}
}
i didn't test this, but this is the direction i would have tried.
For Copying to Hadoop.
1. You can install Hadoop Client in the other machine and then do
hadoop dfs -copyFromLocal from commandline
2. You could simple write a java code that would use FileSystem API to copy to hadoop.
For Pig.
Assuming you know field 2 may contain XYZTechnologies
A = load '<input-hadoop-dir>' using PigStorage() as (X:chararray,Y:chararray);
-- There should not be "(" and ")" after 'matches'
B = Filter A by Y matches '.*XYZTechnologies.*';
STORE B into 'Hadoop=Path' using PigStorage();
Hi you can use the hadoop grep function to find the specific string in the file.
for e.g my file contains some data as follows
Hi myself xyz. i like hadoop.
hadoop is good.
i am practicing.
so the hadoop command is
hadoop fs -text 'file name with path' | grep 'string to be found out'
Pig shell:
--Load the file data into the pig variable
**data = LOAD 'file with path' using PigStorage() as (text:chararray);
-- find the required text
txt = FILTER data by ($0 MATCHES '.string to be found out.');
--display the data.
dump txt; ---or use Illustrate txt;
-- storing it in another file
STORE txt into "path' using PigStorage();

Resources