Flink using S3AFileSystem does not read subfolders from S3 - hadoop

We are using Flink 1.2.0 with the suggested S3AFileSystem configuration. A simple streaming job works as expected when its source is a single folder within an S3 bucket.
The job runs without errors--but does not produce output--when its source is a folder which itself contains subfolders.
For clarity, below is a model of the S3 bucket. Running the job to point to s3a://bucket/folder/2017/04/25/01/ properly reads all three objects and any subsequent objects that appear in the bucket. Pointing the job to s3a://bucket/folder/2017/ (or any other intermediate folder) leads to a job that runs without producing anything.
In fits of desperation, we've tried the permutations that [in|ex]clude the trailing /.
.
`-- folder
`-- 2017
`-- 04
|-- 25
| |-- 00
| | |-- a.txt
| | `-- b.txt
| `-- 01
| |-- c.txt
| |-- d.txt
| `-- e.txt
`-- 26
Job code:
def main(args: Array[String]) {
val parameters = ParameterTool.fromArgs(args)
val bucket = parameters.get("bucket")
val folder = parameters.get("folder")
val path = s"s3a://$bucket/$folder"
val env = StreamExecutionEnvironment.getExecutionEnvironment
val lines: DataStream[String] = env.readFile(
inputFormat = new TextInputFormat(new Path(path)),
filePath = path,
watchType = FileProcessingMode.PROCESS_CONTINUOUSLY,
interval = Time.seconds(10).toMilliseconds)
lines.print()
env.execute("Flink Streaming Scala API Skeleton")
}
core-site.xml is configured per the docs:
<configuration>
<property>
<name>fs.s3a.impl</name>
<value>org.apache.hadoop.fs.s3a.S3AFileSystem</value>
</property>
<property>
<name>fs.s3a.buffer.dir</name>
<value>/tmp</value>
</property>
</configuration>
We have included all the jars for S3AFileSystem listed here: https://ci.apache.org/projects/flink/flink-docs-release-1.2/setup/aws.html#flink-for-hadoop-27
We are stumped. This seems like it should work; there are plenty of breadcrumbs on the internet that indicate that this did work. [e.g., http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Reading-files-from-an-S3-folder-td10281.html]
Help me, fellow squirrels... you're my only hope!

Answering my own question... with help from Steve Loughran above.
In Flink, when working with a file-based data source to process continuously, FileInputFormat does not enumerate nested files by default.
This is true whether the source is S3 or anything else.
You must set it like so:
def main(args: Array[String]) {
val parameters = ParameterTool.fromArgs(args)
val bucket = parameters.get("bucket")
val folder = parameters.get("folder")
val path = s"s3a://$bucket/$folder"
val env = StreamExecutionEnvironment.getExecutionEnvironment
val textInputFormat = new TextInputFormat(new Path(path))
//this is important!
textInputFormat.setNestedFileEnumeration(true)
val lines: DataStream[String] = env.readFile(
inputFormat = textInputFormat,
filePath = path,
watchType = FileProcessingMode.PROCESS_CONTINUOUSLY,
interval = Time.seconds(10).toMilliseconds)
lines.print()
env.execute("Flink Streaming Scala API Skeleton")
}

What version of Hadoop is under this?
If this has stopped with Hadoop 2.8, is probably a regression, meaning probably my fault. First file a JIRA # issues.apache.org under FLINK, then, if its new in 2.8.0 link it as broken-by HADOOP-13208
The code snippet here is a good example that could be used for a regression test, and it's time I did some for Flink.
That big listFiles() change moves the enum of files under a path from a recursive treewalk to a series of flat lists of all child entries under a path: it works fantastically for everything else (distcp, tests, hive, spark) and has been shipping in products since Dec '16; I'd be somewhat surprised if it is the cause, but not denying blame. Sorry

As with flink 1.7.x version Flink provides two file systems to talk to Amazon S3, flink-s3-fs-presto and flink-s3-fs-hadoop. Both flink-s3-fs-hadoop and flink-s3-fs-presto register default FileSystem wrappers for URIs with the s3:// scheme, flink-s3-fs-hadoop also registers for s3a:// and flink-s3-fs-presto also registers for s3p://, so you can use this to use both at the same time.
Sample code :
//Reading Data from S3
// This will print all the contents in the bucket line wise
final Path directory = new Path("s3a://husnain28may2020/");
final FileSystem fs = directory.getFileSystem();
//using input format
org.apache.flink.api.java.io.TextInputFormat textInputFormatS3 = new org.apache.flink.api.java.io.TextInputFormat(directory);
DataSet<String> linesS3 = env.createInput(textInputFormatS3);
linesS3.print();

Related

OraclePropertyGraphDataLoader loadData from HDFS

I'm using Spark+Hive to build graphs and relations and export flat OPV/OPE files to HDFS, one OPV/OPE CSV per reducer.
All our graph database is ready to be loaded on OPG/PGX for analytics an that worked like a charm.
Now, we want to load those vertices/edges on Oracle Property Graph.
I'v dumped the filenames from hdfs this way:
$ hadoop fs -find '/user/felipeferreira/dadossinapse/ops/*.opv/*.csv' | xargs -I{} echo 'hdfs://'{} > opvs.lst
$ hadoop fs -find '/user/felipeferreira/dadossinapse/ops/*.ope/*.csv' | xargs -I{} echo 'hdfs://'{} > opes.lst
And I'm experimenting on groovy shell with some issues and doubts:
opvs = new File('opvs.lst') as String[]
opes = new File('opes.lst') as String[]
opgdl.loadData(opg, opvs, opes, 72)
That doesn't work out of the box, I receive errors like
java.lang.IllegalArgumentException: loadData: part-00000-f97f1abf-5f69-479a-baee-ce0a7bcaa86c-c000.csv flat file does not exist
I'll manage that with a InputStream approach available in the loadData interface, hope that solves this problem, but I have some questions/sugestions:
Does loadData support vfs so I may load 'hdfs://...' files directly?
Wouldn't be nice to have glob syntax in the filenames so we may do something like:
opgdl.loadData(opg, 'hdfs:///user/felipeferreira/opvs/**/*.csv' ...
Thanks in advance!
You can use an alternate API from OraclePropertyGraphDataLoader where you can specifiy the InputStream objects for the opv/ope files used for loading. This way, you can use FsDataInputStream objects to read the files from your HDFS environment.
A small sample is the following:
// ====== Init HDFS File System Object
Configuration conf = new Configuration();
// Set FileSystem URI
conf.set("fs.defaultFS", hdfsuri);
conf.set("fs.hdfs.impl", org.apache.hadoop.hdfs.DistributedFileSystem.class.getName());
conf.set("fs.file.impl", org.apache.hadoop.fs.LocalFileSystem.class.getName());
// Set HADOOP user
System.setProperty("HADOOP_USER_NAME", "hdfs");
System.setProperty("hadoop.home.dir", "/");
//Get the filesystem - HDFS
FileSystem fs = FileSystem.get(URI.create(hdfsuri), conf);`
// Read files into InputStreams using HDFS FsDataInputStream Java APIs
**Path pathOPV = new Path("/path/to/file.opv");
FSDataInputStream inOPV = fileSystem.open(pathOPV);
Path pathOPV = new Path("/path/to/file.ope");
FSDataInputStream inOPE = fileSystem.open(pathOPE);**
cfg = GraphConfigBuilder.forPropertyGraphHbase().setName("sinapse").setZkQuorum("bda1node05,bda1node06").build()
opg = OraclePropertyGraph.getInstance(cfg)
opgdl = OraclePropertyGraphDataLoader.getInstance();
opgdl.loadData(opg, **inOPV, inOPE**, 100);
Let us know if this one works for you.
For the sake of tracking, here is the solution we'v adopted:
Mounted the hdfs through the NFS gateway on a folder below the groovy shell.
Exported the filenames to the OPV/OPE list-of-files:
$ find ../hadoop/user/felipeferreira/dadossinapse/ -iname "*.csv" | grep ".ope" > opes.lst
$ find ../hadoop/user/felipeferreira/dadossinapse/ -iname "*.csv" | grep ".opv" > opvs.lst
Then it was as simple as this to load the data on the opg/hbase:
cfg = GraphConfigBuilder.forPropertyGraphHbase().setName("sinapse").setZkQuorum("bda1node05,bda1node06").build()
opg = OraclePropertyGraph.getInstance(cfg)
opgdl = OraclePropertyGraphDataLoader.getInstance()
opvs = new File("opvs.lst") as String[]
opes = new File("opes.lst") as String[]
opgdl.loadData(opg, opvs, opes, 100)
This seems to get bottlenecked by the nfs gateway, but we will evaluate this next week.
Graph data loading is running just fine so far.
If anyone would suggest a better approach, please let me know!

How to recursively read Hadoop files from directory using Spark?

Inside the given directory I have many different folders and inside each folder I have Hadoop files (part_001, etc.).
directory
-> folder1
-> part_001...
-> part_002...
-> folder2
-> part_001...
...
Given the directory, how can I recursively read the content of all folders inside this directory and load this content into a single RDD in Spark using Scala?
I found this, but it does not recursively enters into sub-folders (I am using import org.apache.hadoop.mapreduce.lib.input):
var job: Job = null
try {
job = Job.getInstance()
FileInputFormat.setInputPaths(job, new Path("s3n://" + bucketNameData + "/" + directoryS3))
FileInputFormat.setInputDirRecursive(job, true)
} catch {
case ioe: IOException => ioe.printStackTrace(); System.exit(1);
}
val sourceData = sc.newAPIHadoopRDD(job.getConfiguration(), classOf[TextInputFormat], classOf[LongWritable], classOf[Text]).values
I also found this web-page that uses SequenceFile, but again I don't understand how to apply it to my case?
If you are using Spark, you can do this using wilcards as follow:
scala>sc.textFile("path/*/*")
sc is the SparkContext which if you are using spark-shell is initialized by default or if you are creating your own program should will have to instance a SparkContext by yourself.
Be careful with the following flag:
scala> sc.hadoopConfiguration.get("mapreduce.input.fileinputformat.input.dir.recursive")
> res6: String = null
Yo should set this flag to true:
sc.hadoopConfiguration.set("mapreduce.input.fileinputformat.input.dir.recursive","true")
I have found that the parameters must be set in this way:
.set("spark.hive.mapred.supports.subdirectories","true")
.set("spark.hadoop.mapreduce.input.fileinputformat.input.dir.recursive","true")
connector_output=${basepath}/output/connector/*/*/*/*/*
works for me when I've dir structure like -
${basepath}/output/connector/2019/01/23/23/output*.dat
I didn't have to set any other properties, just used following -
sparkSession.read().format("csv").schema(schema)
.option("delimiter", "|")
.load("/user/user1/output/connector/*/*/*/*/*");

Storing graphx vertices on HDFS and loading later

I create an RDD:
val verticesRDD: RDD[(VertexId, Long)] = vertices
I can inspect it and everything looks ok:
verticesRDD.take(3).foreach(println)
(4000000031043205,1)
(4000000031043206,2)
(4000000031043207,3)
I save this RDD to HDFS via:
verticesRDD.saveAsObjectFile("location/vertices")
I then try and read this file to make sure it worked:
val verticesRDD_check = sc.textFile("location/vertices")
This works fine, however when I try and inspect, something is wrong.
verticesRDD_check.take(2).foreach(println)
SEQ!org.apache.hadoop.io.NullWritable"org.apache.hadoop.io.BytesWritablea��:Y4o�e���v������ur[Lscala.Tuple2;.���O��xp
srscala.Tuple2$mcJJ$spC�~��f��J _1$mcJ$spJ _2$mcJ$spxr
scala.Tuple2�}��F!�L_1tLjava/lang/Object;L_2q~xppp5���sq~pp5���sq~pp5���sq~pp5���sq~pp5���esq~pp5���hsq~pp5��୑sq~pp5���sq~pp5���q sq~pp5��ஓ
Is there an issue in how I save the RDD using saveAsObjectFile? Or is it reading via textFile?
When you read it back, you need to specify the type.
val verticesRDD : RDD[(VertexId, Long)] = sc.objectFile("location/vertices")

Spark Streaming: HDFS

I can't get my Spark job to stream "old" files from HDFS.
If my Spark job is down for some reason (e.g. demo, deployment) but the writing/moving to HDFS directory is continuous, I might skip those files once I up the Spark Streaming Job.
val hdfsDStream = ssc.textFileStream("hdfs://sandbox.hortonworks.com/user/root/logs")
hdfsDStream.foreachRDD(
rdd => logInfo("Number of records in this batch: " + rdd.count())
)
Output --> Number of records in this batch: 0
Is there a way for Spark Streaming to move the "read" files to a different folder? Or we have to program it manually? So it will avoid reading already "read" files.
Is Spark Streaming the same as running the spark job (sc.textFile) in CRON?
As Dean mentioned, textFileStream uses the default of only using new files.
def textFileStream(directory: String): DStream[String] = {
fileStream[LongWritable, Text, TextInputFormat](directory).map(_._2.toString)
}
So, all it is doing is calling this variant of fileStream
def fileStream[
K: ClassTag,
V: ClassTag,
F <: NewInputFormat[K, V]: ClassTag
] (directory: String): InputDStream[(K, V)] = {
new FileInputDStream[K, V, F](this, directory)
}
And, looking at the FileInputDStream class we will see that it indeed can look for existing files, but defaults to new only:
newFilesOnly: Boolean = true,
So, going back into the StreamingContext code, we can see that there is and overload we can use by directly calling the fileStream method:
def fileStream[
K: ClassTag,
V: ClassTag,
F <: NewInputFormat[K, V]: ClassTag]
(directory: String, filter: Path => Boolean, newFilesOnly: Boolean):InputDStream[(K, V)] = {
new FileInputDStream[K, V, F](this, directory, filter, newFilesOnly)
}
So, the TL;DR; is
ssc.fileStream[LongWritable, Text, TextInputFormat]
(directory, FileInputDStream.defaultFilter, false).map(_._2.toString)
Are you expecting Spark to read files already in the directory? If so, this is a common misconception, one that took me by surprise. textFileStream watches a directory for new files to appear, then it reads them. It ignores files already in the directory when you start or files it's already read.
The rationale is that you'll have some process writing files to HDFS, then you'll want Spark to read them. Note that these files much appear atomically, e.g., they were slowly written somewhere else, then moved to the watched directory. This is because HDFS doesn't properly handle reading and writing a file simultaneously.
val filterF = new Function[Path, Boolean] {
def apply(x: Path): Boolean = {
println("looking if "+x+" to be consider or not")
val flag = if(x.toString.split("/").last.split("_").last.toLong < System.currentTimeMillis){ println("considered "+x); list += x.toString; true}
else{ false }
return flag
}
}
this filter function is used to determine whether each path is actually the one preferred by you. so the function inside the apply should be customized as per your requirement.
val streamed_rdd = ssc.fileStream[LongWritable, Text, TextInputFormat]("/user/hdpprod/temp/spark_streaming_output",filterF,false).map{case (x, y) => (y.toString)}
now you have to set the third variable of filestream function to false, this is to make sure not only new files but also consider old existing files in the streaming directory.

Reading Distributed Files in Hadoop

I'm trying to the following in hadoop:
I have implemented a map-reduce job that outputs a file to directory "foo".
the foo files are with a key=IntWriteable, value=IntWriteable format (used a SequenceFileOutputFormat).
Now, I want to start another map-reduce job. the mapper is fine, but each reducer is required to read the entire "foo" files at start-up (I'm using the HDFS for sharing data between reducers).
I used this code on the "public void configure(JobConf conf)":
String uri = "out/foo";
FileSystem fs = FileSystem.get(URI.create(uri), conf);
FileStatus[] status = fs.listStatus(new Path(uri));
for (int i=0; i<status.length; ++i) {
Path currFile = status[i].getPath();
System.out.println("status: " + i + " " + currFile.toString());
try {
SequenceFile.Reader reader = null;
reader = new SequenceFile.Reader(fs, currFile, conf);
IntWritable key = (IntWritable) ReflectionUtils.newInstance(reader.getKeyClass(), conf);
IntWritable value = (IntWritable ) ReflectionUtils.newInstance(reader.getValueClass(), conf);
while (reader.next(key, value)) {
// do the code for all the pairs.
}
}
}
The code runs well on a single machine, but I'm notsure if it will run on a cluster.
In other words, does this code reads files from the current machine or does id read from the distributed system?
Is there a better solution for what I'm trying to do?
Thanks in advance,
Arik.
The URI for the FileSystem.get() does not have scheme defined and hence, the File System used depends on the configuration parameter fs.defaultFS. If none set, the default setting i.e LocalFile system will be used.
Your program writes to the Local file system under the workingDir/out/foo. It should work in the cluster as well but looks for the local file system.
With the above said, I'm not sure why you need the entire files from foo directory. You may have consider other designs. If needed, these files should copied to HDFS first and read the files from the overridden setup method of your reducer. Needless to say, to close the files opened in the overridden closeup method of your reducer. While the files can be read in reducers, the map/reduce programs are not designed for this kind of functionality.

Resources