How to use AvroParquetReader inside a Flink application? - hadoop

I am having trouble using AvroParquetReader inside a Flink Application. (flink>=1.15)
Motivaton (AKA why I want to use it)
According to official doc one can read Parquet files in Flink into FileSource. However, I only want to write a function to load parquet file into Avro records without creating a DataStreamSource. In particular, I want to load parquet files into FileInputFormat which is a complete separate API (for some weird reasons). (And I could not see easily how one could cast BulkFormat or StreamFormat into it, if one dig one level deeper.)
Therefore, it would much simpler if one use org.apache.parquet.avro.AvroParquetReader to read it directly.
Error description
However, I found this error after run the Flink application locally: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.S3AFileSystem not found.
This is quite unexpected, since the flink-s3-hadoop-fs jar has already been loaded inside the plugin system (and the file path has already been added to HADOOP_CLASSPATH as well). So not only flink knows where it is, so should the local hadoop as well.
Comments:
Without this AvroParquetReader, the Flink app can write to S3 without problem.
The Hadoop is not a flink shaded one, but installed separately with version 2.10.
Would love to hear if you have some insights about this.
ParquetAvroReader should be able to read the parquet files without problem.

there is an official hadoop guide that has some potential fixes for the issue and can be found here. If I recall correnctly this issue was cause by some Hadoop AWS dependencies missing.

Related

Magic committer not improving performance in a Spark3+Yarn3+S3 setup

What I am trying to achieve?
I am trying to enable the S3A magic committer for my Spark3.3.0 application running on a Yarn (Hadoop 3.3.1) cluster, to see performance improvements in my app during S3 writes. IIUC, my Spark application is writing about 21GBs of data with 30 tasks in the corresponding Spark stage (see below image).
My setup
I have a server which has the Spark client. The Spark client submits the application on Yarn cluster via the client-mode with PySpark.
What I tried
I am using the following config (setting via PySpark Spark-conf) to enable the committer:
"spark.sql.sources.commitProtocolClass": "org.apache.spark.internal.io.cloud.PathOutputCommitProtocol"
"spark.sql.parquet.output.committer.class": "org.apache.hadoop.mapreduce.lib.output.BindingPathOutputCommitter"
"spark.hadoop.mapreduce.outputcommitter.factory.scheme.s3a": "org.apache.hadoop.fs.s3a.commit.S3ACommitterFactory"
"spark.hadoop.fs.s3a.committer.name": "magic"
"spark.hadoop.fs.s3a.committer.magic.enabled": "true"
I also downloaded the spark-hadoop-cloud jar to the jars/ directory of the Spark-Home on the Nodemanagers and my Spark-client servers.
Changes that I see after applying the aforementioned configs:
I see PRE __magic/ directory if I run aws s3 ls <write-path> when the job is running.
I don't see the warning WARN AbstractS3ACommitterFactory: Using standard FileOutputCommitter to commit work. This is slow and potentially unsafe. anymore.
A _SUCCESS file gets created with (JSON) content. One of the key-value that I see in that file is "committer" : "magic".
Hence, I believe my configs are getting applied correctly.
What I expect
I have read in multiple articles that this committer is expected to show a performance boost (e.g. this article claims 57-77% time reduction). Hence, I expect to see significant reduction (from 39s) in the "duration" column of my "paruqet" stage, when I use the above shared configs.
Some other point that might be of value
When I use "spark.sql.sources.commitProtocolClass": "com.hortonworks.spark.cloud.commit.PathOutputCommitProtocol", my app fails with the error java.lang.ClassNotFoundException: com.hortonworks.spark.cloud.commit.PathOutputCommitProtocol.
I have not looked into enabling S3gaurd, as S3 now provides strong consistency.
correct. you don't need s3guard
the com.hortonworks binding was for the wip committer work. the binding classes for wiring up spark/parquet are all in spark-hadoop-cloud and have org.spark prefixes. you seem to be ok there
the simple test for what committer is live is to print the JSON _SUCCESS file. If that is a 0 byte file, you are still using the old committer. it does sound like you are.
grab the latest spark+hadoop build you can get, there's always ongoing improvements, with hadoop 3.3.5 doing a big enhancement there.
you should see performance improvements compared to the v1 committer, with commit speed O(files) rather than O(data). it is also correct, which the v1 algorithm doesn't offer on s3 (and which v2 doesn't offer anywhere

Run Pig with Lipstick on AWS EMR

I'm running an AWS EMR Pig job using script-runner.jar as described here: http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-hadoop-script.html
Now, I want to hook up Netflix' Lipstick to monitor my scripts. I set up the server, and in the wiki here: https://github.com/Netflix/Lipstick/wiki/Getting-Started I can't quite figure out how to do the last step:
hadoop jar lipstick-console-[version].jar -Dlipstick.server.url=http://$LIPSTICK_URL
Should I substitute script-runner.jar with this?
Also, after following the build process in wiki I ended up with 3 different console jars:
lipstick-console-0.6-SNAPSHOT.jar
lipstick-console-0.6-SNAPSHOT-withHadoop.jar
lipstick-console-0.6-SNAPSHOT-withPig.jar
What is the purpose of the latter two jars?
UPDATE:
I think I'm making progress, but it still does not seem to work.
I set the pig.notification.listener parameter as described here and lipstick server url. There is more than one way to do it in EMR. Since I am using ruby API, I had to specify a step
hadoop_jar_step:
jar: 's3://elasticmapreduce/libs/script-runner/script-runner.jar'
properties:
- pig.notification.listener.arg: com.netflix.lipstick.listeners.LipstickPPNL
- lipstick.server.url: http://pig_server_url
Next, I added lipstick-console-0.6-SNAPSHOT.jar to hadoop classpath. For this, I had to create a bootstrap action as follows:
bootstrap_actions:
- name: copy_lipstick_jar
script_bootstrap_action:
path: #s3 path to bootstrap_lipstick.sh
where contents of bootstrap_lipstick.sh is
#!/bin/bash
hadoop fs -copyToLocal s3n://wp-data-west-2/load_code/java/lipstick-console-0.6-SNAPSHOT.jar /home/hadoop/lib/
The bootstrap action copies the lipstick jar to cluster nodes, and /home/hadoop/lib/ is already in hadoop classpath (EMR takes care of that).
It still does not work, but I think I am missing something really minor ... Any ideas appreciated.
Thanks!
Currently Lipstick's Main class is a drop-in replacement to Pig's Main class. This is a hack (and far from ideal) to have access to the logical and physical plans for your script before and after optimization that are simply not accessible otherwise. As such it unfortunately won't work to just register the LipstickPPNL class as a PPNL for Pig. You've got to run Lipstick Main as though it was Pig.
I have not tried to run lipstick on EMR but it looks like you're going to need to use a custom jar step, not a script step. See the docs here: http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-launch-custom-jar-cli.html
The jar name would be the lipstick-console-0.6-SNAPSHOT-withHadoop.jar. It contains all the necessary dependencies to run Lipstick. Additionally the lipstick.server.url will need to be set.
Alternatively, you might take a look at https://www.mortardata.com/ which runs on EMR and has lipstick integration built-in.

how to get multipleOutput in hadoop

I'm new to Hadoop, and now have to process a input file. I want to process each line and the output should be one file for each line.
I surf the internet and found MultipleOutputFormat, and generateFileNameForKeyValue.
But most people write it with JobConf class. As I'm using Hadoop 0.20.1, I think Job class takes place. And I don't know how to use Job class to generate multiple output files by key.
Could anyone help me?
The Eclipse plugin is mainly used to submit and monitor jobs as well as interact with HDFS, against a real or 'psuedo' cluster.
If you're running in local mode, then i don't think the plugin gains you anything - seeing as your job will be run in a single JVM. With this in mind i would say include include the most recent 1.x hadoop-core in your Eclipse project's classpath.
Eitherway MultipleOutputFormat has not been ported to the new mapreduce package (neither in 1.1.2 or 2.0.4-alpha), so you'll either need to port it yourself or find another way (maybe MultipleOutputs - The Javadoc page has some usage on using MultipleOutputs)

Hadoop streaming with zip input files

I'm trying to run a streaming job where the input files are csv inside zip files.
I tried using this, however it doesn't seem for work with CDH4 (I get the error class com.cotdp.hadoop.ZipFileInputFormat not org.apache.hadoop.mapred.InputFormat)
Anyone know of an input file reader I can use for streaming with zip files? If possible, I'm looking for a multi file reader (that can be given the top level directory).
I ended up writing zipstream.
Note that is process only the first file in the zip, I'll probably add support for multiple files later.
There are two hadoop api's for input formats. mapred.InputFormat, and mapreduce.InputFormat.
mapreduce is the newer API and the one you should be using if you can.
I would check to see which InputFormat the ZipInputFormat actually implements. If it implements the mapreduce version you'll need to move your job over to this second API.
For a bit of background: In an earlier Hadoop version 'mapred' was depreciated in favor of 'mapreduce', a newer, faster, and cleaner implementation. Unfortunately this new API didn't include all the features of the old one, so in more recent versions of Hadoop 'mapred' was reinstated, and now there are two APIs that basically do the same thing.

Writing MetaData inside HDFS

We are using nutch to crawl our intranet site.
We are extracting the meta data in xml file, in the indexing phase(We modified the code of indexer.java), and when ran in local mode it gave us the required metadata.
Now, we thought of using nutch in cluster mode(using hadoop), when we crawled nutch in cluster, we are able to get the index but not the metadata which we used to get previously, in local mode we used(java's IO classes to write meta to files). For hadoop we have changed this to hadoop file system io classes. Yet we are not able to get the meta.
Are there any solution, or are we missing something?
Thanks in advance,
Geo
We are extracting the meta data in xml file, in the indexing phase(We modified the code of indexer.java), and when ran in local mode it gave us the required metadata.
modifying the indexer is not the best option as illustrated by the issue you've encountered
you could :
add the metadata as part of the injection (if you want to do that for the seeds only)
or write a custom indexing plugin : and e.g. get it to load the XML md from a file in conf/
the content of conf/ is added to the job file and is distributed across the nodes of the cluster. There are quite a few examples of indexing plugins in the code.
Maybe you should use the Nutch user list to get a broader audience?

Resources