amazon s3n integration with hadoop mapreduce not working - hadoop

I am trying to run some map reduce job over the files which are stored in amazon s3. I saw http://wiki.apache.org/hadoop/AmazonS3 and following it to do the integration. Here is my code which sets the input directory for the map reduce job
FileInputFormat.setInputPaths(job, "s3n://myAccessKey:mySecretKey#myS3Bucket/dir1/dir2/*.txt");
When i run the mapreduce job i am getting this exception
Exception in thread "main" java.lang.IllegalArgumentException:
Wrong FS: s3n://myAccessKey:mySecretKey#myS3Bucket/dir1/dir2/*.txt,
expected: s3n://myAccessKey:mySecretKey#myS3Bucket
at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:381)
at org.apache.hadoop.fs.FileSystem.makeQualified(FileSystem.java:294)
at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.setInputPaths(FileInputFormat.java:352)
at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.setInputPaths(FileInputFormat.java:321)
at com.appdynamics.blitz.hadoop.migration.DataMigrationManager.convertAndLoadData(DataMigrationManager.java:340)
at com.appdynamics.blitz.hadoop.migration.DataMigrationManager.migrateData(DataMigrationManager.java:300)
at com.appdynamics.blitz.hadoop.migration.DataMigrationManager.migrate(DataMigrationManager.java:166)
at com.appdynamics.blitz.command.DataMigrationCommand.run(DataMigrationCommand.java:53)
at com.appdynamics.blitz.command.DataMigrationCommand.run(DataMigrationCommand.java:21)
at com.yammer.dropwizard.cli.ConfiguredCommand.run(ConfiguredCommand.java:58)
at com.yammer.dropwizard.cli.Cli.run(Cli.java:53)
at com.yammer.dropwizard.Service.run(Service.java:61)
at com.appdynamics.blitz.service.BlitzService.main(BlitzService.java:84)
I can't find resource to help me on this. Any pointer will be deeply appreciated.

You're just gonna have to keep playing with
Wrong FS: s3n://myAccessKey:mySecretKey#myS3Bucket/dir1/dir2/*.txt
The path you're giving Hadoop just isn't correct and it won't work until it can access the correct files.

So i found the problem. It was caused by this bug
https://issues.apache.org/jira/browse/HADOOP-3733
Even though i replaced the "/" by "%2F" it kept giving the same problem. I regenerated the keys and put one where is no "/" in the secret key and it fixed the issue.

Related

java.lang.StackoverflowError when writing dataframe into Postgresql using JDBC

I'm trying to write the result of multiple operations into an AWS Aurora PostgreSQL cluster. All the calculations performs right but, when I try to write the result into the database I get the next error:
py4j.protocol.Py4JJavaError: An error occurred while calling o12179.jdbc.
: java.lang.StackOverflowError
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:256)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:256)
at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:255)
I already tried to increase cluster size (15 r4.2xlarge machines), change number of partitions for the data to 120 partitions, change executor and driver memory to 4Gb each and I'm facing the same results.
The current SparkSession configuration is the next:
spark = pyspark.sql.SparkSession\
.builder\
.appName("profile")\
.config("spark.sql.shuffle.partitions", 120)\
.config("spark.executor.memory", "4g").config("spark.driver.memory", "4g")\
.getOrCreate()
I don't know if is a Spark configuration problem or if it's a programming problem.
Finally I found the problem.
The problem was an iterative read from S3 creating a really big DAG. I changed the way I read CSV files from S3 with the following instruction.
df = spark.read\
.format('csv')\
.option('header', 'true')\
.option('delimiter', ';')\
.option('mode', 'DROPMALFORMED')\
.option('inferSchema', 'true')\
.load(list_paths)
Where list_paths is a precalculated list of paths to S3 objects.

Error in pig script while processing large file

I am trying to split a large file (15GB) into multiple small files based on a key column inside the file.The same code works fine if i run it on few 1000s of rows.
My code is as below.
REGISTER /home/auto/ssachi/piggybank-0.16.0.jar;
input_dt = LOAD '/user/ssachi/sywr_sls_ln_ofr_dtl/sywr_sls_ln_ofr_dtl.txt-10' USING PigStorage(',');
STORE input_dt into '/user/rahire/sywr_sls_ln_ofr_dtl_split' USING org.apache.pig.piggybank.storage.MultiStorage('/user/rahire/sywr_sls_ln_ofr_dtl_split','4','gz',',');
Error is as below
ERROR org.apache.pig.tools.grunt.GruntParser - ERROR 6015: During execution, encountered a Hadoop error.
HadoopVersion 2.6.0-cdh5.8.2
PigVersion 0.12.0-cdh5.8.2
I tried setting the below parameters assuming it is a memory issue, but it did not help.
SET mapreduce.map.memory.mb 16000;
SET mapreduce.map.java.opts 14400;
With the above parameters set, i got the below error.
Container exited with a non-zero exit code 1
org.apache.pig.backend.executionengine.ExecException: ERROR 2997: Unable to recreate exception from backed error: AttemptID:attempt_1486048646102_2613_m_000066_3 Info:Exception from container-launch.
Whats the Cardinality of your " key column " is it in 1000?
If its in 1000 then you will get the error as your Mappers are dying because of OOME.
Do understand each Mapper now maintain 1000 file pointers and a associated buffer for each filePointer enough to occupy whole of your heap.
Can you please provide logs of your mappers for further investigation
Multioutput in MapReduce which is being called internally.
http://bytepadding.com/big-data/map-reduce/multipleoutputs-in-map-reduce/

Hadoop passing variables from reducer to main

I am working on a map reduce program. I'm trying to pass parameters to the context configuration in the reduce method using the setLong method and then after completion read them in the main
in reducer:
context.getConfiguration().setLong(key, someLong);
In the Main after the job completion i try to read using :
long val = job.getConfiguration().getLong(key, -1);
but i always get -1.
when i try reading inside the reducer i see that the value is set and i get the correct answer.
am i missing something?
Thank you
You can use counters: set&update their value in reducers and then you can access them in your client application (Main).
You can translate configuration from main to map task or reduce task, but you cannot translate it back. The procedure of configuration translation is:
A configuration file is generated on the MapReduce client based on the configuration you set on main, and it will be pushed to a HDFS path only shared by the job. The file will be readonly
When launching a map or reduce task, the configuration file is pulled from the HDFS path, and task init the configuration based by the file.
If you want to translate configuration back, you may use another HDFS file: update the file on Reducer, and read it after job completes

Create Snapshot of FS from Spark Job

I would like to create a snapshot of the underlying HDFS, when running a spark job. The particular step involves deleting contents of some parquet files. I want to create a snapshot perform the delete operation, verify the operation results and proceed with next Steps.
However, I am unable to find a good way to access the HDFS API from my spark job. The directory I want to create a snapshot is tagged/marked snapshotable in HDFS. the command line method of creating the snapshot works, However I need to do this programmatically.
i am running Spark 1.5 on CDH 5.5.
any hints clues as to how I can perform this operation ?
Thanks
Ramdev
I have not verified this, but atleast I do not get Compile errors and in theory this solution should work.
This is scala code:
val sc = new SparkContext();
val fs = FileSystem.get(sc.hadoopConfig)
val snapshotPath = fs.createSnapshot("path to createsnapshot of","snapshot name")
.....
.....
if (condition satisfied) {
fs.deleteSnapshot(snapshotPath,"snapshot name")
}
I assume this will work in theory.

error in hadoop mapreduce program

I am trying to write data from hbase to hdfs and encountered this error in compilation. Is it problem with the reducer code or something else?
HbaseFile.java:36: setReducerClass(java.lang.Class) in org.apache.hadoop.mapreduce.Job cannot be applied to (java.lang.Class)
job.setReducerClass(CountWordReducer.class);
^
HbaseFile.java:38: setOutputPath(org.apache.hadoop.mapred.JobConf,org.apache.hadoop.fs.Path) in org.apache.hadoop.mapred.FileOutputFormat cannot be applied to (org.apache.hadoop.mapreduce.Job,org.apache.hadoop.fs.Path)
FileOutputFormat.setOutputPath(job, new Path(args[0]));
From the packages you are using you are mixing the older and new api. To fix this problem you will have to pick one and use it consistently.
Notice your Job is the new api org.apache.hadoop.mapreduce.Job. But you're trying to use the old api to set the outputpath, I can tell because it takes the old JobConf org.apache.hadoop.mapred.JobConf.
If you see "org.apache.hadoop.mapreduce", and "org.apache.hadoop.mapred" in your code at the same time, you are probably mixing the api's and should change things around to pick just one.

Resources