MLlib ALS cannot delete checkpoint RDDs Wrong FS: hdfs://[url] expected: file:/// - hadoop

I am using Spark MLlib's ALS class to train a MatrixFactorizationModel. I have setup a HDFS for checkpointing intermediate rdds (as suggested by ALS class). The rdds are begin saved but I get an exception when it tries to delete them again: java.lang.IllegalArgumentException: Wrong FS: hdfs://[url], expected: file:///
Here is the stack trace:
Exception in thread "main" java.lang.IllegalArgumentException: Wrong FS: hdfs://[url], expected: file:///
at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:642)
at org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalFileSystem.java:69)
at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:516)
at org.apache.hadoop.fs.ChecksumFileSystem.delete(ChecksumFileSystem.java:528)
at org.apache.spark.ml.recommendation.ALS$$anonfun$2$$anonfun$apply$mcV$sp$1.apply(ALS.scala:568)
at org.apache.spark.ml.recommendation.ALS$$anonfun$2$$anonfun$apply$mcV$sp$1.apply(ALS.scala:566)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.ml.recommendation.ALS$$anonfun$2.apply$mcV$sp(ALS.scala:566)
at org.apache.spark.ml.recommendation.ALS$$anonfun$train$1.apply$mcVI$sp(ALS.scala:602)
at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
at org.apache.spark.ml.recommendation.ALS$.train(ALS.scala:596)
at org.apache.spark.mllib.recommendation.ALS.run(ALS.scala:219)
at adapters.ALSAdapter.run(ALSAdapter.java:59) ...
The culprit seems to be:
at org.apache.spark.ml.recommendation.ALS.scala 568
FileSystem.get(sc.hadoopConfiguration).delete(new Path(file), true)
which appears to return a RawLocalFilesystem instead of a distributed filesystem object.
I have not touched sc.hadoopConfiguration. The only interaction I've had is to call myJavaStreamingContext.checkpoint(hdfs://[url + directory]);.
Is there anything further I need to do client side to setup sc.hadoopConfiguration or would the problem be hdfs server side?
I was using spark 1.3.1 but tried 1.4.1 and the problem still persists.

your sc.hadoopConfiguration is using local filesystem ("fs.default.name" = file:///) and despite specifying explicitly the checkpoint location, the ALS code uses sc.hadoopConfiguration to determine filesystem (e.g. for deleting previous checkpoints if applicable et al)
you can try
set sc.hadoopConfiguration.set("fs.default.name", hfds://xx:port)
set sc.setCheckpointDir(yourCheckpointDir)
and just call checkpoint with no explicit filesystem path or see if setCheckpointInterval will work for you (i.e. let ALS take care of checkpointing)
you can add hdfs-site.xml and core-site.xml to spark classpath if you don't want sc.hadoopConfiguration.set in code

Related

HBase HFile Corruption on AWS S3

I am running HBase on an EMR cluster (emr-5.7.0) enabled on S3.
We are using 'ImportTsv' and 'CompleteBulkLoad' utilities for importing the data into HBase.
During our process, we have observed that intermittently there were failures stating that there were HFile corruption for some of the imported files. This happens sporadically and there is no pattern that we could deduce for the errors.
After lot of research and going through many suggestions in blogs, I have tried the below fixes but of no avail and we are still facing the discrepancy.
Tech Stack :
AWS EMR Cluster (emr-5.7.0 | r3.8xlarge | 15 nodes)
AWS S3
HBase 1.3.1
Data Volume:
~ 960000 lines (To be upserted) | ~ 7GB TSV file
Commands used in sequence:
1) hbase org.apache.hadoop.hbase.mapreduce.ImportTsv -Dimporttsv.separator="|" -Dimporttsv.columns="<Column Names (472 Columns)>" -Dimporttsv.bulk.output="<HFiles Path on HDFS>" <Table Name> <TSV file path on HDFS>
2) hadoop fs -chmod 777 <HFiles Path on HDFS>
3) hbase org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles <HFiles Path on HDFS> <Table Name>
Fixes Tried:
Increasing S3 Max Connections:
We increased the below property but it did not seem to resolve the issue. fs.s3.maxConnections : Values tried -- 10000, 20000, 50000, 100000.
HBase Repair:
Another approach was to execute the HBase repair command but it didn't seem to help either.
Command : hbase hbase hbck -repair
Error Trace is as below:
[LoadIncrementalHFiles-17] mapreduce.LoadIncrementalHFiles: Received a
CorruptHFileException from region server: row '00218333246' on table
'WB_MASTER' at
region=WB_MASTER,00218333246,1506304894610.f108f470c00356217d63396aa11cf0bc.,
hostname=ip-10-244-8-74.ec2.internal,16020,1507907710216, seqNum=198
org.apache.hadoop.hbase.io.hfile.CorruptHFileException:
org.apache.hadoop.hbase.io.hfile.CorruptHFileException: Problem
reading HFile Trailer from file
s3://wbpoc-landingzone/emrfs_test/wb_hbase_compressed/data/default/WB_MASTER/f108f470c00356217d63396aa11cf0bc/cf/2a9ecdc5c3aa4ad8aca535f56c35a32d_SeqId_200_
at
org.apache.hadoop.hbase.io.hfile.HFile.pickReaderVersion(HFile.java:497)
at
org.apache.hadoop.hbase.io.hfile.HFile.createReader(HFile.java:525)
at
org.apache.hadoop.hbase.regionserver.StoreFile$Reader.(StoreFile.java:1170)
at
org.apache.hadoop.hbase.regionserver.StoreFileInfo.open(StoreFileInfo.java:259)
at
org.apache.hadoop.hbase.regionserver.StoreFile.open(StoreFile.java:427)
at
org.apache.hadoop.hbase.regionserver.StoreFile.createReader(StoreFile.java:528)
at
org.apache.hadoop.hbase.regionserver.StoreFile.createReader(StoreFile.java:518)
at
org.apache.hadoop.hbase.regionserver.HStore.createStoreFileAndReader(HStore.java:667)
at
org.apache.hadoop.hbase.regionserver.HStore.createStoreFileAndReader(HStore.java:659)
at
org.apache.hadoop.hbase.regionserver.HStore.bulkLoadHFile(HStore.java:799)
at
org.apache.hadoop.hbase.regionserver.HRegion.bulkLoadHFiles(HRegion.java:5574)
at
org.apache.hadoop.hbase.regionserver.RSRpcServices.bulkLoadHFile(RSRpcServices.java:2034)
at
org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:34952)
at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2339)
at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:123)
at
org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:188)
at
org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:168)
Caused by: java.io.FileNotFoundException: File not present on S3 at
com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem$NativeS3FsInputStream.read(S3NativeFileSystem.java:203)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:246) at
java.io.BufferedInputStream.read1(BufferedInputStream.java:286) at
java.io.BufferedInputStream.read(BufferedInputStream.java:345) at
java.io.DataInputStream.readFully(DataInputStream.java:195) at
org.apache.hadoop.hbase.io.hfile.FixedFileTrailer.readFromStream(FixedFileTrailer.java:391)
at
org.apache.hadoop.hbase.io.hfile.HFile.pickReaderVersion(HFile.java:482)
Any suggestions in figuring out the root cause for this discrepancy would be really helpful.
Appreciate your help! Thank you!
After much research and trial & errors, I was finally able to find a resolution for this issue, thanks to AWS support folks. It seems the issue is an occurrence as a result of S3's eventual consistency. The AWS team suggested to use the below property and it worked like a charm, so far we haven't hit the HFile corruption issue. Hope this helps if someone is facing the same issue!
Property (hbase-site.xml):
hbase.bulkload.retries.retryOnIOException : true

Required field 'uncompressed_page_size' was not found in serialized data! Parquet

I am getting below error while trying to save parquet file from local directory using pyspark.
I tried spark 1.6 and 2.2 both give same error
It display's schema properly but gives error at the time of writing file.
base_path = "file:/Users/xyz/Documents/Temp/parquet"
reg_path = "file:/Users/xyz/Documents/Temp/parquet/ds_id=48"
df = sqlContext.read.option( "basePath",base_path).parquet(reg_path)
out_path = "file:/Users/xyz/Documents/Temp/parquet/out"
df2 = df.coalesce(5)
df2.printSchema()
df2.write.mode('append').parquet(out_path)
org.apache.spark.SparkException: Task failed while writing rows
Caused by: java.io.IOException: can not read class org.apache.parquet.format.PageHeader: Required field 'uncompressed_page_size' was not found in serialized data! Struct: PageHeader(type:null, uncompressed_page_size:0, compressed_page_size:0)
In my own case, I was writing a custom Parquet Parser for Apache Tika and I experienced this error. It turned out that if the file is being used by another process, the ParquetReader will not be able to access uncompressed_page_size. Hence, causing the error.
Verify if other processes are not holding on to the file.
Temporary resolved by the spark config:
"spark.sql.hive.convertMetastoreParquet": "false"
Although it would has extra cost, but a walkaround approach by now.

FileStream on cluster gives me an exception

I am writing a Spark STreaming application using file stream...
val probeFileLines = ssc.fileStream[LongWritable, Text, TextInputFormat]("/data-sources/DXE_Ver/1.4/MTN_Abuja/DXE/20160221/HTTP", filterF, false) //.persist(StorageLevel.MEMORY_AND_DISK_SER)
But I get exception error for file/IO..for
16/09/07 10:20:30 WARN FileInputDStream: Error finding new files
java.io.FileNotFoundException: /mapr/cellos-mapr/data-sources/DXE_Ver/1.4/MTN_Abuja/DXE/20160221/HTTP
at com.mapr.fs.MapRFileSystem.listMapRStatus(MapRFileSystem.java:1486)
at com.mapr.fs.MapRFileSystem.listStatus(MapRFileSystem.java:1523)
at com.mapr.fs.MapRFileSystem.listStatus(MapRFileSystem.java:86)
While the directory exist in my cluster.
I am running my job using spark submit
spark-submit --class "StreamingEngineSt" target/scala-2.11/sprkhbase_2.11-1.0.2.jar
This could be related to file permissions or ownership(May be need hdfs user).

pig + hbase + hadoop2 integration

has anyone had successful experience loading data to hbase-0.98.0 from pig-0.12.0 on hadoop-2.2.0 in an environment of hadoop-2.20+hbase-0.98.0+pig-0.12.0 combination without encountering this error:
ERROR 2998: Unhandled internal error.
org/apache/hadoop/hbase/filter/WritableByteArrayComparable
with a line of log trace:
java.lang.NoClassDefFoundError: org/apache/hadoop/hbase/filter/WritableByteArra
I searched the web and found a handful of problems and solutions but all of them refer to pre-hadoop2 and base-0.94-x which were not applicable to my situation.
I have a 5 node hadoop-2.2.0 cluster and a 3 node hbase-0.98.0 cluster and a client machine installed with hadoop-2.2.0, base-0.98.0, pig-0.12.0. Each of them functioned fine separately and I got hdfs, map reduce, region servers , pig all worked fine. To complete an "loading data to base from pig" example, i have the following export:
export PIG_CLASSPATH=$HADOOP_INSTALL/etc/hadoop:$HBASE_PREFIX/lib/*.jar
:$HBASE_PREFIX/lib/protobuf-java-2.5.0.jar:$HBASE_PREFIX/lib/zookeeper-3.4.5.jar
and when i tried to run : pig -x local -f loaddata.pig
and boom, the following error:ERROR 2998: Unhandled internal error. org/apache/hadoop/hbase/filter/WritableByteArrayComparable (this should be the 100+ times i got it dying countless tries to figure out a working setting).
the trace log shows:lava.lang.NoClassDefFoundError: org/apache/hadoop/hbase/filter/WritableByteArrayComparable
the following is my pig script:
REGISTER /usr/local/hbase/lib/hbase-*.jar;
REGISTER /usr/local/hbase/lib/hadoop-*.jar;
REGISTER /usr/local/hbase/lib/protobuf-java-2.5.0.jar;
REGISTER /usr/local/hbase/lib/zookeeper-3.4.5.jar;
raw_data = LOAD '/home/hdadmin/200408hourly.txt' USING PigStorage(',');
weather_data = FOREACH raw_data GENERATE $1, $10;
ranked_data = RANK weather_data;
final_data = FILTER ranked_data BY $0 IS NOT NULL;
STORE final_data INTO 'hbase://weather' USING
org.apache.pig.backend.hadoop.hbase.HBaseStorage('info:date info:temp');
I have successfully created a base table 'weather'.
Has anyone had successful experience and be generous to share with us?
ant clean jar-withouthadoop -Dhadoopversion=23 -Dhbaseversion=95
By default it builds against hbase 0.94. 94 and 95 are the only options.
If you know which jar file contains the missing class, e.g. org/apache/hadoop/hbase/filter/WritableByteArray, then you can use the pig.additional.jars property when running the pig command to ensure that the jar file is available to all the mapper tasks.
pig -D pig.additional.jars=FullPathToJarFile.jar bulkload.pig
Example:
pig -D pig.additional.jars=/usr/lib/hbase/lib/hbase-protocol.jar bulkload.pig

my pig UDF runs in local mode but fails with "Deserialization error: could not instantiate" on my cluster

I have a pig UDF which runs perfectly in local mode, but fails with: could not instantiate 'com.bla.myFunc' with arguments 'null' when I try it on the cluster.
my mistake was not digging hard enough in the task logs.
when you dig there thru the jobTracker UI, you could see that the root cause was:
Caused by: java.lang.ClassNotFoundException: com.google.common.collect.Maps
at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
so, besides the usual:
pigServer.registerFunction("myFunc", new FuncSpec("com.bla.myFunc"));
we should add:
registerJar(pigServer, Maps.class);
and so on for any jar used by the UDF.
Another option is to use build-jar-with-dependencies, but then you have to put the pig.jar before yours in the classpath, or else you'll tackle this one: embedded hadoop-pig: what's the correct way to use the automatic addContainingJar for UDFs?

Resources