HBase HFile Corruption on AWS S3 - hadoop

I am running HBase on an EMR cluster (emr-5.7.0) enabled on S3.
We are using 'ImportTsv' and 'CompleteBulkLoad' utilities for importing the data into HBase.
During our process, we have observed that intermittently there were failures stating that there were HFile corruption for some of the imported files. This happens sporadically and there is no pattern that we could deduce for the errors.
After lot of research and going through many suggestions in blogs, I have tried the below fixes but of no avail and we are still facing the discrepancy.
Tech Stack :
AWS EMR Cluster (emr-5.7.0 | r3.8xlarge | 15 nodes)
AWS S3
HBase 1.3.1
Data Volume:
~ 960000 lines (To be upserted) | ~ 7GB TSV file
Commands used in sequence:
1) hbase org.apache.hadoop.hbase.mapreduce.ImportTsv -Dimporttsv.separator="|" -Dimporttsv.columns="<Column Names (472 Columns)>" -Dimporttsv.bulk.output="<HFiles Path on HDFS>" <Table Name> <TSV file path on HDFS>
2) hadoop fs -chmod 777 <HFiles Path on HDFS>
3) hbase org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles <HFiles Path on HDFS> <Table Name>
Fixes Tried:
Increasing S3 Max Connections:
We increased the below property but it did not seem to resolve the issue. fs.s3.maxConnections : Values tried -- 10000, 20000, 50000, 100000.
HBase Repair:
Another approach was to execute the HBase repair command but it didn't seem to help either.
Command : hbase hbase hbck -repair
Error Trace is as below:
[LoadIncrementalHFiles-17] mapreduce.LoadIncrementalHFiles: Received a
CorruptHFileException from region server: row '00218333246' on table
'WB_MASTER' at
region=WB_MASTER,00218333246,1506304894610.f108f470c00356217d63396aa11cf0bc.,
hostname=ip-10-244-8-74.ec2.internal,16020,1507907710216, seqNum=198
org.apache.hadoop.hbase.io.hfile.CorruptHFileException:
org.apache.hadoop.hbase.io.hfile.CorruptHFileException: Problem
reading HFile Trailer from file
s3://wbpoc-landingzone/emrfs_test/wb_hbase_compressed/data/default/WB_MASTER/f108f470c00356217d63396aa11cf0bc/cf/2a9ecdc5c3aa4ad8aca535f56c35a32d_SeqId_200_
at
org.apache.hadoop.hbase.io.hfile.HFile.pickReaderVersion(HFile.java:497)
at
org.apache.hadoop.hbase.io.hfile.HFile.createReader(HFile.java:525)
at
org.apache.hadoop.hbase.regionserver.StoreFile$Reader.(StoreFile.java:1170)
at
org.apache.hadoop.hbase.regionserver.StoreFileInfo.open(StoreFileInfo.java:259)
at
org.apache.hadoop.hbase.regionserver.StoreFile.open(StoreFile.java:427)
at
org.apache.hadoop.hbase.regionserver.StoreFile.createReader(StoreFile.java:528)
at
org.apache.hadoop.hbase.regionserver.StoreFile.createReader(StoreFile.java:518)
at
org.apache.hadoop.hbase.regionserver.HStore.createStoreFileAndReader(HStore.java:667)
at
org.apache.hadoop.hbase.regionserver.HStore.createStoreFileAndReader(HStore.java:659)
at
org.apache.hadoop.hbase.regionserver.HStore.bulkLoadHFile(HStore.java:799)
at
org.apache.hadoop.hbase.regionserver.HRegion.bulkLoadHFiles(HRegion.java:5574)
at
org.apache.hadoop.hbase.regionserver.RSRpcServices.bulkLoadHFile(RSRpcServices.java:2034)
at
org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:34952)
at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2339)
at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:123)
at
org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:188)
at
org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:168)
Caused by: java.io.FileNotFoundException: File not present on S3 at
com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem$NativeS3FsInputStream.read(S3NativeFileSystem.java:203)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:246) at
java.io.BufferedInputStream.read1(BufferedInputStream.java:286) at
java.io.BufferedInputStream.read(BufferedInputStream.java:345) at
java.io.DataInputStream.readFully(DataInputStream.java:195) at
org.apache.hadoop.hbase.io.hfile.FixedFileTrailer.readFromStream(FixedFileTrailer.java:391)
at
org.apache.hadoop.hbase.io.hfile.HFile.pickReaderVersion(HFile.java:482)
Any suggestions in figuring out the root cause for this discrepancy would be really helpful.
Appreciate your help! Thank you!

After much research and trial & errors, I was finally able to find a resolution for this issue, thanks to AWS support folks. It seems the issue is an occurrence as a result of S3's eventual consistency. The AWS team suggested to use the below property and it worked like a charm, so far we haven't hit the HFile corruption issue. Hope this helps if someone is facing the same issue!
Property (hbase-site.xml):
hbase.bulkload.retries.retryOnIOException : true

Related

Sqoop through JAVA API

We are trying to sqoop data from mysql to HDFS. When we run the code the data gets stored in local file system. We want the data to be in HDFS. Can any one suggest us with the following code?
SqoopOptions options = new SqoopOptions();
options.setConnectString("jdbc:mysql:hostname/db_name");
options.setUsername("user");
options.setPassword("pass");
options.setTableName("table");
options.setDirectMode(true);
options.setNumMappers(4);
options.setDriverClassName("com.mysql.jdbc.Driver");
options.setSqlQuery("select * from table");
options.setWhereClause("value > 15.0");
options.setTargetDir("output");
options.doHiveImport();
System.out.println();
int ret=new ImportTool().run(options);
System.out.println(ret);
I ran the same program in hdfs and got the output :)
Here the issue is with options.setTargetDir("output");
You are not specifying a qualifying HDFS path. If you change "output" with a valid HDFS path, you should be able to run the code from anywhere and still get a proper result.

Spring-xd strange too many open files error

I upgraded from spring-xd 1.2.1 to 1.3.0, and have both under /opt on my system. After starting xd in single node (but configured to use Zookeeper), I tried to create another stream (e.g. "time | log"), and spring-xd throws the following exception:
java.io.FileNotFoundException: /opt/spring-xd-1.2.1.RELEASE/xd/config/modules/modules.yml (Too many open files)
I changed ulimit -n 60000, but it didn't solve the problem. The strange thing is why it still points to spring-xd-1.2.1.RELEASE? I have started both xd-singlenode and xd-shell under /opt/spring-xd-1.3.1.RELEASE
EDIT: add xd-singlenode running process output just to show it's pointing to 1.3.1:
/usr/java/default/bin/java -Dspring.application.name=admin
-Dlogging.config=file:/opt/spring-xd-1.3.0.RELEASE/xd/config//
/xd-singlenode-logback.groovy -Dxd.home=/opt/spring-xd-1.3.0.RELEASE/xd
-Dspring.config.location=file:/opt/spring-xd-1.3.0.RELEASE/xd/config//
-Dxd.config.home=file:/opt
/spring-xd-1.3.0.RELEASE/xd/config//
-Dspring.config.name=servers,application
-Dxd.module.config.location=file:/opt/spring-xd-1.3.0.RELEASE/xd/config//modules/
-Dxd.module.config.name=modules -classpath
/opt/spring-xd-1.3.0.RELEASE/xd/modules/processor/scripts:/opt/spring-xd
-1.3.0.RELEASE/xd/config:/opt/spring-xd-1.3.0.RELEASE/xd/lib/activation-
...
have you updated your environment variables? specifically XD_CONFIG_LOCATION based on the error shown above.

MLlib ALS cannot delete checkpoint RDDs Wrong FS: hdfs://[url] expected: file:///

I am using Spark MLlib's ALS class to train a MatrixFactorizationModel. I have setup a HDFS for checkpointing intermediate rdds (as suggested by ALS class). The rdds are begin saved but I get an exception when it tries to delete them again: java.lang.IllegalArgumentException: Wrong FS: hdfs://[url], expected: file:///
Here is the stack trace:
Exception in thread "main" java.lang.IllegalArgumentException: Wrong FS: hdfs://[url], expected: file:///
at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:642)
at org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalFileSystem.java:69)
at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:516)
at org.apache.hadoop.fs.ChecksumFileSystem.delete(ChecksumFileSystem.java:528)
at org.apache.spark.ml.recommendation.ALS$$anonfun$2$$anonfun$apply$mcV$sp$1.apply(ALS.scala:568)
at org.apache.spark.ml.recommendation.ALS$$anonfun$2$$anonfun$apply$mcV$sp$1.apply(ALS.scala:566)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.ml.recommendation.ALS$$anonfun$2.apply$mcV$sp(ALS.scala:566)
at org.apache.spark.ml.recommendation.ALS$$anonfun$train$1.apply$mcVI$sp(ALS.scala:602)
at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
at org.apache.spark.ml.recommendation.ALS$.train(ALS.scala:596)
at org.apache.spark.mllib.recommendation.ALS.run(ALS.scala:219)
at adapters.ALSAdapter.run(ALSAdapter.java:59) ...
The culprit seems to be:
at org.apache.spark.ml.recommendation.ALS.scala 568
FileSystem.get(sc.hadoopConfiguration).delete(new Path(file), true)
which appears to return a RawLocalFilesystem instead of a distributed filesystem object.
I have not touched sc.hadoopConfiguration. The only interaction I've had is to call myJavaStreamingContext.checkpoint(hdfs://[url + directory]);.
Is there anything further I need to do client side to setup sc.hadoopConfiguration or would the problem be hdfs server side?
I was using spark 1.3.1 but tried 1.4.1 and the problem still persists.
your sc.hadoopConfiguration is using local filesystem ("fs.default.name" = file:///) and despite specifying explicitly the checkpoint location, the ALS code uses sc.hadoopConfiguration to determine filesystem (e.g. for deleting previous checkpoints if applicable et al)
you can try
set sc.hadoopConfiguration.set("fs.default.name", hfds://xx:port)
set sc.setCheckpointDir(yourCheckpointDir)
and just call checkpoint with no explicit filesystem path or see if setCheckpointInterval will work for you (i.e. let ALS take care of checkpointing)
you can add hdfs-site.xml and core-site.xml to spark classpath if you don't want sc.hadoopConfiguration.set in code

hbase ExportSnapshot from 0.94 to 0.98

We have two clusters with hbase 0.94, hadoop 1.04 and hbase 0.98, hadoop 2.4
I've created a snapshot from table on 0.94 snapshot and want to migrate it to cluster with hbase 0.98.
After run this command on 0.98 cluster:
hbase org.apache.hadoop.hbase.snapshot.ExportSnapshot -snapshot snapshot-name -copy-from webhdfs://hadoops-master:9000/hbase -copy-to hdfs://solr1:8020/hbase
I see:
Exception in thread "main" org.apache.hadoop.hbase.snapshot.ExportSnapshotException: Failed to copy the snapshot directory: from=webhdfs://hadoops-master:9000/hbase/.hbase-snapshot/snapshot-name to=hdfs://solr1:8020/hbase/.hbase-snapshot/.tmp/snapshot-name
at org.apache.hadoop.hbase.snapshot.ExportSnapshot.run(ExportSnapshot.java:916)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.hadoop.hbase.snapshot.ExportSnapshot.innerMain(ExportSnapshot.java:1000)
at org.apache.hadoop.hbase.snapshot.ExportSnapshot.main(ExportSnapshot.java:1004)
Caused by: java.net.SocketException: Unexpected end of file from server
at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:772)
at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:633)
at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:769)
at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:633)
at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1323)
at java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:468)
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem$Runner.getResponse(WebHdfsFileSystem.java:596)
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem$Runner.run(WebHdfsFileSystem.java:530)
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem.run(WebHdfsFileSystem.java:417)
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem.getHdfsFileStatus(WebHdfsFileSystem.java:630)
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem.getFileStatus(WebHdfsFileSystem.java:641)
at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:337)
at org.apache.hadoop.hbase.snapshot.ExportSnapshot.run(ExportSnapshot.java:914)
... 3 more
Lars Hofhansl himself (a principal HBase committer and the maintainer of 0.94) has stated that exports are not supported from 0.94 and 0.98. So you are likely going nowehere with this.
Here is an excerpt from his thread just this afternoon:
It's tricky from multiple angles:- replication between 0.94 and 0.98 does not work (there's a gateway process that supposedly does that, but it's not known to be very reliable)- snapshots cannot be exported from 0.94 and 0.98
source: hbase user's mailing list today 12/15/14
UPDATE On the HBase mailing list there was a report that one user was able to find a way to do the export. A piece of that info is:
If exporting from an HBase 0.94 cluster to an HBase 0.98 cluster, you will need to use the webhdfs protocol (or possibly hftp, though I couldn’t get that to work). You also need to manually move some files around because snapshot layouts have changed. Based on the example above, on the 0.98 cluster do the following:
check whether any imports already exist for the table:
hadoop fs -ls /apps/hbase/data/archive/data/default
It would not be correct for me to reproduce the entire discussion here: a link to nabble that contains the entire gory details is:
http://apache-hbase.679495.n3.nabble.com/0-94-going-forward-td4066883.html
I am digging into this issue and it has more to do with the underlying HDFS.
Once the streams (in my case for distcp) have been written, close is called:
public void close() throws IOException {
try {
super.close();
} finally {
try {
validateResponse(op, conn, true);
} finally {
conn.disconnect();
}
}
}
Wherein it fails in the validate response call (probably the other end of connection has been closed).
This could be due to incompatibility between HDFS 1.0 and 2.4!
I've repacked the hadoop and hbase libs with jarjar https://code.google.com/p/jarjar/
It needs to fix some class name issues.
Then I've write a mapreduce copyTable job. It reads rows from 94 culster and writes to 98 cluster.
Here is the code:
https://github.com/fiserro/copy-table-94to98
Thanks github.com/falsecz for the idea and help!

pig + hbase + hadoop2 integration

has anyone had successful experience loading data to hbase-0.98.0 from pig-0.12.0 on hadoop-2.2.0 in an environment of hadoop-2.20+hbase-0.98.0+pig-0.12.0 combination without encountering this error:
ERROR 2998: Unhandled internal error.
org/apache/hadoop/hbase/filter/WritableByteArrayComparable
with a line of log trace:
java.lang.NoClassDefFoundError: org/apache/hadoop/hbase/filter/WritableByteArra
I searched the web and found a handful of problems and solutions but all of them refer to pre-hadoop2 and base-0.94-x which were not applicable to my situation.
I have a 5 node hadoop-2.2.0 cluster and a 3 node hbase-0.98.0 cluster and a client machine installed with hadoop-2.2.0, base-0.98.0, pig-0.12.0. Each of them functioned fine separately and I got hdfs, map reduce, region servers , pig all worked fine. To complete an "loading data to base from pig" example, i have the following export:
export PIG_CLASSPATH=$HADOOP_INSTALL/etc/hadoop:$HBASE_PREFIX/lib/*.jar
:$HBASE_PREFIX/lib/protobuf-java-2.5.0.jar:$HBASE_PREFIX/lib/zookeeper-3.4.5.jar
and when i tried to run : pig -x local -f loaddata.pig
and boom, the following error:ERROR 2998: Unhandled internal error. org/apache/hadoop/hbase/filter/WritableByteArrayComparable (this should be the 100+ times i got it dying countless tries to figure out a working setting).
the trace log shows:lava.lang.NoClassDefFoundError: org/apache/hadoop/hbase/filter/WritableByteArrayComparable
the following is my pig script:
REGISTER /usr/local/hbase/lib/hbase-*.jar;
REGISTER /usr/local/hbase/lib/hadoop-*.jar;
REGISTER /usr/local/hbase/lib/protobuf-java-2.5.0.jar;
REGISTER /usr/local/hbase/lib/zookeeper-3.4.5.jar;
raw_data = LOAD '/home/hdadmin/200408hourly.txt' USING PigStorage(',');
weather_data = FOREACH raw_data GENERATE $1, $10;
ranked_data = RANK weather_data;
final_data = FILTER ranked_data BY $0 IS NOT NULL;
STORE final_data INTO 'hbase://weather' USING
org.apache.pig.backend.hadoop.hbase.HBaseStorage('info:date info:temp');
I have successfully created a base table 'weather'.
Has anyone had successful experience and be generous to share with us?
ant clean jar-withouthadoop -Dhadoopversion=23 -Dhbaseversion=95
By default it builds against hbase 0.94. 94 and 95 are the only options.
If you know which jar file contains the missing class, e.g. org/apache/hadoop/hbase/filter/WritableByteArray, then you can use the pig.additional.jars property when running the pig command to ensure that the jar file is available to all the mapper tasks.
pig -D pig.additional.jars=FullPathToJarFile.jar bulkload.pig
Example:
pig -D pig.additional.jars=/usr/lib/hbase/lib/hbase-protocol.jar bulkload.pig

Resources