How to decompress Hadoop snappy compressed file in Java - hadoop

We are compressing Flink job S3 output using Hadoop Parquet+Snappy compression.
AvroParquetWriter.<T>builder(out)
.withSchema(schema)
.withCompressionCodec(CompressionCodecName.SNAPPY)
.withDataModel(dataModel)
.build();
Now we tried to use Hadoop SnappyDecompressor/Snappy-java dependency in an Ec2 Java service to decompress this file but both returning following exception :
"java.lang.UnsatisfiedLinkError: 'int org.apache.hadoop.io.compress.snappy.SnappyDecompressor.decompressBytesDirect()'" .
Please let us know what is the correct way to decompress these files in a Java Ec2 service.
final SnappyDecompressor decompressor = new SnappyDecompressor();
final byte[] data = IOUtils.toByteArray(s3ObjectInputStream);
decompressor.setInput(data, 0, data. Length);
final byte[] uncompressed = new byte[10 * 1024 * 1024];
decompressor.decompress(uncompressed, 0, data.length);
final SnappyInputStream snappyInputStream = new SnappyInputStream(s3ObjectInputStream);
final List<String> lines = IOUtils.readLines(snappyInputStream, StandardCharsets.UTF_8);

Related

Unable to write data in HDFS "datanode" - Node added in excluded list

I'm running "namenode" and "datanode" in the same jvm, when I try to write data I'm getting the following exception
org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy$NotEnoughReplicasException:
at org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseRandom(BlockPlacementPolicyDefault.java:836)
at org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseRandom(BlockPlacementPolicyDefault.java:724)
at org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseLocalRack(BlockPlacementPolicyDefault.java:631)
at org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseLocalStorage(BlockPlacementPolicyDefault.java:591)
at org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTargetInOrder(BlockPlacementPolicyDefault.java:490)
at org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:421)
at org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:297)
at org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:148)
at org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:164)
at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:2127)
at org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.chooseTargetForNewBlock(FSDirWriteFileOp.java:294)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2771)
at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:876)
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:567)
at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:876)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:822)
at java.base/java.security.AccessController.doPrivileged(Native Method)
at java.base/javax.security.auth.Subject.doAs(Subject.java:423)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2682)
final File file = new File("C:\\ManageEngine\\test\\data\\namenode");
final File file1 = new File("C:\\ManageEngine\\test\\data\\datanode1");
BasicConfigurator.configure();
final HdfsConfiguration nameNodeConfiguration = new HdfsConfiguration();
FileSystem.setDefaultUri(nameNodeConfiguration, "hdfs://localhost:5555");
nameNodeConfiguration.set(DFSConfigKeys.DFS_NAMENODE_NAME_DIR_KEY, file.toURI().toString());
nameNodeConfiguration.set(DFSConfigKeys.DFS_REPLICATION_KEY, "1" );
final NameNode nameNode = new NameNode(nameNodeConfiguration);
final HdfsConfiguration dataNodeConfiguration1 = new HdfsConfiguration();
dataNodeConfiguration1.set(DFSConfigKeys.DFS_DATANODE_DATA_DIR_KEY, file1.toURI().toString());
dataNodeConfiguration1.set(DFSConfigKeys.DFS_DATANODE_ADDRESS_KEY, "localhost:5556" );
dataNodeConfiguration1.set(DFSConfigKeys.DFS_REPLICATION_KEY, "1" );
FileSystem.setDefaultUri(dataNodeConfiguration1, "hdfs://localhost:5555");
final DataNode dataNode1 = DataNode.instantiateDataNode(new String[]{}, dataNodeConfiguration1);
final FileSystem fs = FileSystem.get(dataNodeConfiguration1);
Path hdfswritepath = new Path(fileName);
if(!fs.exists(hdfswritepath)) {
fs.create(hdfswritepath);
System.out.println("Path "+hdfswritepath+" created.");
}
System.out.println("Begin Write file into hdfs");
FSDataOutputStream outputStream=fs.create(hdfswritepath);
//Cassical output stream usage
outputStream.writeBytes(fileContent);
outputStream.close();
System.out.println("End Write file into hdfs");
Request data - Image
You cannot have the number of replicas higher than the number of datanodes.
If you want run on a single node, set dfs.replication to 1 in your hdfs-site.xml.

Map Reduce job on EMR successfully running but no output data on S3

Im running MR job on EMR master host.
My input file is in S3 and output set to a table in Hive via Hcatalog.
The job is running successful and i do see reducers output rows but looking at the S3 new partitions folder i can only see MR 0 byte SUCCESS file but no actual data files.
note- when reducer stage start i do see files writes to S3 into temp folder, but it seems the last operation throws the files somewhere.
I don't see any errors in MR logs.
Relevant MR driver code:"
Job job = Job.getInstance();
job.setJobName("Build Events");
job.setJarByClass(LoggersApp.class);
job.getConfiguration().set("fs.defaultFS", "s3://my-bucket");
// set input paths Path[] inputPaths = "file on s3";
FileInputFormat.setInputPaths(job, inputPaths); // set input output
format job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(HCatOutputFormat.class);
_configureOutputTable(job);
private void _setReducer(Job job) {
job.setReducerClass(Reducer.class);
job.setOutputValueClass(DefaultHCatRecord.class); }
private void _configureOutputTable(Job job) throws IOException {
OutputJobInfo jobInfo =
OutputJobInfo.create(_cli.getOptionValue("hive-dbname"),
_cli.getOptionValue("output-table"), null); HCatOutputFormat.setOutput(job, jobInfo); HCatSchema schema =
HCatOutputFormat.getTableSchema(job.getConfiguration());
HCatFieldSchema partitionDate = new HCatFieldSchema("date",
TypeInfoFactory.stringTypeInfo, null); HCatFieldSchema
partitionBatchId = new HCatFieldSchema("batch_id",
TypeInfoFactory.stringTypeInfo, null);
schema.append(partitionDate); schema.append(partitionBatchId);
HCatOutputFormat.setSchema(job, schema);
}
Any help?

Tika text extraction not working on HDFS

I'm trying to use Tika to extract text from a bunch of simple txt files stored on HDFS. I have the following code in my reducer, but surprisingly Tika does not return anything. It work fine in my local machine but as soon as I move everything to hadoop cluster, the result is empty.
FileSystem fs = FileSystem.get(new Configuration());
Path pt = new Path(Configs.BLOBSTORAGEPREFIX+fileAdd);
InputStream stream = fs.open(pt);
AutoDetectParser parser = new AutoDetectParser();
BodyContentHandler handler = new BodyContentHandler();
Metadata metadata = new Metadata();
parser.parse(stream, handler, metadata);
spaceContentBuffer.append(handler.toString());
The last line append the extreaxted content to a StringBuilder, but it is always empty.
p.s. my hadoop cluster is Azure HDInsight so the HDFS is Blob Storage.
I also tried the following code
Metadata metadata = new Metadata();
BodyContentHandler handler = new BodyContentHandler();
Parser parser = new TXTParser();
ParseContext con = new ParseContext();
parser.parse(stream, handler, metadata, con);
and I got the following error message:
Failed to detect the character encoding of a document
If the user does not specify Content-Type when uploading a blob, it will be set to “application/octet-stream” by default.

Apache Spark 1.2.1 standalone cluster giving java heap space error

I need information about, how to figure out how much heap space(memory) would be needed to operate on x mb(suppose x means 600 mb) in spark standalone cluster.
Scenario:
I have standalone cluster with 14gb memory and 8 cores. I want to operate(Reading data from files and writing it to Cassandra) on 600 MB of data.
For this task I have SparkConfig as:
.set("spark.cassandra.output.throughput_mb_per_sec","800")
.set("spark.storage.memoryFraction", "0.3")
And --executor-memory=5g --total-executor-cores 6 --driver-memory 6g at the time of submitting task.
In spite of above configuration,I getting java heap space error while writing data to Cassandra.
Below is the java code:
public static void main(String[] args) throws Exception {
String fileName = args[0];
Long now = new Date().getTime();
SparkConf conf = new SparkConf(true)
.setAppName("JavaSparkSQL_" +now)
.set("spark.cassandra.connection.host", "192.168.1.65")
.set("spark.cassandra.connection.native.port", "9042")
.set("spark.cassandra.connection.rpc.port", "9160")
.set("spark.cassandra.output.throughput_mb_per_sec","800")
.set("spark.storage.memoryFraction", "0.3");
JavaSparkContext ctx = new JavaSparkContext(conf);
JavaRDD<String> input =ctx.textFile
("hdfs://abc.xyz.net:9000/figmd/resources/" + fileName, 12);
JavaRDD<PlanOfCare> result = input.mapPartitions(new
ParseJson()).filter(new PickInputData());
System.out.print("Count --> "+result.count());
System.out.println(StringUtils.join(result.collect(), ","));
javaFunctions(result).writerBuilder("ks","pt_planofcarelarge",
mapToRow(PlanOfCare.class)).saveToCassandra();
}
What configuration I am suppose to do?Am I missing anything?
Thanks in advance.
JavaRDD collect method return an array that contains all of the elements in this RDD.
So in your case, it will creates an array with 340000 elements which will result in a Java Heap Error, you may want to take a small sample of your data and collect it or you may want to save it directly to your disk.
For more information about JavaRDD, you can always refer to the official documentation.

No data being written to S3 using Hadoop FileSystem and BouncyCastle

I'm using the following code to write encrypted data to Amazon S3:
byte[] bytes = compressFile(instr, CompressionAlgorithmTags.ZIP);
PGPEncryptedDataGenerator encGen = new PGPEncryptedDataGenerator(new JcePGPDataEncryptorBuilder(PGPEncryptedData.CAST5).setWithIntegrityPacket(withIntegrityCheck).setSecureRandom(new SecureRandom()).setProvider("BC"));
encGen.addMethod(new JcePublicKeyKeyEncryptionMethodGenerator(pubKey).setProvider("BC"));
OutputStream cOut = encGen.open(out, bytes.length);
cOut.write(bytes);
cOut.close();
If I set "out" to:
final OutputStream fsOutStr = new FileOutputStream(new File("/home/hadoop/encrypted.gpg"));
It writes the file just fine.
However when I attempt to write it to S3, it does not give me any errors, appears to work, but there is no data on S3 when I check for it:
final FileSystem fileSys = FileSystem.get(new URI(GenericUtils.getAsEncodedStringIfEmbeddedSpaces(s3OutputDir)), new Configuration());
final OutputStream fsOutStr = fileSys.create(new Path(s3OutputDir)); // outputPath on S3
Any idea why it writes the data perfectly fine to the local disk but does not write the file to S3?
Closing fsOutStr solved the problem.

Resources