ParquetWriter outputs empty parquet file in a java stand alone program - hadoop

I tried to convert existing avro file to parquet. But the output parquet file is empty. I am not sure what I did wrong...
My code snippet:
FileReader<GenericRecord> fileReader = DataFileReader.openReader(
new File("output/users.avro"), new GenericDatumReader<GenericRecord>());
Schema avroSchema = fileReader.getSchema();
// generate the corresponding Parquet schema
MessageType parquetSchema = new AvroSchemaConverter().convert(avroSchema);
// choose compression scheme
CompressionCodecName compressionCodecName = CompressionCodecName.UNCOMPRESSED;
// set Parquet file block size and page size values
int pageSize = 64 * 1024;
Path outputPath = new Path("output/users.parquet");
// create a parquet writer using builder
ParquetWriter parquetWriter = (ParquetWriter) AvroParquetWriter.builder(outputPath)
.withSchema(avroSchema)
.withCompressionCodec(compressionCodecName)
.withPageSize(pageSize)
.build();
// read avro, write parquet
while (fileReader.hasNext()) {
GenericRecord record = fileReader.next();
System.out.println(record);
parquetWriter.write(record);
}

I had the same problem and found that I needed to close the parquetWriter before the data was committed to the file. It just needs you to add
parquetWriter.close();
after the while loop.

Related

How to read a COMPRESS()-ed H2 blob column via JDBC?

I have a file-based H2 database (engine version 1.4.196) with a mediumblob column containing data returned by the COMPRESS() function:
create table foo (compressed_data mediumblob);
...
insert into foo (compressed_data) values (COMPRESS(STRINGTOUTF8('Test'), 'DEFLATE'));
(The table is created and filled by flyway.)
I'd like to read this data in a JDBC client without calling DECOMPRESS() first. (I want to do the decompression client-side for compatibility with another system). I've tried to read the data via an InflaterInputStream, which can uncompress DEFLATE data:
try (InputStream dbStream = rs.getBinaryStream("compressed_data");
InflaterInputStream inflaterStream = new InflaterInputStream(dbStream);
) {
inflaterStream.read();
...
But this causes an error:
java.util.zip.ZipException: incorrect header check
at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:164)
...
Is there any way I can get InflaterInputStream-compatible compressed data from a column in H2?
Since you are already using H2 JDBC to access the database you can simply retrieve the compressed data with getBytes and use the expand method of org.h2.tools.CompressTool to uncompress it:
// .java source file is Cp1252 encoded
String sql = "SELECT COMPRESS(STRINGTOUTF8('fermé'), 'DEFLATE') AS foo";
ResultSet rs = st.executeQuery(sql);
rs.next();
byte[] bytesOut = rs.getBytes(1);
byte[] expanded = org.h2.tools.CompressTool.getInstance().expand(bytesOut);
String strOut = new String(expanded, "UTF-8");

spark job performing poorly while converting text files to parquet format

I have a spark streaming application which is responsible for converting text files into parquet format on the fly, and then saving the data in an external hive table. Please refer the mentioned piece of code which is one of the classes for processing text files into parquet:
object HistTableLogic {val logger = Logger.getLogger("file")
def schemadef(batchId: String) {
println("process started!")
logger.debug("process started")
val sourcePath = "some path"
val destPath = "somepath"
println(s"source path :${sourcePath}")
println(s"dest path :${destPath}")
logger.debug(s"source path :${sourcePath}")
logger.debug(s"dest path :${destPath}")
// val sc = new SparkContext(new SparkConf().set("spark.driver.allowMultipleContexts", "true"))
val conf = new Configuration()
println("Spark Context created!!")
logger.debug("Spark Context created!!")
val spark = SparkSession.builder.enableHiveSupport().getOrCreate()
println("Spark session created!")
logger.debug("Spark session created!")
val schema = StructType.apply(spark.read.table("hivetable").schema.fields.dropRight(2))
try {
val fs = FileSystem.get(conf)
spark.sql("ALTER table hivetable drop if exists partition (batch_run_dt='"+batchId.substring(1,9)+"', batchid='"+batchId+"')")
fs.listStatus(new Path(sourcePath)).foreach(x => {
val df = spark.read.format("com.databricks.spark.csv").option("inferSchema","true").option("delimiter","\u0001").
schema(schema).csv(s"${sourcePath}/"+batchId).na.fill("").repartition(50).write.mode("overwrite").option("compression", "gzip")
.parquet(s"${destPath}/batch_run_dt="+batchId.substring(1,9)+"/batchid="+batchId)
spark.sql("ALTER table hivetable add partition (batch_run_dt='"+batchId.substring(1,9)+"', batchid='"+batchId+"')")
logger.debug("Partition added")
})
} catch {
case e: Exception => {
println("---------Exception caught---------!")
logger.debug("---------Exception caught---------!")
e.printStackTrace()
logger.debug(e.printStackTrace)
logger.debug(e.getMessage)
}
}
}}
I am passing schemadef method of above class in main method of another java class which has the logic of receiving batchIds 24x7, set via custom receiver.
Functionally the application runs fine, but is taking around 15 minutes to process even 1GB of data. And if I try to simply load the data into hive table through LOAD query, it happens within a minute.
referring below configuration for spark job:
SPARK_MASTER YARN
SPARK_DEPLOY-MODE CLUSTER
SPARK_DRIVER-MEMORY 13g
SPARK_NUM-EXECUTORS 6
SPARK_EXECUTOR-MEMORY 15g
SPARK_EXECUTOR-CORES 2
Please let me know if you find any flaw in this or any other optimization I can do to enhance this process. Thank you

Tika text extraction not working on HDFS

I'm trying to use Tika to extract text from a bunch of simple txt files stored on HDFS. I have the following code in my reducer, but surprisingly Tika does not return anything. It work fine in my local machine but as soon as I move everything to hadoop cluster, the result is empty.
FileSystem fs = FileSystem.get(new Configuration());
Path pt = new Path(Configs.BLOBSTORAGEPREFIX+fileAdd);
InputStream stream = fs.open(pt);
AutoDetectParser parser = new AutoDetectParser();
BodyContentHandler handler = new BodyContentHandler();
Metadata metadata = new Metadata();
parser.parse(stream, handler, metadata);
spaceContentBuffer.append(handler.toString());
The last line append the extreaxted content to a StringBuilder, but it is always empty.
p.s. my hadoop cluster is Azure HDInsight so the HDFS is Blob Storage.
I also tried the following code
Metadata metadata = new Metadata();
BodyContentHandler handler = new BodyContentHandler();
Parser parser = new TXTParser();
ParseContext con = new ParseContext();
parser.parse(stream, handler, metadata, con);
and I got the following error message:
Failed to detect the character encoding of a document
If the user does not specify Content-Type when uploading a blob, it will be set to “application/octet-stream” by default.

hadoop-1.0.3 sequenceFile.Writer overwrites instead of appending images into a sequencefile

I am using hadoop 1.0.3 (I can't really upgrade right now,Thats for later. )
I have around 100 images in my HDFS and I am trying to combine them into a single sequencefile ( default no compression etc.. )
here's my code:
FSDataInputStream in = null;
BytesWritable value = new BytesWritable();
Text key = new Text();
Path inpath = new Path(fs.getHomeDirectory(),"/user/hduser/input");
Path seq_path = new Path(fs.getHomeDirectory(),"/user/hduser/output/file.seq");
FileStatus[] files = fs.listStatus(inpath);
SequenceFile.Writer writer = null;
for( FileStatus fileStatus : files){
inpath = fileStatus.getPath();
try {
in = fs.open(inpath);
byte bufffer[] = new byte[in.available()];
in.read(bufffer);
writer = SequenceFile.createWriter(fs,conf,seq_path,key.getClass(),value.getClass());
writer.append(new Text(inpath.getName()), new BytesWritable(bufffer));
}catch (Exception e) {
System.out.println("Exception MESSAGES = "+e.getMessage());
e.printStackTrace();
}}
This just goes through all the files in input/ and one by one appends them.
HOWEVER this just overwrites my sequence file instead of appending it , I see only the last image in sequencefile.
NOTE I am not closing the writer before the for loop ends , can anyone help me with this please.
I am not sure How can I append the images?
Your main issue is with the following line :
writer = SequenceFile.createWriter(fs, conf, seq_path, key.getClass(), value.getClass());
which is inside the for, creating a new writer in each pass. It replaces previous file at the path seq_path. Thus only last image is available.
Pull it out of the loop, and the problem should vanish.

Error Appending to IsolatedStorageFile

I am having some problems with Isolated file store , I am trying to append to a file, but when I use the code below, I get an error about invalid Arguments on this line
IsolatedStorageFileStream("Folder\\barcodeinfo.txt", FileMode.Append,
FileMode.OpenOrCreate, myStore))
I think it has something to do with the Filemode.Append.. I am trying to append to the file rather than create new
// Obtain the virtual store for the application.
IsolatedStorageFile myStore = IsolatedStorageFile.GetUserStoreForApplication();
// Create a new folder and call it "MyFolder".
myStore.CreateDirectory("Folder");
// Specify the file path and options.
using (var isoFileStream = new IsolatedStorageFileStream("Folder\\barcodeinfo.txt", FileMode.Append, FileMode.OpenOrCreate, myStore))
{
//Write the data
using (var isoFileWriter = new StreamWriter(isoFileStream))
{
isoFileWriter.WriteLine(textBox1.Text);
isoFileWriter.WriteLine(textBox2.Text);
isoFileWriter.WriteLine(textBox3.Text);
}
}
There is no overload that takes two FileModes. It should be
IsolatedStorageFileStream("Folder\\barcodeinfo.txt", FileMode.Append,
FileAccess.Write, myStore));
Important thing to note about FileMode.Append is:
[FileMode.Append] Opens the file if it exists and seeks to the end of the file, or
creates a new file. Append can only be used in conjunction with Write.
Attempting to seek to a position before the end of the file will throw
an IOException and any attempt to read fails and throws an
NotSupportedException.
which is why FileAccess.Write is used.
It looks like you have FileMode.Append, FileMode.OpenOrCreate. That is 2 file modes. The first parameter is be FileMode and the second should be FileAccess.
That should fix your problem.

Resources