appending to existing sequence file

appending to existing sequence file - hadoop

In my use case, I need a find a way to append key/value pairs to the existing sequence file. How to do it? Any clue would be greatly helpful. I am using hadoop 2x.
Also, I came across the below documentation. Can anyone tell me how to use this to append?
public static org.apache.hadoop.io.SequenceFile.Writer createWriter(FileContext fc,
Configuration conf,
Path name,
Class keyClass,
Class valClass,
org.apache.hadoop.io.SequenceFile.CompressionType compressionType,
CompressionCodec codec,
org.apache.hadoop.io.SequenceFile.Metadata metadata,
EnumSet createFlag,
org.apache.hadoop.fs.Options.CreateOpts... opts)
throws IOException
Construct the preferred type of SequenceFile Writer.
Parameters:
fc - The context for the specified file.
conf - The configuration.
name - The name of the file.
keyClass - The 'key' type.
valClass - The 'value' type.
compressionType - The compression type.
codec - The compression codec.
metadata - The metadata of the file.
**createFlag - gives the semantics of create: overwrite, append etc.**
opts - file creation options; see Options.CreateOpts.
Returns:
Returns the handle to the constructed SequenceFile Writer.
Throws:
IOException

UPDATE: The issue HADOOP-7139 now it's closed and from version 2.6.1 / 2.7.2 It's possible to append to an existing SequenceFile :)
(I was using version 2.7.1 and looking for append to a SequenceFile, so I downgraded to 2.6.1 because version 2.7.2 it's not still out)

It's still not possible to append to an existing Sequence File.
There is an open issue to work on that but it's still unresolved.

Related

Parquet-MR AvroParquetWriter - how to convert data to Parquet (with Specific Mapping)

I'm working on a tool for converting data from a homegrown format to Parquet and JSON (for use in different settings with Spark, Drill and MongoDB), using Avro with Specific Mapping as the stepping stone. I have to support conversion of new data on a regular basis and on client machines which is why I try to write my own standalone conversion tool with a (Avro|Parquet|JSON) switch instead of using Drill or Spark or other tools as converters as I probably would if this was a one time job. I'm basing the whole thing on Avro because this seems like the easiest way to get conversion to Parquet and JSON under one hood.
I used Specific Mapping to profit from static type checking, wrote an IDL, converted that to a schema.avsc, generated classes and set up a sample conversion with specific constructor, but now I'm stuck configuring the writers. All Avro-Parquet conversion examples I could find [0] use AvroParquetWriter with deprecated signatures (mostly: Path file, Schema schema) and Generic Mapping.
AvroParquetWriter has only one none-deprecated Constructor, with this signature:
AvroParquetWriter(
Path file,
WriteSupport<T> writeSupport,
CompressionCodecName compressionCodecName,
int blockSize,
int pageSize,
boolean enableDictionary,
boolean enableValidation,
WriterVersion writerVersion,
Configuration conf
)
Most of the parameters are not hard to figure out but WriteSupport<T> writeSupport throws me off. I can't find any further documentation or an example.
Staring at the source of AvroParquetWriter I see GenericData model pop up a few times but only one line mentioning SpecificData: GenericData model = SpecificData.get();.
So I have a few questions:
1) Does AvroParquetWriter not support Avro Specific Mapping? Or does it by means of that SpecificData.get() method? The comment "Utilities for generated Java classes and interfaces." over 'SpecificData.class` seems to suggest that but how exactly should I proceed?
2) What's going on in the AvroParquetWriter constructor, is there an example or some documentation to be found somewhere?
3) More specifically: the signature of the WriteSupport method asks for 'Schema avroSchema' and 'GenericData model'. What does GenericData model refer to? Maybe I'm not seeing the forest because of all the trees here...
To give an example of what I'm aiming for, my central piece of Avro conversion code currently looks like this:
DatumWriter<MyData> avroDatumWriter = new SpecificDatumWriter<>(MyData.class);
DataFileWriter<MyData> dataFileWriter = new DataFileWriter<>(avroDatumWriter);
dataFileWriter.create(schema, avroOutput);
The Parquet equivalent currently looks like this:
AvroParquetWriter<SpecificRecord> parquetWriter = new AvroParquetWriter<>(parquetOutput, schema);
but this is not more than a beginning and is modeled after the examples I found, using the deprecated constructor, so will have to change anyway.
Thanks,
Thomas
[0] Hadoop - The definitive Guide, O'Reilly, https://gist.github.com/hammer/76996fb8426a0ada233e, http://www.programcreek.com/java-api-example/index.php?api=parquet.avro.AvroParquetWriter

Try AvroParquetWriter.builder :
MyData obj = ... // should be avro Object
ParquetWriter<Object> pw = AvroParquetWriter.builder(file)
.withSchema(obj.getSchema())
.build();
pw.write(obj);
pw.close();
Thanks.

ZlibDecompressor throws incorrect header check exception

I'm using ZlibDecompressor in Hadoop, but I'm getting incorrect header check exception.
This is how I instantiate it
ZlibDecompressor inflater = new ZlibDecompressor(ZlibDecompressor.CompressionHeader.DEFAULT_HEADER,1024);
inflater.setInput(bytesCompressed, 0, bytesCompressed.length);
And here is how I use it for decompression
inflater.decompress(bytesDecompressedBuffer,0,bufferSizeInBytes);
I'm using Hadoop 0.20.2.
What could be the problem and how to solve it?
Thanks
d31efcf42e83e76d3df76d38db5d3c141f76135e7417de41d44dc50b507a07b03a07a03ad40f75db7f00038d7df02177db9dbbd01f02e35ef7eb60f6f77dfaebde3a0b7f75036d41dc3dc00c4e40136e3b044e83ec5d35f01044f050841011000c0df4d3ae40ec1079078101f02dfcd40dfbef9df5ec4db8e45d37d85102d350b8001d79f7de8303ce7a045efdd75e35dfc03b036f3c0f5e43034d78dfadb9e7ad7d0750c10c30bce7a103d04ef4000dbde01dfdf7a0c20b907df7def9d80137ef8

The problem reported is that there is no valid zlib header in the first two bytes. The problem with the data is that it does not appear to have any deflate-compressed data anywhere in it, regardless of whether such data could be zlib wrapped, gzip wrapped, or raw.

Scalding + LZO +Protobuf

Are there any pointers to get Scalding to work with LZO Protobuf data on HDFS?
I am trying to read files that are stored in binary Protobuf and compressed in LZO using Scalding.
Can we use Elephantbird to read those files? Any pointers will be appreciated!
I have looked at the LzoTraits and LzoProtobufScheme? But I am not sure how I should be using it to read the data? Any examples would be great!

Here is an example:
case class SomeProto() extends FixedPathSource("/my/greatData/*")
with LzoProtobuf[MyProtoClassHere] {
override def column = classOf[MyProtoClassHere]
}
You can mix with other types of abstract base Sources (like TimePathedSource, or MostRecentGoodSource) in a similar way. You can mix in with LocalTapSource if you want to use the Hadoop-inside-cascading-local trick (if you don't run in cascading local mode, you don't need this).

processing multiple files in minimum time

I am new to hadoop. Basically I am writing a program which takes two multifasta files (ref.fasta,query.fasta) which are 3+ GB.....
ref.fasta:
gi|12345
ATATTATAGGACACCAATAAAATT..
gi|5253623
AATTATCGCAGCATTA...
..and so on..
query.fasta:
query
ATTATTTAAATCTCACACCACATAATCAATACA
AATCCCCACCACAGCACACGTGATATATATACA
CAGACACA...
NOw to each mapper I need to give a single part of ref file and the whole query file.
i.e
gi|12345
ATATTATAGGACACCAATA....
(a single fasta sequence from ref file)
AND the entire query file.because I want to run an exe inside mapper which takes these both as input.
so do i process ref.fasta outside and then give it to mapper?or some thing else..??
I just need approach which will take minimum time.
Thanks.

The best approach for your use-case may be to have the query file in distributed cache and get the file object ready in the configure()/setup() to be used in the map(). And have the ref file as normal input.
You may do the following:
In your run() add the query file to the distributed cache:
DistributedCache.addCacheFile(new URI(queryFile-HDFS-Or-S3-Path), conf);
Now have the mapper class something like following:
public static class MapJob extends MapReduceBase implements Mapper {
File queryFile;
#Override
public void configure(JobConf job) {
Path queryFilePath = DistributedCache.getLocalCacheFiles(job)[0];
queryFile = new File(queryFilePath.toString());
}
#Override
public void map(LongWritable key, Text value, OutputCollector<Text, Text> output, Reporter reporter)
throws IOException {
// Use the queryFile object and [key,value] from your ref file here to run the exe file as desired.
}
}

I faced a similar problem.
I'd suggest you pre-process your ref file and split it into multiple files (one per sequence).
Then copy those files to a folder on the hdfs that you will set as your input path in your main method.
Then implement a custom input format class and custom record reader class. Your record reader will just pass the name of the local file split path (as a Text value) to either the key or value parameter of your map method.
For the query file that is require by all map functions, again add your query file to the hdfs and then add it to the DistributedCache in your main method.
In your map method you'll then have access to both local file paths and can pass them to your exe.
Hope that helps.
I had a similar problem and eventually re-implemented the functionality of blast exe file so that I didn't need to deal with reading files in my map method and could instead deal entire with Java objects (Genes and Genomes) that are parsed from the input files by my custom record reader and then passed as objects to my map function.
Cheers, Wayne.

How to use Hadoop API copyMerge function? What is the addString parameter?

Does anyone know or have used copyMerge function in Hadoop API - FileUtil?
copyMerge(FileSystem srcFS, Path srcDir, FileSystem dstFS, Path dstFile, boolean deleteSource, Configuration conf, String addString);
In the function, what is the addString parameter? How do I set how those files are merged? Example I have part number 1,2,3,4,5..., I want to combine them into one file in ascending order, how can I do it?
Detail about the API: http://archive.cloudera.com/cdh/3/hadoop-0.20.2+320/api/org/apache/hadoop/fs/FileUtil.html
Thanks!

Looks like the the addString is just written to the OutputStream in the FileUtil class
if (addString!=null)
out.write(addString.getBytes("UTF-8"));
}
When there is no documentation, source code is the true and best source for details. I have written a few articles on how to setup Git here and here. Git helps for faster and easier access to the code.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

appending to existing sequence file - hadoop

UPDATE: The issue HADOOP-7139 now it's closed and from version 2.6.1 / 2.7.2 It's possible to append to an existing SequenceFile :) (I was using version 2.7.1 and looking for append to a SequenceFile, so I downgraded to 2.6.1 because version 2.7.2 it's not still out)

It's still not possible to append to an existing Sequence File. There is an open issue to work on that but it's still unresolved.

Related

Parquet-MR AvroParquetWriter - how to convert data to Parquet (with Specific Mapping)

ZlibDecompressor throws incorrect header check exception

Scalding + LZO +Protobuf

processing multiple files in minimum time

How to use Hadoop API copyMerge function? What is the addString parameter?

Categories

Resources