How to use Hadoop API copyMerge function? What is the addString parameter? - hadoop

Does anyone know or have used copyMerge function in Hadoop API - FileUtil?
copyMerge(FileSystem srcFS, Path srcDir, FileSystem dstFS, Path dstFile, boolean deleteSource, Configuration conf, String addString);
In the function, what is the addString parameter? How do I set how those files are merged? Example I have part number 1,2,3,4,5..., I want to combine them into one file in ascending order, how can I do it?
Detail about the API: http://archive.cloudera.com/cdh/3/hadoop-0.20.2+320/api/org/apache/hadoop/fs/FileUtil.html
Thanks!

Looks like the the addString is just written to the OutputStream in the FileUtil class
if (addString!=null)
out.write(addString.getBytes("UTF-8"));
}
When there is no documentation, source code is the true and best source for details. I have written a few articles on how to setup Git here and here. Git helps for faster and easier access to the code.

Related

Placing file inside folder of S3 bucket

have a spring boot application, where I am tring to place a file inside folder of S3 target bucket. target-bucket/targetsystem-folder/file.csv
The targetsystem-folder name will differ for each file which will be retrived from yml configuration file.
The targetsystem-folder have to created via code if the folder doesnot exit and file should be placed under the folder
As I know, there is no folder concept in S3 bucket and all are stored as objects.
Have read in some documents like to place the file under folder, have to give the key-expression like targetsystem-folder/file.csv and bucket = target-bucket.
But it doesnot work out.Would like to achieve this using spring-integration-aws without using aws-sdk directly
<int-aws:s3-outbound-channel-adapter id="filesS3Mover"
channel="filesS3MoverChannel"
transfer-manager="transferManager"
bucket="${aws.s3.target.bucket}"
key-expression="headers.targetsystem-folder/headers.file_name"
command="UPLOAD">
</int-aws:s3-outbound-channel-adapter>
Can anyone guide on this issue
Your problem that the SpEL in the key-expression is wrong. Just try to start from the regular Java code and imagine how you would like to build such a value. Then you'll figure out that you are missing concatenation operation in your expression:
key-expression="headers.targetsystem-folder + '/' + headers.file_name"
Also, please, in the future provide more info about error. In most cases the stack trace is fully helpful.
In the project that I was working before, I just used the java aws sdk provided. Then in my implementation, I did something like this
private void uploadFileTos3bucket(String fileName, File file) {
s3client.putObject(new PutObjectRequest("target-bucket", "/targetsystem-folder/"+fileName, file)
.withCannedAcl(CannedAccessControlList.PublicRead));
}
I didn't create anymore configuration. It automatically creates /targetsystem-folder inside the bucket(then put the file inside of it), if it's not existing, else, put the file inside.
You can take this answer as reference, for further explanation of the subject.
There are no "sub-directories" in S3. There are buckets and there are
keys within buckets.
You can emulate traditional directories by using prefix searches. For
example, you can store the following keys in a bucket:
foo/bar1
foo/bar2
foo/bar3
blah/baz1
blah/baz2

Parquet-MR AvroParquetWriter - how to convert data to Parquet (with Specific Mapping)

I'm working on a tool for converting data from a homegrown format to Parquet and JSON (for use in different settings with Spark, Drill and MongoDB), using Avro with Specific Mapping as the stepping stone. I have to support conversion of new data on a regular basis and on client machines which is why I try to write my own standalone conversion tool with a (Avro|Parquet|JSON) switch instead of using Drill or Spark or other tools as converters as I probably would if this was a one time job. I'm basing the whole thing on Avro because this seems like the easiest way to get conversion to Parquet and JSON under one hood.
I used Specific Mapping to profit from static type checking, wrote an IDL, converted that to a schema.avsc, generated classes and set up a sample conversion with specific constructor, but now I'm stuck configuring the writers. All Avro-Parquet conversion examples I could find [0] use AvroParquetWriter with deprecated signatures (mostly: Path file, Schema schema) and Generic Mapping.
AvroParquetWriter has only one none-deprecated Constructor, with this signature:
AvroParquetWriter(
Path file,
WriteSupport<T> writeSupport,
CompressionCodecName compressionCodecName,
int blockSize,
int pageSize,
boolean enableDictionary,
boolean enableValidation,
WriterVersion writerVersion,
Configuration conf
)
Most of the parameters are not hard to figure out but WriteSupport<T> writeSupport throws me off. I can't find any further documentation or an example.
Staring at the source of AvroParquetWriter I see GenericData model pop up a few times but only one line mentioning SpecificData: GenericData model = SpecificData.get();.
So I have a few questions:
1) Does AvroParquetWriter not support Avro Specific Mapping? Or does it by means of that SpecificData.get() method? The comment "Utilities for generated Java classes and interfaces." over 'SpecificData.class` seems to suggest that but how exactly should I proceed?
2) What's going on in the AvroParquetWriter constructor, is there an example or some documentation to be found somewhere?
3) More specifically: the signature of the WriteSupport method asks for 'Schema avroSchema' and 'GenericData model'. What does GenericData model refer to? Maybe I'm not seeing the forest because of all the trees here...
To give an example of what I'm aiming for, my central piece of Avro conversion code currently looks like this:
DatumWriter<MyData> avroDatumWriter = new SpecificDatumWriter<>(MyData.class);
DataFileWriter<MyData> dataFileWriter = new DataFileWriter<>(avroDatumWriter);
dataFileWriter.create(schema, avroOutput);
The Parquet equivalent currently looks like this:
AvroParquetWriter<SpecificRecord> parquetWriter = new AvroParquetWriter<>(parquetOutput, schema);
but this is not more than a beginning and is modeled after the examples I found, using the deprecated constructor, so will have to change anyway.
Thanks,
Thomas
[0] Hadoop - The definitive Guide, O'Reilly, https://gist.github.com/hammer/76996fb8426a0ada233e, http://www.programcreek.com/java-api-example/index.php?api=parquet.avro.AvroParquetWriter
Try AvroParquetWriter.builder :
MyData obj = ... // should be avro Object
ParquetWriter<Object> pw = AvroParquetWriter.builder(file)
.withSchema(obj.getSchema())
.build();
pw.write(obj);
pw.close();
Thanks.

Xtext get the absolute path of the generated files

I want to access the file generated by Xtext to compile it automatically. So I need its absolute path. It's enough to get the absolute path of the current project at run-time. Any idea how I can get it?
I am working inside the "MyDslGenerator" Class. I tried to get it from the "resource" in
override void doGenerate(Resource resource, IFileSystemAccess fsa)
but couldn't find it.
Help is highly appreciated.
I ended up using this code:
var uri = (fsa as IFileSystemAccessExtension2).getURI(fileName)
maybe you can use the Interface org.eclipse.xtext.generator.IFileSystemAccessExtension2. the passed IFileSystemAccess may implement this interface too.

Scalding + LZO +Protobuf

Are there any pointers to get Scalding to work with LZO Protobuf data on HDFS?
I am trying to read files that are stored in binary Protobuf and compressed in LZO using Scalding.
Can we use Elephantbird to read those files? Any pointers will be appreciated!
I have looked at the LzoTraits and LzoProtobufScheme? But I am not sure how I should be using it to read the data? Any examples would be great!
Here is an example:
case class SomeProto() extends FixedPathSource("/my/greatData/*")
with LzoProtobuf[MyProtoClassHere] {
override def column = classOf[MyProtoClassHere]
}
You can mix with other types of abstract base Sources (like TimePathedSource, or MostRecentGoodSource) in a similar way. You can mix in with LocalTapSource if you want to use the Hadoop-inside-cascading-local trick (if you don't run in cascading local mode, you don't need this).

HP UFT API Test - Saving Response/Checkpoint values

Is there a way to capture and store (or write to a file) the values returned in the Response? (Checkpoint values)
Using HP UFT 11.52
Thanks,
Lynn
I figured it out. In UFT API under Standard Activities, there are File function modules including "Write to File". I added the module to the test, set the path and other properties, passed the variable to the file and it worked! Couldn't be easier.
I mentioned this on my other answer , you can also write it programatically if you have dynamic array response please refer below:
https://stackoverflow.com/a/28012383/3972994
After running a test, in the test folder, you can find a Snapshots/LastIteration directory.
In it you can find the return value for each step saved in a txt file.
Pay attention that if you data drive the step, only the last iteration will be saved to file.
However, in the Test's log (Test dir/Log/vtd_user.log) you can find all the iterations persisted
Thanks,
Yossi
You do not need to use the standard activities if you do this
var iResponse = this.Activity.responsebody;
System.IO.File.WriteLines(#"directorypath&FileName);
the above will write the response to the file and rewrite it for every run

Resources