Apache Spark move/rename succefully processed files - hadoop

I would like to use spark streaming (1.1.0-rc2 Java-API) to process some files, and move/rename them once the processing is done successfully in order to push them to other jobs.
I thought about using the file path included in the name of generated RDDs (newAPIHadoopFile), but how can we determine a successful end of processing of a file?
Also not sure this the right way to achieve it so any ideas are welcome.
EDIT:
Here is some pseudo code to be more clear :
logs.foreachRDD(new Function2<JavaRDD<String>, Time, Void>() {
#Override
public Void call(JavaRDD<String> log, Time time) throws Exception {
String fileName=log.name();
String newlog=Process(log);
SaveResultToFile(newlog, time);
//are we done with the file so we can move it ????
return null;
}
});

You aren't guaranteed that the input is backed by an HDFS file. But it doesn't seem like you need that given your question. You create a new file and write something to it. When the write completes, you're done. Move it with other HDFS APIs.

Related

Multithreaded Use of Spring Pulsar

I am working on a project to read from our existing ElasticSearch instance and produce messages in Pulsar. If I do this in a highly multithreaded way without any explicit synchronization, I get many occurances of the following log line:
Message with sequence id X might be a duplicate but cannot be determined at this time.
That is produced from this line of code in the Pulsar Java client:
https://github.com/apache/pulsar/blob/a4c3034f52f857ae0f4daf5d366ea9e578133bc2/pulsar-client/src/main/java/org/apache/pulsar/client/impl/ProducerImpl.java#L653
When I add a synchronized block to my method, synchronizing on the pulsar template, the error disappears, but my publish rate drops substantially.
Here is the current working implementation of my method that sends Protobuf messages to Pulsar:
public <T extends GeneratedMessageV3> CompletableFuture<MessageId> persist(T o) {
var descriptor = o.getDescriptorForType();
PulsarPersistTopicSettings settings = pulsarPersistConfig.getSettings(descriptor);
MessageBuilder<T> messageBuilder = Optional.ofNullable(pulsarPersistConfig.getMessageBuilder(descriptor))
.orElse(DefaultMessageBuilder.DEFAULT_MESSAGE_BUILDER);
Optional<ProducerBuilderCustomizer<T>> producerBuilderCustomizerOpt =
Optional.ofNullable(pulsarPersistConfig.getProducerBuilder(descriptor));
PulsarOperations.SendMessageBuilder<T> sendMessageBuilder;
sendMessageBuilder = pulsarTemplate.newMessage(o)
.withSchema(Schema.PROTOBUF_NATIVE(o.getClass()))
.withTopic(settings.getTopic());
producerBuilderCustomizerOpt.ifPresent(sendMessageBuilder::withProducerCustomizer);
sendMessageBuilder.withMessageCustomizer(mb -> messageBuilder.applyMessageBuilderKeys(o, mb));
synchronized (pulsarTemplate) {
try {
return sendMessageBuilder.sendAsync();
} catch (PulsarClientException re) {
throw new PulsarPersistException(re);
}
}
}
The original version of the above method did not have the synchronized(pulsarTemplate) { ... } block. It performed faster, but generated a lot of logs about duplicate messages, which I knew to be incorrect. Adding the synchronized block got rid of the log messages, but slowed down publishing.
What are the best practices for multithreaded access to the PulsarTemplate? Is there a better way to achieve very high throughput message publishing?
Should I look at using the reactive client instead?
EDIT: I've updated the code block to show the minimum synchronization necessary to avoid the log lines, which is just synchronizing during the .sendAsync(...) call.
Your usage w/o the synchronized should work. I will look into that though to see if I see anything else going on. In the meantime, it would be great to give the Reactive client a try.
This issue was initially tracked here, and the final resolution was that it was an issue that has been resolved in Pulsar 2.11.
Please try updating the Pulsar 2.11.

Second value of array is not written by openCSV to file, especially when jar is executed within a shell file but working fine on IDE [duplicate]

I have two raw streams and I am joining those streams and then I want to count what is the total number of events that have been joined and how much events have not. I am doing this by using map on joinedEventDataStream as shown below
joinedEventDataStream.map(new RichMapFunction<JoinedEvent, Object>() {
#Override
public Object map(JoinedEvent joinedEvent) throws Exception {
number_of_joined_events += 1;
return null;
}
});
Question # 1: Is this the appropriate way to count the number of events in the stream?
Question # 2: I have noticed a wired behavior, which some of you might not believe. The issue is that when I run my Flink program in IntelliJ IDE, it shows me correct value for number_of_joined_events but 0 in the case when I submit this program as jar. So I am getting the initial value of number_of_joined_events when I run the program as a jar file instead of the actual count. Why is this happening only in case of jar file submission and not in IDE?
Your approach is not working. The behavior you noticed when executing the program via a JAR file is expected.
I don't know how number_of_joined_events is defined, but I assume its a static variable in your program. When you run the program in your IDE, it runs in a single JVM. Hence, all operators have access to the static variable. When you submit a JAR file to a remote process, the program is executed in a different JVM (possibly multiple JVMs) and the static variable in your client process is never updated.
You can use Flink's metrics or a ReduceFunction that sums 1s to count the number of processed records.

Hadoop jobs using same reducer output to same file

I ran into an interesting situation, and now am looking for how to do it intentionally. On my local single node setup, I ran 2 jobs simultaneously from the terminal screen. My both jobs use same reducer, they only have difference in map function (aggregation key - the group by), the output of both jobs was written to the output of first job (though second job did created its own folder, but it was empty). What I am working on is providing rollup aggregations across various levels, and this behavior is fascinating for me, that the aggregation output from two different levels are available to me in one single file (also perfectly sorted).
My question is how to achieve the same in real Hadoop cluster, where we have multiple data nodes i.e. I programmatically initiate multiple jobs, all accessing same input file, mapping the data differently, but using the same reducer, and the output is available in one single file, and not in 5 different output files.
Please advise.
I was taking a look at merge output files after reduce phase before I decided to ask my question.
Thanks and Kind regards,
Moiz Ahmed.
When different Mappers consume the same input file, with other words the same data structure, then source code for all these different mappers can be placed into separate methods of a single Mapper implementation and use a parameter from the context to decide which map functions to invoke. On the pluss side you need to start only one Map Reduce Job. Example is pseudo code:
class ComplexMapper extends Mapper {
protected BitSet mappingBitmap = new BitSet();
protected void setup(Context context) ... {
{
String params = context.getConfiguration().get("params");
---analyze params and set bits into the mappingBitmap
}
protected void mapA(Object key, Object value, Context context){
.....
context.write(keyA, value);
}
protected void mapB(Object key, Object value, Context context){
.....
context.write(keyA, value);
}
protected void mapB(Object key, Object value, Context context){
.....
context.write(keyB, value);
}
public void map(Object key, Object value, Context context) ..... {
if (mappingBitmap.get(1)) {
mapA(key, value, context);
}
if (mappingBitmap.get(2)) {
mapB(key, value, context);
}
if (mappingBitmap.get(3)) {
mapC(key, value, context);
}
}
Of cause it can be implemented more elegantly with interfaces etc.
In the job setup just add:
Configuration conf = new Configuration();
conf.set("params", "AB");
Job job = new Job(conf);
As Praveen Sripati mentioned, having a single output file will force you into having just one Reducer which might be bad for performance. You can always concatenate the part** files when you download them from the hdfs. Example:
hadoop fs -text /output_dir/part* > wholefile.txt
Usually each reducer task produces a separate file in HDFS, so that the reduce tasks can operate in parallel. If the requirement is to have one o/p file from the reduce task then configure the job to have one reducer task. The number of reducers can be configure using the mapred.reduce.tasks property which is defaulted to 1. The con of this approach is there is only one reducer which might be a bottle neck for the job to complete.
Another option is to use some other output format which allows multiple reducers to write to the same sink simultaneously like DBOuputFormat. Once the Job processing is complete, the results from the DB can be exported into a flat file. This approach will enable multiple reduce tasks to run in parallel.
Another options is to merge the o/p files as mentioned in the OP. So, based on the pros and cons of each of the approach and the volume of the data to be processed the one of the approach can be chosen.

What does the sync and syncFs of SequenceFile.Writer means?

Environment: Hadoop 0.20.2-cdh3u5
I am trying to upload log data (10G) to HDFS with a customized tool which using SequenceFile.Writer.
SequenceFile.Writer w = SequenceFile.createWriter(
hdfs,
conf,
p,
LongWritable.class,
Text.class,
4096,
hdfs.getDefaultReplication(),
hdfs.getDefaultBlockSize(),
compressionType,
codec,
null,
new Metadata());
During the uploading process, if the tool crashed (without invoke the close() method explicitly), will the log that has been uploaded lost?
Should I invoke sync() or syncFs() timely, what do the two methods means?
Yes, probably.
sync() create a sync point. As stated in the book "Hadoop- The Definitive Guide" by Tom White (Cloudera)
a sync point is a point in the stream which can used by to
resynchronize with a record boundary if the reader is "lost" - for
example after seeking to an arbitrary position on the stream.
Now the implementation of syncFS() is pretty simple:
public void syncFs() throws IOException {
if (out != null) {
out.sync(); // flush contents to file system
}
}
where out is a FSDataOutputStream. Again, in the same book is stated:
HDFS provides a method for forcing all buffers to be synchronized to
the datanodes via the sync() method on FSDataOutputStream. After
a successful call return from sync() HDFS garantees that the data
written up to that point in the file is persisted and visible to all
readers. In the event of a crash (of the client or HDFS), the data
will not be lost.
But a footnote warns to look to bug HDFS-200, since the visibility mentioned above was not always not always honored.

multiple input into a Mapper in hadoop

I am trying to send two files to a hadoop reducer.
I tried DistributedCache, but anything I put using addCacheFile in main, doesn't seem to be given back to with getLocalCacheFiles in the mapper.
right now I am using FileSystem to read the file, but I am running locally so I am able to just send the name of the file. Wondering how to do this if I was running on a real hadoop system.
is there anyway to send values to the mapper except the file that it's reading?
I also had a lot of problems with distribution cache, and sending parameters. Options worked for me are below:
For distributed cache usage:
For me it was a nightmare to get the url/path to file on HDFS in Map or Reduce, but with symlink it worked
in run() method of the job
DistributedCache.addCacheFile(new URI(file+"#rules.dat"), conf);
DistributedCache.createSymlink(conf);
and then read in Map or Reduce
in header, before methods
public static FileSystem hdfs;
and then in setup() method of Map or Reduce
hdfs = FileSystem.get(new Configuration()).open(new Path ("rules.dat"));
For parameters:
Send some values to Map or Reduce (could be a filename to open from HDFS):
public int run(String[] args) throws Exception {
Configuration conf = new Configuration();
...
conf.set("level", otherArgs[2]); //sets variable level from command line, it could be a filename
...
}
then in Map or Reduce class just:
int level = Integer.parseInt(conf.get("level")); //this is int, but you can read also strings, etc.
If distributed cache suites your need - it is a way to go.
getLocalCacheFiles works differently in the local mode and in the distributed mode. (it actually do not work in local mode).
Look into this link: http://developer.yahoo.com/hadoop/tutorial/module5.html
look for the phrase: As a cautionary note:

Resources