HDFS Short circuit reads - hadoop

According to the documentation, short circuit reads are faster as they doesn't go through the data node. If this is the case then
Why isn't this enabled by default?
In which scenarios do we need short circuit reading?

Take a look at this article: http://blog.cloudera.com/blog/2013/08/how-improved-short-circuit-local-reads-bring-better-performance-and-security-to-hadoop/
Summary of article:
One major downside to the original implementation is that it screwed with security implications. It had to give clients direct read access to the data files. I guess this was bad for kerberos enabled hdfs.
The new implementation passes a file descriptor instead, which supposedly is more secure and faster.
I guess there were some downsides to the old method. I don't see what the downsides to the new method are. I couldn't find a definitive answer in which version of Hadoop the new method appeared.

Related

Use Cases of NIFI

I have a question about Nifi and its capabilities as well as the appropriate use case for it.
I've read that Nifi is really aiming to create a space which allows for flow-based processing. After playing around with Nifi a bit, what I've also come to realize is it's capability to model/shape the data in a way that is useful for me. Is it fair to say that Nifi can also be used for data modeling?
Thanks!
Data modeling is a bit of an overloaded term, but in the context of your desire to model/shape the data in a way that is useful for you, it sounds like it could be a viable approach. The rest of this is under that assumption.
While NiFi employs dataflow through principles and design closely related to flow based programming (FBP) as a means, the function is a matter of getting data from point A to B (and possibly back again). Of course, systems aren't inherently talking in the same protocols, formats, or schemas, so there needs to be something to shape the data into what the consumer is anticipating from what the producer is supplying. This gets into common enterprise integration patterns (EIP) [1] such as mediation and routing. In a broader sense though, it is simply getting the data to those that need it (systems, users, etc) when and how they need it.
Joe Witt, one of the creators of NiFi, gave a great talk that may be in line with this idea of data shaping in the context of Data Science at a Meetup. The slides of which are available [2].
If you have any additional questions, I would point you to check out the community mailing lists [3] and ask any additional questions so you can dig in more and get a broader perspective.
[1] https://en.wikipedia.org/wiki/Enterprise_Integration_Patterns
[2] http://files.meetup.com/6195792/ApacheNiFi-MD_DataScience_MeetupApr2016.pdf
[3] http://nifi.apache.org/mailing_lists.html
Data modeling might well mean many things to many folks so I'll be careful to use that term here. What I do think in what you're asking is very clear is that Apache NiFi is a great system to use to help mold the data into the right format and schema and content you need for your follow-on analytics and processing. NiFi has an extensible model so you can add processors that can do this or you can use the existing processors in many cases and you can even use the ExecuteScript processors as well so you can write scripts on the fly to manipulate the data.

Nifi processor batch insert - handle failure

I am currently in the process of writing an ElasticSearch Nifi processor. Individual inserts / writes to ES are not optimal, instead batching documents is preferred. What would be considered the optimal approach within a Nifi processor to track (batch) documents (FlowFiles) and when at a certain amount batch them in? The part I am most concerned about is if ES is unavailable, down, network partition, etc. prevents the batch from being successful. The primary point of the question, is given that Nifi has content storage for queuing / back-pressure, etc. is there a preferred method for using that to ensure no FlowFiles get lost if a destination is down? Maybe there is another processor I should look at for an example?
I have looked at the Mongo processor, Merge, etc. to try and get an idea of the preferred approach for batching inside of a processor, but can't seem to find anything specific. Any suggestions would be appreciated.
Good chance I am overlooking some basic functionality baked into Nifi. I am still fairly new to the platform.
Thanks!
Great question and a pretty common pattern. This is why we have the concept of a ProcessSession. It allows you to send zero or more things to an external endpoint and only commit once you know it has been ack'd by the recipient. In this sense it offers at least-once semantics. If the protocol you're using supports two-phase commit style semantics you can get pretty close to the ever elusive exactly-once semantic. Much of the details of what you're asking about here will depend on the destination systems API and behavior.
There are some examples in the apache codebase which reveal ways to do this. One way is if you can produce a merged collection of events prior to pushing to the destination system. Depends on its API. I think PutMongo and PutSolr operate this way (though the experts on that would need to weigh in). An example that might be more like what you're looking for can be found in PutSQL which operates on batches of flowfiles to send in a single transaction (on the destination DB).
https://github.com/apache/nifi/blob/master/nifi-nar-bundles/nifi-standard-bundle/nifi-standard-processors/src/main/java/org/apache/nifi/processors/standard/PutSQL.java
Will keep an eye here but can get the eye of a larger NiFi group at users#nifi.apache.org
Thanks
Joe

The best way to store restart information in spring-batch readers, processors, writers and tasklets

Currently I'm designing my first batch application with spring batch using several tasklets and own readers, writers and processors primarily doing input data checks and tif-file handling (split, merge etc) depending on the input data i.e. document metadata with the appertaining image files. I want to store and use restart information persistet in the batch_step_execution_context in the spring-batch job-repository. Unfortunately I did not find many examples where and how to do this best. I want to make the application restartable so that it can continue after error correction at the point it left off.
What I have done so far and checked if in case of an exception the step information has been persistet:
Implemented ItemStream in a CustomItemWriter using update() and open() to store and regain information to/from the step_execution_context e.g. executionContext.putLong("count", count). Works good.
Used StepListeners and found that the context information written in beforeStep() has been persistet. Also works.
I appreciate help which will give or point to some examples, "restart tutorial" or sources where to read how to do it in Readers, Processors, Writers and tasklets. Does it make sense in Readers and Processors? I'm aware that handling restart information might also depend on commit-interval, restartable flags etc..
Remark: Maybe I require some deeper understanding of spring-batch concepts beyond what I read and tried so far. Also hints regarding this are welcome. I consider myself as intermediate level lacking details to make my application using some comforts of spring-batch.

What are the file update requirements of HDFS?

Under the Simple Coherency Model section of the HDFS Archiectiure guide, it states (emphasis mine):
HDFS applications need a write-once-read-many access model for files. A file once created, written, and closed need not be changed. This assumption simplifies data coherency issues and enables high throughput data access. A Map/Reduce application or a web crawler application fits perfectly with this model. There is a plan to support appending-writes to files in the future.
I am confused by the use of "need not" here. Do they really mean "must not" or "should not"? If so, how can programs like HBase provide update support? If they really do mean "need not" (i.e. "doesn't have to"), what is trying to be conveyed? What file systems requires you to change a file once written?
Up to what I know, the need not is part of the assumption that "simplifies data coherency issues that enables high...". Actually means can't. But you can delete and create again the hole file.
After hadoop 0.20.2-append (like shown here) you can append data.
For all I read, I understand that HBase uses mainly memory (WAL? section 11.8.3) and modifications gets appended as marks. For example, to delete a column it makes a tombstone (see section 5.8.1.5) just marking the delete, and periodical compactation.
Maybe I am wrong. Good moment for me to learn the exact explanation :)

Vectored Referencing buffer implementation

I was reading code from one of the projects from github. I came across something called a Vectored Referencing buffer implementation. Can have someone come across this ? What are the practical applications of this. I did a quick google search and wasn't able to find any simple sample implementation for this.
Some insight would be helpful.
http://www.ibm.com/developerworks/library/j-zerocopy/
http://www.linuxjournal.com/article/6345
http://www.seccuris.com/documents/whitepapers/20070517-devsummit-zerocopybpf.pdf
https://github.com/joyent/node/pull/304
I think some more insight on your specific project/usage/etc would allow for a more specific answer.
However, the term is generally used to either change or start an interface/function/routine with the goal that it does not allocate another instance of its input in order to perform its operations.
EDIT: Ok, after reading the new title, I think you are simply talking about pushing buffers into a vector of buffers. This keeps your code clean, you can pass any buffer you need with minimal overhead to any function call, and allows for a better cleanup time if your code isn't managed.
EDIT 2: Do you mean this http://cpansearch.perl.org/src/TYPESTER/Data-MessagePack-Stream-0.07/msgpack-0.5.7/src/msgpack/vrefbuffer.h

Resources