Storing HDFS data only on specific nodes in a Hadoop Cluster - hadoop

We have a 30 nodes production cluster. We want to add 5 data nodes for additional storage to handle the interim spike of data( around 2 TB). This data is to be stored temporarily and we want to get rid of it after 15 days.
Is it possible to make sure that the interim data (2 TB) coming in will be stored only on the newly added data nodes?
I am looking for something similar to YARN node labelling.
Thank you in advance.

Unfortunately I don't know a simple way to achieve this in the same HDFS cluster.
But I think you can achieve this behavior by implementing a custom "Block Placement Policy".
However, performing this task can be somewhat risky and complex.
Here is the HDFS jira ticket where this functionality is defined/added that allows you to customize this policy (JIRA TICKET).
You can read here the current behavior of choosing datanode to understand you better if you want to customize it:
link 1
Also here you can find a post with several references that can be useful on how to implement a custom policy and the risks of it:
Other readings that I recommend if you want to take this way:
link 2
post 2
This is a good paper about an experiment with a custom block placement policy to place replicas in SSD or HDD (Hybrid cluster):
I think that if possible, it will be simpler to use a second cluster. E.g. you can eval ViewFS that uses namespaces to reference each cluster:
viewFs reference
link 3


Nifi processor batch insert - handle failure

I am currently in the process of writing an ElasticSearch Nifi processor. Individual inserts / writes to ES are not optimal, instead batching documents is preferred. What would be considered the optimal approach within a Nifi processor to track (batch) documents (FlowFiles) and when at a certain amount batch them in? The part I am most concerned about is if ES is unavailable, down, network partition, etc. prevents the batch from being successful. The primary point of the question, is given that Nifi has content storage for queuing / back-pressure, etc. is there a preferred method for using that to ensure no FlowFiles get lost if a destination is down? Maybe there is another processor I should look at for an example?
I have looked at the Mongo processor, Merge, etc. to try and get an idea of the preferred approach for batching inside of a processor, but can't seem to find anything specific. Any suggestions would be appreciated.
Good chance I am overlooking some basic functionality baked into Nifi. I am still fairly new to the platform.
Great question and a pretty common pattern. This is why we have the concept of a ProcessSession. It allows you to send zero or more things to an external endpoint and only commit once you know it has been ack'd by the recipient. In this sense it offers at least-once semantics. If the protocol you're using supports two-phase commit style semantics you can get pretty close to the ever elusive exactly-once semantic. Much of the details of what you're asking about here will depend on the destination systems API and behavior.
There are some examples in the apache codebase which reveal ways to do this. One way is if you can produce a merged collection of events prior to pushing to the destination system. Depends on its API. I think PutMongo and PutSolr operate this way (though the experts on that would need to weigh in). An example that might be more like what you're looking for can be found in PutSQL which operates on batches of flowfiles to send in a single transaction (on the destination DB).
Will keep an eye here but can get the eye of a larger NiFi group at

Monitor Hadoop Cluster using Collectl

I am evaluating various system monitoring tools to use one to monitor my hadoop cluster.
One of the tools I am impressed by is collectl. I have been playing around with it since a couple of days.
I am struggling to find how can we aggregate the metrics captured by collectl when using colmux?
Say, I have 10 nodes in my hadoop cluster each running collectl as a service. Using colmux I can see the
performance metrics of each node in a single view (in single and multi-line formats). Great!
But what if I am considering aggregate of CPU, IO etc on all the nodes in the cluster. That is I want to find
how my cluster as a whole is performing by aggregating the performance metrics from each node into corresponding
numbers, thereby giving me cluster-level metrics instead of node-level.
Any help is greatly appreciated. Thanks!
I had already answered this on the mailing list but for the benefit of those not on it I'll repeat myself here..
That's a cool idea. So if I understand you correctly you might see some sort of total line at the bottom? I can always add to my wish list but no promises. But I think I may also have a solution if you don't mind doing a little extra work on your own ;) btw - can I assume you've installed readkey so you can change sort columns with the arrow keys?
If you run colmux with --noesc, it will take it out of full screen more and simply print everything as scrolling output. If you then also include "--lines 99999" (or some big number) it will print all the output from all the remote systems so you don't miss anything. Finally you can pipe the output through perl, python, bash, or whatever your favorite scripting tool might be and do the totals yourself. Then whenever you see a new header fly by, print the totals and reset the counters to 0. You could even add timestamps and maybe even ultimately make it your own opensource project. I bet others would find it useful too.

Hadoop use-case scenario

I would like to have some expert views on the use of a Big Data platform like Hadoop in one of my project scenarios. I am a complete novice in this technology although I understand databases like MySQL well.
We are creating a product which would be used to analyse data from social media. So the input data would be a large volume of tweets, facebook posts, user profiles, YouTube data and data from blogs etc. On top of this I would be having a web application to help me view and analyse this data. As the requirement makes it clear, I would be needing a sort of real time system. So if I have a tweet coming in, I would like to have it available to my web app readily for processing. Batch data processing may not be a suitable choice for my application.
My questions are:
Is a Hadoop engine a good choice for me?
What are the parameter I should base my decision on?
Is it also a good option to use a Multi Cluster MySQL engine as opposed to Hadoop?
Is there any benchmarking in terms of Size and velocity of data in which Hadoop becomes a good choice?
Hadoop is not appropriate for near real time / interactive analysis. Hadoop was designed to do big batch processing of say a few hours of data plus. I used to use Hadoop to process any dataset that was around 10 GB or more (which is still a bit overkill), once it get's to 100 GB then you defo want something like Hadoop.
Now my recommendation would be for Spark as this is much more modern, much faster, more flexible, more powerful, and has a SparkStreaming module for achieving closer to real time analysis. Read all about it!
In this case I prefer the Lambda Architecture.
With Lambda Architecture you have two routes: A fast route with a noSQL database for the current informations, and a batch route with hadoop-hdfs for the archive data, and with a merge component you can merge the two datasources in one query, so you receive a whole amount of data, which is near real time.
Image about lambda architecture:
We created a PoC Project with Lambda Architecture (also for Twitter analysis), and its working fine.
Spark will be the best solution for your problem.You can also look other in-memory databases.

How to Throttle DataStage

I work on a project where we run a number of DataStage sequences can be run in parallel, one in particular is poorly performing and takes a lot of resources, impacting the shared environment. Performance tuning initiative is in progress but will take time.
In the meantime I was hopeful that we could throttle DataStage to restrict the resources that could be used by this particular job/sequence - however I'm not personally experienced with DataStage specifically.
Can anyone comment if this facility exists in DataStage (v8.5 I believe), and point me in the direction of some further detail.
Secondly, I know that we can at the throttle based on the user (I think this ties into AIX 'ulimit', but not sure). Is it easy/possbile to run different jobs/sequences as different users?
In this type of situations resources for a particular job can be restricted by specifying number of nodes and resources in a config file. Possible in 8.5 and you may find something at
Revolution_In_Progress is right.
Datastage PX has the notion of a configuration file. That file can be specified for all the jobs you run or it can be overridden on a job by job basis. The configuration file can be used to limit the physical resources that are associated with a job.
In this case, if you have a 4-node config file for most of your jobs, you may want to write a 2-node config file for the job with performance issue. That way, you'll get the minimum amount of parallelism (without going completely sequential) and use the minimum amount of resources.
Sequence is a collection of individual jobs.
In most cases, jobs in a sequence can be rearranged to run serially. Please check the organisation of the sequence and do a critical path analyis to remove the jobs that need not run in parallel to critical jobs.

Streaming data and Hadoop? (not Hadoop Streaming)

I'd like to analyze a continuous stream of data (accessed over HTTP) using a MapReduce approach, so I've been looking into Apache Hadoop. Unfortunately, it appears that Hadoop expects to start a job with an input file of fixed size, rather than being able to hand off new data to consumers as it arrives. Is this actually the case, or am I missing something? Is there a different MapReduce tool that works with data being read in from an open socket? Scalability is an issue here, so I'd prefer to let the MapReducer handle the messy parallelization stuff.
I've played around with Cascading and was able to run a job on a static file accessed via HTTP, but this doesn't actually solve my problem. I could use curl as an intermediate step to dump the data somewhere on a Hadoop filesystem and write a watchdog to fire off a new job every time a new chunk of data is ready, but that's a dirty hack; there has to be some more elegant way to do this. Any ideas?
The hack you describe is more or less the standard way to do things -- Hadoop is fundamentally a batch-oriented system (for one thing, if there is no end to the data, Reducers can't ever start, as they must start after the map phase is finished).
Rotate your logs; as you rotate them out, dump them into HDFS. Have a watchdog process (possibly a distributed one, coordinated using ZooKeeper) monitor the dumping grounds and start up new processing jobs. You will want to make sure the jobs run on inputs large enough to warrant the overhead.
Hbase is a BigTable clone in the hadoop ecosystem that may be interesting to you, as it allows for a continuous stream of inserts; you will still need to run analytical queries in batch mode, however.
What about It's made for processing streaming data.
A new product is rising: Storm - Distributed and fault-tolerant realtime computation: stream processing, continuous computation, distributed RPC, and more
I think you should take a look over Esper CEP ( ).
Yahoo S4
It provide real time stream computing, like map reduce
Twitter's Storm is what you need, you can have a try!
Multiple options here.
I suggest the combination of Kafka and Storm + (Hadoop or NoSql) as the solution.
We already build our big data platform using those opensource tools, and it works very well.
Your use case sounds similar to the issue of writing a web crawler using Hadoop - the data streams back (slowly) from sockets opened to fetch remote pages via HTTP.
If so, then see Why fetching web pages doesn't map well to map-reduce. And you might want to check out the FetcherBuffer class in Bixo, which implements a threaded approach in a reducer (via Cascading) to solve this type of problem.
As you know the main issues with Hadoop for usage in stream mining are the fact that first, it uses HFDS which is a disk and disk operations bring latency that will result in missing data in stream. second, is that the pipeline is not parallel. Map-reduce generally operates on batches of data and not instances as it is with stream data.
I recently read an article about M3 which tackles the first issue apparently by bypassing HDFS and perform in-memory computations in objects database. And for the second issue, they are using incremental learners which are not anymore performed in batch. Worth checking it out M3
: Stream Processing on
Main-Memory MapReduce. I could not find the source code or API of this M3 anywhere, if somebody found it please share the link here.
Also, Hadoop Online is also another prototype that attemps to solve the same issues as M3 does: Hadoop Online
However, Apache Storm is the key solution to the issue, however it is not enough. You need some euqivalent of map-reduce right, here is why you need a library called SAMOA which actually has great algorithms for online learning that mahout kinda lacks.
Several mature stream processing frameworks and products are available on the market. Open source frameworks are e.g. Apache Storm or Apache Spark (which can both run on top of Hadoop). You can also use products such as IBM InfoSphere Streams or TIBCO StreamBase.
Take a look at this InfoQ article, which explains stream processing and all these frameworks and products in detail: Real Time Stream Processing / Streaming Analytics in Combination with Hadoop. Besides the article also explains how this is complementary to Hadoop.
By the way: Many software vendors such as Oracle or TIBCO call this stream processing / streaming analytics approach "fast data" instead of "big data" as you have to act in real time instead of batch processing.
You should try Apache Spark Streaming.
It should work well for your purposes.
