Hadoop use-case scenario - hadoop

I would like to have some expert views on the use of a Big Data platform like Hadoop in one of my project scenarios. I am a complete novice in this technology although I understand databases like MySQL well.
We are creating a product which would be used to analyse data from social media. So the input data would be a large volume of tweets, facebook posts, user profiles, YouTube data and data from blogs etc. On top of this I would be having a web application to help me view and analyse this data. As the requirement makes it clear, I would be needing a sort of real time system. So if I have a tweet coming in, I would like to have it available to my web app readily for processing. Batch data processing may not be a suitable choice for my application.
My questions are:
Is a Hadoop engine a good choice for me?
What are the parameter I should base my decision on?
Is it also a good option to use a Multi Cluster MySQL engine as opposed to Hadoop?
Is there any benchmarking in terms of Size and velocity of data in which Hadoop becomes a good choice?

Hadoop is not appropriate for near real time / interactive analysis. Hadoop was designed to do big batch processing of say a few hours of data plus. I used to use Hadoop to process any dataset that was around 10 GB or more (which is still a bit overkill), once it get's to 100 GB then you defo want something like Hadoop.
Now my recommendation would be for Spark as this is much more modern, much faster, more flexible, more powerful, and has a SparkStreaming module for achieving closer to real time analysis. Read all about it! https://spark.apache.org/

In this case I prefer the Lambda Architecture.
With Lambda Architecture you have two routes: A fast route with a noSQL database for the current informations, and a batch route with hadoop-hdfs for the archive data, and with a merge component you can merge the two datasources in one query, so you receive a whole amount of data, which is near real time.
Image about lambda architecture: http://i.stack.imgur.com/eofRW.png
We created a PoC Project with Lambda Architecture (also for Twitter analysis), and its working fine.

Spark will be the best solution for your problem.You can also look other in-memory databases.


Separate Data Access Layers for Distributed Compute

Currently my product maintains a DAL that is separated from business logic and exposed via a set of services where each service generally corresponds to an element i.e. Car objects are accessed via the CarService. These services are powered through Spring Data Repositories and access data (models) stored in both PostgreSQL and Elasticsearch.
We are now processing more and more data (documents in, our models out or documents in, clustering, models out) and have realized that computation has become a bottleneck. To overcome this we are evaluating Spark or Apache Beam to distribute the computation horizontally which would solve the problem.
After looking into the Spark (and Beam) frameworks I have found that they generally provide their own integration (or plugin) for reading/writing from/to datasources, which in and of itself is great. The problem for me is that I can't find anyway for these frameworks to support distributed reading/writing through our current set of services. Spark requires RDD and Beam requires PCollection and I'd rather not support 2 methods of reading/writing from our datastores to accommodate.
My Question
Has anyone encountered this before? What was your strategy?
Did you go ahead and support 2 types of DAL? If so, were there any caveats with this especially with regards to the ongoing maintenance of the code?
In software engineering, multi-tier architecture is a client-server architecture in which, the presentation, the application processing and the data management are logically separate processes, crosscutting concern or logical separation helps for performance, scalability and maintenance.
keep in mind that tiers are at logical levels which means that may or may not be many physical layers.
If you are going with Image 1 then no need new DAO layers but in Image 2 , i will suggest create separate project and use EAI pattern to communicate both projects
Image 1:
In image 1 you can process data and keep into database and use same DAO layer to get data
Image 2:
In image 2 You can create separate layer where you have to submit your jobs and collect results directly into your spring code .
Apacge Spark or Bigdata has diffrent archetecture styles ,plesae read following links .
What are the differences between kappa-architecture and lambda-architecture

Use Cases of NIFI

I have a question about Nifi and its capabilities as well as the appropriate use case for it.
I've read that Nifi is really aiming to create a space which allows for flow-based processing. After playing around with Nifi a bit, what I've also come to realize is it's capability to model/shape the data in a way that is useful for me. Is it fair to say that Nifi can also be used for data modeling?
Data modeling is a bit of an overloaded term, but in the context of your desire to model/shape the data in a way that is useful for you, it sounds like it could be a viable approach. The rest of this is under that assumption.
While NiFi employs dataflow through principles and design closely related to flow based programming (FBP) as a means, the function is a matter of getting data from point A to B (and possibly back again). Of course, systems aren't inherently talking in the same protocols, formats, or schemas, so there needs to be something to shape the data into what the consumer is anticipating from what the producer is supplying. This gets into common enterprise integration patterns (EIP) [1] such as mediation and routing. In a broader sense though, it is simply getting the data to those that need it (systems, users, etc) when and how they need it.
Joe Witt, one of the creators of NiFi, gave a great talk that may be in line with this idea of data shaping in the context of Data Science at a Meetup. The slides of which are available [2].
If you have any additional questions, I would point you to check out the community mailing lists [3] and ask any additional questions so you can dig in more and get a broader perspective.
[1] https://en.wikipedia.org/wiki/Enterprise_Integration_Patterns
[2] http://files.meetup.com/6195792/ApacheNiFi-MD_DataScience_MeetupApr2016.pdf
[3] http://nifi.apache.org/mailing_lists.html
Data modeling might well mean many things to many folks so I'll be careful to use that term here. What I do think in what you're asking is very clear is that Apache NiFi is a great system to use to help mold the data into the right format and schema and content you need for your follow-on analytics and processing. NiFi has an extensible model so you can add processors that can do this or you can use the existing processors in many cases and you can even use the ExecuteScript processors as well so you can write scripts on the fly to manipulate the data.

Hadoop and Stata

Does anyone have any experience using Stata and Hadoop? Stata 13 now has a Java Plugin API, so I think it should be straightforward to get them to play nice.
I am particularly interested in being able to parse weblog data to get it into a form suitable for statistical analysis.
This question came up on Statalist recently, but there was no response, so I thought I would try it here where the audience is more likely to have experience with this technology.
I think it would be easier to do something like this using the ELK Stack (http://www.elastic.co). Logstash (the middle layer) has several parsers/tokenizers/analyzes built on the Apache Lucene engine for cleaning and formatting log data and can push the resulting data into elasticsearch, which exposes an HTTP API that you can curl fairly easily to get results (e.g., use insheetjson and pass the HTTP GET request as the URL and it should be imported into Stata without much problem).
I've been trying to cobble together a program to use the Jackson JSON library to build out more robust JSON I/O capabilities from within Stata and would definitely not mind trying to work with others to get it done.
Hope this helps,
I'll take an (un?)educated stab at this. From the looks of the java API, the caller seems to treat Stata as essentially a datastore. If that's the case, then I would imagine Stata would fit in to the hadoop world as a database and would be accessed by its own InputFormat and OutputFormat. In your specific case I'd imagine you'd write a StataOutputFormat which your reducer would use to write the parsed data. The only drawback seems to be your referenced comments that Stata apps tend to be I/O bound so I don't know that using hadoop is really going to help you since
you'll have to write all that data anyway, and
that write will be I/O bound, whether you use hadoop or not.

Streaming data and Hadoop? (not Hadoop Streaming)

I'd like to analyze a continuous stream of data (accessed over HTTP) using a MapReduce approach, so I've been looking into Apache Hadoop. Unfortunately, it appears that Hadoop expects to start a job with an input file of fixed size, rather than being able to hand off new data to consumers as it arrives. Is this actually the case, or am I missing something? Is there a different MapReduce tool that works with data being read in from an open socket? Scalability is an issue here, so I'd prefer to let the MapReducer handle the messy parallelization stuff.
I've played around with Cascading and was able to run a job on a static file accessed via HTTP, but this doesn't actually solve my problem. I could use curl as an intermediate step to dump the data somewhere on a Hadoop filesystem and write a watchdog to fire off a new job every time a new chunk of data is ready, but that's a dirty hack; there has to be some more elegant way to do this. Any ideas?
The hack you describe is more or less the standard way to do things -- Hadoop is fundamentally a batch-oriented system (for one thing, if there is no end to the data, Reducers can't ever start, as they must start after the map phase is finished).
Rotate your logs; as you rotate them out, dump them into HDFS. Have a watchdog process (possibly a distributed one, coordinated using ZooKeeper) monitor the dumping grounds and start up new processing jobs. You will want to make sure the jobs run on inputs large enough to warrant the overhead.
Hbase is a BigTable clone in the hadoop ecosystem that may be interesting to you, as it allows for a continuous stream of inserts; you will still need to run analytical queries in batch mode, however.
What about http://s4.io/. It's made for processing streaming data.
A new product is rising: Storm - Distributed and fault-tolerant realtime computation: stream processing, continuous computation, distributed RPC, and more
I think you should take a look over Esper CEP ( http://esper.codehaus.org/ ).
Yahoo S4 http://s4.io/
It provide real time stream computing, like map reduce
Twitter's Storm is what you need, you can have a try!
Multiple options here.
I suggest the combination of Kafka and Storm + (Hadoop or NoSql) as the solution.
We already build our big data platform using those opensource tools, and it works very well.
Your use case sounds similar to the issue of writing a web crawler using Hadoop - the data streams back (slowly) from sockets opened to fetch remote pages via HTTP.
If so, then see Why fetching web pages doesn't map well to map-reduce. And you might want to check out the FetcherBuffer class in Bixo, which implements a threaded approach in a reducer (via Cascading) to solve this type of problem.
As you know the main issues with Hadoop for usage in stream mining are the fact that first, it uses HFDS which is a disk and disk operations bring latency that will result in missing data in stream. second, is that the pipeline is not parallel. Map-reduce generally operates on batches of data and not instances as it is with stream data.
I recently read an article about M3 which tackles the first issue apparently by bypassing HDFS and perform in-memory computations in objects database. And for the second issue, they are using incremental learners which are not anymore performed in batch. Worth checking it out M3
: Stream Processing on
Main-Memory MapReduce. I could not find the source code or API of this M3 anywhere, if somebody found it please share the link here.
Also, Hadoop Online is also another prototype that attemps to solve the same issues as M3 does: Hadoop Online
However, Apache Storm is the key solution to the issue, however it is not enough. You need some euqivalent of map-reduce right, here is why you need a library called SAMOA which actually has great algorithms for online learning that mahout kinda lacks.
Several mature stream processing frameworks and products are available on the market. Open source frameworks are e.g. Apache Storm or Apache Spark (which can both run on top of Hadoop). You can also use products such as IBM InfoSphere Streams or TIBCO StreamBase.
Take a look at this InfoQ article, which explains stream processing and all these frameworks and products in detail: Real Time Stream Processing / Streaming Analytics in Combination with Hadoop. Besides the article also explains how this is complementary to Hadoop.
By the way: Many software vendors such as Oracle or TIBCO call this stream processing / streaming analytics approach "fast data" instead of "big data" as you have to act in real time instead of batch processing.
You should try Apache Spark Streaming.
It should work well for your purposes.

Where is Pentaho Kettle's architecture?

Where can I find Pentaho Kettle architecture? I'm looking for a short wiki, design document, blog post, anything to give a good overview on how things work. This question is not meant for specific "how to" starting guides but rather a good view at the technology and architecture.
Specific questions I have are:
How does data flow between steps? It would seem everything is in memory - am I right about this?
Is the above true about different transformations as well?
How are the Collect steps implemented?
Any specific performence guidelines to using it?
Is the ftp task reliable and performant?
Any other "Dos and Don'ts" ?
See this PDF.
How does data flow between steps? It would seem everything is in
memory - am I right about this?
Data flow is row-based. For transformation every step produce a 'tuple' or a row with fields. Every field is pair of data and a metadata. Every step has input and output. Step takes rows from input, modify rows and send rows to outputs. For most cases every all information is in memory. But. Steps reads data in streaming fashion (like jdbc or other) - so typically in memory only a part of data from a stream.
Is the above true about different transformations as well?
There is a 'job' concept and 'transformation' concept. All written above is mostly true for transformation. Mostly - means transformation can contain very different steps, some of them - like collect steps - can try to collect all data from a stream. Jobs - is a way to perform some actions that do not follow 'streaming' concept - like send email on success, load some files from net, execute different transformations one by one.
How are the Collect steps implemented?
It only depend on particular step. Typically as said above - collect steps may try to collect all data from stream - having so - can be a reason of OutOfMemory exceptions. If data is too big - consider replace 'collect' steps with different approach to process data (for example use steps that do not collect all data).
Any specific performence guidelines to using it?
A lot of. Depends on steps transformation is consists, sources of data used. I would try to speak on exact scenario rather then general guidelines.
Is the ftp task reliable and performant?
As far as I remember ftp is backed by EdtFTP implementation, and there may be some issues with that steps like - some parameters not saved, or http-ftp proxy not working or other. I would say Kettle in general is reliable and perfomant - but for some not commonly used scenarios - it can be not so.
Any other "Dos and Don'ts" ?
I would say the Do - is to understand a tool before starting use it intensively. As mentioned in this discussion - there is a couple of literature on Kettle/Pentaho Data Integration you can try search for it on specific sites.
One of advantages of Pentaho Data Integration/Kettle is relatively big community you can ask for specific aspects.
