Hadoop and Stata - hadoop

Does anyone have any experience using Stata and Hadoop? Stata 13 now has a Java Plugin API, so I think it should be straightforward to get them to play nice.
I am particularly interested in being able to parse weblog data to get it into a form suitable for statistical analysis.
This question came up on Statalist recently, but there was no response, so I thought I would try it here where the audience is more likely to have experience with this technology.

Dimitry,
I think it would be easier to do something like this using the ELK Stack (http://www.elastic.co). Logstash (the middle layer) has several parsers/tokenizers/analyzes built on the Apache Lucene engine for cleaning and formatting log data and can push the resulting data into elasticsearch, which exposes an HTTP API that you can curl fairly easily to get results (e.g., use insheetjson and pass the HTTP GET request as the URL and it should be imported into Stata without much problem).
I've been trying to cobble together a program to use the Jackson JSON library to build out more robust JSON I/O capabilities from within Stata and would definitely not mind trying to work with others to get it done.
Hope this helps,
Billy

I'll take an (un?)educated stab at this. From the looks of the java API, the caller seems to treat Stata as essentially a datastore. If that's the case, then I would imagine Stata would fit in to the hadoop world as a database and would be accessed by its own InputFormat and OutputFormat. In your specific case I'd imagine you'd write a StataOutputFormat which your reducer would use to write the parsed data. The only drawback seems to be your referenced comments that Stata apps tend to be I/O bound so I don't know that using hadoop is really going to help you since
you'll have to write all that data anyway, and
that write will be I/O bound, whether you use hadoop or not.

Related

Integrating / Transforming data from different / disparate sources without storing it

I have a usecase. I want to Integrate / Transform data from different / disparate sources without storing it. Data sources are database(oracle,db2,etc), Webservice(Rest/Soap), Flat files(CSV, XML, JSON), MQ dumps, mainframe systems. I want to pull data from these sources and do some kind of intelligent transformation and integration and provide it our customers. It looks like typical ETL scenario, but my situation is different. I am not allowed to store the data given by the desperate sources, that means, for simple example, i pull data from oracle, soap and a rest, and do all my intelligent transformations and integrations on the fly.
I browsed through google and technical stuffs but could not get convincing solution to my problem.
If you guys can help me giving some valuable insight on this problem and give suggestion and probable approaches to it.
Note: Data size from these sources can sometime be really huge.
Thanks in Advance
Take a look at htto://teiid.org
Thst is exactly what it does, and it is Open Source.
Talend Open Studio y a great solution as well, I'm using it and it's great and easy to make the ETL workflow.
https://www.talend.com/products/data-integration/data-integration-manuals-release-notes/
You can see a lot of help videos: https://www.youtube.com/results?search_query=talend+studio

Use Cases of NIFI

I have a question about Nifi and its capabilities as well as the appropriate use case for it.
I've read that Nifi is really aiming to create a space which allows for flow-based processing. After playing around with Nifi a bit, what I've also come to realize is it's capability to model/shape the data in a way that is useful for me. Is it fair to say that Nifi can also be used for data modeling?
Thanks!
Data modeling is a bit of an overloaded term, but in the context of your desire to model/shape the data in a way that is useful for you, it sounds like it could be a viable approach. The rest of this is under that assumption.
While NiFi employs dataflow through principles and design closely related to flow based programming (FBP) as a means, the function is a matter of getting data from point A to B (and possibly back again). Of course, systems aren't inherently talking in the same protocols, formats, or schemas, so there needs to be something to shape the data into what the consumer is anticipating from what the producer is supplying. This gets into common enterprise integration patterns (EIP) [1] such as mediation and routing. In a broader sense though, it is simply getting the data to those that need it (systems, users, etc) when and how they need it.
Joe Witt, one of the creators of NiFi, gave a great talk that may be in line with this idea of data shaping in the context of Data Science at a Meetup. The slides of which are available [2].
If you have any additional questions, I would point you to check out the community mailing lists [3] and ask any additional questions so you can dig in more and get a broader perspective.
[1] https://en.wikipedia.org/wiki/Enterprise_Integration_Patterns
[2] http://files.meetup.com/6195792/ApacheNiFi-MD_DataScience_MeetupApr2016.pdf
[3] http://nifi.apache.org/mailing_lists.html
Data modeling might well mean many things to many folks so I'll be careful to use that term here. What I do think in what you're asking is very clear is that Apache NiFi is a great system to use to help mold the data into the right format and schema and content you need for your follow-on analytics and processing. NiFi has an extensible model so you can add processors that can do this or you can use the existing processors in many cases and you can even use the ExecuteScript processors as well so you can write scripts on the fly to manipulate the data.

Nifi processor batch insert - handle failure

I am currently in the process of writing an ElasticSearch Nifi processor. Individual inserts / writes to ES are not optimal, instead batching documents is preferred. What would be considered the optimal approach within a Nifi processor to track (batch) documents (FlowFiles) and when at a certain amount batch them in? The part I am most concerned about is if ES is unavailable, down, network partition, etc. prevents the batch from being successful. The primary point of the question, is given that Nifi has content storage for queuing / back-pressure, etc. is there a preferred method for using that to ensure no FlowFiles get lost if a destination is down? Maybe there is another processor I should look at for an example?
I have looked at the Mongo processor, Merge, etc. to try and get an idea of the preferred approach for batching inside of a processor, but can't seem to find anything specific. Any suggestions would be appreciated.
Good chance I am overlooking some basic functionality baked into Nifi. I am still fairly new to the platform.
Thanks!
Great question and a pretty common pattern. This is why we have the concept of a ProcessSession. It allows you to send zero or more things to an external endpoint and only commit once you know it has been ack'd by the recipient. In this sense it offers at least-once semantics. If the protocol you're using supports two-phase commit style semantics you can get pretty close to the ever elusive exactly-once semantic. Much of the details of what you're asking about here will depend on the destination systems API and behavior.
There are some examples in the apache codebase which reveal ways to do this. One way is if you can produce a merged collection of events prior to pushing to the destination system. Depends on its API. I think PutMongo and PutSolr operate this way (though the experts on that would need to weigh in). An example that might be more like what you're looking for can be found in PutSQL which operates on batches of flowfiles to send in a single transaction (on the destination DB).
https://github.com/apache/nifi/blob/master/nifi-nar-bundles/nifi-standard-bundle/nifi-standard-processors/src/main/java/org/apache/nifi/processors/standard/PutSQL.java
Will keep an eye here but can get the eye of a larger NiFi group at users#nifi.apache.org
Thanks
Joe

Hadoop use-case scenario

I would like to have some expert views on the use of a Big Data platform like Hadoop in one of my project scenarios. I am a complete novice in this technology although I understand databases like MySQL well.
We are creating a product which would be used to analyse data from social media. So the input data would be a large volume of tweets, facebook posts, user profiles, YouTube data and data from blogs etc. On top of this I would be having a web application to help me view and analyse this data. As the requirement makes it clear, I would be needing a sort of real time system. So if I have a tweet coming in, I would like to have it available to my web app readily for processing. Batch data processing may not be a suitable choice for my application.
My questions are:
Is a Hadoop engine a good choice for me?
What are the parameter I should base my decision on?
Is it also a good option to use a Multi Cluster MySQL engine as opposed to Hadoop?
Is there any benchmarking in terms of Size and velocity of data in which Hadoop becomes a good choice?
Hadoop is not appropriate for near real time / interactive analysis. Hadoop was designed to do big batch processing of say a few hours of data plus. I used to use Hadoop to process any dataset that was around 10 GB or more (which is still a bit overkill), once it get's to 100 GB then you defo want something like Hadoop.
Now my recommendation would be for Spark as this is much more modern, much faster, more flexible, more powerful, and has a SparkStreaming module for achieving closer to real time analysis. Read all about it! https://spark.apache.org/
In this case I prefer the Lambda Architecture.
With Lambda Architecture you have two routes: A fast route with a noSQL database for the current informations, and a batch route with hadoop-hdfs for the archive data, and with a merge component you can merge the two datasources in one query, so you receive a whole amount of data, which is near real time.
http://lambda-architecture.net/
Image about lambda architecture: http://i.stack.imgur.com/eofRW.png
We created a PoC Project with Lambda Architecture (also for Twitter analysis), and its working fine.
Spark will be the best solution for your problem.You can also look other in-memory databases.

Is avoiding the T in ETL possible?

ETL is pretty common-place. Data is out there somewhere so you go get it. After you get it, it's probably in a weird format so you transform it into something and then load it somewhere. The only problem I see with this method is you have to write the transform rules. Of course, I can't think of anything better. I supposed you could load whatever you get into a blob (sql) or into a object/document (non-sql) but then I think you're just delaying the parsing. Eventually you'll have to parse it into something structured (assuming you want to). So is there anything better? Does it have a name? Does this problem have a name?
Example
Ok, let me give you an example. I've got a printer, an ATM and a voicemail system. They're all network enabled or I can give you connectivity. How would you collect the state from all these devices? For example, the printer dumps a text file when you type status over port 9000:
> status
===============
has_paper:true
jobs:0
ink:low
The ATM has a CLI after you connect on port whatever and you can type individual commands to get different values:
maint-mode> GET BILLS_1
[$1 bills]: 7
maint-mode> GET BILLS_5
[$5 bills]: 2
etc ...
The voicemail system requires certain key sequences to get any kind of information over a network port:
telnet> 7,9*
0 new messages
telnet> 7,0*
2 total messages
My thoughts
Printer - So this is pretty straight-forward. You can just capture everything after sending "status", split on lines and then split on colons or something. Pretty easy. It's almost like getting a crap-formatted result from a web service or something. I could avoid parsing and just dump the whole conversation from port 9000. But eventually I'll want to get rid of that equal signs line. It doesn't really mean anything.
ATM - So this is a bit more of a pain because it's interactive. Now I'm approaching expect or a protocol territory. It'd be better if they had a service that I could query these values but that's out of scope for this post. So I write a client that gets all the values. But now if I want to collect all the data, I have to define what all the questions are. For example, I know that the ATM has more bills than $1 and $5 so I'd have a complete list like "BILLS_1 BILLS_5 BILLS_10 BILLS_20". If I ask all the questions then I have an inventory of the ATM machine. Of course, I still have to parse out the results and clean up the text if I wanted to figure out how much money is left in the ATM machine. So I could parse the results and figure out the total at data collection time or just store it raw and make sense of it later.
Voicemail - This is similar to the ATM machine where it's interactive. It's just a bit weirder because the key sequences/commands aren't "get key". But essentially it's the same problem and solution.
Future Proof
Now what if I was going to give you an unknown device? Like a refrigerator. Or a toaster. Or anything? You'd have to write "connectors" ahead of time or write a parser afterwards against some raw field you stored earlier. Maybe in the case of these very limited examples there's no alternative. There's no way to future-proof. You just have to understand the new device and parse it at collection or parse it after the fact (your stored blob/object/document).
I was thinking that all these systems are text driven so maybe you could create a line iterator type abstraction layer that simply requires the device to split out lines. Then you could have a text processing piece that parses based on rules. For the ATM device, you'd have to write something that "speaks ATM" and turns it into lines which the iterator would then take care of. At this point, hopefully you'd be able to say "I can handle anything that has lines of text".
But then what will you call these rules for parsing the text? "Printer rules" might as well be called "printer parser" which is the same to me as "printer transform". Is there a better term for all of this?
I apologize for this question being so open ended. :)
When your sources of information are as disparate as what you illustrate then you have no choice but to implement the Transform in order to bring the items into a common data repository. Usually your data sources won't be this extreme, the data will all be related in some way but you may be retrieving it from different sources (some might come from a nicely structured database, some more might come from an Excel or XML or text file, some more might come from a web service call, etc).
When coding up a custom ETL application, a common pattern that is used is the Provider model, this enables you to write a whole bunch of custom providers to load/query and then transform the data. All the providers will implement a common interface with some relatively common function definitions (for example QueryData(), TransformData()), but the implementation of those methods will be wildly different depending on the data source being dealt with - the interface just gives a common way to deal with all the different providers. You can then use an XML configuration file to dictate which providers to run and any other initial settings they may require. Tools like SSIS abstract this stuff away for you by giving you a nice visual designer, but you can still get down and dirty and write your own code which it calls.
Now what if I was going to give you an unknown device? Like a refrigerator. Or a toaster.
No problem, i would just write a new provider, which can sit in its very own assembly (dll), so it can be shipped (or modified, upgraded, etc) in isolation to any other providers i already have. Or if i was using SSIS then i would write a new DTS package.
I was thinking that all these systems are text driven so maybe you could create a line iterator type abstraction layer ... Then you could have a text processing piece that parses based on rules.
Absolutely - you can have a base class containing common functionality which several different providers can implement, and each provider can use its own set of rules which could be coded into it or they can be contained in an external configuration file.
So I could parse the results and figure out the total at data collection time or just store it raw and make sense of it later.
Use whichever approach makes sense for the data you are grabbing. It is also quite common for an ETL process to dump its data into a staging area (like some staging tables in a database) while the data is all being aggregated and accumulated, and then further process it to link related data and perform calculations. In the case of your ATM it may not be necessary to calculate a cash balance at ETL time because you can easily calculate it at any time in the future.

Resources