How do I get from "Big Data" to a webpage? - hadoop

I've spent a lot of time reading and watching videos of people talking about how they use tools designed for handling huge datasets and real-time processing in their architectures. And while I understand what it is that tools like Hadoop/Cassandra/Kafka etc do, no one seems to explain how the data gets from these large processing tools to rendering something on a client/webpage.
From what I understand of big data tools, is that you can't build your application the same way you would a standard web-app querying MySQL, which I can understand given the size of the data that flows through these tools, however, for all this talk of "realtime data analytics" I cannot find any explanation of how the actual analytics gets put in front of someone in terms of some chart/table/etc?

explain how the data gets from these large processing tools to rendering something on a client/webpage.
With respect to this, one way would be to process the big data using Spark or Hadoop and store the results onto a RDBMS. Then have your webapp pull data from RDBMS to render charts, table etc. I can provide you the examples that I have done myself if you need more information.

Impala supports ODBC/JDBC interfaces. So, you actually could hook up a web app to it the same way you do with MySQL.
Other stuff you might want to check out is HBase, Kudu or Solr. In some realtime architectures data ends up in one of those. And all of them have some sort of an API that you can use in your web app to access their data.

If you want a simple solution for realtime data processing and analytics, check out the new Stride API, which enables developers to collect, process, and analyze streaming data and then either visualize summary data in Stride or push processed data out to applications in realtime. This is a very easy way to build the kind of realtime reporting dashboards and monitoring / alerting systems you described above.
Take a look at the Stride API technical docs for examples and more info on how to implement this.

Related

Difference between Apache NiFi and StreamSets

I am planning to do a class project and was going through few technologies where I can automate or set the flow of data between systems and found that there are couple of them i.e. Apache NiFi and StreamSets ( to my knowledge ). What I couldn't understand is the difference between them and use-cases where they can be used? I am new to this and if anyone can explain me a bit would be highly appreciated. Thanks
Suraj,
Great question.
My response is as a member of the open source Apache NiFi project management committee and as someone who is passionate about the dataflow management domain.
I've been involved in the NiFi project since it was started in 2006. My knowledge of Streamsets is relatively limited so I'll let them speak for it as they have.
The key thing to understand is that NiFi was built to do one really important thing really well and that is 'Dataflow Management'. It's design is based on a concept called Flow Based Programming which you may want to read about and reference for your project 'https://en.wikipedia.org/wiki/Flow-based_programming'
There are already many systems which produce data such as sensors and others. There are many systems which focus on data processing like Apache Storm, Spark, Flink, and others. And finally there are many systems which store data like HDFS, relational databases, and so on. NiFi purely focuses on the task of connecting those systems and providing the user experience and core functions necessary to do that well.
What are some of those key functions and design choices made to make that effective:
1) Interactive command and control
The job of someone trying to connect systems is to be able to rapidly and efficiently interact with the constant streams of data they see. NiFi's UI allows you do just that as the data is flowing you can add features to operate on it, fork off copies of data to try new approaches, adjust current settings, see recent and historical stats, helpful in-line documentation and more. Almost all other systems by comparison have a model that is design and deploy oriented meaning you make a series of changes and then deploy them. That model is fine and can be intuitive but for the dataflow management job it means you don't get the interactive change by change feedback that is so vital to quickly build new flows or to safely and efficiently correct or improve handling of existing data streams.
2) Data Provenance
A very unique capability of NiFi is its ability to generate fine grained and powerful traceability details for where your data comes from, what is done to it, where its sent and when it is done in the flow. This is essential to effective dataflow management for a number of reasons but for someone in the early exploration phases and working a project the most important thing this gives you is awesome debugging flexibility. You can setup your flows and let things run and then use provenance to actually prove that it did exactly what you wanted. If something didn't happen as you expected you can fix the flow and replay the object then repeat. Really helpful.
3) Purpose built data repositories
NiFi's out of the box experience offers very powerful performance even on really modest hardware or virtual environments. This is because of the flowfile and content repository design which gives us the high performance but transactional semantics we want as data works its way through the flow. The flowfile repository is a simple write ahead log implementation and the content repository provides an immutable versioned content store. That in turn means we can 'copy' data by only ever adding a new pointer (not actually copying bytes) or we can transform data by simply reading from the original and writing out a new version. Again very efficient. Couple that with the provenance stuff I mentioned a moment ago and it just provides a really powerful platform. Another really key thing to understand here is that in the business of connecting systems you don't always get to dictate things like size of data involved. The NiFi API was built to honor that fact and so our API lets processors do things like receive, transform, and send data without ever having to load the full objects in memory. These repositories also mean that in most flows the majority of processors do not even touch the content at all. However, you can easily see from the NiFi UI precisely how many bytes are actually being read or written so again you get really helpful information in establishing and observing your flows. This design also means NiFi can support back-pressure and pressure-release naturally and these are really critical features for a dataflow management system.
It was mentioned previously by the folks from the Streamsets company that NiFi is file oriented. I'm not really sure what the difference is between a file or a record or a tuple or an object or a message in generic terms but the reality is when data is in the flow then it is 'a thing that needs to be managed and delivered'. That is what NiFi does. Whether you have lots of really high speed tiny things or you have large things and whether they came from a live audio stream off the Internet or they come from a file sitting on your harddrive it doesn't matter. Once it is in the flow it is time to manage and deliver it. That is what NiFi does.
It was also mentioned by the Streamsets company that NiFi is schemaless. It is accurate that NiFi does not force conversion of data from whatever it is originally to some special NiFi format nor do we have to reconvert it back to some format for follow-on delivery. It would be pretty unfortunate if we did that because what this means is that even the most trivial of cases would have problematic performance implications and luckily NiFi does not have that problem. Further had we gone that route then it would mean handling diverse datasets like media (images, video, audio, and more) would be difficult but we're on the right track and NiFi is used for things like that all the time.
Finally, as you continue with your project and if you find there are things you'd like to see improved or that you'd like to contribute code we'd love to have your help. From https://nifi.apache.org you can quickly find information on how to file tickets, submit patches, email the mailing list, and more.
Here are a couple of fun recent NiFi projects to checkout:
https://www.linkedin.com/pulse/nifi-ocr-using-apache-read-childrens-books-jeremy-dyer
https://twitter.com/KayLerch/status/721455415456882689
Good luck on the class project! If you have any questions the users#nifi.apache.org mailing list would love to help.
Thanks
Joe
Both Apache NiFi and StreamSets Data Collector are Apache-licensed open source tools.
Hortonworks does have a commercially supported variant called Hortonworks DataFlow (HDF).
While both have a lot of similarities such as a web-based ui, both are used for ingesting data there are a few key differences. They also both consist of a processors linked together to perform transformations, serialization, etc.
NiFi processors are file-oriented and schemaless. This means that a piece of data is represented by a FlowFile (this could be an actual file on disk, or some blob of data acquired elsewhere). Each processor is responsible for understanding the content of the data in order to operate on it. Thus if one processor understands format A and another only understands format B, you may need to perform a data format conversion in between those two processors.
NiFi can be run standalone, or as a cluster using its own built-in clustering system.
StreamSets Data Collector (SDC) however, takes a record based approach. What this means is that as data enters your pipeline it (whether its JSON, CSV, etc) it is parsed into a common format so that the responsibility of understanding the data format is no longer placed on each individual processor and any processor can be connected to any other processor.
SDC also runs standalone, and also a clustered mode, but it runs atop Spark on YARN/Mesos instead, leveraging existing cluster resources you may have.
NiFi has been around for about the last 10 years (but less than 2 years in the open source community).
StreamSets was released to the open source community a little bit later in 2015. It is vendor agnostic, and as far as Hadoop goes Hortonworks, Cloudera, and MapR are all supported.
Full Disclosure: I am an engineer who works on StreamSets.
They are very similar for data ingest scenarios.
Apache NIFI(HDP) is more mature and StreamSets is more lightweight.
Both are easy to use, both have strong capability. And StreamSets could easily
They have companies behind, Hortonworks and Cloudera.
Obviously there are more contributors working on NIFI than StreamSets, of course, NIFI have more enterprise deployments in production.
Two of the key differentiators between the two IMHO are.
Apache NiFi is a Top Level Apache project, meaning it has gone through the incubation process described here, http://incubator.apache.org/policy/process.html, and can accept contributions from developers around the world who follow the standard Apache process which ensures software quality. StreamSets, is Apache LICENSED, meaning anyone can reuse the code, etc. But the project is not managed as an Apache project. In fact, in order to even contribute to Streamsets, you are REQUIRED to sign a contract. https://streamsets.com/contributing/ . Contrast this with the Apache NiFi contributor guide, which wasn't written by a lawyer. https://cwiki.apache.org/confluence/display/NIFI/Contributor+Guide#ContributorGuide-HowtocontributetoApacheNiFi
StreamSets "runs atop Spark on YARN/Mesos instead, leveraging existing cluster resources you may have." which imposes a bit of restriction if you want to deploy your dataflows further toward the Edge where the Devices that are generating the data live. Apache MiniFi, a sub-project of NiFi can run on a single Raspberry Pi, while I am fairly confident that StreamSets cannot, as YARN or Mesos require more resources than a Raspberry Pi provides.
Disclosure: I am a Hortonworks employee

Azure Technology Choice for Project

There is a lot of information out there about the various Azure data storage flavors however I'd like to ask for some advice for my particular scenario.
I'm putting together a pet project to become more familiar with Azure technology, in particular, Service Bus/Event Hubs and data storage platforms. The system I want to create is fairly simple: accept a moderate load of events (not IoT scale), persist them, and make aggregated data available such as 'User A had N events of type X in the past day/week/month/etc.' as reports.
Given that the data will be quite structured (e.g. users, user groups, events, etc.), and I will need aggregation capabilities, it suggests that relational storage may be the best fit, although more expensive.
Another alternative I've considered is to maintain aggregated data at near real-time using something like stream analytics but not sure if this is overkill compared to a more data warehouse-esque solution.
Any suggestions/help would be greatly appreciated.
John
John,
Azure SQL would be a decent choice, or if that proves to be too expensive, regular SQL hosted on a VM. You can create an Azure Service Bus to hold the incoming requests, and then create competing consumers on 1 or more worker roles to monitor and process the messages. Each consumer can run the SQL and persist the data in a new table that is created and "pre-aggregated" for the caller, or you could persist the information to Azure BLOB storage in a structured format that matches your reporting tool (i.e. JSON). BLOB storage of the aggregated information will be the most cost effective, and relieve strain on SQL.
An alternative would be HDInsight which can aggregate the information in batch processing mode as well. I guess the choice between SQL/HDInsight depends on the native format of the base (non-aggregated) information.
I agree with Daniel. SQL Azure may be the way to go for your relational data needs. Another option to investigate for larger workloads for streaming and analytics is Azure Data Lake (https://azure.microsoft.com/en-us/solutions/data-lake/)

Using elasticsearch as central data repository

We are currently using elasticsearch to index and perform searches on about 10M documents. It works fine and we are happy with its performance. My colleague who initiated the use of elasticsearch is convinced that it can be used as the central data repository and other data systems (e.g. SQL Server, Hadoop/Hive) can have data pushed to them. I didn't have any arguments against it because my knowledge of both is too limited. However, I am concerned.
I do know that data in elasticsearch is stored in a manner that is efficient for text searching. Hadoop stores data just as a file system would but in a manner that is efficient to scale/replicate blocks over over multiple data nodes. Therefore, in my mind it seems more beneficial to use Hadoop (as it is more agnostic w.r.t its view on data) as a central data repository. Then push data from Hadoop to SQL, elasticsearch, etc...
I've read a few articles on Hadoop and elasticsearch use cases and it seems conventional to use Hadoop as the central data repository. However, I can't find anything that would suggest that elasticsearch wouldn't be a decent alternative.
Please Help!
As is the case with all database deployments, it really depends on your specific application.
Elasticsearch is a great open source search engine built on top of Apache Lucene. Its features and upgrades allow it to basically function just like a schema-less JSON datastore that can be accessed using both search-specific methods and regular database CRUD-like commands.
Nevertheless all the advantages Elasticsearch that brings, there are still some main disadvantages:
Security - Elasticsearch does not provide any authentication or access control functionality. It's supported since they have introduced shield.
Transactions - There is no support for transactions or processing on data manipulation. Well now data manipulation is handled with logstash.
Durability - ES is distributed and fairly stable but backups and durability are not as high priority as in other data stores.
Maturity of tools - ES is still relatively new and has not had time to develop mature client libraries and 3rd party tools which can make development much harder. We can consider that it's quite mature now
with a variety of connectors and tools around it like kibana. But it's still not suited for large computations - Commands for searching data are not suited to "large" scans of data and advanced computation on the db side.
Data Availability - ES makes data available in "near real-time" which may require additional considerations in your application (ie: comments page where a user adds new comment, refreshing the page might not actually show the new post because the index is still updating).
If you can deal with these issues then there's certainly no reason why you can't use Elasticsearch as your primary data store. It can actually lower complexity and improve performance by not having to duplicate your data but again this depends on your specific use case.
As always, weigh the benefits, do some experimentation and see what works best for you.
DISCLAIMER: This answer was written a while ago for the Elasticsearch 1.x series. These critics still somehow stand with the 2.x series. But Elastic is working on them, as the 2.x series comes with more mature tools, APIs and plugins per example, security wise, like Shield or even transport clients like Logstash or Beats, etc.
I'd highly discourage most users from using elasticsearch as your primary datastore. It will work great until your cluster melts down due to a network partition. Even settings such as minimum_master_nodes that the ES pros always set won't save you. See this excellent analysis by Aphyr with his Call Me Maybe series:
http://aphyr.com/posts/317-call-me-maybe-elasticsearch
eliasah, is right, it depends on your use case, but if your data (and job) is important to you, stay away.
Keep your golden record of your data stored in something really focused on persisting and sync your data out to search from there. It adds extra complexity and resources, but will result in a better nights rest :)
There are plenty of ways to go about this and if elasticsearch does everything you need, you can look into Kafka for persisting all the events going into a cluster which would allow replaying if things go wrong. I like this approach as it provides an async ingestion pipeline into elasticsearch that also does the persistence.

Measuring Application Performance

I was wondering if there is a tool to keep track of application performance. What I have in mind is a tool that will listen for updates and register performance metrics published by an application. i.e. time to serve a request, time a certain operation took to finish. And this tool would then aggregate the data and measure performance trends.
If you want to measure your application from outside, then you can use RRDtool to collect the data.
You can use slamd for webapp written in Java.
For Django use hotshot.
Search for profiler + your language, framework
Take a look at HP SiteScope. It's ability to drive the system with a Web User Script, to monitor the metrics on the backend, even to the extent of creation of custom shell scripts and database queries, plus the ability to add logic for report/alert against these combined data sets appears to be what you need.
Other mechanisms that you might consider would be a roll your own service using CURL to push information in, queries to the systems involved to pull metrics or database information and then your own interface for alerting and reporting.
Then it becomes a cost question, can you roll the level of functionality for less money than you can purchase an already existing solution on the open market.
Ref:
HP SiteScope Wiki Page

Core Data's Limits, can Core Data be used as a Serverside Technology?

I've found no clear answer so far, but maybe I've searched the wrong way.
My Question is, can Core Data to be used as a Persitence Storage for a Server Project? Where are Core Data's Limits, how much Data can be handled with Core Data and SQLite? SQLite should handle a lot of Data very well according to their website. I know of a properitary Java Persitence Manager with an Oracle DB as Storage that handles Millions of Entries and 3000 Clients without Problems. For my own Project I wonder if I can use Core Data on the Server Side for User Mangament and intern microblogging, texting with up to 5000 clients. Will it handle such big amounts of Data or do I have to manage something like that myself? Does anyone happend to have experience with huge amounts if Data and Core Data?
Thank you
twickl
I wouldn't advise using Core Data for a server side project. Core Data was designed to handle the data of individual, object-oriented applications therefore it lacks many of the common features of dedicated server software such as easily handling multiple simultaneous accesses.
Really, the only circumstance where I would advise using it is when the server side logic is very complex and the number of users small. For example, if you wanted to write an in house web app and have almost all the logic on the server, then Core Data might serve well.
Apple used to have WebObjects which was a package to manage servers using an object-oriented DB much like Core Data. (Core Data was inspired by a component of WebObjects called Enterprise Objects.) However, IIRC Apple no longer supports WebObjects for external use.
Your better off using one of the many dedicated server packages out there than trying to roll your own.
I have no experience using Core Data in the manner you describe, but my understanding of the architecture leads me to believe that it could be used, depending on how you plan to query and manipulate the data.
Core Data is very good at maintaining an object graph and using faults to bring parts into memory as needed. In that manner, it could be good on a server for reducing memory requirements even with a large data set.
Core Data is not very good at manipulating collections of objects without loading them into memory, making a change, and writing them back out to disk. Brent Simmons wrote a blog post about this, where he decide to stop using Core Data for some of his RSS reader's model objects because an operation like "mark all as read" didn't scale. While you would like to be able to say something like UPDATE articles SET status = 'read', Core Data must load each article, set its status property, then write it back to disk.
This isn't because Apple engineers are stupid, but because the query layer can't make assumptions about the storage layer (you could be using XML instead of SQLite) and it also must take into account cascading changes and the fact that some article objects may already be loaded into memory and will need to be updated there.
Note that you can also write your own storage providers for Core Data, see Aaron Hillegass's BNRPersistence project. So if Core Data was "mostly good" you might be able to improve on it for your application.
So, a possible answer to your question is that Core Data may be appropriate to your application, as long as you do not need to rely on batch updates to large number of objects. In general, no algorithm or data structure is appropriate for every scenario. Engineering is about wisely choosing between trade-offs. You won't find anything that works well for many clients in every case. It always matters what you are doing.

Resources