I am working on Apache zeppelin for data visualization, I am able to visualize data from different sources such as HAWQ.
Now the requirement is to visualize data on a near real time basis for e.g. I need to visualize a table in HAWQ for every 15 seconds. I don't see such feature in a zeppelin. If anyone has done such kind of use cases on zeppelin please share some pointers
Related
I have notebooks that performs transformation in tables stored in dbfs(databricks file system).I want to capture and display the data lineage. Additionally i want to know how to do the same in hdinsight.
Spline is derived from the words Spark and Lineage. It is a tool which is used to visualize and track how the data changes over time. Spline provides a GUI where the user can view and analyze how the data transforms to give rise to the insights.
You may checkout article which explains Spark Data Lineage on Databricks Notebook using Spline and Data Lineage Tracking And Visualization Solution.
I'm currently am interested in performing real time data analytics using real time aircraft performance data for predictive analysis. What tools and technologies could be used to implement such a system on research level?
For real time data analytics if i would be in your place i would have choose the following technologies
1) Kafka for real time Data Ingestion.
2) Spark Streaming for Streaming Processing
3) Spark ML for using machine learning algorithms (Prediction)
4) Apache Zeppelin for Visualization.
5) Data Storage you can use Hive or HDFS as per your needs
6) Ganglia for performance monitoring
Hope this Helps!!!...
I am working with Elasticsearch and kibanba for analyzing data.
Does anybody know is it possible to cluster data based on ...(whatever) in elasticsearch or kibana?
Clustering or Classification, or groupping.
for example like machine learning to give it some samples and then it can understand the trend for other data.
Thanks
Elasticsearch Hadoop contains an Elasticsearch Spark connector. You could run a Spark job on Elasticsearch data and use Spark MLLib for the machine learning stuff.
I have a lot of data saved into Cassandra on a daily basis and I want to compare one datapoint with last 5 versions of data for different regions.
Lets say there is a price datapoint of a product and there are 2000 products in a context/region(say US). I want to show a heat map dash board showing when the price change happened for different regions.
I am new to hadoop, hive and pig. Which path would help me achieve my goal and some details appreciated.
Thanks.
This sounds like a good use case for either traditional mapreduce or spark. You have relatively infrequent updates, so a batch job running over the data and updating a table that in turn provides the data for the heatmap seems like the right way to go. Since the updates are infrequent, you probably don't need to worry about spark streaming- just a traditional batch job run a few times a day is fine.
Here's some info from datastax on reading from cassandra in a spark job: http://www.datastax.com/documentation/datastax_enterprise/4.5/datastax_enterprise/spark/sparkSCcontext.html
For either spark or mapreduce, you are going to want to leverage the (spark or MR) framework's ability to partition the task- if you are manually connecting to cassandra and reading/writing the data like you would from a traditional RDBMS, you are probably doing something wrong. If you write your job correctly, the framework will be responsible for spinning up multiple readers (one for each node that contains the source data that you are interested in), distributing the calculation tasks, and routing the results to the appropriate machine to store them.
Some more examples are here:
http://www.datastax.com/documentation/datastax_enterprise/4.5/datastax_enterprise/spark/sparkIntro.html
and
http://www.datastax.com/documentation/datastax_enterprise/4.5/datastax_enterprise/byoh/byohIntro.html
Either way, MapReduce is probably a little simpler, and Spark is probably a little more future proof.
I want to know the advantages/disadvantages of using a MySQL Cluster and using the Hadoop framework.
What is the better solution. I would like to read your opinion.
I think the advantages of using a MySQL Cluster are:
high availability
good scalability
high performance / real time data access
you can use commodity hardware
And I don't see a disadvantage! Are there any disadvantages that Hadoop do not has?
The advantages of Hadoop with Hive on top of it are:
also good scalability
you can also use commodity hardware
the ability to run in heterogenous environments
parallel computing with the MapReduce framework
Hive with HiveQL
and the disadvantage is:
no real time data access. It may takes minutes or hours to analyze the data.
So in my opinion for handling big data a MySQL cluster is the better solution. Why Hadoop is the holy grail of handling big data? What is your opinion?
Both of the above answers miss a huge differentiation between mySQL and Hadoop. mySQL requires you to store data in a certain format. It likes heavily structured data - you declare the data type of each column in a table etc. Hadoop doesn't care about this at all.
Example - if you have a billion text log files, to make analysis even possible for mySQL you'd need to parse and load the data first into a mySQL table, typeing each column along the way. With hadoop and mapreduce, you define the function that is to scan/analyze/return the data from its raw source - you don't need pre-processing ETL to get it pre-structured.
If the data is already structured and in mySQL - then (hopefully) its well structured - why export it for hadoop to analyze? If it isn't, why spend the time to ETL the data?
Hadoop is not a replacement of MySQL, so I think they have their own scenario。
Every one know hadoop is better for batch job or offline compute, but there also have many related real time product, such as hbase.
If you wanna choose a offline compute & storage arch.
I suggest hadoop not MySQL cluster for offline compute & storage, because of :
Cost : obviously, hadoop cluster is more cheap than MySQL cluster
Scalability : hadoop support more than ten thousands machine in a cluster
Ecosystem : mapreduce, hive, pig, sqoop and etc.
So you can choose hadoop as offline compute & storage and MySQL as online compute & storage, you also can learn more from lambda architecture.
The other answer is good, but doesn't really explain why hadoop is more scalable for offline data crunching than MySQL Clusters. Hadoop is more efficient for large data sets that must be distributed across many machines because it gives you full control over the sharding of data.
MySQL clusters use auto-sharding, and it's designed to randomly distribute the data so no one machine gets hit with more of the load. On the other hand, Hadoop allows you to explicitly define the data partition so that multiple data points that require simultaneous access will be on the same machine, minimizing the amount of communication among the machines necessary to get the job done. This makes Hadoop better for processing massive data sets in many cases.
The answer to this question has a good explanation of this distinction.