I have a bunch of data in .csv format in Hadoop HDFS in several GBs.i have Flight data on one airport. there are different delays like carrier delay, weather delay. NAS delay etc
I want to create a dashboard that reports on the contents in there e.g maximum delay on particular route, maximum delay flight wise etc.
I am new to hadoop world.
thnak you
You can try Hive. Similar like SQL.
You can load the data from HDFS into tables using simple create table statements.
Hive also provides in-built functions which you can exploit to get the necessary results.
Many Data Visualizations tools are available, some commonly used are
Tableau
Qlik
Splunk
These tools provide us capabilities to create our own dashboard.
Related
Is Cassandra a good alternative for Hadoop as a data warehouse where data is append only and all updates in source databases should not overwrite the existing rows in the data warehouse but get appended. Is Cassandra really ment to act as a data warehouse or just as a database to store the results of batch / stream queries?
Cassandra can be used both as a data warehouse(raw data storage) and as a database (for final data storage). It depends more on the cases you want to do with the data.
You even may need to have both Hadoop and Cassandra for different purposes.
Assume, you need to gather and process data from multiple mobile devices and provide some complex aggregation report to the user.
So at first, you need to save data as fast as possible (as new portions appear very often) so you use Cassandra here. As Cassandra is limited in aggregation features, you load data into HDFS and do some processing via HQL scripts (assume, you're not very good at coding but great in complicated SQLs). And then you move the report results from HDFS to Cassandra in a dedicated reports table partitioned by user id.
So when the user wants to have some aggregation report about his activity in the last month, the application takes the id of active user and returns the aggregated result from Cassandra (as it is simple key-value search).
So for your question, yes, it could be an alternative, but the selection strategy depends on the data types and your application business cases.
You can read more information about usage of Cassandra
here
I am a new hadoop developer and I have been able to install and run hadoop services in a single-node cluster. The problem comes during data visualization. What purpose does MapReduce jar file play when I need to use a data visualization tool like Tableau. I have a structured data source in which I need to add a layer of logic so that the data could make sense during visualization. Do I need to write MapReduce programs if I am going to visualize with other tools? Please shed some light on how I could go about on this issue.
This probably depends on what distribution of Hadoop you are using and which tools are present. It also depends on the actual data preparation task.
If you don't want to actually write map-reduce or spark code yourself you could try SQL-like queries using Hive (which translates to map-reduce) or the even faster Impala. Using SQL you can create tabular data (hive tables) which can easily be consumed. Tableau has connectors for both of them that automatically translate your tableau configurations/requests to Hive/Impala. I would recommend connecting with Impala because of its speed.
If you need to do work that requires more programming or where SQL just isn't enough you could try Pig. Pig is a high level scripting language that compiles to map-reduce code. You can try all of the above in their respective editor in Hue or from CLI.
If you feel like all of the above still don't fit your use case I would suggest writing map-reduce or spark code. Spark does not need to be written in Java only and has the advantage of being generally faster.
Most tools can integrate with hive tables meaning you don't need to rewrite code. If a tool does not provide this you can make CSV extracts from the hive tables or you can keep the tables stored as CSV/TSV. You can then import these files in your visualization tool.
The existing answer already touches on this but is a bit broad, so I decided to focus on the key part:
Typical steps for data visualisation
Do the complex calculations using any hadoop tool that you like
Offer the output in a (hive) table
Pull the data into the memory of the visualisation tool (e.g. Tableau), for instance using JDBC
If the data is too big to be pulled into memory, you could pull it into a normal SQL database instead and work on that directly from your visualisation tool. (If you work directly on hive, you will go crazy as the simplest queries take 30+ seconds.)
In case it is not possible/desirable to connect your visualisation tool for some reason, the workaround would be to dump output files, for instance as CSV, and then load these into the visualisation tool.
Check out some end to end solutions for data visualization.
For example like Metatron Discovery, it uses druid as their OLAP engine. So you just link your hadoop with Druid and then you can manage and visualize your hadoop data accordingly. This is an open source so that you also can see the code inside it.
We have billions of records formatted with relational data format (e.g transaction id, user name, user id and some other fields), my requirement is to create system where user can request data export from this data store (user will provide some filters like user id, date and so on), typically exported file will be having thousand to 100s of thousands to millions of records based on selected filters (output file will be CSV or similar format)
Other than raw data, I am also looking for some dynamic aggregation on few of the fields during data export.
Typical time between user submitting request and exported data file available should be within 2-3 minutes (max can be 4-5 minutes).
I am seeking suggestions on backend noSQLs for this use case, I've used Hadoop map-reduce so far but hadoop batch job execution with typical HDFS data map-reduce might not give expected SLA in my opinion.
Another option is to use Spark map-reduce which I've never used but it should be way faster then typical Hadoop map-reduce batch job.
We've already tried production grade RDBMS/OLTP instance but it clearly seems not a correct option due to size of data we are exporting and dynamic aggregation.
Any suggestion on using Spark here? or any other better noSQL?
In summary SLA, dynamic aggregation and raw data (millions) are the requirement considerations here.
If system only requires to export data after doing some ETL - aggregations, filtering and transformations then answer is very straight forward. Apache Spark is the best. You would have to fine tune the system and decide whether you want to use only memory or memory + disk or serialization etc.. However, most of the times one needs to think about other aspects too; I am considering them as well.
This is a wide topic of discussion and it involves many aspects such aggregations involved, search related queries (if any), development time. As per the description, it seems to be an interactive/near-real-time-interactive system. Other aspect is whether any analysis involved? And another important point is type of system (OLTP/OLAP, only reporting etc..).
I see there are two questions involved -
Which computing/data processing engine to use?
Which data storage/NoSQL?
- Data processing -
Apache Spark would be a best choice for computing. We are using for the same purpose, along with the filtering, we also have xml transformations to perform which are also done in Spark. Its superfast as compared to Hadoop MapReduce. Spark can run standalone and it can also run on the top of Hadoop.
- Storage -
There are many noSQL solutions available. Selection depends upon many factors such as volume, aggregations involved, search related queries etc..
Hadoop - You can go with Hadoop with HDFS as a storage system. It has many benefits as you get entire Hadoop ecosystem.If you have analysts/data scientists who require to get insights of data/ play with data then this would be a better choice as you would get different tools such as Hive/Impala. Also, resource management would be easy. But for some applications it can be too much.
Cassendra - Cassandra as a storage engine that has solved the problems of distribution and availability while maintaining scale and performance. It brings wonders when used with Spark. For example, performing complex aggregations. By the way, we are using it. For visualization (to view data for analyzing), options are Apache Zeppelin, Tableau (lot of options)
Elastic Search - Elastic Search is also a suitable option if your storage is in few TBs upto 10 TBs. It comes with Kibana (UI) which provides limited analytics capabilities including aggregations. Development time is minimal, its very quick to implement.
So, depending upon your requirement I would suggest Apache Spark for data processing (transformations/filtering/aggregations) and you may also require to consider other technology for storage and data visualization.
I'm going to write a sales analytics application with Spark. Therefore I get a delta-dataset every night with new sales data (the sellings of the day before). Later I want to realize some analytics like Association-Rules or popularity of products.
The sales data contains information about:
store-id
article-group
timestamp of cash-point
article GTIN
amount
price
So far I used a simple .textFile method and RDDs in my Applications. I heard something about DataFrame and Parquet, which is a table-like data format for text files, right? And what about storing the data once in a database (I have HBase installed in a Hadoop cluster) and later read this?
Can someone give a short overview of the different types of save-/load-possibilities in Spark? And give a recommendation what to use for this data?
The data-amount are actually about 6 GB, which represent data data for 3 stores for about 1 year. Later I will work with data of ~500 stores and time-period of ~5 years.
You can use spark to process that data without any problem. You can read from a csv file as well(there's a library from databricks that supports csv). You can manipulate it, from an rdd your one step closer to turning it into a dataframe. And you can throw the final dataframe dirrectly into HBASE.
All needed documentation you can find here:
http://spark.apache.org/docs/latest/sql-programming-guide.html
https://www.mapr.com/blog/spark-streaming-hbase
Cheers,
Alex
every one.
I have some data about 6G in hdfs that has been exported from mysql.And I have write mapreduces prehandling data to fill some key field that data can be easily queried.
As the business demands are different aggregation data group by day ,hour,hospital,area etc,
so I have to write many hive sqls exporting data to local disk,and then I write python script to parse files on local disk ,then get datas in demand.
Is there some good technique on hadoop to resolve my demand.I am considering.
Can you help me ,please.