Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 2 years ago.
Improve this question
I am studying a use case where we are going to move datas from a SQL database (600TB ~100 tables) into a transformed format into hadoop. We don't have logs enabled in the SQL DB. We decided to copy the datas as a datamart view and to refresh this view every week. The copied datas will be erased every week to be rewritten.
This SQL DB is used for reporting purposes that is derived from the datalake. This OLTP database is an old system we are replacing progressivly. The dataset that is copied is deleted every week and copied again (refreshed).
80% of data copy is straight with no transformation.
20% has redesign.
We identified 3 options :
AirFlow + Beam for the processing
ETL (informatica) was excluded
Kafka (connect, stream, sink into hadoop) with optionnaly CDC Debezium
What do you think is a best approach regarding : performance, overall time to deliver, data architecture ?
Thanks for help !
My thoughts - for what they are worth:
I would definitely not be looking to copy 600TB per week. Given that the majority of this data will not have changed from week to week (I assume) then you should be looking to only copy across the data that has changed. As your data in Hadoop will be partitioned then you would mainly be inserting new data into new partitions - for those records that have changed you will just be dropping/reloading a few partitions
I would copy all the necessary data into a staging area in Hadoop as-is (without transformation) and then process it on the Hadoop platform to produce the data you actually need - you can then drop the staging area data if you want
Data processing tool - if you already have experience of a specific toolset within your company then use that; don't multiply the toolsets in use unless there is critical functionality required that is not available within existing tools. If this one process is all you are going to be using this toolset for then it probably doesn't matter which one you use - pick one that is quickest to learn/deploy. If this toolset is going to be expanded to other use cases then I would definitely use a dedicated ETL/ELT tool rather than use a coding solution (why have you discarded Informatica as a solution?)
The following is definitely an opinion...
If you are building a new analytical platform, I am surprised that you are using Hadoop. Hadoop is legacy technology that has been superseded by more modern and capable Cloud data platforms (Snowflake, etc.).
Also, Hadoop is a horrible platform to try and run analytics on (it's ok as just a data lake to hold data while you decide what you want to do with it). Trying to run queries on it that don't align with how that data is partitioned gives really bad performance (for non-trivial dataset sizes). For example, if your transactions are partitioned by date then running a query to sum transaction values in the last week will run quickly. However, running a query to sum transactions for a specific account (or group of accounts) will perform very badly
Related
I have a (desktop) application that logs high frequency data in sqlite. Our annalists have asked to move to parquet (for domain specific reasons). I have ported our application, and am getting terrible write performance (very similar performance to commiting sqlite every update, without controlling transactions)
Does parquet have similar transaction control or a similar analogy?
Additional background information-
In every transaction I have ~1200 columns of data to update
I defined an entirely "flat" parquet message schema, where everyone entry is required
additionally, I believe that I've ruled out filesystem journaling-like bottlenecks, but if it's relevant, I am testing on xfs and would deploy on ext4
and finally (?) this is implemented with the rust implementation of parquet ("parquet = 0.16.0")
I'm happy to fill in any missing gaps, where have I gone wrong in this port?
After researching this further, parameters such as row_group_size, compression, encoding, page_size, etc... can all be set using the WriterPropertiesBuilder. These can even be configured on a per-column basis.
This did not actually solve my problem but answered the gist of my above question of what and where can we configure parquet FileWriters.
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 11 months ago.
Improve this question
Am currently building an ETL pipeline, which outputs tables of data (order of ~100+ GBs) to a downstream interactive dashboard, which allows filtering the data dynamically (based on pre-defined & indexed filters).
Have zeroed in on using PySpark / Spark for the initial ETL phase.
Next, this processed data will be summarised (simple counts, averages, etc.) & then visualised in the interactive dashboard.
Towards the interactive querying part, I was wondering which tool might work best with my structured & transactional data (stored in Parquet format) -
Spark SQL (in memory dynamic querying)
AWS Athena (Serverless SQL querying, based on Presto)
Elastic Search (search engine)
Redis (Key Value DB)
Feel free to suggest alternative tools, if you know of a better option.
Based on the information you've provided, I am going to make several assumptions:
You are on AWS (hence Elastic Search and Athena being options). Therefore, I will steer you to AWS documentation.
As you have pre-defined and indexed filters, you have well ordered, structured data.
Going through the options listed
Spark SQL - If you are already considering Spark and you are already on AWS, then you can leverage AWS Elastic Map Reduce.
AWS Athena (Serverless SQL querying, based on Presto) - Athena is a powerful tool. It lets you query data stored on S3, which is quite cost effective. However, building workflows in Athena can require a bit of work as you'll spend a lot of time managing files on S3. Historically, Athena can only produce CSV output, so it often works best as the final stage in a Big Data Pipeline. However, with support for CTAS statements, you can now output data in multiple formats such as Parquet with multiple compression algorithms.
Elastic Search (search engine) - Is not really a query tool, so it is likely not part of the core of this pipeline.
Redis (Key Value DB) - Redis is an in memory key-value data store. It is generally used to provide small bits of information to be rapidly consumed by applications in use cases such as caching and session management. Therefore, it does not seem to fit your use case. If you want some hands on experience with Redis, I recommend Try Redis.
I would also look into Amazon Redshift.
For further reading, read Big Data Analytics Options on AWS.
As #Damien_The_Unbeliever recommended, there will be no substitute for your own prototyping and benchmarking.
Athena is not limited to .csv. In fact using binary compressed formats like parquet are a best practice for use with Athena, because it substantially reduces query times and cost. I have used AWS firehose, lambda functions and glue crawlers for converting text data to a compressed binary format for querying via Athena. When I have had issues with processing large data volumes, the issue was forgetting to raise the default Athena limits set for the accounts. I have a friend who processes gigantic volumes of utility data for predictive analytics, and he did encounter scaling problems with Athena, but that was in its early days.
I also work with ElasticSearch with Kibana as a text search engine and we use the AWS Log Analytics "solution" based on ElasticSearch and Kibana. I like both. Athena is best for working with huge volumes of log data, because it is more economical to work with it in a compressed binary format. A terabyte of JSON text data reduces down to approximately 30 gig or less in parquet format. Our developers are more productive when they use ElasticSearch/Kibana to analyze problems in their log files, because ElasticSeach and Kibana are so easy to use. The curator Lambda function that controls logging retention times and is a part of AWS Centralized logging is also very convenient.
you can use amazon quicksight , it has spice to do the query.. and can do visualisation at the same time..
I'm having fun learning about Hadoop and the various projects around it and currently have 2 different strategies I'm thinking about for building a system to store a large collection of market tick data, I'm just getting started with both Hadoop/HDSF and HBase but hoping someone can help me plant a system seed that I won't have to junk later using these technologies. Below is an outline of my system and requirements with some query and data usage use cases and lastly my current thinking about the best approach from the little documentation I have read. It is an open ended question and I'll gladly like any answer that is insightful and accept the best one, feel free to comment on any or all of the points below. - Duncan Krebs
System Requirements - Be able to leverage the data store for historical back testing of systems, historical data charting and future data mining. Once stored, data will always be read-only, fast data access is desired but not a must-have when back testing.
Static Schema - Very Simple, I want to capture 3 types of messages from the feed:
Timestamp including date,day,time
Quote including Symbol,timestamp,ask,askSize,bid,bidSize,volume....(About 40 columns of data)
Trade including Symbol,timestamp,price,size,exchange.... (About 20 columns of data)
Data Insert Use Cases - Either from a live market stream of data or lookup via broker API
Data Query Use Cases - Below demonstrates how I would like to logically query my data.
Get me all Quotes,Trades,Timestamps for GOOG on 9/22/2014
Get me all Trades for GOOG,FB BEFORE 9/1/2014 AND AFTER 5/1/2014
Get me the number of trades for these 50 symbols for each day over the last 90 days.
The Holy Grail - Can MapReduce be used for uses cases like these below??
Generate meta-data from the raw market data through distributed agents. For example, Write a job that will compute the average trading volume on a 1 minute interval for all stocks and all sessions stored in the database. Create the job to have an agent for each stock/session that I tell what stock and session it should compute this value for. (Is this what MapReduce can do???)
On the classpath of the agents can I add my own util code so that the use case above for example could publish its value into a central repo or Messaging server? Can I deploy an agent as an OSGI bundle?
Create different types of agents for different types of metrics and scores that are executed every morning before pre-market trading?
High Frequency Trading
I'm also interested if anyone can share some experience using Hadoop in the context of high frequency trading systems. Just getting into this technology my initial sense is Hadoop can be great for storing and processing large volumes of historic tick data, if anyone is using this for real-time trading I'd be interested in learning more! - Duncan Krebs
Based of my understanding of your requirements, Hadoop would be really good solution to store your data and run your queries on it using Hive.
Storage: You can store the data in Hadoop in a directory structure like:
~/stock_data/years=2014/months=201409/days=20140925/hours=01/file
Inside the hours folder, the data specific to that hour of the day can reside.
One advantage of using such structure is that you can create external tables in Hive over this data with your partitions on years, months, days and hours. Something like this:
Create external table stock_data (schema) PARTITIONED BY (years bigint, months bigint, days bigint, hours int) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LOCATION
'~/stock_data'
Coming to the queries part, once you have the data stored in the format mentioned above you can easily run simple queries.
Get me all Quotes,Trades,Timestamps for GOOG on 9/22/2014
select * from stock_data where stock = 'GOOG' and days = 20140922
Get me all Trades for GOOG,FB BEFORE 9/1/2014 AND AFTER 5/1/2014
select * from stock_data where stock in ('GOOG', 'FB') and days > 20140501 and days < 20140901)
You can run any such aggregation queries once in a day and use the output to come up with the metrics before pre-market trading. Since Hive internally runs mapreduce these queries won't be very fast.
In order to get faster results, you can use some of the in memory projects like Impala or Spark. I have myself used Impala to run queries on my hive tables and I have seen a major improvement in the run time for my queries (around 40x). Also you wouldn't need to make any changes to the structure of the data.
Data Insert Use Cases : You can use tools like Flume or Kafka for inserting data in real time to Hadoop (and thus to the hive tables). Flume is linearly scalable and can also help in processing events on the fly while transferring.
Overall, a combination of multiple big data technologies can provide a really decent solution to the problem you proposed and these solution would scale to huge amounts of data.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 9 years ago.
Improve this question
I just started exploring Hive. It has all the structures similar to an RDBMS like tables, joins, partitions.. what i understand is Hive still uses HDFS for storage and it is an SQL abstraction of HDFS. From this I am not sure weather Hive itself is a database solution like HBase, Cassnadra.. or simply it is a query system on top of HDFS. I don't think it is simply a query language because it has tables, joins and partitions..
Hive is a data warehousing package/infrastructure built on top of Hadoop. It provides an SQL dialect called Hive Query Language (HQL) for querying data stored in a Hadoop cluster. Like all SQL dialects in widespread use, HQL doesn’t fully conform to any particular revision of the ANSI SQL standard. It is perhaps closest to MySQL’s dialect, but with significant differences. Hive offers no support for row level inserts, updates, and deletes. Hive doesn’t support transactions. So we can't compare it with RDBMS. Hive adds extensions to provide better performance in the context of Hadoop and to integrate with custom extensions and even external programs. It is well suited for batch processing data like: Log processing, Text mining, Document indexing, Customer-facing business intelligence,
Predictive modeling, hypothesis testing etc.
Hive is not designed for online transaction processing and does not offer real-time queries.
I need to store large amount of small data objects (millions of rows per month). Once they're saved they wont change. I need to :
store them securely
use them to analysis (mostly time-oriented)
retrieve some raw data occasionally
It would be nice if it could be used with JasperReports or BIRT
My first shot was Infobright Community - just a column-oriented, read-only storing mechanism for MySQL
On the other hand, people says that NoSQL approach could be better. Hadoop+Hive looks promissing, but the documentation looks poor and the version number is less than 1.0 .
I heard about Hypertable, Pentaho, MongoDB ....
Do you have any recommendations ?
(Yes, I found some topics here, but it was year or two ago)
Edit:
Other solutions : MonetDB, InfiniDB, LucidDB - what do you think?
Am having the same problem here and made researches; two types of storages for BI :
column oriented. Free and known : monetDB, LucidDb, Infobright. InfiniDB
Distributed : hTable, Cassandra (also column oriented theoretically)
Document oriented / MongoDb, CouchDB
The answer depends on what you really need :
If your millions of row are loaded at once (nighly batch or so), InfiniDB or other column oriented DB are the best; They have great performance and are "BI oriented". http://www.d1solutions.ch/papers/d1_2010_hauenstein_real_life_performance_database.pdf
And they won't require a setup of "nodes", "sharding" and other stuff that comes with distributed/"NoSQL" DBs.
http://www.mysqlperformanceblog.com/2010/01/07/star-schema-bechmark-infobright-infinidb-and-luciddb/
If the rows are added in real time.. then column oriented DB are bad. You can either choose two have two separate DB (that's my choice : one noSQL for real feeding of the stats by the front, and real time stats. The other DB column-oriented for BI). Or turn towards something that mixes column oriented (for out requests) and distribution (for writes) / like Cassandra.
Document oriented DBs are not suited for BI, they are more useful for CRM/CMS issues where you need frequent access to a particular row
As for the exact choice inside a category, I'm still undecided. Cassandra in distributed, and Monet or InfiniDB for CODB, are leaders. Monet is reported to have problem loading very big tables because it runs indexes in memory.
You could also consider GridSQL. Even for a single server, you can create multiple logical "nodes" to utilize multiple cores when processing queries.
GridSQL uses PostgreSQL, so you can also take advantage of partitioning tables into subtables to evaluate queries faster. You mentioned the data is time-oriented, so that would be a good candidate for creating subtables.
If you're looking for compatibility with reporting tools, something based on MySQL may be your best choice. As for what will work for you, Infobright may work. There are several other solutions as well, however you may want also to look at plain-old MySQL and the Archive table. Each record is compressed and stored and, IIRC, it's designed for your type of workload, however I think Infobright is supposed to get better compression. I haven't really used either, so I'm not sure which will work best for you.
As for the key-value stores (E.g. NoSQL), yes, they can work as well and there are plenty of alternatives out there. I know CouchDB has "views", but I haven't had the opportunity to use any, so I don't know how well any of them work.
My only concern with your data set is that since you mentioned time, you may want to ensure that whatever solution you use will allow you to archive data past a certain time. It's a common data warehouse practice to only keep N months of data online and archive the rest. This is where partitioning, as implemented in an RDBMS, comes in very useful.