I am building a new application where I am expecting a high volume of geo location data something like a moving object sending geo coordinates every 5 seconds. This data needs to be stored in some database so that it can be used for tracking the moving object on a map anytime. So, I am expecting about 250 coordinates per moving object per route. And each object can run about 50 routes a day. and I have 900 such objects to track. SO, that brings to about 11.5 million geo coordinates to store per day. I have to store about one week of data at least in my database.
This data will be basically used for simple queries like find all the geocoordates for a particular object and a particular route. so, the query is not very complicated and this data will not be used for any analysis purpose.
SO, my question is should I just go with normal Oracle database like 12C distributed over two VMs or should I think about some big data technologies like NO SQL or hadoop?
One of the key requirement is to have high performance. Each query has to respond withing 1 second.
Since you know the volume of data (11.5 million) you can easily simulate the all your scenario in Oracle DB and test it well before.
My suggestions are you need to go for day level partitions & 2 sub partitions like objects & routs. All your business SQL has to hit right partitions always.
and also you might required to clear older days data. or Some sort of aggregation you can created with past days and delete your raw data would help.
its well doable 12C.
Related
I am doing some research about what the best possible state that data should be in so that reporting and BI analytics perform well but can be produced by business users from a set of various data collections which align with a business data glossary that I have worked through.
We have not chosen a specific BI tool but have been playing around with Power BI and Sisense
We have not decided on a data store technology to use for reporting purposes
Origin Data
Our business application that the data will originate from has a normalised SQL relational database. There are quite a few tables and joins to consider which work fine from an application perspective but I have recommended supplying the output of those queries as a flat denormalised set of data to increase redundancy and remove the joins entirely.
Business Data Glossary
As we go through defining the business data glossary, the number of columns increases but I do not anticipate there being any more than 100 columns per row as a complete reporting set of data. I wanted to ensure that each row of data is at a transactional depth (level 0) and that the roll up through the data would be done through aggregations by distinct key values and dimensional taxonomy.
Architecture
I want some advice around what a modern architecture looks like and what works for business users rather than users who are comfortable with SQL queries and a myriad of joins on a physical data model.
I read an article about setting up data flows for Power BI which looked like they type of thing I want to do from a data availability perspective but it doesn't advice on how the data should be stored and what type of database to use.
Data Sets
The data we have that needs to be reported on are transactions where level 0 is trade positions (individual transactions from either a local or counterparty entity), level 1 is reconciliations (relating local and counterparty entities and trade linking identifier) and level 2 would be where it can be rolled up by taxonomy like asset type or status.
The current data set size would be a snapshot of positions every business day so, its duplicated every day with a snapshot date applied. The reports would be able to move across dates and show changes over time.
Any advice would be greatly appreciated on how to tackle reporting and BI in 2020. Oooh, one last thing, there is the possibility that we won't be allowed to process this type of data in the public cloud, we have our own infrastructure which is on private cloud so, that might need to be a consideration. Thanks
My program receives thousands of events in a second from different types. For example 100k API access in a second from users with millions of different IP addresses. I want to keep statistics and limit number of accesses in 1 minute, 1 hour, 1 day and so on. So I need event counts in last minute, hour or day for every user and I want it to be like a sliding window. In this case, type of event is the user address.
I started using a time series database, InfluxDB; but it failed to insert 100k events per second and aggregate queries to find event counts in a minute or an hour is even worse. I am sure InfluxDB is not capable of inserting 100k events per second and performing 300k aggregate queries at the same time.
I don't want events retrieved from the database because they are just a simple address. I just want to count them as fast as possible in different time intervals. I want to get the number of events of type x in a specific time interval (for example, past 1 hour).
I don't need to store statistics in the hard disk; so maybe a data structure to keep event counts in different time intervals is good for me. On the other hand, I need it to be like a sliding window.
Storing all the events in RAM in a linked-list and iterating over it to answer queries is another solution that comes to my mind but because the number of events is too high, keeping all of the events in RAM could not be a good idea.
Is there any good data structure or even a database for this purpose?
You didn't provide enough details on events input format and how events can be delivered to statistics backend: is it a stream of udp messages, http put/post requests or smth else.
One possible solution would be to use Yandex Clickhouse database.
Rough description of suggested pattern:
Load incoming raw events from your application into memory-based table Events
with Buffer storage engine
Create materialized view with per-minute aggregation in another
memory-based table EventsPerMinute with Buffer engine
Do the same for hourly aggregation of data in EventsPerHour
Optionally, use Grafana with clickhouse datasource plugin to build
dashboards
In Clickhouse DB Buffer storage engine not associated with any on-disk table will be kept entirely in memory and older data will be automatically replaced with fresh. This will give you simple housekeeping for raw data.
Tables (materialized views) EventsPerMinute and EventsPerHour can be also created with MergeTree storage engine if case you want to keep statistics on disk. Clickhouse can easily handle billions of records.
At 100K events/second you may need some kind of shaper/load balancer in front of database.
you can think of a hazelcast cluster instead of simple ram. I also think a graylog or simple elastic seach but with this kind of load you shoud test. You can think about your data structure as well. You can construct a hour map for each address and put the event into the hour bucket. And when the time passes the hour you can calculate the count and cache in this hour's bucket. When you need a minute granularity you go to hours bucket and count the events under the list of this hour.
There is a situation in our systems in which the user can view and "close" a report. After they close it, the report is moved to a temporary table inside the database where it is kept for 24 hrs, and then moved to an archives table(where the report is stored for next 7 years). At any point during the 7 years, a user can "reopen" the report and work on it. The problem is that archives storage is getting large and finding/reopening reports tend to be time consuming. And I need to get statistics on the archives from time to time(i.e. report dates, clients, average length "opened", etc). I want to use a big data approach but I am not sure whether to use Hadoop, Cassandra, or something else ? Can someone provide me with some guidelines how to get started and decide on what to use ?
If you archive is large and you'd like to get reports from it, you won't be able to use just Cassandra, as it has no easy means of aggregating the data. You'll end up collocating Hadoop and Cassandra on the same nodes.
From my experience archives (write once - read many) is not the best use case for Cassandra if you're having a lot of writes (we've tried it for a backend for a backup sysyem). Depending on your compaction strategy you'll pay either in space or in iops for having that. Added changes are propagated through the SSTable hierarchies resulting in a lot more writes than the original change.
It is not possible to answer your question in full without knowing other variables: how much hardware (servers, their ram/cpu/hdd/ssd) are you going to allocate? what is the size of each 'report' entry? how many reads / writes you usually serve daily? How large is your archive storage now?
Cassandra might work fine. Keep two tables, reports and reports_archive. Define the schema using a TTL of 24 hours and 7 years:
CREATE TABLE reports (
...
) WITH default_time_to_live = 86400;
CREATE TABLE reports_archive (
...
) WITH default_time_to_live = 86400 * 365 * 7;
Use the new Time Window Compaction Strategy (TWCS) to minimize write amplification. It could be advantageous to store the report metadata and report binary data in separate tables.
For roll-up analytics, use Spark with Cassandra. You don't mention the size of your data, but roughly speaking 1-3 TB per Cassandra node should work fine. Using RF=3 you'll need at least three nodes.
I'm having fun learning about Hadoop and the various projects around it and currently have 2 different strategies I'm thinking about for building a system to store a large collection of market tick data, I'm just getting started with both Hadoop/HDSF and HBase but hoping someone can help me plant a system seed that I won't have to junk later using these technologies. Below is an outline of my system and requirements with some query and data usage use cases and lastly my current thinking about the best approach from the little documentation I have read. It is an open ended question and I'll gladly like any answer that is insightful and accept the best one, feel free to comment on any or all of the points below. - Duncan Krebs
System Requirements - Be able to leverage the data store for historical back testing of systems, historical data charting and future data mining. Once stored, data will always be read-only, fast data access is desired but not a must-have when back testing.
Static Schema - Very Simple, I want to capture 3 types of messages from the feed:
Timestamp including date,day,time
Quote including Symbol,timestamp,ask,askSize,bid,bidSize,volume....(About 40 columns of data)
Trade including Symbol,timestamp,price,size,exchange.... (About 20 columns of data)
Data Insert Use Cases - Either from a live market stream of data or lookup via broker API
Data Query Use Cases - Below demonstrates how I would like to logically query my data.
Get me all Quotes,Trades,Timestamps for GOOG on 9/22/2014
Get me all Trades for GOOG,FB BEFORE 9/1/2014 AND AFTER 5/1/2014
Get me the number of trades for these 50 symbols for each day over the last 90 days.
The Holy Grail - Can MapReduce be used for uses cases like these below??
Generate meta-data from the raw market data through distributed agents. For example, Write a job that will compute the average trading volume on a 1 minute interval for all stocks and all sessions stored in the database. Create the job to have an agent for each stock/session that I tell what stock and session it should compute this value for. (Is this what MapReduce can do???)
On the classpath of the agents can I add my own util code so that the use case above for example could publish its value into a central repo or Messaging server? Can I deploy an agent as an OSGI bundle?
Create different types of agents for different types of metrics and scores that are executed every morning before pre-market trading?
High Frequency Trading
I'm also interested if anyone can share some experience using Hadoop in the context of high frequency trading systems. Just getting into this technology my initial sense is Hadoop can be great for storing and processing large volumes of historic tick data, if anyone is using this for real-time trading I'd be interested in learning more! - Duncan Krebs
Based of my understanding of your requirements, Hadoop would be really good solution to store your data and run your queries on it using Hive.
Storage: You can store the data in Hadoop in a directory structure like:
~/stock_data/years=2014/months=201409/days=20140925/hours=01/file
Inside the hours folder, the data specific to that hour of the day can reside.
One advantage of using such structure is that you can create external tables in Hive over this data with your partitions on years, months, days and hours. Something like this:
Create external table stock_data (schema) PARTITIONED BY (years bigint, months bigint, days bigint, hours int) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LOCATION
'~/stock_data'
Coming to the queries part, once you have the data stored in the format mentioned above you can easily run simple queries.
Get me all Quotes,Trades,Timestamps for GOOG on 9/22/2014
select * from stock_data where stock = 'GOOG' and days = 20140922
Get me all Trades for GOOG,FB BEFORE 9/1/2014 AND AFTER 5/1/2014
select * from stock_data where stock in ('GOOG', 'FB') and days > 20140501 and days < 20140901)
You can run any such aggregation queries once in a day and use the output to come up with the metrics before pre-market trading. Since Hive internally runs mapreduce these queries won't be very fast.
In order to get faster results, you can use some of the in memory projects like Impala or Spark. I have myself used Impala to run queries on my hive tables and I have seen a major improvement in the run time for my queries (around 40x). Also you wouldn't need to make any changes to the structure of the data.
Data Insert Use Cases : You can use tools like Flume or Kafka for inserting data in real time to Hadoop (and thus to the hive tables). Flume is linearly scalable and can also help in processing events on the fly while transferring.
Overall, a combination of multiple big data technologies can provide a really decent solution to the problem you proposed and these solution would scale to huge amounts of data.
My team needs to find a solution to the following problem:
Our application allows users to view total sales for the enterprise, totals by product, totals by region, totals by region x product, totals by regions x division, etc. You get the idea. There are so many values that need to be aggregated to get many of those totals that they cannot be computed on the fly - we have to pre-aggregate them to provide decent response times, a process that takes about 5 minutes.
The problem, which we thought was a common one but can find no references to, is how to allow updates to various sales without shutting off the users. Also, the users cannot accept eventual consistency - if they drill down on a total of 12 they better see numbers that add up to 12. So we need Consistency + Availability.
The best solution we've come up with so far is to direct all queries to a redundant database, "B" (optimized for queries) while updates are directed to the primary database, "A". When we decide to spend the 5 minutes to update all the aggregates, we update database "C", which is yet another redundant database just like "B". Then, new user sessions get directed to "C", while existing user sessions continue to use "B". Eventually, warning anyone left using "B", we kill the sessions on "B" and re-aggregate there, swapping the roles of "B" and "C". Typical drain-stop scenario.
We are surprised that we cannot find any discussion of this and are concerned that we are over-engineering this problem or maybe it's not the problem we think it is. Any advice is greately appreciated.
This was an interesting problem so I thought about it on the train, and I came up with the idea of storing a timestamp for each row in the database that you aggregate over. (I think this technique has a name, but it escapes me and googling isn't finding it...)
The timestamp would indicate when this row was inserted. In addition:
-If rows can be updated, then you will have two 'versions' of the row at once, one more recent than the other.
-If rows can be deleted, then there will need to be a 'deleted version' row that specifies when it was deleted.
Now you can do things such as:
1) Say you update the aggregates at Jan 1 2000 midnight. You can have views of the table return the table's data as though it was Jan 1 2000 midnight, ignoring all inserts/updates/deletes more recent than that. Now the aggregates are as up to date as the data in the view AND you can keep adding data to the underlying table.
2) I don't know how feasible/easy to guarantee it's reliable this would be, but you could have 'differentially computed aggregates' where on Jan 2 2000 midnight, you take the aggregates of Jan 1 2000 midnight and update them only with the data that has been changed since that time - saving you from recomputing so much historical data. (Of course, it gets hairier once you consider rows being updated or deleted that are older than 24 hours)
3) Whenever you bring your aggregates up to date, you can merge updated and deleted rows with their older version and get rid of the older version, so you only have to keep duplicates of rows around when you need them to separate rows that have been aggregated and rows that aren't (this also means that, for instance, if all your aggregates run at once, and you update a row three times in quick succession, you only need to keep the most recent update-indicating row)
If updates cannot be computed on the fly, then caching of results sets as you are doing in another database helps solve the issue of availability with faster response times.
For consistency, you may be able to make use of some form of transaction isolation. For example, MySQL supports a number of different transaction levels, of which REPEATABLE READ may go close to providing you with some consistency in a single transaction. If a transaction can be left open for multiple requests as the users drill down to see the data, they effectively see a snapshot of the database state as of the first request.
In a more generic sense, you're just after a handle which to the data which is provided by the client to indicate a consistent set. As in Patashu's answer, the handle for a client requesting a set of aggregates could be time based. The first stage of client interaction would be to get a handle to the latest aggregate data, eg the current time. If would then pass that handle with each request. As requests are made of the server, it uses the handle to determine which set of aggregate data to return. Rather than having both server "B" and "C", all aggregate data could be stored in server "B", with all aggregate data containing the handle information. This then allows requests to a single server for aggregate data both new and old. At some point, old aggregate data could be purged from "B".
Perhaps a search on transaction isolation will turn up more results for discussion on consistency.
I think you're looking for Data Warehousing concepts
In computing, a data warehouse or enterprise data warehouse (DW, DWH,
or EDW) is a database used for reporting and data analysis. It is a
central repository of data which is created by integrating data from
one or more disparate sources. Data warehouses store current as well
as historical data and are used for creating trending reports for
senior management reporting such as annual and quarterly comparisons.
...
Unlike the ETL-based data warehouse, the integrated source data
systems and the data warehouse are all integrated since there is no
transformation of dimensional or reference data. This integrated data
warehouse architecture supports the drill down from the aggregate data
of the data warehouse to the transactional data of the integrated
source data systems.