I am Using VieleRETS as client applications to fetch data into our MYSQL database. Is there any chances if I use other RETS Client application service such and Rets Connecter to fetch data and same amount of data will be updated of it will vary?
Not 100% sure what you are asking.
If you are asking whether 2 different MLS Systems have the same schema, the answer is no. Well, not likely at least.
If you are asking whether 2 clients, viele and some other client, will see the same schema from the same MLS system the answer is yes. And in this scenario you should see the same record counts, assuming no defect in the data.
A simple example of data defect would be the DMQL queries ListingID=0+ and ModificationTimestamp=1900-01-01T00:00:00+ should return all listings in both cases. If a listing timestamp or id were null the counts would be off.
Related
Background information
We sell an API to users, that analyzes and presents corporate financial-portfolio data derived from public records.
We have an "analytical data warehouse" that contains all the raw data used to calculate the financial portfolios. This data warehouse is fed by an ETL pipeline, and so isn't "owned" by our API server per se. (E.g. the API server only has read-only permissions to the analytical data warehouse; the schema migrations for the data in the data warehouse live alongside the ETL pipeline rather than alongside the API server; etc.)
We also have a small document store (actually a Redis instance with persistence configured) that is owned by the API layer. The API layer runs various jobs to write into this store, and then queries data back as needed. You can think of this store as a shared persistent cache of various bits of the API layer's in-memory state. The API layer stores things like API-key blacklists in here.
Problem statement
All our input data is denominated in USD, and our calculations occur in USD. However, we give our customers the query-time option to convert the response just-in-time to another currency. We do this by having the API layer run a background job to scrape exchange-rate data, and then cache it in the document store. Individual API-layer nodes then do (in-memory-cached-with-TTL) fetches from this exchange-rates key in the store, whenever a query result needs to be translated into a specific currency.
At first, we thought that this unit conversion wasn't really "about" our data, just about the API's UX, and so we thought this was entirely an API-layer concern, where it made sense to store the exchange-rates data into our document store.
(Also, we noticed that, by not pre-converting our DB results into a specific currency on the DB side, the calculated results of a query for a particular portfolio became more cache-friendly; the way we're doing things, we can cache and reuse the portfolio query results between queries, even if the queries want the results in different currencies.)
But recently we've been expanding into also allowing partner clients to also execute complex data-science/Business Intelligence queries directly against our analytical data warehouse. And it turns out that they will also, often, need to do final exchange-rate conversions in their BI queries as well—despite there being no API layer involved here.
It seems like, to serve the needs of BI querying, the exchange-rate data "should" actually live in the analytical data warehouse alongside the financial data; and the ETL pipeline "should" be responsible for doing the API scraping required to fetch and feed in the exchange-rate data.
But this feels wrong: the exchange-rate data has a different lifecycle and integrity constraints than our financial data. The exchange rates are dirty and ephemeral point-in-time samples attained by scraping, whereas the financial data is a reliable historical event stream. The exchange rates get constantly updated/overwritten, while the financial data is append-only. Etc.
What is the best practice for serving the needs of analytical queries that need to access backend "application state" for "query result presentation" needs like this? Or am I wrong in thinking of this exchange-rate data as "application state" in the first place?
What I find interesting about your scenario is about when the exchange rate data is applicable.
In the case of the API, it's all about the realtime value in the other currency and it makes sense to have the most recent value in your API app scope (Redis).
However, I assume your analytical data warehouse has tables with purchases that were made at a certain time. In those cases, the current exchange rate is not really relevant to the value of the transaction.
This might mean that you want to store the exchange rate history in your warehouse or expand the "purchases" table to store the values in all the currencies at that moment.
In an application we have to send sensory data stream from multiple clients to a central server over internet. One obvious solution is to use MOMs (Message Oriented Middlewares) such as Kafka, but I recently learned that we can do this with data base synchronization tools such as oracle Materialized View.
The later approach works in some application (sending data from a central server to multiple clients, inverse directin of our application), but what is the pros and cons of it in our application? Which one is better for sending sensory data stream from multiple (~100) clients to server in terms of speed, security, etc.?
Thanks.
P.S.
For more detail consider an application in which many (about 100) clients have to send streaming data (1MB data per minute) to a central server over internet. The data are needed in server for the sake of online monitoring, analysis and some computation such as machine learning and data mining tasks.
My question is about the difference between db-to-db connection and streaming solutions such as kafka for trasfering data from clients to server.
Prologue
I'm going to try and break your question down into in order to get a clearer understanding of your current requirements and then build it back up again. This has taken a long time to write so I'd really appreciate it if you do two things off the back of it:
Be sceptical - there's absolutely no substitute for testing things yourself. The internet is very useful as a guide but there's no guarantee that the help you receive (if this answer is even helpful!) is the best thing for your specific situation. It's impossible to completely describe your current situation in the space allotted and so any answer is, of necessity, going to be lacking somewhere.
Look again at how you explained yourself - this is a valid question that's been partially stopped by a lack of clarity in your description of the system and what you're trying to achieve. Getting someone unfamiliar with your system to look over your question before posting a complex question may help.
Problem definition
sensory data stream from multiple clients to a central server
You're sending data from multiple locations to a single persistence store
online monitoring
You're going to be triggering further actions based off the raw data and potentially some aggregated data
analysis and some computation such as machine learning and data mining tasks
You're going to be performing some aggregations on the clients' data, i.e. you require aggregations of all of the clients' data to be persisted (however temporarily) somewhere
Further assumptions
Because you're talking about materialized views we can assume that all the clients persist data in a database, probably Oracle.
The data coming in from your clients is about the same topic.
You've got ~100 clients, at that amount we can assume that:
the number of clients might change
you want to be able to add clients without increasing the number of methods of accessing data
You don't work for one of Google, Amazon, Facebook, Quantcast, Apple etc.
Architecture diagram
Here, I'm not making any comment on how it's actually going to work - it's the start of a discussion based on my lack of knowledge of your systems. The "raw data persistence" can be files, Kafka, a database etc. This is description of the components that are going to be required and a rough guess as to how they will have to connect.
Applying assumed architecture to materialized views
Materialized views are a persisted query. Therefore you have two choices:
Create a query that unions all 100 clients data together. If you add or remove a client you must change the query. If a network issue occurs at any one of your clients then everything fails
Write and maintain 100 materialized views. The Oracle database at your central location has 100 incoming connections.
As you can probably guess from the tradeoffs you'll have to make I do not like materialized views as the sole solution. We should be trying to reduce the amount of repeated code and single points of failure.
You can still use materialized views though. If we take our diagram and remove all the duplicated arrows in your central location it implies two things.
There is a single service that accepts incoming data
There is a single service that puts all the incoming data into a single place
You could then use a single materialized view for your aggregation layer (if your raw data persistence isn't in Oracle you'll first have to put the data into Oracle).
Consequences of changes
Now we've decided that you have a single data pipeline your decisions actually become harder. We've decoupled your clients from the central location and the aggregation layer from our raw data persistence. This means that the choices are now yours but they're also considerably easier to change.
Reimagining architecture
Here we need to work out what technologies aren't going to change.
Oracle databases are expensive and you're pushing 140GB/day into yours (that's 50TB/year by the way, quite a bit). I don't know if you're actually storing all the raw data but at those volumes it's less likely that you are - you're only storing the aggregations
I'm assuming you've got some preferred technologies where your machine learning and data mining happen. If you don't then consider getting some to prevent madness supporting everything
Putting all of this together we end up with the following. There's actually only one question that matters:
How many times do you want to read your raw data off your database.
If the answer to that is once then we've just described middleware of some description. If the answer is more than once then I would reconsider unless you've got some very good disks. Whether you use Kafka for this middle layer is completely up to you. Use whatever you're most familiar with and whatever you're most willing to invest the time into learning and supporting. The amount of data you're dealing with is non-trivial and there's going to be some trial and error getting this right.
One final point about this; we've defined a data pipeline. A single method of data flowing through your system. In doing so, we've increased the flexibility of the system. Want to add more clients, no need to do anything. Want to change the technology behind part of the system, as long as the interface remains the same there's no issue. Want to send data elsewhere, no problem, it's all in the raw data persistence layer.
When I run a query to copy data from schemas, does it perform all SQL on the server end or copy data to a local application and then push it back out to the DB?
The two tables sit in the same DB, but the DB is accessed through a VPN. Would it change if it was across databases?
For instance (Running in Toad Data Point):
create table schema2.table
as
select
sum(row1)
,row2
from schema1
The purpose I ask the question is because I'm getting quotes for a Virtual Machine in Azure Cloud and want to make sure that I'm not going to break the bank on data costs.
The processing of SQL statements on the same database usually takes place entirely on the server and generates little network traffic.
In Oracle, schemas are a logical object. There is no physical barrier between them. In a SQL query using two tables it makes no difference if those tables are in the same schema or in different schemas (other than privilege issues).
Some exceptions:
Real Application Clusters (RAC) - RAC may share a huge amount of data between the nodes. For example, if the table was cached on one node and the processing happened on another, it could send all the table data through the network. (I'm not sure how this works on the cloud though. Normally the inter-node traffic is done with a separate, dedicated network connection.)
Database links - It should be obvious if your application is using database links though.
Oracle Reports and Forms(?) - A few rare tools have client-side PL/SQL processing. Possibly those programs might send data to the client for processing. But I still doubt it would do something crazy like send an entire table to the client to be sorted, and then return the results to the server.
Backups/archive logs - I assume all the data will be backed up. I'm not sure how that's counted, but possibly that means all data written will also be counted as network traffic eventually.
The queries below are examples of different ways to check the network traffic being generated.
--SQL*Net bytes sent for a session.
select *
from gv$sesstat
join v$statname
on gv$sesstat.statistic# = v$statname.statistic#
--You probably also want to filter for a specific INST_ID and SID here.
where lower(display_name) like '%sql*net%';
--SQL*Net bytes sent for the entire system.
select *
from gv$sysstat
where lower(name) like '%sql*net%'
order by value desc;
I have been using cache for a long time. We store data against some key and fetch it from cache whenever required. I know that StackOverflow and many other sites heavily rely on cache. My question is do they always use key-value mechanism for caching or do they form some sql like query within a cache? For instance, I want to view last week report. This report's content will vary each day. Do i need to store different reports against each day (where day as a key) or can I get this result from forming some query that aggregate result across different key? Does any caching product (like redis) provide this functionality?
Thanks In Advance
Cache is always done as a key-value hash table. This is how it stays so fast. If you're doing querying then you're not doing cache.
What you may be trying to ask is... you could have in your database a table that contains agregated report data. And you could query against that pre-calculated table.
One of the reasons for cache (e.g. memcached ) being fast is its simplicity of data access and querying protocol.
The more functionality you add, more tradeoff you will have to do on the efficiency part. A full fledged SQL engine in a "caching" database is not a good design. Though you can utilize a data structures oriented database like Redis to design your cache data to suit your querying needs. For example: one set or one hash for each date.
A step further, you can use databases like MongoDb , or memsql which are pretty fast and have rich querying support.So an aggregation report once a while won't be an issue.
However, as a design decision, you will have to accept that their caching throughput will not be as much as memcached or redis.
We have a need coming up in an application where the following is true:
A web page uses AJAX to request data from a server.
The specification of the data (e. g. table name) requested from the server will not be known until run-time.
The configuration of the data view is itself data-driven, and configurable by an administrator.
Data updates and inserts must be supported, not just views.
Prototyping this was very easy - we could pass in the appropriate information (table name, changeset, whatever) to a generic data service that just did what it was told (using JSON as the data storage mechanism). The data service could do basic validation on the parameters to ensure the current user can perform the requested operation (read the data, insert a row, read the row).
The issue we have now that we are looking to doing this is a secure production manner, and the idea of passing table names and column names is frightening. Everything we think of to deal with this devolves into trusting the client in some significant way, or seems to involve substantial bookkeeping on the server. For example:
User requests a viewing page.
The server notes the table name and saves it server side with a request ID
The server notes the column names and saves them, replacing them with "col1, col2", etc., and stores the mapping with the request ID data.
The client page sends the request ID to the service, which looks up the server storage by ID
The service returns col1, col2, etc.
This would work, we think, but feels very messy.
Does anyone have experience with this kind of problem and can offer a solution?
Do you need to give them access to raw tables?
Perhaps you can go meta, and make a meta-table that stores the tabular data in a secure manner (ie, only the system knows the table/schema, but the user's concept of schema/table are just abstractions that all map back to the same schema/table)...
Again, more information is needed as to what can be abstracted. Allowing DDL operations by the end-users is asking for trouble, as you rightfully assessed, and I would just abstract that so that "DDL" becomes DML.
However, mapping actual SQL that is written against this data would be much more difficult to abstract, if that is a requirement.
If I had to expose back-end information to end customers, I'd probably hide the actual physical representation using meta-data that would remap table names and columns to more user-friendly text, that would also enable me to provide views on the tables that are a bit more advanced than plain table / column names... As properly modeling associations between tables and so on...