UI for querying large sets of non-cloud financial data - user-interface

We have a large amount of financial data, stored locally, and cloud is not an option for us right now. We need to give a non-technical user, a way to run a few standard queries, with the results of those being stored in a file.
We can definitely write something in-house, a web page through which user enters the query and corresponding parameters, that creates a job, that queries the data and writes it to a file, and lets the user know when its done.
However, I feel there might be something that already performs similar tasks. is there a package/tech out there, that provides a UI for querying large sets of findata and dumps results into a file?

Where is your data stored? In a database or files? What kind of queries do you want to run? SQL?
From your description of "large data" and Web UI, my immediate thoughts are Hadoop HDFS for data storage and Hive for queries and maybe Hue for a web frontend.

Related

Lambda Architecture - Why batch layer

I am going through the lambda architecture and understanding how it can be used to build fault tolerant big data systems.
I am wondering how batch layer is useful when everything can be stored in realtime view and generate the results out of it? is it because realtime storage cant be used to store all of the data, then it wont be realtime as the time taken to retrieve the data is dependent on the the space it took for the data to store.
Why batch layer
To save Time and Money!
It basically has two functionalities,
To manage the master dataset (assumed to be immutable)
To pre-compute the batch views for ad-hoc querying
Everything can be stored in realtime view and generate the results out of it - NOT TRUE
The above is certainly possible, but not feasible as data could be 100's..1000's of petabytes and generating results could take time.. a lot of time!
Key here, is to attain low-latency queries over large dataset. Batch layer is used for creating batch views (queries served with low-latency) and realtime layer is used for recent/updated data which is usually small. Now, any ad-hoc query can be answered by merging results from batch views and real-time views instead of computing over all the master dataset.
Also, think of a query (same query?) running again and again over huge dataset.. loss of time and money!
Further to the answer provided by #karthik manchala, data Processing can be handled in three ways - Batch, Interactive and Real-time / Streaming.
I believe, your reference to real-time is more with interactive response than to streaming as not all use cases are streaming related.
Interactive responses are where the response can be expected anywhere from sub-second to few seconds to minutes, depending on the use case. Key here is to understand that processing is done on data at rest i.e. already stored on a storage medium. User interacts with the system while processing and hence waits for the response. All the efforts of Hive on Tez, Impala, Spark core etc are to address this issue and make the responses as fast as possible.
Streaming on the other side is where data streams into the system in real-time - for example twitter feeds, click streams etc and processing need to be done as soon as the data is generated. Frameworks like Storm, Spark Streaming address this space.
The case for batch processing is to address scenarios where some heavy-lifting need to be done on a huge dataset before hand such that user would be made believe that the responses he sees are real-time. For example, indexing a huge collection of documents into Apache Solr is a batch job, where indexing would run for minutes or possibly hours depending on the dataset. However, user who queries the Solr index would get the response in sub-second latency. As you can see, indexing cannot be achieved in real-time as there may be hue amounts of data. Same is the case with Google search, where indexing would be done in a batch mode and the results are presented in interactive mode.
All the three modes of data processing are likely involved in any organisation grappling with data challenges. Lambda Architecture addresses this challenge effectively to use the same data sources for multiple data processing requirements
You can check out the Kappa-Architecture where there is no seperate Batch-Layer.
Everything is analyzed in the Stream-Layer. You can use Kafka in the right configuration as as master-datasetstorage and save computed data in a database as your view.
If you want to recompute, you can start a new Stream-Processing job and recompute your view from Kafka into your database and replace your old view.
It is possible to use only the Realtime view as the main storage for adhoc query but as it is already mentioned in other answers, it is faster if you have much data to do batch-processing and stream-processing seperate instead of doing batch-jobs as a stream-job. It depends on the size of your data.
Also it is cheaper to have a storage like hdfs instead of a database for batch-computing.
And the last point in many cases you have different algorithms for batch and stream processing, so you need to do it seperate. But basically it is possible to only use the "realtime view" as your batch-and stream-layer also without using Kafka as masterset. It depends on your usecase.

oracle user_constraints, user_tables etc views for production

Is it ok to use that views in production? I mean if queries to dictionary is intended to be frequently called or it is designed just for very rare usage with tools like sql navigator, sql developer etc.
It depends on your definition of "frequently", the size of those objects in your database, and why you need to query them.
In general, it's fine to query data dictionary tables on a regular basis in production-- tons of database monitoring tools, for example, will regularly query a bunch of data dictionary tables to gather performance data. At the same time, though, you can easily configure most of these tools to put a tremendous load on your database by gathering too much data too frequently so your performance monitoring tool becomes the source of performance problems. Normally, you can just dial back the amount of data getting captured and the frequency at which it is captured to get 99% of the monitoring benefit without creating a bunch of issues.
I'm not sure why any tool would frequently need to query user_tables-- since tables aren't getting created or destroyed at runtime in a proper system, there aren't too many reasons why you'd really need to query that particular view all that frequently.

Is no sql database a good solution for a small application?

I am developing an internal web application that needs a back end. The data stored is not really RDBMS type. Currently it is in XML document fashion that the application parses (XQuery) to display html tables and other type of fields.
It is likely that I will have a few more different types of XML documents and CSV(comma separated values) coming up. Given the scenario, I can always back the data up with a mySQL database, breaking the process that generates XML or CSV to insert straight in to database.
Is no-sql database a good choice in this scenario? or mySQL is still better? I do not see any need for clustering/high availability/distributed processing scenarios.
Define "better".
I think the choice should be made based on how relational (MySQL) or document-based (NoSQL) your data is.
A good way to know is to analyze typical use cases. Better yet, write two prototypes and measure.

How can I limit memory usage when generating a CSV from a large resultset?

I have a web application in Spring that has a functional requirement for generating a CSV/Excel spreadsheet from a result set coming from a large Oracle database. The expected rows are in the 300,000 - 1,000,000 range. Time to process is not as large of an issue as keeping the application stable -- and right now, very large result sets cause it to run out of memory and crash.
In a normal situation like this, I would use pagination and have the UI display a limited number of results at a time. However, in this case I need to be able to produce the entire set in a single file, no matter how big it might be, for offline use.
I have isolated the issue to the ParameterizedRowMapper being used to convert the result set into objects, which is where I'm stuck.
What techniques might I be able to use to get this operation under control? Is pagination still an option?
A simple answer:
Use a JDBC recordset (or something similar, with an appropriate array/fetch size) and write the data back a LOB, either temporary or back into the database.
Another choice:
Use PL/SQL in the database to write a file using UTL_FILE for your recordset in CSV format. As the file will be on the database server, not on the client, Use UTL_SMTP or JavaMail using Java Stored Procedures to mail the file. After all, I'd be surprised if someone was going to watch the hourglass turn over repeatedly waiting for a 1 million row recordset to be generated.
Instead of loading an entire file in memory you can process each row individually and use output stream to send the output directly to the web browser. E.g. in servlets API, you can get the output stream from ServletResponse.getOutputStream() and then simply write result CSV lines to that stream.
I would push back on those requirements- they sound pretty artificial.
What happens if your application fails, or the power goes out before the user looks at that data?
From your comment above, sounds like you know the answer- you need filesystem or oracle access, in order to do your job.
You are being asked to generate some data- something that is not repeatable by sql?
If it were repeatable, you would just send pages of data back to the user at a time.
Since this report, I'm guessing, has something to do with the current state of your data, you need to store that result somewhere, if you can't stream it out to the user. I'd write a stored procedure in oracle- it's much faster not to send data back and forth across the network. If you have special tools or its just easier, sounds like there's nothing wrong with doing it on the java side instead.
Can you schedule this report to run once a week?
Have you considered the performance of an Excel spreadsheet with 1,000,000 rows?

How to access data in Dynamics CRM?

What is the best way in terms of speed of the platform and maintainability to access data (read only) on Dynamics CRM 4? I've done all three, but interested in the opinions of the crowd.
Via the API
Via the webservices directly
Via DB calls to the views
...and why?
My thoughts normally center around DB calls to the views but I know there are purists out there.
Given both requirements I'd say you want to call the views. Properly crafted SQL queries will fly.
Going through the API is required if you plan to modify data, but it isnt the fastest approach around because it doesnt allow deep loading of entities. For instance if you want to look at customers and their orders you'll have to load both up individually and then join them manually. Where as a SQL query will already have the data joined.
Nevermind that the TDS stream is a lot more effecient that the SOAP messages being used by the API & webservices.
UPDATE
I should point out in regard to the views and CRM database in general: CRM does not optimize the indexes on the tables or views for custom entities (how could it?). So if you have a truckload entity that you lookup by destination all the time you'll need to add an index for that property. Depending upon your application it could make a huge difference in performance.
I'll add to jake's comment by saying that querying against the tables directly instead of the views (*base & *extensionbase) will be even faster.
In order of speed it'd be:
direct table query
view query
filterd view query
api call
Direct table updates:
I disagree with Jake that all updates must go through the API. The correct statement is that going through the API is the only supported way to do updates. There are in fact several instances where directly modifying the tables is the most reasonable option:
One time imports of large volumes of data while the system is not in operation.
Modification of specific fields across large volumes of data.
I agree that this sort of direct modification should only be a last resort when the performance of the API is unacceptable. However, if you want to modify a boolean field on thousands of records, doing a direct SQL update to the table is a great option.
Relative Speed
I agree with XVargas as far as relative speed.
Unfiltered Views vs Tables: I have not found the performance advantage to be worth the hassle of manually joining the base and extension tables.
Unfiltered views vs Filtered views: I recently was working with a complicated query which took about 15 minutes to run using the filtered views. After switching to the unfiltered views this query ran in about 10 seconds. Looking at the respective query plans, the raw query had 8 operations while the query against the filtered views had over 80 operations.
Unfiltered Views vs API: I have never compared querying through the API against querying views, but I have compared the cost of writing data through the API vs inserting directly through SQL. Importing millions of records through the API can take several days, while the same operation using insert statements might take several minutes. I assume the difference isn't as great during reads but it is probably still large.

Resources