Pentaho report how to pass db instance via URL? - reporting

I have few apps, each one conected to it's database (MSSQL)
=> (1) pentaho server and severals common reports i would like to share between apps (every thing identic except data and logo).
=> (n) apps & (n) databases.
example:
app1.mysite.net & db1 instance at db1.mydbs.net,
app2.mysite.net & db2 instance at db2.mydbs.net,
appn.mysite.net & dbn instance at dbn.mydbs.net.
Is were a way for each app to pass it's db "instance/name" through URL in order to retrieve it's own data report?
The goal is to reduce report maintenance sharing them across multiples apps and servers and also reduce cost avoiding multiples pentaho instances.

Related

Advice on Setup

I started my first data analysis job a few months ago and I am in charge of a SQL database and then taking that data and creating dashboards within Power BI. Our SQL database is replicated from an online web portal we use for data entry. We do not add data ourselves to the database but instead the data is put into tables based on the data entered into the web portal. Since this database is replicated via another company, I created our own database that is connected via linked server. I have built many views to pull only the needed data from the initial database( did this to limit the amount of data sent to Power BI for performance). My view count is climbing and wondering in terms of performance, is this the best way forward. The highest row count of a view is 32,000 and the lowest is around 1000 rows.
Some of the views that I am writing end up joining 5-6 tables together due to the structure built by the data web portal company that controls the database.
My suggestion would be to create a Datawarehouse schema ( star schema ) keeping as principal, one star schema per domain. For example one for sales, one for subscriptions, one for purchase, etc. Use the logic of Datamarts.
Identify your dimensions and your facts and keep evolving that schema. You will find out that you will end up with a much fewer number of tables.
Your data are not that big so you can use whatever ETL strategy you like.
Truncate load or incrimental.

How to parameterise the data connection in Tableau in AWS (cloudformation or otherwise)?

I have a simple web app UI (which stores certain dataset parameters (for simplicity, assuming they are all data tables in a single Redshift database, but the schema/table name can vary, and the Redshift is in AWS). Tableau is installed on an EC2 instance in the same AWS account.
I am trying to determine an automated way of passing 'parameters' as a data source (i.e. within the connection string inside Tableau on EC2/AWS) rather than manually creating data source connections and inputting the various customer requests.
The flow for the user would be say 50 users select various parameters on the UI (for simplicity suppose the parameters are stored as a JSON file in AWS) -> parameters are sent to Tableau and data sources created -> connection is established within Tableau without the customer 'seeing' anything in the back end -> customer is able to play with the data in Tableau and create tables and charts accordingly.
How may I do this at least through a batch job or cloud formation setup? A "hacky" solution is fine.
Bonus: if the above is doable in real-time across multiple users that would be awesome.
** I am open to using other dashboard UI tools which solve this problem e.g. QuickSight **
After installing Tableau on EC2 I am facing issues in finding an article/documentation of how to pass parameters into the connection string itself and/or even parameterise manually.
An example could be customer1 selects "public_schema.dataset_currentdata" and "public_scema.dataset_yesterday" and one customer selects "other_schema.dataser_currentdata" all of which exist in a single database.
3 data sources should be generated (one for each above) but only the data sources selected should be open to the customer that selected it i.e. customer2 should only see the connection for other_schema.dataset_currentdata.
One hack I was thinking is to spin up a cloud formation template with Tableau installed for a customer when they make a request, creating the connection accordingly, and when they are done then just delete the cloud formation template. I am mainly unsure how I would get the connection established though i.e. pass in the parameters. I am not sure spinning up 50 EC2's though is wise. :D
An issue I have seen so far is creating a manual extract limits the number of rows. Therefore I think I need a live connection per customer request. Hence I am trying to get around this issue.
You can do this with a combination of a basic embed and applying filters. This would load the Tableau workbook. Then you would apply a filter based on whatever values your user selects from the JSON.
The final missing part is that you would use a parameter instead of a filter and pass those values to the database via initial sql.

Parallel processing of records from database table

I have a relational table that is being populated by an application. There is a column named o_number which can be used to group the records.
I have another application that is basically having a Spring Scheduler. This application is deployed on multiple servers. I want to understand if there is a way where I can make sure that each of the scheduler instances processes a unique group of records in parallel. If a set of records are being processed by one server, it should not be picked up by another one. Also, in order to scale, we would want to increase the number of instances of the scheduler application.
Thanks
Anup
This is a general question, so here's my general 2 cents on the matter.
You create a new layer managing the requesting originating from your application instances to the database. So, probably you will be building a new code/project running on the same server as the database (or some other server). The application instances will be talking to that managing layer instead of the database directly.
The manager will keep track of which records are requested hence fetch records that are yet to be processed upon each new request.

accessing database between same instance of a micro service

In my project, I have a microservice [say A] and it has a SQL database. We have a 5 node cluster and each of the node this microservice runs. So, We have 5 instances running of service A on the cluster. Now, suppose there is a select query in a particular function of the microservice that is retrieving data from the database. Now, since 5 instance are running, all the 5 instance will use the same query and will work on the same data. Is there any way, in which, we can divide data among 5 instances of service A.
Application clustering is different to database clustering. You cannot "divide" data among the 5 instances of application services since all application instances require a similar set of data to function (unless your application is designed to work on a subset of the data, i.e. each application instance is used to serve a specific list of countries, then you might be able to break the data up by country).
You can look into clustering at the database level for ideas on how you can cluster at the SQL level: https://www.brentozar.com/archive/2012/02/introduction-sql-server-clusters/ .

Is Redis good choice for performing large scale of calculations?

Currently pulling large scale of data from the oracle database & then performing calculation on web side for generating HTML reports. I am using Groovy & Grails frame work for report generation.
Now the problem is , We are having very huge calculation & it takes lots of time to generate report on web side.
I am planning to re-architecture my reports , so it generate reports very quickly.
I don't have any command on ORACLE database as it's third-party production database.
I don't want any replication of the database , because it has millions of records , so can't schedule & replication it slow down the production.
I finally came up with some caching architecture , which perform like some calculation engine.
Anyone can help me by providing best solution ?
Thanks
What is structure of your data? Do you want to query so SQL can help you, or is it binary/document?
Do you need persistence (durability) or not?
Redis is fast. But if you have single threaded app using MS SQL and their bulk importer, it's incredibly fast too.
Redis is key/value stores so you need to perform single SET for every column within your domain object, so it can be slower than any other RDBMS which uses INSERT along with all columns.
Or if your results are in form of JSON object, Mongo can be very useful.
It just depends on your data and purpose of persistence.

Resources