Data integration for Magento to Quick Book - etl

I'm currently new to Talend and I'm learning through videos and documentation, so I'm just not sure how to approach/implement this with best practices.
Goal
Integrate Magento and Quick Book using Talend.
My thoughts
Initially my first thought was I will setup direct DB connection for Magento and will take relevant data which I need and will process it and will send to QuickBook using REST API's(specifically bulk API's in batch)
But then again I thought it would be little hectic for me to query Magento database(multiple joins) so I've another option to use Magento's REST API.
But as I'm not much familiar with the tool I'm struggling little to find best suitable approach, so any help is appreciated.
What I've done till now?
I've saved my auth(for QB) and db(Magento) credentials data in file and using tFileInputDelimited and tContextLoad, I'm storing them in context variables so they can be accessible globally.
I've successfully configured database connection and dbinput but I've not used metadata for connection(should I use that and if Yes how can I pass dynamic values there?). I've used my context variables data in db connection settings.
I've taken relevant fields for now but if I want multiple fields simple query is not enough as Magento stores data in multiple tables for Customer etc but it's not big deal I know but I think it might increase my work.
For now that's what I've built and my next step is send the data to QB using REST while getting access_token and saving it to context variable and again storing the QB reference into Magento DB.
Also I've decided to use QB bulk API's but I'm not sure how I can process data in chunks in Talend(I tried to check multiple resources but no luck) i.e. if the Magento is returning 500 rows I want to process them in chunks of 30 as QB batch max limit is 30, so I will be sending it using REST to QB and as I said I also want to store back QB reference ID in magento(so I can update it later).
Also this all will be on local, then how can I do same in production? how I can maintain development and production environment?
Resources I'm referring
For REST and Auth best practices - https://community.talend.com/t5/How-Tos-and-Best-Practices/Using-OAuth-2-0-with-Talend-to-Access-Goo...

Nice example for batch processing here:
https://community.talend.com/t5/Design-and-Development/Batch-processing-in-talend-job/td-p/51952
Redirect your input to a tFileOutputDelimited.
Enter the output filename, tick the option "Split output in several files" from the "Advanced settings" and enter the value of 1000 into the field "Rows in each output file". This will create n files based on the filename with 1000 in each.
On the next subjob, use a tFileList to iterate over this file list to get records from each file.

Related

How to parameterise the data connection in Tableau in AWS (cloudformation or otherwise)?

I have a simple web app UI (which stores certain dataset parameters (for simplicity, assuming they are all data tables in a single Redshift database, but the schema/table name can vary, and the Redshift is in AWS). Tableau is installed on an EC2 instance in the same AWS account.
I am trying to determine an automated way of passing 'parameters' as a data source (i.e. within the connection string inside Tableau on EC2/AWS) rather than manually creating data source connections and inputting the various customer requests.
The flow for the user would be say 50 users select various parameters on the UI (for simplicity suppose the parameters are stored as a JSON file in AWS) -> parameters are sent to Tableau and data sources created -> connection is established within Tableau without the customer 'seeing' anything in the back end -> customer is able to play with the data in Tableau and create tables and charts accordingly.
How may I do this at least through a batch job or cloud formation setup? A "hacky" solution is fine.
Bonus: if the above is doable in real-time across multiple users that would be awesome.
** I am open to using other dashboard UI tools which solve this problem e.g. QuickSight **
After installing Tableau on EC2 I am facing issues in finding an article/documentation of how to pass parameters into the connection string itself and/or even parameterise manually.
An example could be customer1 selects "public_schema.dataset_currentdata" and "public_scema.dataset_yesterday" and one customer selects "other_schema.dataser_currentdata" all of which exist in a single database.
3 data sources should be generated (one for each above) but only the data sources selected should be open to the customer that selected it i.e. customer2 should only see the connection for other_schema.dataset_currentdata.
One hack I was thinking is to spin up a cloud formation template with Tableau installed for a customer when they make a request, creating the connection accordingly, and when they are done then just delete the cloud formation template. I am mainly unsure how I would get the connection established though i.e. pass in the parameters. I am not sure spinning up 50 EC2's though is wise. :D
An issue I have seen so far is creating a manual extract limits the number of rows. Therefore I think I need a live connection per customer request. Hence I am trying to get around this issue.
You can do this with a combination of a basic embed and applying filters. This would load the Tableau workbook. Then you would apply a filter based on whatever values your user selects from the JSON.
The final missing part is that you would use a parameter instead of a filter and pass those values to the database via initial sql.

Apache Nifi - Federated Search

My team’s been thrown into the deep end and have been asked to build a federated search of customers over a variety of large datasets which hold varying degrees of differing data about each individuals (and no matching identifiers) and I was wondering how to go about implementing it.
I was thinking Apache Nifi would be a good fit to query our various databases, merge the result, deduplicate the entries via an external tool and then push this result into a database which is then queried for use in an Elasticsearch instance for the applications use.
So roughly speaking something like this:-
For examples sake the following data then exists in the result database from the first flow :-

Then running https://github.com/dedupeio/dedupe over this database table which will add cluster ids to aid the record linkage, e.g.:-

Second flow would then query the result database and feed this result into Elasticsearch instance for use by the applications API for querying which would use the cluster id to link the duplicates.
Couple questions:-
How would I trigger dedupe to run on the merged content was pushed to the database?
The corollary question - how would the second flow know when to fetch results for pushing into Elasticsearch? Periodic polling?
I also haven’t considered any CDC process here as the databases will be getting constantly updated which I'd need to handle, so really interested if anybody had solved a similar problem or used different approach (happy to consider other technologies too).
Thanks!
For de-duplicating...
You will probably need to write a custom processor, or use ExecuteScript. Since it looks like a Python library, I'm guessing writing a script for ExecuteScript, unless there is a Java library.
For triggering the second flow...
Do you need that intermediate DB table for something else?
If you do need it, then you can send the success relationship of PutDatabaseRecord as the input to the follow-on ExecuteSQL.
If you don't need it, then you can just go MergeContent -> Dedupe -> ElasticSearch.

Laravel pagination in Data Table

I am using DataTable plugin in Laravel. I have a record of 3000 entries in some
But when i load that page it loads all 3000 records in the browser then create pagination, this slow down the page loading.
How to fix this or correct way
Use server-side processing.
Get help from some Laravel Packages. Such as Yajra's: https://yajrabox.com/docs/laravel-datatables/
Generally you can solve pagination either on the front end, the back end (server or database side), or a combination of both.
Server side processing, without a package, would mean setting up TOP/FETCH or make rows in data being returned from your server. 

You could also load a small amount (say 20) and then when the user scrolls to the bottom of the list, load another 20 or so. I mention the inclusion of front end processing as well because I’m not sure what your use cases are, but I imagine it’s pretty rare any given user actually needs to see 3000 rows at a time.

Given that Data Tables seems to have built-in functionality for paginating data, I think that #tersakyan is essentially correct — what you want is some form of back-end filtering or paginating of rows of data to limit what’s being sent to the front end.

I don’t know if that package works for you or not or what your setup looks like, but pagination can also be achieved directly from a DataBase returning data via the SQL (using TOP/FETCH for example) or could be implemented in a Controller or Service by tracking pages of data and “loading a page at a time” both from the server and then into the table. All you would need is a unique key to associate each "set of pages" for a specific request.
But for performance, you want to avoid both large data requests and operations on large sets of data. So the more you limit how much data is being grabbed or processed at any stage of your application using it, the more performant your application will be in principle.




Realistic Data Backup method for Parse.com

We are building an iOS app with Parse.com, but still can't figure out the right way to backup data efficiently.
As a premise, we have and will have a LOT of data store rows.
Say we have a class with 1million rows, assume we have it backed up, then want to bring it back to Parse, after a hazardous situation (like data loss on production).
The few solutions we have considered are the following:
1) Use external server for backup
BackUp:
- use the REST API to constantly back up data to a remote MySQL server (we chose MySQL for customized analytics purpose, since it's way faster and easier to handle data with MySQL for us)
ImportBack:
a) - recreate JSON objects from MySQL backup and use the REST API to send back to Parse.
Say we use the batch operation which permits 50 simultaneous objects to be created with 1 query, and assume it takes 1 sec for every query, 1million data sets will take 5.5hours to transfer to Parse.
b) - recreate one JSON file from MySQL backup and use the Dashboard to import data manually.
We just tried with 700,000 records file with this method: it took about 2 hours for the loading indicator to stop and show the number of rows in the left pane, but now it never opens in the right pane (it says "operation time out") and it's over 6hours since the upload started.
So we can't rely on 1.b, and 1.a seems to take too long to recover from a disaster (if we have 10 million records, it'll be like 55 hours = 2.2 days).
Now we are thinking about the following:
2) Constantly replicate data to another app
Create the following in Parse:
- Production App: A
- Replication App: B
So while A is in production, every single query will be duplicated to B (using background job constantly).
The downside is of course that it'll eat up the burst limit of A as it'll simply double the amount of query. So not ideal thinking of scaling up.
What we want is something like AWS RDS which gives an option to automatically backup daily.
I wonder how this could be difficult for Parse since it's based on AWS infra.
Please let me know if you have any idea on this, will be happy to share know-hows.
P.S.:
We’ve noticed an important flaw in the above 2) idea.
If we replicate using REST API, all the objectIds of all Classes will be changed, so every 1to1 or 1toMany relations will be broken.
So we think about putting a uuid for every object class.
Is there any problem about this method?
One thing we want to achieve is
query.include(“ObjectName”)
( or in Obj-C “includeKey”),
but I suppose that won’t be possible if we don’t base our app logic on objectId.
Looking for a work around for this issue;
but will uuid-based management be functional under Parse’s Datastore logic?
Parse has never lost production data. While we don't currently offer automated backups, you can request one any time you like, and we're working on making all of this even nicer. Additionally, it's easier in most cases to import the JSON export file through the data browser rather than using the REST batch.
I can confirm that today, Parse did lost my data. Or at least it appeared to be so.
After several errors where detected on multiple apps (agreed by Parse Status twitter account), we could not retrieve data for an app, without any error.
It was because an entire column of one of our class (type pointer) disappeared and data was not present anymore in the dashboard.
We are using this pointer column to filter / retrieve data, so the returned queries and collections were empty.
So we decided to recreate the column manually. By chance, recreating the column, with the same name and type, solved the issue and the data was still there... I can't explain it but I really thought, and the app reacted as if, data were lost.
So an automated backup and restore option is mandatory, it is not an option.
On December 2015 parse.com released a new dashboard with an improved export feature.
Just select your app, click on "App Settings" -> "General" -> "Export app data". Parse generates a json-file for every class in your app and sends an email to you, if the export-progress is done.
UPDATE:
Sad but true, parse.com is winding down: http://blog.parse.com/announcements/moving-on/
I had the same issue of backing up parse server data. As parse server is using mongodb that is why backing up data is not an issue I have just done a simple thing. downloaded the mongodb backup from the server. And then restored it using
mongorestore /path-to-mongodump (extracted files)
As parse has been turned to open source.Therefore we can adopt this technique.
For accidental deletes, writing a cloud function 'beforedelete' to backup the current row to another class would work.
For regular backups, manual export of changed records (use filter) will be useful. For recovery this requires you to write scripts / use import option (not so sure) in data browser. You could also write a cloud function replicate data on your backup server (haven't tried this yet).
However there are some limitations to cloud code that you should consider before venturing into it:
https://parse.com/docs/cloud_code_guide#functions-resource

Rhomobile inserting into local database using CSV or XML from external web server

I am currently developing a Rhomobile application. I have a backend database which holds customer information. I have got from the webserver a csv string (or XML - I am able to parse the XML using REXML) which contains all the customers. Each time I sync the device I am going to reset the customer table on the device and re-insert all data from the backend database. I am not using RhoSync and the device will be using property bag.
Is it possible to use the CSV or XML data to insert into the customers table? If so, how would I go about it?
At the moment the only option I can see that would work would be to manually loop through the CSV/XML and insert into the database manually; this isn't very elegant.
Any help will be much appreciated, sorry if this is a dumb question; still relatively new to this framework.
I have come to the conclusion that the only way is to loop through the csv/xml, which with the help of a database transaction this doesn't take long.
Using fixed schema also increases the performance a lot as property bag has to do column inserts (so if you have lots of columns - there is lots of inserts per record).
Also in Rhomobile garbage collection is turned off, so if you are trying to process large data sets your device will quickly run out of memory:
GC.enable
The above solves this issue

Resources