My company is in the process of converting a very large VB6 application to WPF. Throughout the years, customers have saved all different types of documents using the OLEContainer object in VB6. So customers do not lose their documents, I am in need of pulling all those original documents back out and printing them to PDF. I am in need of an automated way to do this. I have been searching for days now and have gotten no where fast. Any thoughts?
Related
Our team uses Spotfire to host online analyses and also prepare monthly reports. One pain point that we have is around validation. The reports are all prepared reports, and the process for creating them each month is as simple as 1) refresh the data (through Infolink connected to Oracle) and 2) Press button to export each report. The format of the final product is a PDF.
The issue is that there are a lot of small things that can go wrong with the reports (filter accidentally applied, wrong month selected, data didn't refresh, new department not grouped correctly, etc.) meaning that someone on our team has to manually validate each of the reports. We create almost 20 reports each month and some of them are as many as 100 pages.
We've done a great job automating the creation of the reports, but now we have this weird imbalance where it takes like 25 minutes to create all the reports but 4+ hours to validate each one.
Does anyone know of a good way to automate, or even cut down, the time we have to spend each month validating the reports? I did a brief google and all I could find was in the realm of validating reports to meet government regulation standards
It depends on 2 factors:
Do your reports have the same template (format) each time you extract them? You said that you pull them out automatically so I guess the answer is Yes.
What exactly are you trying to check/validate? You need to have a clear list on what are you validating. You mentioned month, grouping, data values (for the refresh)). But the clearer the picture you have for validation, the more likely the process can be fully automated.
There are so called RPA (robot process automation) tools that can automate complex workflows.
A "data extract" task, which is part of a workflow, can detect and collect data from documents (PDF for example).
A robot that runs on the validating machine can:
batch read all your PDF reports from specified locations on your computer (or on another computer);
based on predefined templates it can read through the documents for specific fields that you specify (through defined anchors on the templates) and collect the exact data from there;
compare the extracted data with the baseline that you set (compare the month to be correct, compare a data field to confirm proper refresh of the data, another data field to confirm grouping, etc.);
It takes a bit of time to dissect the PDF for each report template and correctly set the anchors but then it runs seamless each time.
One such tool I used is called Atomatik. It has a studio environment where you design the robot (or robots) and run the process.
We've been using Elasticsearch to deliver the 700,000 or so pieces of content to the readers of our site for a couple of years but some circumstances have changed and we need to work out whether or not the service can adapt with us... (sorry this post is so long, I tried to anticipate all questions!)
We use Elasticsearch to store "snapshots" of our content to avoid duplicating work and slowing down our apps by making them fetch data and resolve all resources from our content APIs. We also take advantage of Elasticsearch's search API to retrieve the content in all sorts of ways.
To maintain content in our cluster we run a service that receives notifications of content changes from our APIs which triggers a content "ingest" (fetching the data, doing any necessary transformation and indexing it). The same service also periodically "reingests" content over time. Typically a new piece of content will be ingested in <30 seconds of publishing and touched every 5 days or so thereafter.
The most common method our applications use to retrieve content is by "tag". We have list pages to view content by tag and our users can subscribe to content updates for a tag. Every piece of content has one or more tags.
Tags have several properties:- ID, name, taxonomy, and it's relationship to the content. They're indexed as nested objects so that we can aggregate on them etc.
This is where it gets interesting... tags used to be immutable but we have recently changed metadata systems and they may now change - names will be updated, IDs may flux as they move taxonomy etc.
We have around 65,000 tags in use, the vast majority of which are used only in relatively small numbers. If and when these tags change we can trigger a reingest of all the associated content without requiring any changes to our infrastructure.
However, we also have some tags which are very common, the most popular of which is used more than 180,000 times. And we've just received warning it, a few others with tens of thousands of documents are due to change! So we need to be able to cope with these updates now and into the future.
Triggering a reingest of all the associated content and queuing it up is not the problem, but this could take quite some time, at least 3-5 hours in some cases, and we would like to try and avoid our list pages becoming orphaned or duplicated while this occurs.
If you've got this far, thank you! I have two questions:
Is there a more optimal mapping we could use for our documents knowing now that nested objects - often duplicated thousands of times - may change? Could a parent/child mapping work with so many relations?
Is there an efficient way to update a large number of nested objects? Hacks are fine, at least to cover us in the short term. Could the update by query API and a script handle it?
Thanks
I've already answered a similar question to your use case of Nested datatype.
Here is the link to the answer of maintaining Parent-Child relation data into ES using Nested datatype.
Try this. Do let me know if this solution helps in solving your problem.
I have one thousand Google Form Responses spreadsheets. These are students answer sheets. I built a spreadsheet and pull data (TimeStamps and scores) for each student by using Google Spreadsheet formulas (INDEX MATCH and IMPORTDATA). Each student has different pages. But, it takes too many times and sometimes causes some source sheets being unresponsive (I think because of heavy formula usage). My questions;
Is it possible to do the same thing (pulling data if matches student's name from one thousand spreadsheets) by using Google Script?
If possible, which ones (Google Spreadsheets with formulas or Google Script) performance is better?
By looking your answers I will decide to begin learning Google Script or not.
Thanks in advance.
Is it possible to do the same thing (pulling data if matches student's name from one thousand spreadsheets) by using Google Script?
Yes, it's possible.
NOTE: Bear in mind that Google Sheets has a 5 million cell limit, so if your data exceeds this limit, you should consider to use another data repository.
If possible, which ones (Google Spreadsheets with formulas or Google Script) performance is better?
Since most Google Sheets formulas are recalculated every time that a change is made in the spreadsheet that holds them, it's very likely that Google Apps Script will be better when using Google Sheets/Google Apps Script as database management system because we could have more control over when the database transactions will be made.
Related
Measurement of execution time of built-in functions for Spreadsheet
Why do we use SpreadsheetApp.flush();?
Both does the same thing. Both will be as intensive on your computer. My advice would be to upgrade your PC!
This is a description of the application I want to build and I'm not sure whether to use Core Data or Sqlite (or something else?):
Single user, desktop, not networked, only one frontend is accessing datastorage
User occasionally enters some data, no bulk data importing or large data inserts
Simple datamodel: entity with up to 20-30 attributes
User searches in data (about 50k datasets max.)
Search takes place mostly in attribute values, not looking for any keys here, but searching for text in values
Writing the data is nothing I see as critical, it happens not very often and with small amounts of data. The text search in the attributes has to be blazingly fast, a user would expect almost instant results. This is absolutely critical.
I would rather go with Core Data, but is this a scenario CD can handle?
Thanks
-Fish
Core Data can handle this scenario. But because you're looking for blazingly fast full text search, you'll have to do some extra work. Session 211 of WWDC 2013 goes into depth about how to do this (slides 117-131). You'll probably want to have a separate Entity with text search tokens: all of the findable words in your dataset.
Although one of the FTS extensions is available in Apple's deployment of SQLite, it's not exposed in Core Data.
I have a situation where users have a primary document (a purchase order) that will, throughout its life, have various other documents added to it. The documents could be email messages, word documents or anything else.
Right now the (clunky) solution is to print the document to PDF and then append the document to the Purchase order stored as a PDF.
I'm thinking of using a database (keyed by PO number) and linking the documents to it. The only issue with this is getting the documents into a standard (PDF) format and linking them them to the PO in the database. Any suggestions on a user-friendly way to do this?
If your intention is to store the PDFs externally, your best bet is to store the document with a file name containing the DocumentID generated from your Documents database table, as in
475833.PDF
You will need another table to collect all of the related documents together, like a binder table.
Printing to PDF does have the advantage that it is not dependent on any particular application to produce the PDF; it will work in all applications. The trick is to find software that allows you to specify the file name programmatically. CutePDF does this using registry entries.