Document Conversion Realtime - Implementation Questions

Document Conversion Realtime - Implementation Questions - caching

We have a need to convert MS Office documents to PDF real time when someone provides a link to a document after checking whether user is authorized to view the document or not for an intranet portal. We also need to cache the documents based on the last modified date of the document, we should not convert the document again if another user requests the same document and the document content is not modified since it was last converted.
I have some basic questions on how we can implement this - and would like to check if anyone has previous experience or thoughts how they see this implemented?
For example, if we choose J2EE as the technology, and choose one of the open source Java libraries for PDF conversion; I have following questions.
If there is a 100 MB document - we would need to download entire document from the system where the document is hosted before we start converting the document. This approach may have major concerns on the response time given that this needs to be real time viewing. Is there an option to read first page of a document without downloading entire document so that we can convert document page by page?
How can we cache a document? I do not think we can either store the document in server or database. The reason is this could lead to anyone who is having access to either database or server - can access document content. Any thoughts?
Or do you suggest any out of the box product to do this instead of custom development?
Thanks

I work for a company that creates a product that does exactly what you are trying to do using Java / .NET Web service calls, so let me see if I can answer your questions without bias.
The whole document will need to be downloaded as it will need to be interpreted before PDF Conversion (e.g. for page numbering purposes) can take place. I am sure you are just giving an example, but 100MB is very large for an MS-Office document, although we do see it from time to time.
You can implement caching based on your exact security requirements. If you don't want to store the converted files in a (secured) DB or file system then perhaps you want to store them on a different server behind a firewall. Depending on the number of documents and size you anticipate you may want to cache them in memory. I am sure there are many J2EE caching libraries available, I know there are plenty in .NET. Just keep the most frequently requested documents in your cache.
Depending on your budget you may go for an out of the box product (hint hint :-). I know there are free libraries available for Java that leverage Open Office, but you get the same formatting limitations when opening MS-Office Files in OO. Be careful when trying to do your own MS-Office integration / automation. It is possible to make it reliable and scalable (we did), but it takes a long time and a lot of work.
I hope this helps.

Related

When using asp.net MVC core + EF core + ability to encrypt files. What will be the differences between Blob, FileStream & File System, to manage files

I am working on an asp.net core mvc web application, the web application is a document management workflow. where inside each of the workflow steps users can upload documents, as follow:-
users can upload documents with the following restriction; a file can not exceed 5 MB + all the documents inside a workflow can not exceed 50 MB, unless admin approves it. they can upload as many documents as they want.
we will have lot of views which will show the step and all its documents attached to it, and users can chose to download the documents.
we can have unlimited number of workflows. as the more users register with our application the more workflow will be created.
certain files can be marked as confidential, so they should be encrypted when storing them either inside the database or inside the file system.
we are planning to use EF core as the data access layer for our web application + SQL server 2016 or 2017.
now my question is how we should manage our files, where i found these 3 approaches.
Blob.
FileStream
File system.
now the first approach, will allow us to encrypt the files inside the database + will work with EF. but it will have a huge drawback on performance, since opening a file or querying the files from database means they will be loaded inside the hosting server memory. so since we are seeking for an extensible approach, so i think this approach will not work for us since it is less scalable.
Second approach. will have better performance compared to first approach (Blob), but FileStream are not supported with EF + does not allow encryption. so we have to exclude this also.
third approach. of storing the files inside a folder which have the workflow ID + store the link to the file/folder inside the DB. will allow us to encrypt the files + will work with EF. and have a better performance compared to Blob (not sure if this is valid for FileStream). the only drawback, is that we can not achieve Atomic-ity between the files and their related records inside the database. but with adding some code we can handle this by our-self. for example deleting a database record will delete all its documents inside the folder, and we can add some background jobs to make sure all the documents have database records, other wise to delete the documents..
so based on the above i found that the third approach is the best fit for our need? so can anyone advice on this please? are my assumption correct? and is there a fourth appraoch or a hybrid appraoch that can be a better fit for us?

Although modern RDBMS have been optimised for data storage with the perks of integrity and atomicity, databases should be considered the least most alternative (StackOverflow posts like this and this shall corroborate the above) and therefore the third option mentioned or an improvement thereof shall be the vote.
For instance, a potential improvement would be to store the files renamed to a hash of the content and database the hash which shall eliminate all OS restrictions on subdirectories/files, filenames, and paths. Moreover, with a well structured directory layout duplicates could be filtered out.
The User-defined Database Functions shall aid in achieving atomicity which will efface the need of background jobs. An excellent guide on UDFs particularly for the use of accessing filesytem and invoking an executable can be found here.

Caching Dynamic data that isn't really dynamic in an IIS7 environment

Okay, so I have an old ASP Classic website. I've determined I can reduce a huge number of DB calls by caching the data daily. Our site data is read only, and changes very slowly. I think based on our site usage, I would be able to cache pages by query string for every visit each day, without a hit to our server.
My first thought was to use Output Caching, but the problem I discovered right away was that it wasn't until the third page request was generated that I gained any performance. I verified this using SQL profiler, but I'm not sure why.
My second thought was to add this ObjPageCache include file from https://web.archive.org/web/20211020131054/https://www.4guysfromrolla.com/webtech/032002-1.shtml After some research I discovered that this could cause more issues than it may solve http://support.microsoft.com/kb/316451
I'm hoping someone on here will tell me that since 2002 the issue with Sending ServerXMLHTTP or WinHTTP Requests to the Same Server has been resolved with Microsoft.

Depending on how your data is maintained you could choose from a number of ways to cache it.
If your data is changed and saved in one single place you could choose to generate an html-file which you save to the serverdisk and refer to in your linking. This will require write access for the process running your site though (e.g. NETWORK SERVICE). This will produce fast pages as the server serves these pages without any scriptingengine getting involved.
Another option is reading the data into an DomDocument which you store in the Application object and refer to on the page that needs it (hence saving the roundtrip to the database). You could keep two timestamps together with the cached data (one for the cachingtime and one for the time of change of data in the database). Timestamps will allow for fast check for staleness of the cached data: cached timestamp <> database timestamp => refresh data; otherwise use cached data. One thing to note about this approach is that Application does not accept objects other than multithreaded object so you will have to use the MSXML2.FreeThreadedDomDocument.6.0
Personally I prefer the last one as it allows for a more dynamic usage and I don't have to worry about write access permissions for the process running my site (which would probably pose security risks anyways).

Design Question for Notification System

The original post was posted at https://stackoverflow.com/questions/6007097/design-question-for-notification-system
Here is more clarification of the problem: The notification system purpose is to get user notified (via email for now) when content of the site has changed or updated, or new posting is made. This could be treated as a notification system where people define a rule or keyword for 3rd party site and notification system goes out crawle 3rd party site and crate search inverted indexes. Then a new link or document show up for user defined keyword or rule (more explanation at bottom regarding use case),
For clarified used case: Let suppose I am craigslist user and looking for used vehicle. I define a rule “Honda accord”, “year “ 1996 and price range from “$2000 to $3000”.
For above use case to work what is best approach and how can I leverage on open source technology such as Apache Lucent, Apache Solr and Apache Nutch, and Apache Hadoop to solve this use case.
You can thing of building search engine and with rule and keyword notification system. I just need some pointers and help on how to integrate these open source package to solve use case ?
Any help and pointer will be appreciated. We need three important components are :
1) Web Crawler
2) Index Creator
3) Rule or keyword Mather
Any help will be greatly appreciated. I was referring this wiki which integrates Nutch and Solr together for above purpose http://wiki.apache.org/nutch/RunningNutchAndSolr

Your question is a big one but I'll take a stab at it as I've designed and implemented systems like this before.
Ignoring user account management, your system will need to provide the means to:
retrieve new prospect data (web spider)
identify and extract pertinent results from prospect data (filtering)
collect, maintain and organize results (storage)
select results based on various metadata (querying)
format results for delivery to users (templating)
deliver formatted results to users (delivery)
If the scope of your project is small (say less than 100 sites requiring spidering per day), you could probably get along with one of the many open-source web spiders including wget, Nutch, WebSphinx, etc. You might need to provide instrumentation (custom software) for scheduling, monitoring and control. If your project scope is larger than this, you may need to "roll your own" spidering solution (custom software). Typically this would be designed as a distributed, parallel architecture.
For simple filtering, regular expressions would suffice but for more complex tasks requiring knowledge of HTML layout (extract the textual component of the fifth list element (<LI/>) of the fourth table on the page) you'd need to use an XHTML parser. However you proceed, you'll need to provide custom software to conduct filtering based on your users' needs.
While any database technology can be used to store results extracted from retrieved documents, using an engine optimized for text like Apache SOLR will allow you to easily expand your search criteria as your needs dictate. Since SOLR supports the attachment of and search for metadata associated with each document, it would be a good choice. You'll also need to provide custom software here to automate this step.
Once you've selected a list of candidate results from SOLR, any scripting language could be used to template them into one or more emails and would also inject them into your mail transport agent (MTA). This also requires custom software to automate this process (and if required, to inject user-specific data into each message).

You should probably look at Google's Custom Search API also before diving into crawling the web yourself. This way, google can help you with returning keyword based search results, which you could later filter in your application based on your additional algorithms/rules etc, and make the whole thing work.

How to Increase page loading speed in Zend Framework Application

I have developed application using ZF.The app is little big with a lots of features.
I use Zend_Application(already using autoloader in constructor),Zend_Layout,Zend_view,Zend_form,etc. My current issue is, the page loading is very slow and that too in localhost with XAMP.
I have enabled xdebug, to investigate the issue, got a cachegrind file in "tmp" folder and tried to view it with WinCachegrind software. There i can a see a lot of processes and functions being run for each and every request or page load.
Also, i have installed YSlow add-on for firefox and observed the speed of page loads in seconds...I have compare the speed with ZF and non ZF applications. And from the comparison, the pages for non zf app takes less than 1 sec to load and for the ZF app, it takes atleast 6-7 seconds. What a huge difference.
Main Things happen in the app are :
1) Database connection happens for each request.
2) Im not adding the view to layout explicitly,ZF just appends it automatically, to layout.phtml, based on the action name.
3) Some windows have forms with few drop down boxes which fetches data from the database.
4) Have menus with ACL implimented, before it was loading the privilges from DB for each and every request, but now i have optimized it, so that it will work only duiring the login and rest of the time it will take from the Zend_Registry.
I would like to attach the cachegrind file so that some one can see whats happening in the background, but i cant see an option here for attaching.
Someone please help me to find a solution for this. Any kind of help is really appreciated. Thanks a lot

Let's try to give some hints.
First database connection should happen only once (except if you use several privileges access on the database or several databases). So check that you use Singleton patterns with you Zend_Db_Tables object
Then you do not use Zend_Cache. You should really start to use Zend_Cache and build several cache objects. Let's say for example a File cach, with long term storage, and a memcache or Apc Cache, storing objects. Then use these cache in several layers:
gives the FileCache to Zend_Db_Table (defaultMetaDataCache), this way you will avoid a loot of metadata queries, queries that ask for description of each columns of the tables you use.
Store one or more Acl object (depends on how you use Acl, if you have one big Acl with all rules or several with subsets). And store them in mid-duration caches when they are built.
Think of other usages, detect heavy loops, semi-static contents (like you select lists, how many time should they be considered static?)
Finally, get a whole mental image of how your application engine works, and how your data will grow and be used.You will need that step to use application levels caches in the very best way (for example should some elements be cached for groups of users?, should Acl objects be build for groups, for each user, for everybody, is ther some blocks in the layout that should be rendered the same for everybody?).

should images come from db or content\Images folder

I am developing a eCommerce website in ASP.NET MVC 3 in C#. Using SQL Server 2008R2. My question is if I have 5 images that I want to show in gridView with thumbnails (e.g. something like Amazon website that gives customers couple of pictures to show) would it be advisory if the images are coming from the database or should I reside in the Content\Images folder? There are quite a few sub-categories in sub-category in my db design. What is the most common suit for a professional developer to follow? Thanks. I know there are few options for third party tools like jquery & Telerik Extensions. So I will use them.
Thanks

From my experience and research it is better to put it in a folder/content structure. Yes, there are security things with opening directories to the public but if you instead upload a file via ftp dynamically the problems are solved. I have heard of horror stories about storing files in database and have seen the issues come up but have resolved them. Basically, it is easier to write to database and there are not the security issues of opening up a directory to public but just make sure to regularly check backups that the files are not corrupt or make sure the data is on a fail over cluster where that will never be a problem.
So summary: Database is fine just regularly check backups by restoring them that they are not corrupt or run as a fail over cluster. Otherwise just go with the typical folder/content structure but use ftp to upload the file so there are no open directories to the public.

For me, the best anwser to this question is this: To BLOB or Not To BLOB: Large Object Storage in a Database or a Filesystem
Sumary: Application designers often face the question of whether to store large objects in a filesystem or in a database. Often this decision is made for application design simplicity. Sometimes, performance measurements are also used. This paper looks at the question of fragmentation – one of the operational issues that can affect the performance and/or manageability of the system as deployed long term. As expected from the common wisdom, objects smaller than 256K are best stored in a database while objects larger than 1M are best stored in the filesystem. Between 256K and 1M, the read:write ratio and rate of object overwrite or replacement are important factors. We used the notion of “storage age” or number of object overwrites as way of normalizing wall clock time. Storage age allows our results or similar such results to be applied across a number of read:write ratios and object replacement rates.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio