The project I am currently working on needs a search engine to search a couple of 10.000 pdf files. When the user searches via the website for a certain keyword, the search engine will return a snippet of the pdf files matching his search criteria. The user then has the option to click on a button to view the entire pdf file.
I figured that the best way to do this was using elasticsearch + fscrawler (https://fscrawler.readthedocs.io/en/fscrawler-2.7/). Running some tests today and was able to crawl to a folder on my local machine.
For serving the PDF files (via a website), I figured I could store the PDF files in a google cloud storage and then use the link of the google cloud storage to let the users view the pdf files. However, FS Crawler does not seem to be able to access the bucket. Any tips or ideas on how to solve this. Feel free to criticize the work method described above. If there are better ways to make the users of the website access the PDF files, I would love to hear it.
Thanks in advance and kind regards!
You can use s3fs-fuse to mount s3 bucket into your file system and then use normal Local FS crawler.
Alternatively, you can fork fscrawler and implement a crawler for s3 similar to crawler-ftp.
Related
I have a Laravel web application that stores some word documents on AWS S3 and I'd like to be able to edit these documents on the fly. Currently the only way that I know how to do this is to download the file, open it, edit it, save it, and the reupload and replace the document already on S3. This just seems a bit cumbersome and I'm wondering if there is a better way.
I'd like to be able to open the word document (either through an online app or via the native MS Word app), edit it, and when I save it, it would automatically save to the S3 file. Any advice?
I am able to upload the test data from the Quickstart example but when I try and upload data from my own google cloud storage I get
"Error: Cannot find the referenced file: in request."
I tried taking the data from the quick start that I know works and putting it on my own google cloud storage and uploading it from there and I get the same thing.
I am able to see all of the files when I browse for objects and I am able to see it all just fine. I even tried making the files public to all users thinking maybe it was a permissions thing but it doesn't appear to be that either.
Any advice?
I am trying to implement Dropbox on my website and so far I've been able to upload, fetch file metadata, user details and also download the file on my local machine(using the Dropbox API v2).
But, I would like to import the file directly from Dropbox and upload it to the server to be processed further....I'm able to generate the link for the chosen file using the "Chooser"
Dropbox API explorer lists all the possible APIs dropbox can provide.
To build the website I'm using laravel 5.6.17
Your help would be much appreciated. Thanks in advance
I have developed an application which allows Users to select multiple "transactions"; each of this is directly related to a PDF file.
When a User multi-selects them, and "prints" them, these PDF files are merged into one longer file to provide ease of print.
Currently, "transaction" PDFs are generated on request, and so is PDF-merging.
I'm trying to scale this up relaying over Amazon infrastructure, some questions arised to me.
Should I implement a queue for the PDF generation per "transaction"? If so, how can I provide the user a seamless experience? We don't want them to "wait"
Can I use EC2 to generate these PDF files for me? If so, can I provide a "public" link for the user to download the file directly from Amazon, instead of using our resources.
Thanks a lot!
EDIT ---- More details
User inputs some information through a regular form
System generates a PDF per request, using the provided information for the document
The PDF generated by the system is kept under Amazon S3
We provide an API which allows you to "print" multiple pdfs at once, to do so, we merge the selected PDF files from S3, into one file for ease-of-print
When you multi-print documents, a new window is opened which is your merged file directly, user needs to wait around 20ish seconds for it to display.
We can to leverage the resources used to generate the PDFs onto Amazon infrastructure, but we need to keep the same flow, meaning, we should provide an instant public link to the User to download & print the files.
Based on your understanding, i think you just need your link to be created immediately right after user request for file. However, you want in parallel to create PDF merge. I have idea to do that based on my understanding, and may be it could work in your situations.
First start with some logic to create unique pdf file name, with random string representing name of file. And at same time in background generate PDF, but the name of PDF should be same as you created in first step. This will give user instant name of file with link to download. However, your file creation is still in progress.
Make sure, you use threads if using PHP or event loop if using Node.JS to run both steps at same time. This will avoid 404 error for file not found.
Transferring files from EC2 to S3 would also add latency delay. But if you want to preserve files for later or multiple use in future then S3 is good idea as it could simply serve PDF files for faster delivery. As we know S3 is used for static media storage. Otherwise simply compute everything and generate files on EC2
I've tried some of the services out there, including droplet, ctrlq.org/save, and some other sites that support directly fetching a file from a url and uploading it to dropbox, google drive and the like. Without the user having to store the file on a local disk.
Now the problem is none of these services support multiple urls or batch uploading, but I have quite a few urls and I really need a service where I can put them in, split them with enters or semicolons, and have the files uploaded to dropbox.(or any other cloud storage)
Any help would be gladly appreciated.
The Dropbox Saver JavaScript control allows you to save up to 100 files to the user's Dropbox in one shot. You'll need to programmatically create the button using Dropbox.createSaveButton as explained in the linked page.
It seems like the 100-file limit (at any one time) is universal, but you might find that it isn't the case when using the DropBox REST API. It looks possible to do this with NodeJS server side (OAuth and posts) or Javascript client side (automating FileReader). I'll review and try to add content so these aren't just links.
If you can leave a page open for about 20 minutes due to "technical limitations", the dropbox should be loadable 100-at-a-time like that, assuming each upload takes less than 2 seconds; it's an easy hook to add a progress indicator.
If you're preloading the dropbox once yourself or the initial load is compatible with manual action, perhaps mapping a drive and trying to unzip an archive of your links to it would work. If your list of links isn't extremely volatile then the REST API could be used to synchronize changes.
Edit: Forgot to include this page on CloudConvert, which unzips archives containing up to 100 files into DropBox. Your use case doesn't seem to include retrieving the actual content at your servers (generated zip files), sending the automation list to the browser and then having the browser extract to dropbox, but it's another option.
The Dropbox API now offers the ability to save a file into Dropbox directly via a URL. There's a blog post about it here:
https://blogs.dropbox.com/developers/2015/06/programmatically-saving-a-url-to-dropbox/
The documentation can be found here:
https://www.dropbox.com/developers/core/docs#save-url