Goal is to index uploaded files and search for text within them.
Current setup:
MediaWiki 1.27
PostgreSQL 9.4
Elasticsearch 1.7.5
MW-Extension CirrusSearch 1.27
MW-Extension Elastica (master)
The search with Elasticsearch in wiki-pages and for uploaded files is working. But what do I have to do to index and search for text within the uploaded files (pdf, doc, ...)?
You need a media handler which can extract the text; see MediaHandler::getEntireText. For PDF PdfHandler does it; I imagine extensions exist for other common formats as well.
I used this plugin . One disadvantage of it that it is using too much space, so later in my project we migrated to use tika (.net port version) which is used by mapper plugin.
Related
I have a problem with pdfMerger. I can't merge pdf files higher than the 1.4 version. I guess this problem is because I am using FPDI free version. How can I merge 1.5 PDF files without using Ghostscript? I don't have shell access to the hosting that I am using.
I tried to find different PDF classes to fix my problem, but I can't.
I had installed Hue with the Cloudera Manager on AWS. I have uploaded some directories with few files in there. If I am on the /user/hdfs path, there are directories like project1, project2. If I am searching project, I get as result the projects. But if Im searching to files in the project directories like file1, I get no result. I have looked at config files from hdfs, hue, solr, but cannot find this configuration parameter. How can I fix that, so I can search deeper in the hdfs with hue filebrowser?
Hue doesn't use solr for the file browser, it's webhdfs, which is only able to search from the current folder
The search within the File Browser is just a filtering of the files on the current directory, not a full text search across all the directories.
Currently the top search is only about tables and documents: http://gethue.com/realtime-catalog-search-with-hue-and-apache-atlas/
In the future, it will be extended to leverage the File search of Apache Atlas.
I have on my PC a folder containing many epub and pdf files that i want to be able to do fulltext search.
I know windows has already indexing service. but i would like to perform more logic than simple search for keywords.
So i would like to import those epub and pdf files into elasticsearch. anyone knows a script that can do this?
ElasticSearch has implemented plugin for mapping attachments so hope this would help you:
https://www.elastic.co/guide/en/elasticsearch/plugins/master/mapper-attachments.html
https://github.com/elastic/elasticsearch-mapper-attachments
It works fine for me.
I started playing with Elasticsearch. I want to create index for a textfile. I mean that I have multiple text files in a folder. I want to create index on these text files so that I can perform text search on these files. Is there a way to do this using command line or . Please guide me with an example.
yes, you can by using the FS river + mapper attachment plugin. Here is a link to the source page.
I ran a few tests with it a little while ago. It works fine. Be aware though, that the file has to be local for this to work (even if you can mount a remote file to a local path).
Hope this helps.
I need to split PowerPoint presentation file (pptx and, if possible, ppt) into a set of original format files (pptx or ppt) – each containing one slide from the original. I need to do this programmatically on Linux Ubuntu server using free tools or external free API. When a file gets uploaded to a directory program will be called from my main program (written in PHP) and do the split.
I am looking for suggestions about language or set of tools to use. I looked at several options listed below. It will take some time to try all of them but if anyone could exclude or add to the list and/or provide code examples it would help.
Thanks!
(1) Apache POI project (POI-XSLF)
(2) OpenOffice unoconv command line utility
(3) C# (with compiler Mono for Linux). This may include indirect option of deleting slides with powerPoint.Slides(x).Delete
(4) JODConverter (Java OpenDocument Converter)
(5) PyODConverter (Python OpenDocument Converter)
(6) Google Documents API
(7) Aspose.Slides for .NET is out because of cost
When I had the same needs I ended up shelling and using "UNOCONV" to convert the files to PDF. And then used "PDFTK" to split the file by pages. Once that is done you should be able to take the extra step and convert the new split PDF files back to PPTX using one more UNOCONV.
While it seems rather complicated, PPTX seems to be "that one ooxml file no one wants to touch". Libraries seem to be few and incomplete mostly.