MediaWiki - Search for text in uploaded files

MediaWiki - Search for text in uploaded files - elasticsearch

Goal is to index uploaded files and search for text within them.
Current setup:
MediaWiki 1.27
PostgreSQL 9.4
Elasticsearch 1.7.5
MW-Extension CirrusSearch 1.27
MW-Extension Elastica (master)
The search with Elasticsearch in wiki-pages and for uploaded files is working. But what do I have to do to index and search for text within the uploaded files (pdf, doc, ...)?

You need a media handler which can extract the text; see MediaHandler::getEntireText. For PDF PdfHandler does it; I imagine extensions exist for other common formats as well.

I used this plugin . One disadvantage of it that it is using too much space, so later in my project we migrated to use tika (.net port version) which is used by mapper plugin.

Related

Laravel merge compressed PDF files without ghostscript

I have a problem with pdfMerger. I can't merge pdf files higher than the 1.4 version. I guess this problem is because I am using FPDI free version. How can I merge 1.5 PDF files without using Ghostscript? I don't have shell access to the hosting that I am using.
I tried to find different PDF classes to fix my problem, but I can't.

Hue Filebrowser Search is searching only in the first layer

I had installed Hue with the Cloudera Manager on AWS. I have uploaded some directories with few files in there. If I am on the /user/hdfs path, there are directories like project1, project2. If I am searching project, I get as result the projects. But if Im searching to files in the project directories like file1, I get no result. I have looked at config files from hdfs, hue, solr, but cannot find this configuration parameter. How can I fix that, so I can search deeper in the hdfs with hue filebrowser?

Hue doesn't use solr for the file browser, it's webhdfs, which is only able to search from the current folder

The search within the File Browser is just a filtering of the files on the current directory, not a full text search across all the directories.
Currently the top search is only about tables and documents: http://gethue.com/realtime-catalog-search-with-hue-and-apache-atlas/
In the future, it will be extended to leverage the File search of Apache Atlas.

How to index a folder of epub, pdf documents with elasticsearch

I have on my PC a folder containing many epub and pdf files that i want to be able to do fulltext search.
I know windows has already indexing service. but i would like to perform more logic than simple search for keywords.
So i would like to import those epub and pdf files into elasticsearch. anyone knows a script that can do this?

ElasticSearch has implemented plugin for mapping attachments so hope this would help you:
https://www.elastic.co/guide/en/elasticsearch/plugins/master/mapper-attachments.html
https://github.com/elastic/elasticsearch-mapper-attachments
It works fine for me.

Elasticsearch how to index text files using the command line

I started playing with Elasticsearch. I want to create index for a textfile. I mean that I have multiple text files in a folder. I want to create index on these text files so that I can perform text search on these files. Is there a way to do this using command line or . Please guide me with an example.

yes, you can by using the FS river + mapper attachment plugin. Here is a link to the source page.
I ran a few tests with it a little while ago. It works fine. Be aware though, that the file has to be local for this to work (even if you can mount a remote file to a local path).
Hope this helps.

How to split PowerPoint presentation file into files with one slide in each file

I need to split PowerPoint presentation file (pptx and, if possible, ppt) into a set of original format files (pptx or ppt) – each containing one slide from the original. I need to do this programmatically on Linux Ubuntu server using free tools or external free API. When a file gets uploaded to a directory program will be called from my main program (written in PHP) and do the split.
I am looking for suggestions about language or set of tools to use. I looked at several options listed below. It will take some time to try all of them but if anyone could exclude or add to the list and/or provide code examples it would help.
Thanks!
(1) Apache POI project (POI-XSLF)
(2) OpenOffice unoconv command line utility
(3) C# (with compiler Mono for Linux). This may include indirect option of deleting slides with powerPoint.Slides(x).Delete
(4) JODConverter (Java OpenDocument Converter)
(5) PyODConverter (Python OpenDocument Converter)
(6) Google Documents API
(7) Aspose.Slides for .NET is out because of cost

When I had the same needs I ended up shelling and using "UNOCONV" to convert the files to PDF. And then used "PDFTK" to split the file by pages. Once that is done you should be able to take the extra step and convert the new split PDF files back to PPTX using one more UNOCONV.
While it seems rather complicated, PPTX seems to be "that one ooxml file no one wants to touch". Libraries seem to be few and incomplete mostly.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

MediaWiki - Search for text in uploaded files - elasticsearch

You need a media handler which can extract the text; see MediaHandler::getEntireText. For PDF PdfHandler does it; I imagine extensions exist for other common formats as well.

I used this plugin . One disadvantage of it that it is using too much space, so later in my project we migrated to use tika (.net port version) which is used by mapper plugin.

Related

Laravel merge compressed PDF files without ghostscript

Hue Filebrowser Search is searching only in the first layer

How to index a folder of epub, pdf documents with elasticsearch

Elasticsearch how to index text files using the command line

How to split PowerPoint presentation file into files with one slide in each file

Categories

Resources