Run Elastic Search on pdf and ppts - elasticsearch

I am new to elastic search. I have read its tutorials. But need guidance on my problem:
I have a collection of pdf documents and power point files on my system. I need to build a system using elastic search where I can retrieve these files on the basis of keywords present in this file. Can someone please guide as to how can I proceed here and index my documents.Do I need to parse my pdf and convert it to JSON format using Tika or FSCrawler and then provide it to elastic search.
Thankyou.

You should setup FSCrawler, that'll do the parsing and make the files content searchable.

Related

Elasticsearch index a file automatically

I am new to elasticsearch,this question might look weird but is it possible to index a file automatically (i.e given a file path, elasticsearch should index the contents of it automatically).I have got some open source tool like elasticdump and tried using it for the purpose,but I prefer some plugins of elasticsearch which can support almost all elasticsearch versions.. Can anyone suggest me?

Elastic search next steps

I'm new to elasticsearch and am still trying to set it up. I have installed elasticsearch 5.5.1 using default values I have also installed Kibana 5.5.1 using the default values. I've also installed the ingest-attachment plugin with the latest x-pack plugin. I have elasticsearch running as a service and I have Kibana open in my browser. On the Kibana dashboardI have an error stating that it is unable to fetch mappings. I guess this is because I havn't set up any indices or pipelines yet. This is where I need some steer, all the documentation I've found so far on-line isn't particularly clear. I have a directory with a mixture of document types such as pdf and doc files. My ultimate goal is to be able to search these documents with values that a user will enter via an app. I'm guessing I need to use the Dev Tools/console window in Kibana using the 'PUT' command to create a pipeline next, but I'm unsure of how I should do this so that it points to my directory with the documents. Can anybody provide me an example of this for this version please.
If I understand you correctly, let's first set some basic understanding about elasticsearch:
Elasticsearch in it's simple definition is a "Search engine". so you need to store some data, and then elastic will help you to search using a search criteria, and it will retrieve relevant data back
You need a "Container" to save your data to, and elastic has this thing like any database engine to store your data, but the terms are somehow different. for example a "Database" in sql-like systems is called "Index", and what you know as "table" is called "Type" in elastic.
from my understanding, you will need to create your index (with or without mappings) to have a starting point, and I recommend you to start without mappings just to "start" and get things working, but later on it's highly recommend to work with "mappings" if applicable, because elastic is smart, but it cannot know more about your data than you do
Because Kibana has failed to find a proper index to start with, it has complained and asked you to either provide a syntax for index names, or a specific index name so it can infer the inline mappings and give you the nice features of querying, displaying charts, etc of your data, so once you create your index, you will provide that to the starting page of Kibana, and you will be ready to go.
Let me know if you need something more specific to your needs :)

Upload large folder of json files to elasticsearch

I have a folder that consists of 20,000 json files.
I want to be import this into an index in elastic search. I know of the bulk API, but I'm not sure how to loop over my folder to index each json.
I've seen things on how to do --databinary with the #__.json but this seems to only work for files, not an entire folder. Maybe I am just not understanding correctly.
The jsons are of type geojson and I am on a windows machine. Any help is appreciated, thanks!

files indexing automatically by elasticsearch

I am a newbie in elasticsearch, please forgive me if my question sounds weird :D
I want to index files in some directories with elasticsearch automatically (for example: if i add a file in certain directory then elasticsearch can index that file immediately), but i don't know how to configure elasticsearch in order to solve that problem.
Can anyone suggest me?
Thank in advance
I dont think you can have elasticsearch watch a directory (I wouldn't think that is a good thing to do in most cases.)
Instead, have a client wrapper that implements a FileWatcher. Push changes to ElasticSearch via this client.
You could use PathHierarchyTokenizer to preserve the file system hierarchy in your index, allowing you to drill down your Directory structure.

Spring -mongo full text search

I am developing an app on java. It has mongo db at the back end which stores files(in gridFS). I use spring framework to interact with mongo db.I want to search for text present in the stored documents(pdf,doc,txt files). I know mongo db supports full text search(from 2.4).My question is
does spring framework support Full text search? or should we take the help of solr or lucene?
If both of the above is possible which is a better option?
Wat about indexing?I dont have much knowledge regarding indexing in full text search
When will 2.4 be available?
1 Spring does not support full text search within its core features, however, within the spring-data project there are two sub-projects that allow the interaction with solr and elasticsearch, both of them are full text search engines built in the top of apache lucene, for detailed information look at these links:
https://github.com/dadoonet/spring-elasticsearch
https://github.com/SpringSource/spring-data-solr
2 Depends of your needs, lucene is a low level library, while elasticsearch and solr are out of the box search engines built in the top of lucene, I think that elasticsearch provides better integration with mongodb, through the mongodb-river which support indexing of gridFS attachments. Look at these links:
http://www.elasticsearch.org/
https://github.com/richardwilly98/elasticsearch-river-mongodb/
3 You need to clarify this question.
4 I don't know when the mongodb version 2.4 will be available, but don't forget that the full text search is still an experimental feature, and also I think that this feature still does not support gridFS.
MongoDB text search will not pull text out of PDF, DOC, or, for that that matter, any files that are stored in GridFS. From the perspective of MongoDB, GridFS files are uninterpreted binary.
If you'd like to use MongoDB's new text search capabilities to search in different file types, you'll need to do the work in your application to extract text from these files and add the text into documents that you explicitly insert into MongoDB. You can use existing libraries such as Apache Tika to do the heavy lifting. Note that Tika is what Solr/Lucene uses do text extraction from rich-text document types.
As for text search indexing in MongoDB, please refer to the release notes here

Resources