How to use Elasticsearch to make files in a directory searchable? - elasticsearch

I am very new to search engines and Elasticsearch, so please bear with me and apologies if this question sounds vague. I have a large directory with lots of .csv and .hdr files, and I want to be able to search text within these files. I've done the tutorials and read some of the documentation but I'm still struggling to understand the concept of indexing. It seems like all the tutorials show you how to index one document at a time, but this will take a long time as I have lots of files. Is there an easier way to make elasticsearch index all the documents in this directory and be able to search for what I want?

Elasticsearch can only search on documents it has indexed. Indexed means Elasticsearch has consumed a document one by one and stored it internally.
Normaly internal structure matters and you shold understand what you're doing to get best performance.
So you need a way to get your files into elastic search, I'm affraid there is no "one click way" to achieve this...
You need
Running cluster
Designed index on for the documents
Get document from filesystem to Elasticsearch
Your question is focused on 3).
For this, search for script examples or tools that can crawl your directory and provide Elasticsearch with documents.
5 seconds of using Google brought me to
https://github.com/dadoonet/fscrawler
https://gist.github.com/stevehanson/7462063
Theoretically it could be done with Logstash (https://www.elastic.co/guide/en/logstash/current/plugins-inputs-file.html), but I would give fscrawler a try.

Related

Implements popular keyword in ElasticSearch

I'm using ElasticSearch on AWS EC2.
And i want to implement today's popular keyword function in ES.
there is 3 indexes(place, genre, name), and i want see today's popular keyword in name index only.
I tried to use ES slowlog and logstash. but slowlog save logs every shard's log.
(ex)number of shards : 5 then 5 query log saved.
Is there any good and easy way to implement popular keyword in ES?
As far as I know, this is not supported by Elasticsearch and you need to build your own custom solution.
Design you mentioned using the slowlog is not good as you mentioned its on per shard basis, even if you do some more computing and able to merge and relate them to a single search at index level, it would not be good, as
you have to change the slow log configuration and for every index there needs to be a different threshold, you can change it to 0ms, to make sure you get all the search queries in slow logs, but that would take a huge disk space and would not be good for Elasticsearch performance.
You have to do some parsing of slow log in your application and if you do it runtime it would be very costly.
I think you can maintain a distributed cache in your application where you store the top searched keyword like the leaderboard of a multi-player gaming app, which is changing very frequently but in your case, you don't even have to update this cache very frequently. I would not go into much implementation details, but simple Hashmap of search term as key and count as value would solve the issue.
Hope this helps. let me know if you have questions.

Can I put the result from kibana to elasicsearch again?

Can I put the response result that I query in Kibana dev tools into elasticsearch directly?
Or must I write a script to achieve it?
Any recommends?
Ok So here is one basic understanding after discussion.
Please observe carefully.
If you have head plugin installed for ES .
search for .kibana index .
open the .kibana index and you will have all the designed dashboards listed there with processd info.
Think ES as another Storage from where you can read the data and put that data into Another ES index.
Refer to this link :
https://www.elastic.co/blog/kibana-under-the-hood-object-persistence
Tools you can opt is Logstash for Reading and writing.
Grok pattern learning can give you good lead about that.
Tell me if need some real time pics for same problem.
Happy learning.
It is like you cook in kitchen and ask to put the cooked food in kitchen again.If you cooked food better consume it :)
See the visualization or processed data you see on kibana end is just for kibana.The algorithms or processing techniques for the data set residing at elastic search will be applied over the upcoming data set.
So offcourse you can put/consume your data in Elastic search back again.
It depends what sort of requirement you are facing.
Note : Data in elastic search(inverted index) after kibana processing not gonna change its architecture, due to which you are able to apply another processing techniques from kibana over the same index assuming that data is in it's earlier state.

looking for a search tool that reads data from DB and index

currently i have a requirement to read the new feeds in DB and index into a search tool. i understand that logstash-elastic search combo will work here. we have to input the DB plug in and the same will be indexed to elastic search.
but i am looking for other better options, if any, to do some research. any suggestions please?
if you are looking for an specific tool to index data from a DB into the search index then you might be interested in Solr's DIH.
It can work with other sources (files, rss...but DB is it's forte). Very handy for small to medium size indexing. If you need to index many millions of docs it starts to get trickier (possible, but trickier, as parallelizing stuff is not stragihforward).
In ES land there is elasticsearch-jdbc, similar to the DIH but not built in. I have used it too and it works. And a bit less user friendly for a quick setup imho.

Comparison of Handling Logs and PDFs in Solr & Elasticsearch and Data Visualization in Banana & Kibana

How do Elasticsearch and Solr compare in respect to the following:
Indexing logs.
Indexing events.
Indexing PDF documents.
Ease of creating and distributing visualizations. Kibana vs Banana.
Support and documentation for developers.
Any help is appreciated.
EDIT
More specifically, i am trying to figure out how exactly a PDF document or an event can be indexed at all. I have worked a little bit on Elasticsearch and since i am a fan of JSON, i found it quite useful when i tried to index structured data.
For example logs are mostly structured and thus i guess easier to index and search. Now what if i want to index the whole log file itself?
Follow up
Is Kibana the only visualization tool available for Elasticsearch?
Is Banana the only visualization tool available for Solr?
Here is an answer to try to address just the Elasticsearch aspect of the post.
Take a look at https://github.com/elastic/elasticsearch-mapper-attachments for handling PDFs
For events/logs, you would need to transform those into structured data to index in Elasticsearch. You can have a field in there for the source (the log file the data came from and other information like that) - you will have all the data in the whole log file indexed in that fashion. You can take advantage of ES aggregations to group results based on log file, calculate statistics, etc.
The ELK stack is definitely worth a look.
I don't know if Kibana is the only visualization tool but it is probably the most popular and likely to offer more than something else.

How to run a script on ElasticSearch before indexing?

I have to add/update some fields in the document before indexing. How can I do?
I came across this question because the documents are sent from other applications and I should not touch these. The simplest thing I can think of is writing a script to manipulate the document before indexing.

Resources