Read documents with Elastic Search - elasticsearch

I have a information retrieval assignment where I have to use elasticSearch to generate some indexing/ranking. I was able to download elasticSearch and it's now running on http://localhost:9200/ but how do I read every documents stored in my folder called 'data'?

Elasticsearch is just a search engine. In order to get your docs and files searchable, you need to load them, extract all relevant data and load into elasticsearch.
Apache Tika is a solution for extracting the data out of the files. Write a file system crawler using Tika. Then use the Rest API to index the data.
If you don't want to reinvent the wheel, have a look on the FSCrawler project. Here is a blogpost describing how to solve a task you are facing.
Good luck!

Related

How to know where the Elastic Search Hits are coming from

I have elastic search cluster.
Currently designing a python service for client for read and write query to my elastic search. The python service will not be maintained by me. Only internally python service will call our elastic search for fetching and writing
Is there any way to configure the elastic search so that we get to know that the requests are coming from python service, Or any way we can pass some extra fields while querying based on that fields we will get the logs
There is no online feature in elasticsearch to resolve your request. (you want to check the source and add fields to query).
but there is a solution for audit logs.
https://www.elastic.co/guide/en/elasticsearch/reference/current/enable-audit-logging.html
What you can do is placing a proxy in front of it and do the logging there, we have an Apache in front of our Elastic clusters to enable SSL-offloading there and add logging and ACL possibilities.

Why install logstash if I can just send the data through REST to elasticsearch?

I installed elasticsearch and kibana, and I'm following the tutorial.
https://www.elastic.co/guide/en/elasticsearch/reference/current/_index_and_query_a_document.html
And I'm perfectly inserting and reading data, e.g.:
PUT /customer/external/1?pretty
{
"name": "John Doe"
}
So, that makes me wonder, what do I need logstash or filebeats for?
My plan is to log each web request on a website to elasticsearch for analytics.
Do I need to install logstash? I don't understand what would I need it for.
(I don't plan to store it on a file)I will read the request info(e.g. ip address, time, user_id, etc) from a PHP script and simply send it through a HTTP REST REQUEST...as the example above to the elasticsearch server which will save the data anyway. So, I don't see any reason to store the data on the webserver(that is data duplicity), and If I wanted to, why would I need logstash anyway...I can just read a .log file and send it to elasticsearch....like this example: https://www.elastic.co/guide/en/elasticsearch/reference/current/_exploring_your_data.html
No, you do not have to install Logstash, if you plan to collect, normalize and write your application data yourself. As you correctly assumed, Logstash would be a replacement for your PHP script.
Nevertheless, you might still consider to have a look at Logstash. Since it is developed and maintained by same company taking care of Elastic Search, you could benefit from upcoming changes and optimizations.
As you can read from the introduction, Logstash is a tool to read data from multiple sources, normalize it and write the result to multiple destinations. For more details on which sources, filters and oputputs Logstash offers, you should also take a look at the pipeline documentation.

Elastic search with Google Big Query

I have the event logs loaded in elasticsearch engine and I visualise it using Kibana. My event logs are actually stored in the Google Big Query table. Currently I am dumping the json files to a Google bucket and download it to a local drive. Then using logstash, I move the json files from the local drive to the elastic search engine.
Now, I am trying to automate the process by establishing the connection between google big query and elastic search. From what I have read, I understand that there is a output connector which sends the data from elastic search to Google big query but not vice versa. Just wondering whether I should upload the json file to a kubernete cluster and then establish the connection between the cluster and Elastic search engine.
Any help with this regard would be appreciated.
Although this solution may be a little complex, I suggest some solution that you use Google Storage Connector with ES-Hadoop. These two are very mature and used in production-grade by many great companies.
Logstash over a lot of pods on Kubernetes will be very expensive and - I think - not a very nice, resilient and scalable approach.
Apache Beam has connectors for BigQuery and Elastic Search, I would definitly perform this using DataFlow so you donĀ“t need to implement a complex ETL and staging storage. You can read the data from BigQuery using BigQueryIO.Read.from (take a look to this if performance is important BigQueryIO Read vs fromQuery) and load it into ElasticSearch using ElasticsearchIO.write()
Refer this how read data from BigQuery Dataflow
https://github.com/GoogleCloudPlatform/professional-services/blob/master/examples/dataflow-bigquery-transpose/src/main/java/com/google/cloud/pso/pipeline/Pivot.java
Elastic Search indexing
https://github.com/GoogleCloudPlatform/professional-services/tree/master/examples/dataflow-elasticsearch-indexer
UPDATED 2019-06-24
Recently this year was release BigQuery Storage API which improve the parallelism to extract data from BigQuery and is natively supported by DataFlow. Refer to https://beam.apache.org/documentation/io/built-in/google-bigquery/#storage-api for more details.
From the documentation
The BigQuery Storage API allows you to directly access tables in BigQuery storage. As a result, your pipeline can read from BigQuery storage faster than previously possible.
I have recently worked on a similar pipeline. A workflow I would suggest would either use the mentioned Google storage connector, or other methods to read your json files into a spark job. You should be able to quickly and easily transform your data, and then use the elasticsearch-spark plugin to load that data into your Elasticsearch cluster.
You can use Google Cloud Dataproc or Cloud Dataflow to run and schedule your job.
As of 2021, there is a Dataflow template that allows a "GCP native" connection between BigQuery and ElasticSearch
More information here in a blog post by elastic.co
Further documentation and step by step process by google

ways of monitoring filesystem for new files and ship them to elasticsearch

Thanks for reading!. I have the following problem.
I have a filesystem where new files are regularly pushed into.
/year/month/day/xxxxxxxx.csv
I need to monitor the filesystem for new files
I need to convert them to JSON
I need to ship them to Elasticsearch.
I am wondering what is the most reliable way of doing this.
I was looking at logstash but I am not sure how reliable is the filesystem monitoring bit. Also the file-server is actually a Windows machine.
Also I really want a fool-proof but very simple solution with not too many moving parts.
is there any simple library there that is specialized on file-system monitorig with a simple way to transform a given fileformat into JSON and bulk import it into Elasticsearch ?
Thanks for any advise or suggestions.
Filebeat might help. Then you send the stream to logstash and apply a CSV filter.
Fscrawler do this monitoring for sure but only for JSON files or PDF/oOo/Office and the like docs.
Ok, we created Ambar for the same purpose as you described. It can crawl folder -> extract data -> submit to ElasticSearch. Check our website for more information https://ambar.cloud
Also another great service is FsCrawler, which was mentioned by #dadoonet

crawler + elasticsearch integration

I wasn't able to find out, how to crawl website and index data to elasticsearch. I managed to do that in the combination nutch+solr and as nutch should be able from the version 1.8 export data directly to elasticsearch (source), I tried to use nutch again. Nevertheless I didn't succeed. After trying to invoke
$ bin/nutch elasticindex
I get:
Error: Could not find or load main class elasticindex
I don't insist on using nutch. I just would need the simpliest way to crawl websites and index them to elasticsearch. The problem is, that I wasn't able to find any step-by-step tutorial and I'm quite new to these technologies.
So the question is - what would be the simpliest solution to integrate crawler to elasticsearch and if possible, I would be grateful for any step-by-step solution.
Did you have a look at the River Web plugin? https://github.com/codelibs/elasticsearch-river-web
It provides a good How To section, including creating the required indexes, scheduling (based on Quartz), authentication (basic and NTLM are supported), meta data extraction, ...
Might be worth having a look at the elasticsearch river plugins overview as well: http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/modules-plugins.html#river
Since the River plugins have been deprecated, it may be worth having a look at ManifoldCF or Norconex Collectors.
You can evaluate indexing Common Crawl metadata into Elasticsearch using Hadoop:
When working with big volumes of data, Hadoop provides all the power to parallelize the data ingestion.
Here is an example that uses Cascading to index directly into Elasticsearch:
http://blogs.aws.amazon.com/bigdata/post/TxC0CXZ3RPPK7O/Indexing-Common-Crawl-Metadata-on-Amazon-EMR-Using-Cascading-and-Elasticsearch
The process involves the use of a Hadoop cluster (EMR on this example) running the Cascading application that indexes the JSON metadata directly into Elasticsearch.
Cascading source code is also available to understand how to handle the data ingestion in Elasticsearch.

Resources