Loading data from S3 to Elasticsearch using AWS glue - elasticsearch

I have muliple folders in a S3 bucket and each folder contains one JSON lines file.
I want to do two things with this data
Apply some transformations and get tabular data and save it to some database.
save these json objects, as it is to Elasticseach cluster for full text search
I am using AWS glue for this task and I know how to do 1, but, I can't find any resources that talks about getting data from s3 and storing it to elasticsearch using AWS glue.
Is there a way to do this?

If anyone is looking for an answer to this then I used Logstash to load files to Elasticsearch.

Related

How to ingest Parquet files residing on AWS S3 into Druid

I'm very new to Druid and want to know how we can ingest Parquet files on S3 into Druid?
We get data in CSV format and we standardise it to Parquet format in the Data Lake. This then needs to be loaded into Druid.
Instead of trying to ingest parquet files from S3, I streamed data to a Kinesis topic and used that as a source for Druid.
You have to add druid-parquet-extensions in the druid.extensions.loadList in the common.runtime.properties file.
After that you can restart the Druid server.
However, only ingesting a parquet file from local source is documented. I couldn't verify loading from S3 as my files were encrypted.
Try adding the above extension and then read from S3 just like you'd ingest a regular file from S3.

Read documents with Elastic Search

I have a information retrieval assignment where I have to use elasticSearch to generate some indexing/ranking. I was able to download elasticSearch and it's now running on http://localhost:9200/ but how do I read every documents stored in my folder called 'data'?
Elasticsearch is just a search engine. In order to get your docs and files searchable, you need to load them, extract all relevant data and load into elasticsearch.
Apache Tika is a solution for extracting the data out of the files. Write a file system crawler using Tika. Then use the Rest API to index the data.
If you don't want to reinvent the wheel, have a look on the FSCrawler project. Here is a blogpost describing how to solve a task you are facing.
Good luck!

Migrate data from snowflake to elasticsearch

we are using snowflake data warehouse in my project, we would like to replace snowflake with Elasticsearch as part of project enhancement POC,
i don't found any solutions for moving data from snowflake to Elasticsearch.
can anyone help me to resolve the above concerns.
please share sufficient information, steps etc.
Thanks in advance
don't found any clues on data migration.
You can try to do it into 2 steps:
export data from Elastic to AWS S3 bucket
load data from AWS S3 bucket to snowflake.
You need to implement the migration at schema level. Moreover if you specify the question with the issues. It will be helpful to answer and guide you.
You can use COPY command to export data from Snowflake to a file that can then be loaded to another system. However I am curious to know why you are trying to replace Snowflake with Elasticsearch, as these are 2 different technologies, serving very different functions.
You can export your data from Snowflake S3 copy command.
Export in multiparts so your s3 bucket has small files.
Then you can hook a lambda on S3 PUT Object. So on each file upload a lambda will trigger.
You can write code in your Lambda to make rest calls to Elasticsearch.

Backup elasticsearch index

Is it possible to get backup of index from elasticksearch by http rest interface?
Can I just send http-request and get snapshot without creating snapshot repository?
Want to store as elasticsearch restorable file?
You can store snapshots of individual indices or an entire cluster in a remote repository like a shared file system, S3, or HDFS.
Want to store as JSON so you can use the data outside es?
elasticdump works by sending an input to an output. Both can be either an elasticsearch URL or a File.
CSV?
https://github.com/taraslayshchuk/es2csv

Elastic search with Google Big Query

I have the event logs loaded in elasticsearch engine and I visualise it using Kibana. My event logs are actually stored in the Google Big Query table. Currently I am dumping the json files to a Google bucket and download it to a local drive. Then using logstash, I move the json files from the local drive to the elastic search engine.
Now, I am trying to automate the process by establishing the connection between google big query and elastic search. From what I have read, I understand that there is a output connector which sends the data from elastic search to Google big query but not vice versa. Just wondering whether I should upload the json file to a kubernete cluster and then establish the connection between the cluster and Elastic search engine.
Any help with this regard would be appreciated.
Although this solution may be a little complex, I suggest some solution that you use Google Storage Connector with ES-Hadoop. These two are very mature and used in production-grade by many great companies.
Logstash over a lot of pods on Kubernetes will be very expensive and - I think - not a very nice, resilient and scalable approach.
Apache Beam has connectors for BigQuery and Elastic Search, I would definitly perform this using DataFlow so you donĀ“t need to implement a complex ETL and staging storage. You can read the data from BigQuery using BigQueryIO.Read.from (take a look to this if performance is important BigQueryIO Read vs fromQuery) and load it into ElasticSearch using ElasticsearchIO.write()
Refer this how read data from BigQuery Dataflow
https://github.com/GoogleCloudPlatform/professional-services/blob/master/examples/dataflow-bigquery-transpose/src/main/java/com/google/cloud/pso/pipeline/Pivot.java
Elastic Search indexing
https://github.com/GoogleCloudPlatform/professional-services/tree/master/examples/dataflow-elasticsearch-indexer
UPDATED 2019-06-24
Recently this year was release BigQuery Storage API which improve the parallelism to extract data from BigQuery and is natively supported by DataFlow. Refer to https://beam.apache.org/documentation/io/built-in/google-bigquery/#storage-api for more details.
From the documentation
The BigQuery Storage API allows you to directly access tables in BigQuery storage. As a result, your pipeline can read from BigQuery storage faster than previously possible.
I have recently worked on a similar pipeline. A workflow I would suggest would either use the mentioned Google storage connector, or other methods to read your json files into a spark job. You should be able to quickly and easily transform your data, and then use the elasticsearch-spark plugin to load that data into your Elasticsearch cluster.
You can use Google Cloud Dataproc or Cloud Dataflow to run and schedule your job.
As of 2021, there is a Dataflow template that allows a "GCP native" connection between BigQuery and ElasticSearch
More information here in a blog post by elastic.co
Further documentation and step by step process by google

Resources