How to insert data into Elasticsearch from a CSV file in Go? - go

I want to build a feature in my Go microservice, where I can upload a CSV file and insert the data inside it into Elasticsearch. The columns can vary with every file. I am familiar with the file uploading part, but could not find any efficient method to insert the data. Is there any Go library to insert data into Elasticsearch from a CSV file?

You can use Elasticsearch official Go client.
You can use bulk api to index multipul documents together. Please check Bulk example here.

Related

Elasticsearch index from csv file

I have to setup php project maintaining the elasticsearch server. The client provides me a large txt/csv file with all the data which I want to import(update) in index in elasticsearch. With a bulk operations I have to specify a valid json structure, but I have text file instead.
Is there a way of doing that using the elastic api. Or is there at all such possibility without need of converting the csv file to json
I am totally new for the elastisearch and have difficulties of finding the solution.

Using ElasticSearch Bulk to update and create documents dynamically?

I'm currently using elasticsearch and running a cron job every 10 minutes that will find newly created/updated data from my DB and sync it with elasticsearch. However, I want to use bulk to sync instead of making and arbitrary amount of requests to update/create documents in an index. I'm using the elasticsearch.js library created by elasticsearch.
I face 2 challenges that I'm uncertain about how to handle:
How to use bulk to update a document if it exists and create a document if it doesn't within bulk without knowing if it exists in the index.
How to format a large amount of JSON to run through bulk to update/create the document because bulk api expects the body to be formatted a certain way.
The best option when trying to stream in data from an SQL database is to use Logstash's JDBC Input to do it for you (the documentation). This can hopefully just do it all for you.
Not all SQL schemes make this easy, so for your specific questions:
How to use bulk to update a document if it exists and create a document if it doesn't within bulk without knowing if it exists in the index.
Bulk currently accepts four different types of sub-requests, which behave differently than you probably expect coming from an SQL world:
index
create
update
delete
The first, index, is the most commonly used option. It means that you want to index (the verb) something to the Elasticsearch index (the noun). However, if it already exists in the index given the same _id, then it will replace it. The rest are probably a bit more obvious.
Each one of the sub-requests behaves like the individual option that they're associated with (so update is an UpdateRequest under the hood, delete is a DeleteRequest, and index is an IndexRequest). In the case of create, it is a specialization of index, which effectively says "add this if it doesn't exist, but fail it if is does exist".
How to format a large amount of JSON to run through bulk to update/create the document because bulk api expects the body to be formatted a certain way.
You should look into using either the Logstash approach or any of the existing client language libraries, such as the Python client, which should work well from cron. The clients will take care of the formatting for you. One for your preferred language most likely already exists.

Is there any possibility to extract Google Analytics data and post that to Elastic Search?

I have been working on ways to import Google Analytics raw data without having to use a premium account .So far this is the nearest link to what I want to do
How to extract data from Google Analytics and build a data warehouse (webhouse) from it?
I want to load that data into elastic search and display using kibana .What is the best ETL approach for this ? Has anyone tried to display GA data using ELK stack ?
You should do it in two times
First, get the info, a very very useful site is https://developers.google.com/webmaster-tools/v3/how-tos/search_analytics but you have first to have a google wembaster tool account and create oauth credential on https://console.developers.google.com/apis
Then once you have your data, find a way to import them in elasticsearch, I'm still looking for the best way to do so, maybe transform the result table into csv and then using https://www.elastic.co/guide/en/logstash/current/plugins-filters-csv.html
Have a look at this:
https://www.elastic.co/guide/en/logstash/current/plugins-inputs-http_poller.html
You can use this to poll an endpoint, in this case GA, and load the response data into Elasticsearch. You may want to filter the response with the Split and / or Mutate plugins as well.
I have done this same setup.
Extracted data from Google Analytics with 7 Dimensions and 6 Metrics, out of which 2 Dimensions were primary key (Timestamp and ID). This was done using R.
Did some transformations on the data using linux awk and sed commands.
Loaded the data into Apache Hive with the row column formatting, created like total 9 tables.
Joined all the 9 tables in Hive using Hive Join queries, with 2 primary keys.
Used elasticsearch-hadoop connector to load the final resulting table to elasticsearch. Had to do a little data transformations to match Hive and Elasticsearch data types.
Used Kibana to visualize the data in Elasticsearch.
Now I am planning to avoid all the manual steps and somehow automate all the steps above.

Elasticsearch: Indexing tweets - mapping, template or ETL

I am about to index tweets coming from Apache NiFi to Elasticsearch as POST and want to do the following:
Make create_at field as date. Should I use mapping or index template for this?
make some fields not analyzed. Like hashtags, URLs, etc.
Want to store not entire tweet but some important fields. Like text, not all user information but some field, hashtags, URLs from entities (in post URLs). Don't need quoted source. Etc.
What should I use in this case? template? Pre-process tweets with some ETL process in order to extract data I need and index in ES?
I am a bit confused. Will really appreciate advise.
Thanks in advance.
I guess in your NiFi you have something like GetTwitter and PostHTTP configured. NiFi is already some sort of ETL, so you probably don't need another one. However, since you don't want to index the whole JSOn coming out of Twitter, you clearly need another NiFi process inbetween to select what you want and transform the raw JSON into another more lightweight one. Here is an example on how to do it for Solr, but I'm not sure the same processor exists for Elasticsearch.
This article about streaming Twitter data to Elasticsearch using Logstash shows a possible index template that you could use in order to mold your own (i.e. add the create_at data field if you like).
The way to go for you since you don't want to index everything, is clearly to come up with your own mapping, which you can then use in an index template. Using index templates, you will be able to create daily/weekly/monthly twitter indices as you see fit.

Get all documents from an index of an elasticsearch cluster and index it in another elasticsearch cluster

My goal here is to get all documents from an index of an ES cluster and insert them in another ES cluster keeping the same metadata.
I had a look at mget API to retrieve data and Bulk API to insert it but this Bulk API needs a special structure:
action_and_meta_data\n
optional_source\n
action_and_meta_data\n
optional_source\n
So my idea is to retrieve my data of EScluster1 in a file and rearranged it to meet the structure of Bulk API and index it to EScluster2.
Do you see a better and/or faster way to proceed?
elasticdump does this. If you want to do this manually, you'll want to query using scroll and then bulk index what comes out of that. Not too hard to script together. With elastic dump you can pump the data around without writing to a file. However, it is kind of limited when you have e.g. parent/child relations in your index.

Resources