Elasticsearch index from csv file - elasticsearch

I have to setup php project maintaining the elasticsearch server. The client provides me a large txt/csv file with all the data which I want to import(update) in index in elasticsearch. With a bulk operations I have to specify a valid json structure, but I have text file instead.
Is there a way of doing that using the elastic api. Or is there at all such possibility without need of converting the csv file to json
I am totally new for the elastisearch and have difficulties of finding the solution.

Related

How to insert data into Elasticsearch from a CSV file in Go?

I want to build a feature in my Go microservice, where I can upload a CSV file and insert the data inside it into Elasticsearch. The columns can vary with every file. I am familiar with the file uploading part, but could not find any efficient method to insert the data. Is there any Go library to insert data into Elasticsearch from a CSV file?
You can use Elasticsearch official Go client.
You can use bulk api to index multipul documents together. Please check Bulk example here.

What does elastic search Store, and how?

Elastic search is a search engine, according to Wikipedia. This implies it is not a database, and does not store the data it is indexing (but presumably does store its indexes)
There are presumably 2 ways to get data into Es. Log shipping or directly via api.
Let’s say my app wants to write an old fashioned log file entry:
Logger.error(now() + “ something bad happened in module “ + module + “;” + message”
This could either write to a file or put the data directly in es using a rest api.
If it was done via rest api, does es store the entire log message, in which case you dont need to waste disk writing the logs to files for compliance etc. Or does it only index the data, so you need to keep a separate copy? If you delete or move the original log file, how does es know, and is what it Deos store still usefull?
If you write to a log file, then use log stash or similar to “put the log data in es” does es store the entire log file as well as any indexes?
How does es parse or index arbitrary log files? Does it treat a log line as a single string, or does it require logs to have a specific format such as cvs or Jason?
Does anyone know of a resource with this key info?
Elasticsearch does store the data you are indexing.
When you ingest data into elasticsearch, this data is stored in one or more index and then it can be searched. To be able to search something with elasticsearch you need to store the data in elasticsearch, it can not for example search on external files.
In your example, if you have an app sending logs do elasticsearch, it will store the entire message you send and after it is in elasticsearch you don't need the original log anymore.
If you need to parse your documents in different fields you can do it before sending the log to elasticsearch as a json document, use logstash to do this or use an ingest pipeline in elasticsearch.
A good starting point to know more about how it works is the official documentation

In Kibana can I visualize or extract images and video data from Hadoop (Hive) using Elastic search and Hadoop connector

While working with self driving car data, I have requirement like with given range of timestamp, I need to search Elastic Search and get the vehicle location image,video,latitudes,longitudes,speeds of vehicle. The input data will be loaded into Hadoop using Kafka+Spark Streaming, this transformed data contains json files, images, videos. The json file has attributes like course,speed,timestamp,latitude,longitude,accuracy. The json file name matches with corresponding image name and video name except extensions(.JPG, .JSON) differs. As the search should fetch results very fast on Petta bytes of data, we are asked to use Elastic search and Kibana here by integrating with Hadoop(Impala or presto may not match performance of kibana).
Problem here is we can store image and video data in Hadoop using sequence file format. But can the same data be indexed in ES with Hadoop integration? or can we fetch image or video data directly from hive to Kibana while searching? Or is there alternate way. Your help is much appreciated here.

How to index pdf files from HDFS to Solr

I am new to Apache solr
I have a requirement in my project where I have to upload pdf documents from HDFS to Solr and from there I want to get using Solr rest API's.
I have total 40k pdf documents in my local file system, first I will push them to HDFS. But from there to Solr I really don't have any idea
Another thing is while indexing into solr, i want to read some data from pdf document and index that data also into Solr.
Example: I want extraxt candidate name, candidate location from pdf document and push them into solr schema which looks like,
name: "candidate_name"
location: "candidate_location"
document: "pdf_document"
I searched for this over the internet, but couldn't find the right solution
Try using the https://github.com/lucidworks/hadoop-solr
You should try the DirectoryIngestMapper, it has Tika parsing, but you will have to customized it.

Does ElasticSearch store a duplicate copy of each record?

I started looking into ElasticSearch, and most examples of creating and reading involve POSTing data to the ElasticSearch server and then doing a GET to retrieve them.
Is this data that is POSTed stored separately by the ElasticSearch server? So, if I want to use ElasticSearch with MongoDB, does the raw data, not including the search indices, get stored twice (once copy for MongoDB and one for ElasticSearch)?
In conjunction with an answer to this question, a description or a link to a description of how ElasticSearch and the primary data store interact would be very helpful.
Yes, ElasticSearch can only search within its own data store, so a separate copy will be there.
You can use the mongodb connector to keep the data in elastic in sync with the mongo database: https://github.com/mongodb-labs/mongo-connector

Resources