Elasticsearch: Indexing tweets - mapping, template or ETL - elasticsearch

I am about to index tweets coming from Apache NiFi to Elasticsearch as POST and want to do the following:
Make create_at field as date. Should I use mapping or index template for this?
make some fields not analyzed. Like hashtags, URLs, etc.
Want to store not entire tweet but some important fields. Like text, not all user information but some field, hashtags, URLs from entities (in post URLs). Don't need quoted source. Etc.
What should I use in this case? template? Pre-process tweets with some ETL process in order to extract data I need and index in ES?
I am a bit confused. Will really appreciate advise.
Thanks in advance.

I guess in your NiFi you have something like GetTwitter and PostHTTP configured. NiFi is already some sort of ETL, so you probably don't need another one. However, since you don't want to index the whole JSOn coming out of Twitter, you clearly need another NiFi process inbetween to select what you want and transform the raw JSON into another more lightweight one. Here is an example on how to do it for Solr, but I'm not sure the same processor exists for Elasticsearch.
This article about streaming Twitter data to Elasticsearch using Logstash shows a possible index template that you could use in order to mold your own (i.e. add the create_at data field if you like).
The way to go for you since you don't want to index everything, is clearly to come up with your own mapping, which you can then use in an index template. Using index templates, you will be able to create daily/weekly/monthly twitter indices as you see fit.

Related

Is there any tool out there for generating elasticsearch mapping

Mostly what I do is to assemble the mapping by hand. Choosing the correct types myself.
Is there any tool which facilitates this?
For example which will read a class (c#,java..etc) and choosing the closest ES types accordingly.
I've never seen such a tool, however I know that ElasticSearch has a REST API over HTTP.
So you can create a simple HTTP query with JSON body that will depict your object with your fields: field names + types (Strings, numbers, booleans) - pretty much like a Java/C# class that you've described in the question.
Then you can ask the ES to store the data in the non-existing index (to "index" your document in ES terms). It will index the document, but it will also create an index, and the most importantly for your question, will create a mapping for you "dynamically", so that later you will be able to query the mapping structure (again via REST).
Here is the link to the relevant chapter about dynamically created mappings in the ES documentation
And Here you can find the API for querying the mapping structure
At the end of the day you'd still want to retain some control over how your mapping is generated. I'd recommend:
syncing some sample documents w/o a mapping
investigating what mapping was auto generated and
dropping the index & using dynamic_templates to pseudo-auto-generate / update the mapping as new documents come in.
This GUI could help too.
Currently, there is no such tool available to generate the mapping for elastic.
It is a kind of similar thing as we have to design a database in MySQL.
But if we want such kind of thing then we use Mongo DB which requires no predefined schema.
But Elastic comes with its very dynamic feature, which allows us to play around it. One of the most important features of Elasticsearch is that it tries to get out of your way and let you start exploring your data as quickly as possible like the mongo schema which can be manipulated dynamically.
To index a document, you don’t need to first define a mapping or schema and define your fields along with their data type .
You can just index a document and the index, type, and fields will be created automatically.
For further details you can go through the below documentation:
Elastic Dynamic Mapping

Elastic search document storing

Basic usecase that we are trying to solve is for users to be able to search from the contents of the log file .
Lets say a simple situation where user searches for a keyword and this is present in a log file which i want to render it back to the user.
We plan to use ElasticSearch for handling this. The idea that i have in mind is to use elastic search as a mechanism to store the indexed log files.
Having this concept in mind, i went through https://www.elastic.co/guide/en/elasticsearch/reference/current/index.html
Couple of questions i have,
1) I understand the input provided to elastic search is a JSON doc. It is going to scan this JSON provided and create/update indexes. So i need a mechanism to convert my input log files to JSON??
2) Elastic search would scan this input document and create/update inverted indexes. These inverted indexes actually point to the exact document. So does that mean, ES would store these documents somewhere?? Would it store them as JSON docs? Is it purely in memory or on file sytem/database?
3) No when user searches for a keyword , ES returns back the document which contains the searched keyword. Now do i need to have the ability to convert back this JSON doc to the original log document that user expects??
Clearly im missing something.. Sorry for asking questions this silly , but im trying to improve my skills and its WIP.
Also , i understand that there is ELK stack out there. For some reasons we just want to use ES and not the LogStash and Kibana part of the stack..
Thanks
Logs needs to be parsed to JSON before they can be inserted into Elasticsearch
All documents are stored on the filesystem and some data is kept in memory but all data is persistent.
When you search Elasticsearch you get back matching JSON documents. If you want to display the original error message, you can store that original message in one of the JSON fields and display just that.
So if you just want to store log messages and not break them into fields or anything, you can simply take each row and send it to Elasticsearch like so:
{ "message": "This is my log message" }
To parse logs, break them into fields and add some logic, you will need to use some sort of app, like Logstash for example.

elasticsearch-hadoop for Spark. Send documents from a RDD in different index (by day)

I work on a complex workflow using Spark (parsing, cleaning, Machine Learning ...).
At the end of the workflow I want to send aggregated results to elasticsearch so my portal could query data.
There will be two types of processing: streaming and the possibility to relaunch workflow on all available data.
Right now I use elasticsearch-hadoop and particularly the spark part to send document to elasticsearch with the saveJsonToEs(myindex/mytype) method.
The target is to have an index by day using the proper template that we built.
AFAIK you could not add consideration of a feature in a document to send it to the proper index in elasticsearch-hadoop.
What is the proper way to implement this feature?
Have a special step using spark and bulk so that each executor send documents to the proper index considering the feature of each line?
Is there something that I missed in elasticsearch-hadoop?
I tried to send JSON to _bulk using saveJsonToEs("_bulk") but the pattern has to follow index/type
Thanks to Costin Leau, I have the solution.
Simply use dynamic indexing with something like saveJsonToEs("my-index-{date}/my-type"). "date" have to be a feature in the document that has to be send.
Discussion on elasticsearch google group: https://groups.google.com/forum/#!topic/elasticsearch/5-LwjQxVlhk
Documentation: http://www.elasticsearch.org/guide/en/elasticsearch/hadoop/master/spark.html#spark-write-dyn
You can use : ("custom-index-{date}/customtype") to create dynamic index. This could be any field in given rdd.
If you want format the date : ("custom-index-{date:{YYYY.mm.dd}}/customtype")
[Answered to question ask by Amit_Hora in the comment, as I don't have enough privilege to comment, I am adding this here]

Recommended way to store data in elasticsearch

I want to use elasticsearch on my backend and I have few questions:
My DB contains semi-structured data of products, i.e. each product may have different attributes inside it.
I want to be able to search a text on most of the fields and also search a text on one specific field.
What is the recommended way to store the document in ES ? to store all text in on field (maybe using _all feature) or leave it in different fields.
My concern of different fields is that I might have a lot of indexes (because I have many different product attributes)
I'm using couchbase as my main DB.
What is the recommended way to move the documents from it to ES, assuming I need to make some modifications on the document ?
To update the index from my code explicitly or use external tool ?
10x,
It depends on how many docs you are indexing at a time. If the number of docs are like >2million. Then it's better to store everything in one field , which will save time while indexing.
If the docs indexed are very less, then index them field by field and then search on _all field. This will give a clear view on the data and will be really helpful for what to display and what not to display.

ElasticSearch, get the field where a text matches

I do index JSON documents, where the structure is 'unknown' and I want to search for content and the result should be the FIELD-NAME where the content belongs to. Any way beside doing a query and then iterating through _source document fields to find it again in the results? I thought also about the precolate feature and generating the queries when indexing a JSON document (but this would create hundrets of queries to check while indexing ...). Maybe there is a simple feature I don't know.
E.g. I know "Main Street" will be stored in the data received (e.g. from a web crawler), but to optimise the crawler it would be helpful to get a suggestion to crawl only the field "property.address.street". The point is customer might know some sample data set to extract from JSON comming from different sources. To apply this knowledge to already collected data, the relevant field name must be found, especially when you want to make it automatically by provided sample content.

Resources