Add field to crawled content with StormCrawler (and Elasticsearch) - elasticsearch

I have followed the following tutorial for crawling content with stormcrawler and then store it in elasticsearch: https://www.youtube.com/watch?v=KTerugU12TY . However, I would like to add to every document the date it was crawled. Can anyone tell me how this can be done?
In general, how can I change the fields of the crawled content?
Thanks in advance

One option would be to create an ingest pipeline in Elasticsearch to populate a date field, as described here. Alternatively, you'd have to write a bespoke parse filter to put the date in the metadata and then index it using indexer.md.mapping in the configuration.
It would probably be useful to make this operation simpler, please feel free to open an issue on Github (or even better contribute some code) so that the ES indexer could check the configuration for a field name indicating where to store the current date, e.g. es.now.field.

Related

Ways to only process new(index after last run) data in Elasticsearch?

Is there a way to get the date and time that an elastic search document was written?
I am running es queries via spark and would prefer NOT to look through all documents that I have already processed. Instead I would like read the only documents that were ingested between the last time the program ran and now.
What is the best most efficient way to do this?
I have looked at;
updating to add a field with an array with booleans for if its been looked at by which analytic. The negative is waiting for the update to occur.
index per time frame method, which would be to break down the current indexes into smaller ones so by hour.The negative I see is the number of open file descriptors.
??
Elasticsearch version 5.6
I posted the question on the elasticsearch discussion board and it appears using the ingest pipeline is the best option.
I am running es queries via spark and would prefer NOT to look through
all documents that I have already processed. Instead I would like read
the only documents that were ingested between the last time the
program ran and now.
A workaround could be :
While inserting data using Logstash to Elasticsearch, Logstash appends a #timestamp key to the document which represents the time (in UTC) at which the document is created or we can use an ingest pipline
After that we can query based on the timestamp.
For more on this please have a look at :
Mapping changes
There is no way to ask ES to insert a timestamp at index time
Elasticsearch doesn't have such functionality.
You need manually save with each document date. In this case you will be able to search by date range.

Can Beats update existing documents in Elasticsearch?

Consider the following use case:
I want the information from one particular log line to be indexed into Elasticsearch, as a document X.
I want the information from some log line further down the log file to be indexed into the same document X (not overriding the original, just adding more data).
The first part, I can obviously achieve with filebeat.
For the second, does anyone have any idea about how to approach it? Could I still use filebeat + some pipeline on an ingest node for example?
Clearly, I can use the ES API to update the said document, but I was looking for some solution that doesn't require changes to my application - rather, it is all possible to achieve using the log files.
Thanks in advance!
No, this is not something that Beats were intended to accomplish. Enrichment like you describe is one of the things that Logstash can help with.
Logstash has an Elasticsearch input that would allow you to retrieve data from ES and use it in the pipeline for enrichment. And the Elasticsearch output supports upsert operations (update if exists, insert new if not). Using both those features you can enrich and update documents as new data comes in.
You might want to consider ingesting the log lines as is to Elasticearch. Then using Logstash, build a separate index that is entity specific and driven based on data from the logs.

Elasticsearch: Indexing tweets - mapping, template or ETL

I am about to index tweets coming from Apache NiFi to Elasticsearch as POST and want to do the following:
Make create_at field as date. Should I use mapping or index template for this?
make some fields not analyzed. Like hashtags, URLs, etc.
Want to store not entire tweet but some important fields. Like text, not all user information but some field, hashtags, URLs from entities (in post URLs). Don't need quoted source. Etc.
What should I use in this case? template? Pre-process tweets with some ETL process in order to extract data I need and index in ES?
I am a bit confused. Will really appreciate advise.
Thanks in advance.
I guess in your NiFi you have something like GetTwitter and PostHTTP configured. NiFi is already some sort of ETL, so you probably don't need another one. However, since you don't want to index the whole JSOn coming out of Twitter, you clearly need another NiFi process inbetween to select what you want and transform the raw JSON into another more lightweight one. Here is an example on how to do it for Solr, but I'm not sure the same processor exists for Elasticsearch.
This article about streaming Twitter data to Elasticsearch using Logstash shows a possible index template that you could use in order to mold your own (i.e. add the create_at data field if you like).
The way to go for you since you don't want to index everything, is clearly to come up with your own mapping, which you can then use in an index template. Using index templates, you will be able to create daily/weekly/monthly twitter indices as you see fit.

Kibana 4 Metric visualization show latest value

I'm new to Kibana and Elastic Search and i have run into this problem:
My ES contains (among other stuff) also data containing the current value of one custom performance counter and i would like my dashboard to show this value, e.g., as a big number - therefore i tried to use the Metric visualization, but i have no idea on how to show only the last value. Any help would be highly appreciated. Thanks.
We had a similar issue for our use case. We found two ways to handle it:
If the data is periodically generated then you can use the Kibana feature of showing data of recent n days to see the latest data.
In our case, the above option was not possible so we went with a hack where we have a property in our documents called "IsLatest" so we apply a filter "IsLatest":true in all our charts where we need latest info. We have written our code which feeds data to ElasticSearch in such a way that it updates the older data and sets it's "IsLatest" to false.
Hope it helps

elasticsearch-hadoop for Spark. Send documents from a RDD in different index (by day)

I work on a complex workflow using Spark (parsing, cleaning, Machine Learning ...).
At the end of the workflow I want to send aggregated results to elasticsearch so my portal could query data.
There will be two types of processing: streaming and the possibility to relaunch workflow on all available data.
Right now I use elasticsearch-hadoop and particularly the spark part to send document to elasticsearch with the saveJsonToEs(myindex/mytype) method.
The target is to have an index by day using the proper template that we built.
AFAIK you could not add consideration of a feature in a document to send it to the proper index in elasticsearch-hadoop.
What is the proper way to implement this feature?
Have a special step using spark and bulk so that each executor send documents to the proper index considering the feature of each line?
Is there something that I missed in elasticsearch-hadoop?
I tried to send JSON to _bulk using saveJsonToEs("_bulk") but the pattern has to follow index/type
Thanks to Costin Leau, I have the solution.
Simply use dynamic indexing with something like saveJsonToEs("my-index-{date}/my-type"). "date" have to be a feature in the document that has to be send.
Discussion on elasticsearch google group: https://groups.google.com/forum/#!topic/elasticsearch/5-LwjQxVlhk
Documentation: http://www.elasticsearch.org/guide/en/elasticsearch/hadoop/master/spark.html#spark-write-dyn
You can use : ("custom-index-{date}/customtype") to create dynamic index. This could be any field in given rdd.
If you want format the date : ("custom-index-{date:{YYYY.mm.dd}}/customtype")
[Answered to question ask by Amit_Hora in the comment, as I don't have enough privilege to comment, I am adding this here]

Resources