PDF ingesting through KIBANA - elasticsearch

I am new to Elasticsearch and there are some requirements where I need to ingest and index pdf using Kibana. I have figured out that we have to create a pipeline for the above purpose but do not know which processor to use and how should I configure those. I discovered that the node of my Elasticsearch has ingest-attachment plugin installed. The version which I am using is Elasticsearch 7.14,so any help on it is appreciated thank you.

This might be useful for you, the ingest atachment processor plugin uses base64 for a pdf to extract and ingest data. You would be require to get base64 abd ingest it into a pipeline. For example:
encoded_data = base64.b64encode(data).decode('utf-8') # data is the file that you are parsing
body = {
'query': {
'bool': {
"filter": [
{"ids": { 'values': [contentDocumentId]}},
{"term": {"contentVersionId": contentVersionId}}
]
}
},
'script': {
'source': 'ctx._source["file_data"] = params._file_data',
'params': {'_file_data': encoded_data}
}
}
response = client.update_by_query(conflicts='proceed', index=_index, pipeline='attachment', body=json.dumps(body))
I am using update by query for my used case you can check if you want to use update or update by query

Related

How to add custom index using ingest node pipeline?

Is it possible to create conditional indexing by using ingest node pipelines? I feel this could be done by the script processor but can someone tell if this is possible?
I am in a scenario where I should decide which is a better way to do custom indexing. I can mention conditions in the metricbeat.yml /filebeat.yml files to get this done. But is this the best way to do custom indexing? There is no logstash in my elastic stack
output.elasticsearch:
indices:
- index: "metricbeat-dev-%{[agent.version]}-%{+yyyy.MM.dd}"
when.equals:
kubernetes.namespace: "dev"
This is how I have implemented custom indexing in metric/filebeat right now. I have like 20+ namespaces in my Kubernetes cluster. Please help in suggesting if this could be done by ingest node pipeline or not
Yes, You can achived this by ingest pipeline Set Processor. Ingest Pipeline support accessing of metadata fields and you can access / update index name using _index field name.
Below is sample Ingest Pipeline which will update index name when namespace is dev:
[
{
"set": {
"field": "_index",
"value": "metricbeat-dev",
"if": "ctx.kubernetes?.namespace== 'dev'"
}
}
]
Upadte 1: append agent version to index name. I ahve consider agent version feild name as agent.version
[
{
"set": {
"field": "_index",
"value": "metricbeat-dev-{{agent.version}}",
"if": "ctx.kubernetes?.namespace== 'dev'"
}
}
]

Elastich search rollover index with ingest pipeline

I have a data stream built out in elastic search through Kibana. I have all the right mappings, index patterns and settings. I created the index that matched the correct index pattern. All good so far.
I have a ingest pipeline that I have created to ensure that any documents that come to ES get a #timestamp field before getting ingested into the index.
PUT _ingest/pipeline/my_timestamp_pipeline
{
"description": "Adds a field to a document with the time of ingestion",
"processors": [
{
"set": {
"field": "#timestamp",
"value": "{{_ingest.timestamp}}"
}
}
]
}
I apply the above pipeline to the index as follows
PUT /<<index name>>/_settings
{
"settings": {
"default_pipeline": "my_timestamp_pipeline"
}
}
Everytime I do a manual rollover the ingest pipeline changes get disabled on the index and my documents fail to get indexed due to a missing #timestamp field, which is required as part of a data stream.
Do manual rollovers NOT support ingest pipelines and I have to manually apply the pipeline everytime I do a manual rollover?
I checked that you can pass properties during a manual rollover of an index but not for a rollover of a data stream. Am I missing anything obvious here?
Any help is appreciated
Thanks
Nick

Update ElasticSearch document using Python requests

I am implementing ElasticSearch 7.1.1 in my application using Python requests library. I have successfully created a document in the elastic index using
r = requests.put(url, auth=awsauth, json=document, headers=headers)
However, while updating an existing document, the JSON body(containing to be updated values) that I pass to the method, replaces the original document. How do I overcome this? Thank you.
You could do the following:
document = {
"doc": {
"field_1": "value_1",
"field_2": "value_2"
},
"doc_as_upsert": True
}
...
r = requests.post(url, auth=awsauth, json=document, headers=headers)
It should be POST instead of PUT
You can update existing fields and also add new fields.
Refer the doc in the comment posted by Nishant Saini.

How to extract and visualize values from a log entry in OpenShift EFK stack

I have an OKD cluster setup with EFK stack for logging, as described here. I have never worked with one of the components before.
One deployment logs requests that contain a specific value that I'm interested in. I would like to extract just this value and visualize it with an area map in Kibana that shows the amount of requests and where they come from.
The content of the message field basically looks like this:
[fooServiceClient#doStuff] {"somekey":"somevalue", "multivalue-key": {"plz":"12345", "foo": "bar"}, "someotherkey":"someothervalue"}
This plz is a German zip code, which I would like to visualize as described.
My problem here is that I have no idea how to extract this value.
A nice first success would be if I could find it with a regexp, but Kibana doesn't seem to work the way I think it does. Following its docs, I expect this /\"plz\":\"[0-9]{5}\"/ to deliver me the result, but I get 0 hits (time interval is set correctly). Even if this regexp matches, I would only find the log entry where this is contained and not just the specifc value. How do I go on here?
I guess I also need an external geocoding service, but at which point would I include it? Or does Kibana itself know how to map zip codes to geometries?
A beginner-friendly step-by-step guide would be perfect, but I could settle for some hints that guide me there.
It would be possible to parse the message field as the document gets indexed into ES, using an ingest pipeline with grok processor.
First, create the ingest pipeline like this:
PUT _ingest/pipeline/parse-plz
{
"processors": [
{
"grok": {
"field": "message",
"patterns": [
"%{POSINT:plz}"
]
}
}
]
}
Then, when you index your data, you simply reference that pipeline:
PUT plz/_doc/1?pipeline=parse-plz
{
"message": """[fooServiceClient#doStuff] {"somekey":"somevalue", "multivalue-key": {"plz":"12345", "foo": "bar"}, "someotherkey":"someothervalue"}"""
}
And you will end up with a document like the one below, which now has a field called plz with the 12345 value in it:
{
"message": """[fooServiceClient#doStuff] {"somekey":"somevalue", "multivalue-key": {"plz":"12345", "foo": "bar"}, "someotherkey":"someothervalue"}""",
"plz": "12345"
}
When indexing your document from Fluentd, you can specify a pipeline to be used in the configuration. If you can't or don't want to modify your Fluentd configuration, you can also define a default pipeline for your index that will kick in every time a new document is indexed. Simply run this on your index and you won't need to specify ?pipeline=parse-plz when indexing documents:
PUT index/_settings
{
"index.default_pipeline": "parse-plz"
}
If you have several indexes, a better approach might be to define an index template instead, so that whenever a new index called project.foo-something is created, the settings are going to be applied:
PUT _template/project-indexes
{
"index_patterns": ["project.foo*"],
"settings": {
"index.default_pipeline": "parse-plz"
}
}
Now, in order to map that PLZ on a map, you'll first need to find a data set that provides you with geolocations for each PLZ.
You can then add a second processor in your pipeline in order to do the PLZ/ZIP to lat,lon mapping:
PUT _ingest/pipeline/parse-plz
{
"processors": [
{
"grok": {
"field": "message",
"patterns": [
"%{POSINT:plz}"
]
}
},
{
"script": {
"lang": "painless",
"source": "ctx.location = params[ctx.plz];",
"params": {
"12345": {"lat": 42.36, "lon": 7.33}
}
}
}
]
}
Ultimately, your document will look like this and you'll be able to leverage the location field in a Kibana visualization:
{
"message": """[fooServiceClient#doStuff] {"somekey":"somevalue", "multivalue-key": {"plz":"12345", "foo": "bar"}, "someotherkey":"someothervalue"}""",
"plz": "12345",
"location": {
"lat": 42.36,
"lon": 7.33
}
}
So to sum it all up, it all boils down to only two things:
Create an ingest pipeline to parse documents as they get indexed
Create an index template for all project* indexes whose settings include the pipeline created in step 1

How to update multiple documents that match a query in elasticsearch

I have documents which contains only "url"(analyzed) and "respsize"(not_analyzed) fields at first. I want to update documents that match the url and add new field "category"
I mean;
at first doc1:
{
"url":"http://stackoverflow.com/users/4005632/mehmet-yener-yilmaz",
"respsize":"500"
}
I have an external data and I know "stackoverflow.com" belongs to category 10,
And I need to update the doc, and make it like:
{
"url":"http://stackoverflow.com/users/4005632/mehmet-yener-yilmaz",
"respsize":"500",
"category":"10"
}
Of course I will do this all documents which url fields has "stackoverflow.com"
and I need the update each doc oly once.. Because category data of url is not changeable, no need to update again.
I need to use _update api with _version number to check it but cant compose the dsl query.
EDIT
I run this and looks works fine:
But documents not changed..
Although query result looks true, new field not added to docs, need refresh or etc?
You could use the update by query plugin in order to do just that. The idea is to select all document without a category and whose url matches a certain string and add the category you wish.
curl -XPOST 'localhost:9200/webproxylog/_update_by_query' -H "Content-Type: application/json" -d '
{
"query": {
"filtered": {
"filter": {
"bool": {
"must": [
{
"term": {
"url": "stackoverflow.com"
}
},
{
"missing": {
"field": "category"
}
}
]
}
}
}
},
"script" : "ctx._source.category = \"10\";"
}'
After running this, all your documents with url: stackoverflow.com that don't have a category, will get category: 10. You can run the same query again later to fix new stackoverflow.com documents that have been indexed in the meantime.
Also make sure to enable scripting in elasticsearch.yml and restart ES:
script.inline: on
script.indexed: on
In the script, you're free to add as many fields as you want, e.g.
...
"script" : "ctx._source.category1 = \"10\"; ctx._source.category2 = \"20\";"
UPDATE
ES 2.3 now features the update by query functionality. You can still use the above query exactly as is and it will work (except that filtered and missing are deprecated, but still working ;).
That all sounds great but just to add to #Val answer, Update By Query is available form ElasticSearch 2.x but not for earlier versions. In our case we're using 1.4 for legacy reasons and there is no chance of upgrading in forseeable future so another solution is using the Update by query plugin provided here: https://github.com/yakaz/elasticsearch-action-updatebyquery

Resources