Import Wikipedia's indices into Elasticsearch - elasticsearch

For my research I should import Russian Wikipedia's dump into Elasticsearch 2.2. But instead of importing dump I decided to work with indices published by Wikimedia (http://dumps.wikimedia.org/other/cirrussearch/). To work with it I found an article https://www.elastic.co/blog/loading-wikipedia and tried to use author's scripts for my problem (just replaced some export-statements). But there's a problem in the Step 2.
It is my version of script for Step 2:
export es=localhost:9200
export site=ru.wikipedia.org
export index=ruwiki
curl -XDELETE $es/$index?pretty
curl -s 'https://'$site'/w/api.php?action=cirrus-settings-dump&format=json&formatversion=2' |
jq '{ analysis: .content.page.index.analysis, number_of_shards: 1, number_of_replicas: 0 }' |
curl -XPUT $es/$index?pretty -d #-
curl -s 'https://'$site'/w/api.php?action=cirrus-mapping-dump&format=json&formatversion=2' |
jq .content |
sed 's/"index_analyzer"/"analyzer"/' |
sed 's/"position_offset_gap"/"position_increment_gap"/' |
curl -XPUT $es/$index/_mapping/page?pretty -d #-
And the result
{
"acknowledged" : true
}
{
"acknowledged" : true
}
{
"error" : {
"root_cause" : [ {
"type" : "action_request_validation_exception",
"reason" : "Validation Failed: 1: mapping source is empty;"
} ],
"type" : "action_request_validation_exception",
"reason" : "Validation Failed: 1: mapping source is empty;"
},
"status" : 400
}
I also tried to use author's script just for test. There's the same error. I don't know what to do. Please, help to fix it.

The Wikipedia dumps are currently exported from ElasticSearch 1.7.5. Most likely (I haven't tested) the current mapping is not compatible with ES 2.2. It is likely worthwhile to try using the older version of ES.
Edit: The latest dumps are now compatible with elasticsearch 2.x

Related

Download iCloud file from shared link using bash

I want to download a file from iCloud. I can share a link to a file. However the file is not directly linked in the urls, but the "real" download url can be retrieved:
#!/bin/bash
# given "https://www.icloud.com/iclouddrive/<ID>#<Filename>
ID="...."
URL=$(curl 'https://ckdatabasews.icloud.com/database/1/com.apple.cloudkit/production/public/records/resolve' \
--data-raw '{"shortGUIDs":[{"value":"$ID"}]}' --compressed \
jq -r '.results[0].rootRecord.fields.fileContent.value.downloadURL')
curl "$URL" -o myfile.ext
Sorce: https://gist.github.com/jpillora/702ded79330043e38e8202b5c73835e5
"fileContent" : {
"value" : {
...
"downloadURL" : "https://cvws.icloud-content.com/B/CYo..."
},
This is, however not working:
rl: (6) Could not resolve host: jq
curl: (3) nested brace in URL position 17:
{
"results" : [ {
"shortGUID" : {
"value" : "$ID",
"shouldFetchRootRecord" : true
},
"reason" : "shortGUID cannot be null or empty",
"serverErrorCode" : "BAD_REQUEST"
} ]
}
Any ideas, what I can do to make this work?
as #dan mentioned, jq is not a curl argument, its a separate command. hence you would need to | pipe it instead of \.
So the command would look something like this:
URL=$(curl 'https://ckdatabasews.icloud.com/database/1/com.apple.cloudkit/production/public/records/resolve' \ --data-raw '{"shortGUIDs":[{"value":"$ID"}]}' --compressed | jq -r '.results[0].rootRecord.fields.fileContent.value.downloadURL')
I solved it by installing jq and adding the ID directly instead of using $id. Just installing jq was not sufficient.
brew install jq
#!/bin/bash
# given "https://www.icloud.com/iclouddrive/<ID>#<Filename>
URL=$(curl 'https://ckdatabasews.icloud.com/database/1/com.apple.cloudkit/production/public/records/resolve' \
--data-raw '{"shortGUIDs":[{"value":"ID"}]}' --compressed | jq -r '.results[0].rootRecord.fields.fileContent.value.downloadURL')
Echo $url
curl "$URL" -o myfile.ext

Elasticsearch error: cluster_block_exception [FORBIDDEN/12/index read-only / allow delete (api)], flood stage disk watermark exceeded

When trying to post documents to Elasticsearch as normal I'm getting this error:
cluster_block_exception [FORBIDDEN/12/index read-only / allow delete (api)];
I also see this message on the Elasticsearch logs:
flood stage disk watermark [95%] exceeded ... all indices on this node will marked read-only
This happens when Elasticsearch thinks the disk is running low on space so it puts itself into read-only mode.
By default Elasticsearch's decision is based on the percentage of disk space that's free, so on big disks this can happen even if you have many gigabytes of free space.
The flood stage watermark is 95% by default, so on a 1TB drive you need at least 50GB of free space or Elasticsearch will put itself into read-only mode.
For docs about the flood stage watermark see https://www.elastic.co/guide/en/elasticsearch/reference/6.2/disk-allocator.html.
The right solution depends on the context - for example a production environment vs a development environment.
Solution 1: free up disk space
Freeing up enough disk space so that more than 5% of the disk is free will solve this problem. Elasticsearch won't automatically take itself out of read-only mode once enough disk is free though, you'll have to do something like this to unlock the indices:
$ curl -XPUT -H "Content-Type: application/json" https://[YOUR_ELASTICSEARCH_ENDPOINT]:9200/_all/_settings -d '{"index.blocks.read_only_allow_delete": null}'
Solution 2: change the flood stage watermark setting
Change the "cluster.routing.allocation.disk.watermark.flood_stage" setting to something else. It can either be set to a lower percentage or to an absolute value. Here's an example of how to change the setting from the docs:
PUT _cluster/settings
{
"transient": {
"cluster.routing.allocation.disk.watermark.low": "100gb",
"cluster.routing.allocation.disk.watermark.high": "50gb",
"cluster.routing.allocation.disk.watermark.flood_stage": "10gb",
"cluster.info.update.interval": "1m"
}
}
Again, after doing this you'll have to use the curl command above to unlock the indices, but after that they should not go into read-only mode again.
By default, Elasticsearch installed goes into read-only mode when you have less than 5% of free disk space. If you see errors similar to this:
Elasticsearch::Transport::Transport::Errors::Forbidden: [403]
{"error":{"root_cause":[{"type":"cluster_block_exception","reason":"blocked
by: [FORBIDDEN/12/index read-only / allow delete
(api)];"}],"type":"cluster_block_exception","reason":"blocked by:
[FORBIDDEN/12/index read-only / allow delete (api)];"},"status":403}
Or in /usr/local/var/log/elasticsearch.log you can see logs similar to:
flood stage disk watermark [95%] exceeded on
[nCxquc7PTxKvs6hLkfonvg][nCxquc7][/usr/local/var/lib/elasticsearch/nodes/0]
free: 15.3gb[4.1%], all indices on this node will be marked read-only
Then you can fix it by running the following commands:
curl -XPUT -H "Content-Type: application/json" http://localhost:9200/_cluster/settings -d '{ "transient": { "cluster.routing.allocation.disk.threshold_enabled": false } }'
curl -XPUT -H "Content-Type: application/json" http://localhost:9200/_all/_settings -d '{"index.blocks.read_only_allow_delete": null}'
curl -XPUT -H "Content-Type: application/json" http://localhost:9200/_all/_settings -d '{"index.blocks.read_only_allow_delete": null}'
FROM
https://techoverflow.net/2019/04/17/how-to-fix-elasticsearch-forbidden-12-index-read-only-allow-delete-api/
This error is usually observed when your machine is low on disk space.
Steps to be followed to avoid this error message
Resetting the read-only index block on the index:
$ curl -X PUT -H "Content-Type: application/json" http://127.0.0.1:9200/_all/_settings -d '{"index.blocks.read_only_allow_delete": null}'
Response
${"acknowledged":true}
Updating the low watermark to at least 50 gigabytes free, a high watermark of at least 20 gigabytes free, and a flood stage watermark of 10 gigabytes free, and updating the information about the cluster every minute
Request
$curl -X PUT "http://127.0.0.1:9200/_cluster/settings?pretty" -H 'Content-Type: application/json' -d' { "transient": { "cluster.routing.allocation.disk.watermark.low": "50gb", "cluster.routing.allocation.disk.watermark.high": "20gb", "cluster.routing.allocation.disk.watermark.flood_stage": "10gb", "cluster.info.update.interval": "1m"}}'
Response
${
"acknowledged" : true,
"persistent" : { },
"transient" : {
"cluster" : {
"routing" : {
"allocation" : {
"disk" : {
"watermark" : {
"low" : "50gb",
"flood_stage" : "10gb",
"high" : "20gb"
}
}
}
},
"info" : {"update" : {"interval" : "1m"}}}}}
After running these two commands, you must run the first command again so that the index does not go again into read-only mode
Only changing the settings with the following command did not work in my environment:
curl -XPUT -H "Content-Type: application/json" http://localhost:9200/_all/_settings -d '{"index.blocks.read_only_allow_delete": null}'
I had to also ran the Force Merge API command:
curl -X POST "localhost:9200/my-index-000001/_forcemerge?pretty"
ref: Force Merge API
Even if the computer storage is revived above 95% the issue will still persist.
Short term solution is to increase kibana limit above 95%.This solution works in Windows only.
a. Create a json file with following parameters
{
"persistent": {
"cluster.routing.allocation.disk.watermark.low": "90%",
"cluster.routing.allocation.disk.watermark.high": "95%",
"cluster.routing.allocation.disk.watermark.flood_stage": "97%"
}
}
b.Name it anything ,e.g : json.txt
c.Type following command in command prompt
>curl -X PUT "localhost:9200/_cluster/settings?pretty" -H "Content-Type: application/json" -d #json.txt
d.Following output is received.
{
"acknowledged" : true,
"persistent" : {
"cluster" : {
"routing" : {
"allocation" : {
"disk" : {
"watermark" : {
"low" : "90%",
"flood_stage" : "97%",
"high" : "95%"
}
}
}
}
}
},
"transient" : { }
}
e.Create another json file with following parameter
{
"index.blocks.read_only_allow_delete": null
}
f.Name it anything ,e.g : json1.txt
g.Type following command in command prompt
>curl -X PUT "localhost:9200/*/_settings?expand_wildcards=all" -H "Content-Type: application/json" -d #json1.txt
h.You should get following output
{"acknowledged":true}
i.Restart ELK stack/Kibana and the issue should be resolved.
Delete setting of read-only from PostMan
A nice guide from the ELK team:
https://www.elastic.co/guide/en/elasticsearch/reference/master/disk-usage-exceeded.html
It worked for me with ELK 7.x

Create index-patterns from console with Kibana 6.0 or 7+ (v7.0.1)

I recently upgraded my ElasticStack instance from 5.5 to 6.0, and it seems that some of the breaking changes of this version have harmed my pipeline. I had a script that, depending on the indices inside ElasticSearch, created index-patterns automatically for some groups of similar indices. The problem is that with the new mapping changes of the 6.0 version, I cannot add any new index-pattern from the console. This was the request I used and worked fine in 5.5:
curl -XPOST "http://localhost:9200/.kibana/index-pattern" -H 'Content- Type: application/json' -d'
{
"title" : "index_name",
"timeFieldName" : "execution_time"
}'
This is the response I get now, in 6.0, from ElasticSearch:
{
"error": {
"root_cause": [
{
"type": "illegal_argument_exception",
"reason": "Rejecting mapping update to [.kibana] as the final mapping would have more than 1 type: [index-pattern, doc]"
}
],
"type": "illegal_argument_exception",
"reason": "Rejecting mapping update to [.kibana] as the final mapping would have more than 1 type: [index-pattern, doc]"
},
"status": 400
}
How could I add index-patterns from the console avoiding this multiple mapping issue?
The URL has been changed in version 6.0.0, here is the new URL:
http://localhost:9200/.kibana/doc/doc:index-pattern:my-index-pattern-name
This CURL should work for you:
curl -XPOST "http://localhost:9200/.kibana/doc/index-pattern:my-index-pattern-name" -H 'Content-Type: application/json' -d'
{
"type" : "index-pattern",
"index-pattern" : {
"title": "my-index-pattern-name*",
"timeFieldName": "execution_time"
}
}'
If you are Kibana 7.0.1 / 7+ then you can refer saved_objects API ex:
Refer: https://www.elastic.co/guide/en/kibana/master/saved-objects-api.html (Look for Get, Create, Delete etc).
In this case, we'll use: https://www.elastic.co/guide/en/kibana/master/saved-objects-api-create.html
$ curl -X POST -u $user:$pass -H "Content-Type: application/json" -H "kbn-xsrf:true" "${KIBANA_URL}/api/saved_objects/index-pattern/dummy_index_pattern" -d '{ "attributes": { "title":"index_name*", "timeFieldName":"sprint_start_date"}}' -w "\n" | jq
and
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 327 100 250 100 77 543 167 --:--:-- --:--:-- --:--:-- 543
{
"type": "index-pattern",
"id": "dummy_index_pattern",
"attributes": {
"title": "index_name*",
"timeFieldName": "sprint_start_date"
},
"references": [],
"migrationVersion": {
"index-pattern": "6.5.0"
},
"updated_at": "2020-02-25T22:56:44.531Z",
"version": "Wzg5NCwxNV0="
}
Where $KIBANA_URL was set to: http://my-elk-stack.devops.local:5601
If you don't have jq installed, remove | jq from the command (as listed above).
PS: When KIBANA's GUI is used to create an index-pattern, Kibana stores its i.e. index ID as an alpha-numeric value (ex: laskl32ukdflsdjflskadf-sdf-sdfsaldkjfhsdf-dsfasdf) which is hard to use/find/type when doing GET operation to find info about an existing index-pattern using the following curl command.
If you passed index pattern name (like we did above), then in Kibana/Elasticsearch, it'll story the Index-Pattern's ID by the name you gave to the REST call (ex: .../api/saved_objects/index-pattern/dummy_index_pattern")
here: dummy_index_pattern will become the ID (only visible if you hover over your mouse on the index-pattern name in Kibana GUI) and
it'll have it's index name as: index_name* (i.e. what's listed in GUI when you click on Kibana Home > Gear icon > Index Patterns and see the index patterns listed on the right side.
NOTE: The timeFieldName is very important. This is the field, which is used for looking for time-series events (i.e. especially TSVB Time Series Visual Builder Visualization type). By default, it uses #timestamp field, but if you recreate your index (instead of sending delta information to your target Elasticsearch index from a data source (ex: JIRA)) every time and send all data in one shot from scratch from a data source, then #timestamp won't help with Visualization's Time-Spanning/Window feature (where you change time from last 1 week to last 1 hour or last 6 months); in that case, you can set a different field i.e. sprint_start_date like I used (and now in Kibana Discover data page, if you select this index-pattern, it'll use sprint_start_date (type: date) field, for events.
To GET index pattern info about the newly created index-pattern, you can refer: https://www.elastic.co/guide/en/kibana/master/saved-objects-api-get.html --OR run the following where (the last value in the URL path is the ID value of the index pattern we created earlier:
curl -X GET "${KIBANA_URL}/api/saved_objects/index-pattern/dummy_index_pattern" | jq
or
otherwise (if you want to perform a GET on an index pattern which is created via Kibana's GUI/webpage under Page Index Pattern > Create Index Pattern, you'd have to enter something like this:
curl -X GET "${KIBANA_URL}/api/saved_objects/index-pattern/jqlaskl32ukdflsdjflskadf-sdf-sdfsaldkjfhsdf-dsfasdf" | jq
For Kibana 7.7.0 with Open Distro security plugin (amazon/opendistro-for-elasticsearch-kibana:1.8.0 Docker image to be precise), this worked for me:
curl -X POST \
-u USERNAME:PASSWORD \
KIBANA_HOST/api/saved_objects/index-pattern \
-H "kbn-version: 7.7.0" \
-H "kbn-xsrf: true" \
-H "content-type: application/json; charset=utf-8" \
-d '{"attributes":{"title":"INDEX-PATTERN*","timeFieldName":"#timestamp","fields":"[]"}}'
Please note, that kbn-xsrf header is required, but it seems like it's useless as from security point of view.
Output was like:
{"type":"index-pattern","id":"UUID","attributes":{"title":"INDEX-PATTERN*","timeFieldName":"#timestamp","fields":"[]"},"references":[],"migrationVersion":{"index-pattern":"7.6.0"},"updated_at":"TIMESTAMP","version":"VERSION"}
I can't tell why migrationVersion.index-pattern is "7.6.0".
For other Kibana versions you should be able to:
Open Kibana UI in browser
Open Developers console, navigate to Network tab
Create index pattern using UI
Open POST request in the Developers console, take a look on URL and headers, than rewrite it to cURL
Indices created in Elasticsearch 6.0.0 or later may only contain a single mapping type.
Indices created in 5.x with multiple mapping types will continue to function as before in Elasticsearch 6.x.
Mapping types will be completely removed in Elasticsearch 7.0.0.
Maybe you are creating a index with more than one doc_types in ES 6.0.0.
https://www.elastic.co/guide/en/elasticsearch/reference/current/removal-of-types.html
Create index-pattern in bulk with timestamp:
cat index_svc.txt
my-index1
my-index2
my-index3
my-index4
my-index5
my-index6
cat index_svc.txt | while read index; do
echo -ne "create index-pattern ${index} \t"
curl -XPOST "http://10.0.1.44:9200/.kibana/doc/index-pattern:${index}" -H 'Content-Type: application/json' -d "{\"type\":\"index-pattern\",\"index-pattern\":{\"title\":\"${index}2020*\",\"timeFieldName\":\"#timestamp\"}}"
echo
done

Curl get command not showing the data but showing only header information

Am working on Elasticsearch server and using the curl command for posting, getting the data in Windows command line.
When I try to post the data using the curl -XPUT command the data appears getting inserted. But when I query the data back using curl-XGET I am not getting the data but only header information like Index number etc. Please see the queries and results below.
curl -XPUT "<server location>/megacorp/emp/1" -d "{""first_name"" : ""John"",""last_name"" : ""Smith"",""age"" : "25"}"
{"_index":"megacorp","_type":"emp","_id":"1","_version":1,"created":true}
curl -XPUT "<server location>/megacorp/emp/2" -d "{""first_name"" : ""Jane"",""last_name"" : ""Cooper"",""age"" : "35"}"
{"_index":"megacorp","_type":"emp","_id":"2","_version":1,"created":true}
curl -XPUT "<server location>/megacorp/emp/3" -d "{""first_name"" : ""Bradleey"",""last_name"" : ""Cooper"",""age"" : "40"}"
{"_index":"megacorp","_type":"emp","_id":"3","_version":1,"created":true}
curl -XGET "<server location>/megacorp/emp/_search?q=last_name:Cooper"
{"took":1,"timed_out":false,"_shards":{"total":5,"successful":5,"failed":0},"hits":{"total":2,"max_score":1.0,"hits":[{"_index":"megacorp","_type":"emp","_id":"2","_score":1.0},{"_index":"me
gacorp","_type":"emp","_id":"3","_score":1.0}]}}
Try this: curl -XGET "<server location>/megacorp/emp/_search?q=last_name:Cooper&_source=true"

Elastic search batch insert using curl is not working

I'm performing a small benchmark on an Elastic Search cluster. My benchmark script is written in bash and uses curl.
I'm writing the data to a file that I'm posting to the REST API:
curl -XPOST 'localhost:9200/benchmark/external/_bulk?pretty' \
--data #$DATAFILE
My $DATAFILE is very simple, and has all newlines in place:
{"index":{"_id": "1"}}
{"data":"xxxxxxxxxx"}
{"index":{"_id": "2"}}
{"data":"xxxxxxxxxx"}
{"index":{"_id": "3"}}
{"data":"xxxxxxxxxx"}
...
But when I try to do my post I keep receiving the following error:
{
"error" : "ActionRequestValidationException[Validation Failed: 1: no requests added;]",
"status" : 400
}
I understand that my input is not validated, but why?
curl removed the newlines before the data was sent!
The --data parameter should be replaced with --data-binary:
curl -XPOST 'localhost:9200/benchmark/external/_bulk?pretty' \
--data-binary #$DATAFILE

Resources