atmospherejs.com : Result search sorted by number of downloads - sorting

I am using atmospherejs.com for searching package but unfortunately it does not provide any criteria to sort the result : for example If I want to sort by number of downloads.
Is there anyway to get the result of a query sorted by some criteria : number of downloads, rating, new packages etc.
Thanks

I don't think so, but you can use the atmosphere API to get to the raw data and then filter and sort:
# get raw data
curl --header "Accept: application/json" https://atmospherejs.com/a/packages > /tmp/all.json
# process the json array in whatever way you want, for instance here
# only consider packages starting with an "f", then sort by installs per year
node -e "x = require('/tmp/all.json'); \
console.log(x.filter(function(a) { return a.name[0] == 'f'; })\
.sort(function(a,b) { return a['installs-per-year'] < b['installs-per-year']; }));"

Related

ElasticSearch painless script to remove all the keys except for a list of keys

I want to execute an atomic update operation on a Elasticsearch (6.1) document where I want to remove all the document except for some keys (on the top level, not nested).
I know that for removing a specific key from a document (something in the example) I can do as follows:
curl -XPOST 'localhost:9200/index/type/id/_update' -d '{
"script" : "ctx._source.remove(params.field)",
"params": {
"field": "something"
}
}'
But what If I want to remove every field except for a field called a and a field called b?
I found a way to make it work. I'm posting it here since it might be useful for someone else:
POST /index/type/id/_update
{
"script" : {
"source" : "Object var0 = ctx._source.get(\"a\"); Object var1 = ctx._source.get(\"b\"); ctx._source = params.value; if(var0 != null) ctx._source.put(\"a\", var0); if(var1 != null) ctx._source.put(\"b\", var1);",
"params": {
"value": {
"newKey" : "newValue"
}
}
}
}
This script is updating the document with the content inside params.value while keeping the keys a and b from the previous version of the document. This approach is simpler for my use case since the list of keys to keep is going to be small compared to the amount of keys are present in the existing document.
If you would like only to keep the keys a and be you would first store the keys in variables, then do ctx._source.clear() and then you will add the keys back.

Elasticsearch 2.1: Result window is too large (index.max_result_window)

We retrieve information from Elasticsearch 2.1 and allow the user to page thru the results. When the user requests a high page number we get the following error message:
Result window is too large, from + size must be less than or equal
to: [10000] but was [10020]. See the scroll api for a more efficient
way to request large data sets. This limit can be set by changing the
[index.max_result_window] index level parameter
The elastic docu says that this is because of high memory consumption and to use the scrolling api:
Values higher than can consume significant chunks of heap memory per
search and per shard executing the search. It’s safest to leave this
value as it is an use the scroll api for any deep scrolling https://www.elastic.co/guide/en/elasticsearch/reference/2.x/breaking_21_search_changes.html#_from_size_limits
The thing is that I do not want to retrieve large data sets. I only want to retrieve a slice from the data set which is very high up in the result set. Also the scrolling docu says:
Scrolling is not intended for real time user requests https://www.elastic.co/guide/en/elasticsearch/reference/2.2/search-request-scroll.html
This leaves me with some questions:
1) Would the memory consumption really be lower (any if so why) if I use the scrolling api to scroll up to result 10020 (and disregard everything below 10000) instead of doing a "normal" search request for result 10000-10020?
2) It does not seem that the scrolling API is an option for me but that I have to increase "index.max_result_window". Does anyone have any experience with this?
3) Are there any other options to solve my problem?
If you need deep pagination, one possible solution is to increase the value max_result_window. You can use curl to do this from your shell command line:
curl -XPUT "http://localhost:9200/my_index/_settings" -H 'Content-Type: application/json' -d '{ "index" : { "max_result_window" : 500000 } }'
I did not notice increased memory usage, for values of ~ 100k.
The right solution would be to use scrolling.
However, if you want to extend the results search returns beyond 10,000 results, you can do it easily with Kibana:
Go to Dev Tools and just post the following to your index (your_index_name), specifing what would be the new max result window
PUT your_index_name/_settings
{
"max_result_window" : 500000
}
If all goes well, you should see the following success response:
{
"acknowledged": true
}
The following pages in the elastic documentation talk about deep paging:
https://www.elastic.co/guide/en/elasticsearch/guide/current/pagination.html
https://www.elastic.co/guide/en/elasticsearch/guide/current/_fetch_phase.html
Depending on the size of your documents, the number of shards, and the
hardware you are using, paging 10,000 to 50,000 results (1,000 to
5,000 pages) deep should be perfectly doable. But with big-enough from
values, the sorting process can become very heavy indeed, using vast
amounts of CPU, memory, and bandwidth. For this reason, we strongly
advise against deep paging.
Use the Scroll API to get more than 10000 results.
Scroll example in ElasticSearch NEST API
I have used it like this:
private static Customer[] GetCustomers(IElasticClient elasticClient)
{
var customers = new List<Customer>();
var searchResult = elasticClient.Search<Customer>(s => s.Index(IndexAlias.ForCustomers())
.Size(10000).SearchType(SearchType.Scan).Scroll("1m"));
do
{
var result = searchResult;
searchResult = elasticClient.Scroll<Customer>("1m", result.ScrollId);
customers.AddRange(searchResult.Documents);
} while (searchResult.IsValid && searchResult.Documents.Any());
return customers.ToArray();
}
If you want more than 10000 results then in all the data nodes the memory usage will be very high because it has to return more results in each query request. Then if you have more data and more shards then merging those results will be inefficient. Also es cache the filter context, hence again more memory. You have to trial and error how much exactly you are taking. If you are getting many requests in small window you should do multiple query for more than 10k and merge it by urself in the code, which is supposed to take less application memory then if you increase the window size.
2) It does not seem that the scrolling API is an option for me but that I have to increase "index.max_result_window". Does anyone have any experience with this?
--> You can define this value in index templates , es template will be applicable for new indexes only ,so you either have to delete old indexes after creating template or wait for new data to be ingested in elasticsearch .
{
"order": 1,
"template": "index_template*",
"settings": {
"index.number_of_replicas": "0",
"index.number_of_shards": "1",
"index.max_result_window": 2147483647
},
In my case it looks like reducing the results via the from & size prefixes to the query will remove the error as we don't need all the results:
GET widgets_development/_search
{
"from" : 0,
"size": 5,
"query": {
"bool": {}
},
"sort": {
"col_one": "asc"
}
}

Is there a way to "escape" ElasticSearch stop words?

I am fairly new to ElasticSearch and have a question on stop words. I have an index that contains state names for the USA....ex: New York/NY, California/CA,Oregon/OR. I believe Oregon's abbreviation, 'OR' is a stop word, so when I insert the state data into the index, I cannot search on 'OR'. Is there a way I can set up custom stopwords for this or am I doing something wrong?
Here is how I am building the index:
curl -XPUT http://localhost:9200/test/state/1 -d '{"stateName": ["California","CA"]}'
curl -XPUT http://localhost:9200/test/state/2 -d '{"stateName": ["New York","NY"]}'
curl -XPUT http://localhost:9200/test/state/3 -d '{"stateName": ["Oregon","OR"]}'
A search for 'NY', works fine. Ex:
curl -XGET 'http://localhost:9200/test/state/_search?pretty=1' -d '
{
"query" : {
"match" : {
"stateName" : "NY"
}
}
}'
But a search for 'OR', returns zero hits:
curl -XGET 'http://localhost:9200/test/state/_search?pretty=1' -d '
{
"query" : {
"match" : {
"stateName" : "OR"
}
}
}'
I believe this search returns no results because OR is stop word, but I don't know how to work around this. Thanks for you help.
You can (and definitely should) control the way you index data by modifying your mapping according to your data and the way you want to search against it.
In your case I would disable stopwords for that specific field rather than modifying the stopword list, but you could do the latter too if you wish to. The point is that you're using the default mapping which is great to start with, but as you can see you need to tweak it depending on your needs.
For each field, you can specify what analyzer to use. An analyzer defines the way you split your text into tokens (tokenizer) that will be indexed and also additional changes you can make to each token (even remove or add new ones) using token filters.
You can specify your mapping either while creating your index or update it afterwards using the put mapping api (as long as the changes you make are backwards compatible).

Setting Elastic search limit to "unlimited"

How can i get all the results from elastic search as the results only display limit to 10 only. ihave got a query like:
#data = Athlete.search :load => true do
size 15
query do
boolean do
must { string q, {:fields => ["name", "other_names", "nickname", "short_name"], :phrase_slop => 5} }
unless conditions.blank?
conditions.each do |condition|
must { eval(condition) }
end
end
unless excludes.blank?
excludes.each do |exclude|
must_not { eval(exclude) }
end
end
end
end
sort do
by '_score', "desc"
end
end
i have set the limit to 15 but i wan't to make it unlimited so that i can get all the data
I can't set the limit as my data keeps on changing and i want to get all the data.
You can use the from and size parameters to page through all your data. This could be very slow depending on your data and how much is in the index.
http://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-from-size.html
Another approach is to first do a searchType: 'count', then and then do a normal search with size set to results.count.
The advantage here is it avoids depending on a magic number for UPPER_BOUND as suggested in this similar SO question, and avoids the extra overhead of building too large of a priority queue that Shay Banon describes here. It also lets you keep your results sorted, unlike scan.
The biggest disadvantage is that it requires two requests. Depending on your circumstance, this may be acceptable.
From the docs, "Note that from + size can not be more than the index.max_result_window index setting which defaults to 10,000". So my admittedly very ad-hoc solution is to just pass size: 10000 or 10,000 minus from if I use the from argument.
Note that following Matt's comment below, the proper way to do this if you have a larger amount of documents is to use the scroll api. I have used this successfully, but only with the python interface.
use the scan method e.g.
curl -XGET 'localhost:9200/_search?search_type=scan&scroll=10m&size=50' -d '
{
"query" : {
"match_all" : {}
}
}
see here
You can use search_after to paginate, and the Point in Time API to avoid having your data change while you paginate. Example with elasticsearch-dsl for Python:
from elasticsearch_dsl.connections import connections
# Set up paginated query with search_after and a fixed point_in_time
elasticsearch = connections.create_connection(hosts=[elastic_host])
pit = elasticsearch.open_point_in_time(index=MY_INDEX, keep_alive="3m")
pit_id = pit["id"]
query_size = 500
search_after = [0]
hits: List[AttrDict[str, Any]] = []
while query_size:
if hits:
search_after = hits[-1].meta.sort
search = (
Search()
.extra(size=query_size)
.extra(pit={"id": pit_id, "keep_alive": "5m"})
.extra(search_after=search_after)
.filter(filter_)
.sort("url.keyword") # Note you need a unique field to sort on or it may never advance
)
response = search.execute()
hits = [hit for hit in response]
pit_id = response.pit_id
query_size = len(hits)
for hit in hits:
# Do work with hits

ElasticSearch index unix timestamp

I have to index documents containing a 'time' field whose value is an integer representing the number of seconds since epoch (aka unix timestamp).
I've been reading ES docs and have found this:
http://www.elasticsearch.org/guide/reference/mapping/date-format.html
But it seems that if I want to submit unix timestamps and want them stored in a 'date' field (integer field is not useful for me) I have only two options:
Implement my own date format
Convert to a supported format at the sender
Is there any other option I missed?
Thanks!
If you supply a mapping that tells ES the field is a date, it can use epoch millis as an input. If you want ES to auto-detect you'll have to provide ISO8601 or other discoverable format.
Update: I should also note that you can influence what strings ES will recognize as dates in your mapping. http://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-date-format.html
In case you want to use Kibana, which I expect, and visualize according to the time of a log/entry you will need at least one field to be a date field.
Please note that you have to set the field as date type BEFORE you input any data into the /index/type. Otherwise it will be stored as long and unchangeable.
Simple example that can be pasted into the marvel/sense plugin:
# Make sure the index isn't there
DELETE /logger
# Create the index
PUT /logger
# Add the mapping of properties to the document type `mem`
PUT /logger/_mapping/mem
{
"mem": {
"properties": {
"timestamp": {
"type": "date"
},
"free": {
"type": "long"
}
}
}
}
# Inspect the newly created mapping
GET /logger/_mapping/mem
Run each of these commands in serie.
Generate free mem logs
Here is a simple script that echo to your terminal and logs to your local elasticsearch:
while (( 1==1 )); do memfree=`free -b|tail -n 1|tr -s ' ' ' '|cut -d ' ' -f4`; echo $load; curl -XPOST "localhost:9200/logger/mem" -d "{ \"timestamp\": `date +%s%3N`, \"free\": $memfree }"; sleep 1; done
Inspect data in elastic search
Paste this in your marvel/sense
GET /logger/mem/_search
Now you can move to Kibana and do some graphs. Kibana will autodetect your date field.

Resources