elastic stack twitter sample tweets - elasticsearch

I am new to elastic stack and not sure how to approach the problem. I have managed to get live stream of tweets with specific keyword using Twitter input plugin for elastic however I want to get a sample real time tweets with no specific keyword, just a percentage of all real time tweets. I tried to search how to do it but cannot find a good documentation, I believe I need to use the GET statuses/sample API but there is no documentation on it. This is what I have for now:
input {
twitter {
consumer_key => " cosumer_key"
consumer_secret => "consumer_secret"
oauth_token => "token"
oauth_token_secret => "secret"
keywords => ["something"]
languages => ["en"]
full_tweet => true
}
}
output {
elasticsearch {}
}
How would I search for all sample tweets without using the keyword?
Thank you so much in advance.

Here's an example random score query, this should solve your problem:
GET /twitter/_search
{
"query": {
"function_score": {
"query": {
"match_all": {}
},
"functions": [
{
"random_score": {}
}
]
}
}
}
Edit - Adding a logstash config that takes random entries as well:
input {
twitter {
consumer_key => " cosumer_key"
consumer_secret => "consumer_secret"
oauth_token => "token"
oauth_token_secret => "secret"
keywords => ["something"]
languages => ["en"]
full_tweet => true,
use_samples => true
}
}
output {
elasticsearch {}
}
use_samples:
Returns a small random sample of all public statuses. The tweets returned by the default access level are the same, so if two different clients connect to this endpoint, they will see the same tweets. If set to true, the keywords, follows, locations, and languages options will be ignored. Default ⇒ false

Related

Does Logstash support Elasticsearch's _update_by_query?

Does the Elasticsearch output plugin support elasticsearch's _update_by_query?
https://www.elastic.co/guide/en/logstash/6.5/plugins-outputs-elasticsearch.html
https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-update-by-query.html
The elasticsearch output plugin can only make calls to the _bulk endpoint, i.e. using the Bulk API.
If you want to call the Update by Query API, you need to use the http output plugin and construct the query inside the event yourself. If you explain what you want to achieve, I can update my answer with some more details.
Note: There's an issue requesting this feature, but it's still open after two years.
UPDATE
So if your input event is {"cname":"wang", "cage":11} and you want to update by query all documents with "cname":"wang" to set "cage":11, your query needs to look like this:
POST your-index/_update_by_query
{
"script": {
"source": "ctx._source.cage = params.cage",
"lang": "painless",
"params": {
"cage": 11
}
},
"query": {
"term": {
"cname": "wang"
}
}
}
So your Logstash config should look like this (your input may vary but I used stdin for testing purposes):
input {
stdin {
codec => "json"
}
}
filter {
mutate {
add_field => {
"[script][lang]" => "painless"
"[script][source]" => "ctx._source.cage = params.cage"
"[script][params][cage]" => "%{cage}"
"[query][term][cname]" => "%{cname}"
}
remove_field => ["host", "#version", "#timestamp", "cname", "cage"]
}
}
output {
http {
url => "http://localhost:9200/index/doc/_update_by_query"
http_method => "post"
format => "json"
}
}
The same result can be obtained with standard elasticsearch plugins:
input {
elasticsearch {
hosts => "${ES_HOSTS}"
user => "${ES_USER}"
password => "${ES_PWD}"
index => "<your index pattern>"
size => 500
scroll => "5m"
docinfo => true
}
}
filter {
...
}
output {
elasticsearch {
hosts => "${ES_HOSTS}"
user => "${ES_USER}"
password => "${ES_PWD}"
action => "update"
document_id => "%{[#metadata][_id]}"
index => "%{[#metadata][_index]}"
}
}

CSV response from get request in elastic search

I am sending an http Get request to elastic search server and i want the response to be in csv format.Like in solr we can specify wt=csv is there any way In elastic Search too ?
My query is :
enter code here
http://elasticServer/_search?q=RCE:"some date" OR
VENDOR_NAME:"Anuj"&from=0&size=5&sort=#timestamp
-----After that i want to force the server to return me response in csv format
By default, ES supports only two data formats: JSON and YAML. However, if you're open to using Logstash, you can achieve what you want very easily like this:
input {
elasticsearch {
hosts => ["localhost:9200"]
query => 'RCE:"some date" OR VENDOR_NAME:"Anuj"'
size => 5
}
}
filter {}
output {
csv {
fields => ["field1", "field2", "field3"]
path => "/path/to/data.csv"
}
}
Since the elasticsearch input uses scrolling, you cannot specify any sorting. So if sorting is really important to you, you can use the http_poller input instead of the elasticsearch one, like this:
input {
http_poller {
urls => {
es => {
method => get
url => 'http://elasticServer/_search?q=RCE:"some date" OR VENDOR_NAME:"Anuj"&from=0&size=5&sort=#timestamp'
headers => {
Accept => "application/json"
}
}
}
codec => "json"
}
}
filter {}
output {
csv {
fields => ["field1", "field2", "field3"]
path => "/path/to/data.csv"
}
}
There is a ElasticSearch plugin on Github called Elasticsearch Data Format Plugin that should satisfy your requirements.

Logstash http_poller post giving Name may not be found error

Im trying to use the http_poller to fetch the data from ElasticSearch and write them into another ES. While doing this, ES query need to done as a POST request.
In the examples provided, I could not find the parameters that shoukd be used to post the body and it referred to the manticore client from ruby. Based n that, I have used the params parameter to post the body.
The http_poller component looks like this
input {
http_poller {
urls => {
some_other_service => {
method => "POST"
url => "http://localhost:9200/index-2016-03-26/_search"
params => '"query": { "filtered": { "filter": { "bool": { "must": [ { "term": { "SERVERNAME": "SERVER1" }}, {"range": { "eventtime": { "gte": "26/Mar/2016:13:00:00" }}} ]}}} }"'
}
}
# Maximum amount of time to wait for a request to complete
request_timeout => 300
# How far apart requests should be
interval => 300
# Decode the results as JSON
codec => "json"
# Store metadata about the request in this key
metadata_target => "http_poller_metadata"
}
}
output {
stdout {
codec => json
}
}
When I execute this, the Logstash gives an error,
Error: Name may not be null {:level=>:error}
Any help is appreciated.
The guess that I have is that the params need to be really key value pairs but then the question is as to how to post a query using logstash.
I referred to this link to get the available options for the HTTP Client
https://github.com/cheald/manticore/blob/master/lib/manticore/client.rb
Since I got the answer when I tried different options, thought I would share the solution as well.
Replace params with body in the above payload.
The correct payload to do a post using HTTP Poller is
input {
http_poller {
urls => {
some_other_service => {
method => "POST"
url => "http://localhost:9200/index-2016-03-26/_search"
body=> '"query": { "filtered": { "filter": { "bool": { "must": [ { "term": { "SERVERNAME": "SERVER1" }}, {"range": { "eventtime": { "gte": "26/Mar/2016:13:00:00" }}} ]}}} }"'
}
}
# Maximum amount of time to wait for a request to complete
request_timeout => 300
# How far apart requests should be
interval => 300
# Decode the results as JSON
codec => "json"
# Store metadata about the request in this key
metadata_target => "http_poller_metadata"
}
}
output {
stdout {
codec => json
}
}

Query Mongo Embedded Documents with a size

I have a ruby on rails app using Mongoid and MongoDB v2.4.6.
I have the following MongoDB structure, a record which embeds_many fragments:
{
"_id" : "76561198045636214",
"fragments" : [
{
"id" : 76561198045636215,
"source_id" : "source1"
},
{
"id" : 76561198045636216,
"source_id" : "source2"
},
{
"id" : 76561198045636217,
"source_id" : "source2"
}
]
}
I am trying to find all records in the database that contain fragments with duplicate source_ids.
I'm pretty sure I need to use $elemMatch as I need to query embedded documents.
I have tried
Record.elem_match(fragments: {source_id: 'source2'})
which works but doesn't restrict to duplicates.
I then tried
Record.elem_match(fragments: {source_id: 'source2', :source_id.with_size => 2})
which returns no results (but is a valid query). The query Mongoid produces is:
selector: {"fragments"=>{"$elemMatch"=>{:source_id=>"source2", "source_id"=>{"$size"=>2}}}}
Once that works I need to update it to $size is >1.
Is this possible? It feels like I'm very close. This is a one-off cleanup operation so query performance isn't too much of an issue (however we do have millions of records to update!)
Any help is much appreciated!
I have been able to achieve desired outcome but in testing it's far too slow (will take many weeks to run across our production system). The problem is double query per record (we have ~30 million records in production).
Record.where('fragments.source_id' => 'source2').each do |record|
query = record.fragments.where(source_id: 'source2')
if query.count > 1
# contains duplicates, delete all but latest
query.desc(:updated_at).skip(1).delete_all
end
# needed to trigger after_save filters
record.save!
end
The problem with the current approach in here is that the standard MongoDB query forms do not actually "filter" the nested array documents in any way. This is essentially what you need in order to "find the duplicates" within your documents here.
For this, MongoDB provides the aggregation framework as probably the best approach to finding this. There is no direct "mongoid" style approach to the queries as those are geared towards the existing "rails" style of dealing with relational documents.
You can access the "moped" form though through the .collection accessor on your class model:
Record.collection.aggregate([
# Find arrays two elements or more as possibles
{ "$match" => {
"$and" => [
{ "fragments" => { "$not" => { "$size" => 0 } } },
{ "fragments" => { "$not" => { "$size" => 1 } } }
]
}},
# Unwind the arrays to "de-normalize" as documents
{ "$unwind" => "$fragments" },
# Group back and get counts of the "key" values
{ "$group" => {
"_id" => { "_id" => "$_id", "source_id" => "$fragments.source_id" },
"fragments" => { "$push" => "$fragments.id" },
"count" => { "$sum" => 1 }
}},
# Match the keys found more than once
{ "$match" => { "count" => { "$gte" => 2 } } }
])
That would return you results like this:
{
"_id" : { "_id": "76561198045636214", "source_id": "source2" },
"fragments": ["76561198045636216","76561198045636217"],
"count": 2
}
That at least gives you something to work with on how to deal with the "duplicates" here

Selectively turn off stop words in Elastic Search

So I would like to turn off stop word filtering on the username, title, and tags fields but not the description field.
As you can imagine I do not want to filter out a result called the best but I do want to stop the from affecting the score if it is in the description field (search the on GitHub if you want an example).
Now #Javanna says ( Is there a way to "escape" ElasticSearch stop words? ):
In your case I would disable stopwords for that specific field rather than modifying the stopword list, but you could do the latter too if you wish to.
Failing to provide an example so I searched around and tried the common query: http://www.elasticsearch.org/blog/stop-stopping-stop-words-a-look-at-common-terms-query/ which didn't work for me either.
So I searched for specifically stopping the filtering stop words however the closest I have come to is stopping it index wide: Can I customize Elastic Search to use my own Stop Word list? by attacking the analyzer directly, or failing that the documentation hints at making my own analyzer :/.
What is the best way selectively disable stop words on certain fields?
I think you already know what to do, which would be to customize your analyzers for certain fields. From what I understand you did not manage to create a valid syntax example for that. This is what we used in a project, I hope that this example points you in the right direction:
{
:settings => {
:analysis => {
:analyzer => {
:analyzer_umlauts => {
:tokenizer => "standard",
:char_filter => ["filter_umlaut_mapping"],
:filter => ["standard", "lowercase"],
}
},
:char_filter => {
:filter_umlaut_mapping => {
:type => 'mapping',
:mappings_path => es_config_file("char_mapping")
}
}
}
},
:mappings => {
:company => {
:properties => {
[...]
:postal_city => { :type => "string", :analyzer => "analyzer_umlauts", :omit_norms => true, :omit_term_freq_and_positions => true, :include_in_all => false },
}
}
}
}

Resources