multisearch give diffrent total hits in different runs - elasticsearch

I'm using Elasticsearch 6.8 and python 3.
I'm running on my laptop with 1 node, and there are no threads/processes that insert/update/delete docs into the index while I'm running the multi search
I'm running the following multi search command:
es = Elasticsearch()
search_arr = []
# search-1
search_arr.append({'index': 'test1', 'type': 'type1'})
search_arr.append({"query": {"term": {"confidence": "1"}}})
# search-2
search_arr.append({'index': 'test1', 'type': 'type1'})
search_arr.append({"query": {"match_all": {}}, 'from': 0, 'size': 2})
request = ''
for each in search_arr:
request += '%s \n' % json.dumps(each)
res = es.msearch(body=request)
print("First Query, num of results = ", res['responses'][0]['hits']['total'])
print("Second Query, num of results = ", res['responses'][1]['hits']['total'])
Each time I run this code, I'm getting different results (as I wrote before, there are no processes that insert/delete/update documents)
Why I'm getting different result each time ?
And what I need to do in order to fix it and get consistent results ?

I found the problem cause.
I needed to add refresh = True after the first time I added new data.
(I bulked thousands of new documents and right after use the multi search).

Related

ArangoDB: How to run 2 queries in parallel in community edition

Hi I have written the below 2 queries and would like to run in these queries in parallel and not execute them sequentially. Is it possible to execute them parallelly in the community edition of the ArangoDB?
FOR d IN Transaction
FILTER d._to == "Account/123"
COLLECT AGGREGATE length = COUNT_UNIQUE(d._id),
totamnt = SUM(d.Amount),
daysactive = COUNT_UNIQUE(DATE_TRUNC(d.Time, "day"))
RETURN {
"Incoming Accounts": length ,
"Days Active": LENGTH(daysactive),
"Total Amount": totamnt
}
FOR d IN Transaction
FILTER d._from == "Account/123"
COLLECT AGGREGATE length = COUNT_UNIQUE(d._id),
totamnt = SUM(d.Amount),
daysactive = COUNT_UNIQUE(DATE_TRUNC(d.Time, "day"))
RETURN {
"Outgoing Accounts": length ,
"Days Active": LENGTH(daysactive),
"Total Amount": totamnt
}
of course it is possible to run multiple requests in parallel. Just fire 2 curl calls to _api/cursor or use 2 different arangosh shells.
Or run 2 curl calls in the same shell and use the x-arango-async header for each request to retrieve the result asynchronously as documented here: https://www.arangodb.com/docs/stable/http/async-results-management.html#async-execution-and-later-result-retrieval

Can GA4-API fetch the data from requests made with a combination of minute and region and sessions?

Problem
With UA, I was able to get the number of sessions per region per minute (a combination of minute, region, and sessions), but is this not possible with GA4?
If not, is there any plan to support this in the future?
Detail
I ran GA4 Query Explorer with date, hour, minute, region in Dimensions and sessions in Metrics.
But I got an incompatibility error.
What I tried
I have checked with GA4 Dimensions & Metrics Explorer and confirmed that the combination of minute and region is not possible. (see image below).
(updated 2022/05/16 15:35)Checked by Code Execution
I ran it with ruby.
require "google/analytics/data/v1beta/analytics_data"
require 'pp'
require 'json'
ENV['GOOGLE_APPLICATION_CREDENTIALS'] = '' # service acount file path
client = ::Google::Analytics::Data::V1beta::AnalyticsData::Client.new
LIMIT_SIZE = 1000
offset = 0
loop do
request = Google::Analytics::Data::V1beta::RunReportRequest.new(
property: "properties/xxxxxxxxx",
date_ranges: [
{ start_date: '2022-04-01', end_date: '2022-04-30'}
],
dimensions: %w(date hour minute region).map { |d| { name: d } },
metrics: %w(sessions).map { |m| { name: m } },
keep_empty_rows: false,
offset: offset,
limit: LIMIT_SIZE
)
ret = client.run_report(request)
dimension_headers = ret.dimension_headers.map(&:name)
metric_headers = ret.metric_headers.map(&:name)
puts (dimension_headers + metric_headers).join(',')
ret.rows.each do |row|
puts (row.dimension_values.map(&:value) + row.metric_values.map(&:value)).join(',')
end
offset += LIMIT_SIZE
break if ret.row_count <= offset
end
The result was an error.
3:The dimensions and metrics are incompatible.. debug_error_string:{"created":"#1652681913.393028000","description":"Error received from peer ipv4:172.217.175.234:443","file":"src/core/lib/surface/call.cc","file_line":953,"grpc_message":"The dimensions and metrics are incompatible.","grpc_status":3}
Error in your code, Make sure you use the actual dimension name and not the UI name. The correct name of that dimension is dateHourMinute not Date hour and minute
dimensions: %w(dateHourMinute).map { |d| { name: d } },
The query explore returns this request just fine
results
Limited use for region dimension
The as for region. As the error message states the dimensions and metrics are incompatible. The issue being that dateHourMinute can not be used with region. Switch to date or datehour
at the time of writing this is a beta api. I have sent a message off to google to find out if this is working as intended or if it may be changed.

'Limit of total fields crossed more than 1000 Elasticsearch exception

My Elasticsearch index has more than 1000 fields due to my Sql schema and I get below exception:
{'type': 'illegal_argument_exception', 'reason': 'Limit of total
fields [1000] in index }
And my bulk insert looks like this:
with open('audit1.txt') as file:
for line in file:
columns = line.split(r'||')
dict['TimeStamp']=columns[0].strip('\'')
dict['BusinessTimeStamp']=columns[1].strip('\'')
dict['RuntimeMicroflowID']=columns[2].strip('\'')
dict['MicroflowID']=columns[3].strip('\'')
dict['UserId']=columns[4].strip('\'')
dict['ClientId']=columns[5].strip('\'')
dict['Userlocation']=columns[6].strip('\'')
dict['Transactionid']=columns[7].strip('\'')
dict['Catagorie']=columns[8].strip('\'')
dict['EventType']=columns[9].strip('\'')
dict['Operation']=columns[10].strip('\'')
dict['PrimaryData']=columns[11].strip('\'')
dict['SecondayData']=columns[12].strip('\'')
i=13
while i < len(columns):
tempdict['BFOLDVALUE'] = columns[i+1].strip('\'')
tempdict['BFNEWVALUE'] = columns[i+2].strip('\'')
if columns[i].strip('\'') is not None:
dict[columns[i].strip('\'')] = tempdict.copy()
i+=3
tempdict.clear()
#print(json.dumps(dict,indent = 4))
batch.append(dict)
if counter==BATCHSIZE:
try:
helpers.bulk(es, batch, index='audit-index', doc_type='audit')
insertedrecords+=counter
counter = 0
batch.clear()
print(insertedrecords," - Records Has Been inserted ")
except BulkIndexError:
print("Error Occured -- continuing")
print(json.dumps(dict,indent = 4))
print(BulkIndexError)
batch.clear()
break
counter+=1
dict.clear()
So, I am assuming I am trying to index this wrongly... is there a better way of indexing this kind of formats in elasticsearch? Note than I am using ELK version 7.5.
Here is the sample file I am parsing to elasticsearch:
2018.07.17/15:41:53.735||2018.07.17/15:41:53.735||'0164a8424fbbp84h%2139165'||'BT_TTB_CashDep_PRC'||'eskedarz'||'UXP'||'00001039'||'0164a842e519pJpA'||'Persistence'||''||'CREATE'||'DailyTxns'||'0164a842e4eapJnu'||'CurrentThread'||'WebContainer : 15'||''||'ParentThread'||'system'||''||'TCPWorkerThreadID'||'WebContainer : 15'||''||'f_POSTINGDT'||'2018-07-17'||''||'versionNum'||'0'||''||'f_TXNAMTDR'||'0'||''||'f_ACCOUNTID'||'013XXXXXXXXX0'||''||'f_VALUEDTTM'||'2018-07-17 15:41:53.0'||''||'f_POSTINGDTTM'||'2018-07-17 15:41:53.692'||''||'f_TXNCLBAL'||'25551.610000'||''||'f_TXNREF'||'0000103917071815410685326'||''||'f_PIEVENTTYPE'||'N'||''||'f_TXNAMT'||'5000.00'||''||'f_TRANSACTIONID'||'0164a842e4e9pJng'||''||'f_TYPE'||'N'||''||'f_USERID'||'xxxarz'||''||'f_SRNO'||'1'||''||'f_TXNBASEEQ'||'5000.00'||''||'f_TXNSRCBRANCH'||'0000X039'||''||'f_TXNCODE'||'T08'||''||'f_CHANNELID'||'BranchTeller'||''||'f_TXNAMTCR'||'5000.00'||''||'f_TXNNARRATION'||'SELF '||''||'f_ISACCRUALPENDING'||'false'||''||'f_TXNDTTM'||'2018-07-17 15:41:53.689'||''
if you carefully look at this part of the error message it would be clear.
Limit of total fields [1000] in index
1000 is the default limit of total fields in the Elasticsearch index as shown in their source code.
public static final Setting<Long> INDEX_MAPPING_TOTAL_FIELDS_LIMIT_SETTING =
Setting.longSetting("index.mapping.total_fields.limit", 1000L, 0, Property.Dynamic, Property.IndexScope);
Please note this is a dynamic setting, hence can be changed on a given index, by updating index setting
PUT test_index/_settings
{
"index.mapping.total_fields.limit": 1500. --> changed it to what is suitable for your index.
}
More info on this issue can be found here and here.
better way to handle such exploding index is to normalize as RDBMS that means store some of the key : value combinations in a nested structure
example
{"keyA":"ValueA","keyB":"ValueB","keyC":"ValueC"...} - record to
{"keyA":"ValueA","Keyvalue":{"keyB":"ValueB"
"keyC":"ValueC"}} - record
so searching would look like Keyvalue.Value == KeyB and KeyValue.Value = ValueB

Twitter API Hashtag Search Results don't contain images

I am not seeing image entities in my twitter API search results when I search for a hashtag, but if I use the api endpoint for that specific tweet I do.
Using this tweet for an example: https://twitter.com/mrbuddylee/status/733407581788463104
results = #twitter_client.search('#BecauseSummer', { include_entities: true, count: 200 })
result = results.first # not actually the first result, but just to illustrate.
result.to_h[:entities)
=> {:hashtags=>[{:text=>"BecauseSummer", :indices=>[7, 21]}],
:symbols=>[], :user_mentions=>[],
:urls=>[{:url=>"TWITTER_SHORTENED_URL", :expanded_url=>"http://twitter.com/mrbuddylee/status/733407581788463104/photo/1", :display_url=>"pic.twitter.com/FAAY00SYQH", :indices=>[22, 45]}]}
but If I search for the tweet directly:
#twitterclient.status(733407581788463104).to_h[:entities]
=> {:hashtags=>[{:text=>"BecauseSummer", :indices=>[7, 21]}],
:symbols=>[], :user_mentions=>[], :urls=>[],
:media=>[{:id=>733407573345341441, :id_str=>"733407573345341441", :indices=>[22, 45], :media_url=>"http://pbs.twimg.com/tweet_video_thumb/Ci2WTVzUkAE_gGD.jpg", :media_url_https=>"https://pbs.twimg.com/tweet_video_thumb/Ci2WTVzUkAE_gGD.jpg", :url=>"TWITTER_SHORTENED_URL", :display_url=>"pic.twitter.com/FAAY00SYQH", :expanded_url=>"http://twitter.com/mrbuddylee/status/733407581788463104/photo/1", :type=>"photo", :sizes=>{:small=>{:w=>340, :h=>173, :resize=>"fit"}, :thumb=>{:w=>150, :h=>150, :resize=>"crop"}, :medium=>{:w=>392, :h=>200, :resize=>"fit"}, :large=>{:w=>392, :h=>200, :resize=>"fit"}}}]}
Notice the media hash on the second result.
Why is this? Is it possible to get the media url on the initial search request?
I ended up doing a double query to make sure the images are returned:
results = #twitter_client.search('#BecauseSummer', { include_entities: true, count: 100 })
ids = .first(100).map(&:id)
results = #client.statuses(ids)

Carrot2 circle chart

Anyone know how to create circle chart like the one used in carrto2?
The mbostock/d3 gallery has good visualizations for Carrot2 output.
This carrot2-rb ruby client for Carrot2 returns an object with a clusters array. The scores and phrases attributes can be used in a simple doughnut chart.
More dynamic visualizations like expandable dendrograms are possible with tree structures like flare.json.
Here is a zoomable wheel based on Carrot2 results.
This is the coffeescript code I wrote to create flare.json using the documents elements.
clusters = [{"id":0,"size":3,"phrases":["Coupon"],"score":0.06441151442396735,"documents":["0","1","2"],"attributes":{"score":0.06441151442396735}},{"id":1,"size":2,"phrases":["Exclusive"],"score":0.7044284368639101,"documents":["0","1"],"attributes":{"score":0.7044284368639101}},{"id":2,"size":1,"phrases":["Other Topics"],"score":0.0,"documents":["3"],"attributes":{"other-topics":true,"score":0.0}}]
flare = get_flare clusters
get_children = (index, index2, clusters, documents) ->unless index == (clusters.length - 1) # If not last cluster
orphans = {'name': ''}
intr = _.intersection(documents, clusters[index2].documents);
if intr.length > 0 # continue drilling
if index2 < (clusters.length - 1) # Up until last element.
# Get next layer of orphans
orphan_docs = _.difference(intr, clusters[index2 + 1].documents)
if orphan_docs.length > 0
orphans = {'name': orphan_docs, 'size': orphan_docs.length}
if _.intersection(intr, clusters[index2 + 1].documents).length > 0
return [orphans, {'name': clusters[index2+1].phrases[0], 'children': get_children(index, (index2 + 1), clusters, intr)}]
else
return [orphans]
else
# At second to last cluster, so terminate here
return [{'name': inter}]
else # No intersection, so return bundle of current documents.
return [{'name': documents}]
return [{'name': _.intersection(clusters[index].documents, clusters[index2].documents)}]
get_flare = (clusters) ->
# Make root object
flare =
name: "root"
children: []
children = flare.children
_.each(clusters[0..(clusters.length - 2)], (cluster, index) -> # All clusters but the last. (It has already been compared to previous ones)
#All documents for all remaining clusters in array
remaining_documents = _.flatten(_.map clusters[(index + 1)..clusters.length], (c) ->
c.documents
)
root_child = {'name': cluster.phrases[0], 'children': []}
# Get first layer of orphans
orphan_docs = _.difference(cluster.documents, remaining_documents)
if orphan_docs.length > 0
root_child.children.push {'name': orphan_docs, size: orphan_docs.length}
for index2 in [(index + 1)..(clusters.length - 1)] by 1
if _.intersection(cluster.documents, clusters[index2].documents).length > 0
root_child.children.push {'name': clusters[index2].phrases[0], 'children': get_children(index, (index2), clusters, cluster.documents)}
children.push root_child
)
flare
You can buy their Circles Javascript component: http://carrotsearch.com/circles-overview

Resources