access individual fields using elastic search dsl in python - elasticsearch

Is the below accurate or should it be something else ?
I am getting the expected results just checking if this is the most efficient way to access individual (nested) fields.
from elasticsearch import Elasticsearch
from elasticsearch_dsl import Search, Q
import json
client = Elasticsearch('my_server')
policy_number = 'POLICY1234'
s = Search(using=client, index = "my_index").query("term",policyNumber=policy_number.lower())
es_response = s.execute()
for hits in es_response:
print hits['policyNumber']
print hits.party[0]['fullName']
print hits.party[0].partyAddress[0]['address1']
print hits.party[0].partyAddress[0]['city']
print hits.party[0].phoneList[0]['phoneNumber']

You don't need to call execute manually and you don't have to use [] to access fields by name, you can just use the attribute access:
for hit in s:
print hit.policyNumber
print hit.party[0].fullName
print hit.party[0].partyAddress[0].address1
print hit.party[0].partyAddress[0].city
print hit.party[0].phoneList[0].phoneNumber

Related

Use an Ironpython script to filter and pass filter selections between tables

I have two tables in the analysis. I am using the script below to be able to filter table A and pass those filter selections to the matching filter in table B. Table A and B are visualized in a bar chart. I am triggering the code when the value of a document property changes, following instructions here.
I am running into two problems.
1) After the script runs, clicking Reset All Filters results in only table A being displayed in the visualization. Clicking Reset All Filters again fixes the issue.
2)When I add a second filter (commented out in the code below), making a selection in the Type_A or or Type_B filter wipes out the type B data from the visualization. I think the problem is in how IncludeAllValues is being handled, but I don't know how to fix it. Any help will be appreciated.
from Spotfire.Dxp.Application.Filters import *
from Spotfire.Dxp.Application.Visuals import VisualContent
from System import Guid
#Get the active page and filterPanel
page = Application.Document.ActivePageReference
filterPanel = page.FilterPanel
theFilterA = filterPanel.TableGroups[0].GetFilter("Type_A")
lbFilterA = theFilterA.FilterReference.As[ListBoxFilter]()
theFilter2A = filterPanel.TableGroups[1].GetFilter("Type_A")
lb2FilterA = theFilter2A.FilterReference.As[ListBoxFilter]()
lb2FilterA.IncludeAllValues = False
lb2FilterA.SetSelection(lbFilterA.SelectedValues)
#########################Type_B###########################
# theFilterB = filterPanel.TableGroups[0].GetFilter("Type_B")
# lbFilterB = theFilterB.FilterReference.As[ListBoxFilter]()
# theFilter2B = filterPanel.TableGroups[1].GetFilter("Type_B")
# lb2FilterB = theFilter2B.FilterReference.As[ListBoxFilter]()
# lb2FilterB.IncludeAllValues = False
# lb2FilterB.SetSelection(lbFilterB.SelectedValues)

Check if data already exists before inserting into BigQuery table (using Python)

I am setting up a daily cron job that appends a row to BigQuery table (using Python), however, duplicate data is being inserted. I have searched online and I know that there is a way to manually remove duplicate data, but I wanted to see if I could avoid this duplication in the first place.
Is there a way to check a BigQuery table to see if a data record already exists first in order to avoid inserting duplicate data? Thanks.
CODE SNIPPET:
import webapp2
import logging
from googleapiclient import discovery
from oath2client.client import GoogleCredentials
PROJECT_ID = 'foo'
DATASET_ID = 'bar'
TABLE_ID = 'foo_bar_table’
class UpdateTableHandler(webapp2.RequestHandler):
def get(self):
credentials = GoogleCredentials.get_application_default()
service = discovery.build('bigquery', 'v2', credentials=credentials)
try:
the_fruits = Stuff.query(Stuff.fruitTotal >= 5).filter(Stuff.fruitColor == 'orange').fetch();
for fruit in the_fruits:
#some code here
basket = dict()
basket['id'] = fruit.fruitId
basket['Total'] = fruit.fruitTotal
basket['PrimaryVitamin'] = fruit.fruitVitamin
basket['SafeRaw'] = fruit.fruitEdibleRaw
basket['Color'] = fruit.fruitColor
basket['Country'] = fruit.fruitCountry
body = {
'rows': [
{
'json': basket,
'insertId': str(uuid.uuid4())
}
]
}
response = bigquery_service.tabledata().insertAll(projectId=PROJECT_ID,
datasetId=DATASET_ID,
tableId=TABLE_ID,
body=body).execute(num_retries=5)
logging.info(response)
except Exception, e:
logging.error(e)
app = webapp2.WSGIApplication([
('/update_table', UpdateTableHandler),
], debug=True)
The only way to test whether the data already exists is to run a query.
If you have lots of data in the table, that query could be expensive, so in most cases we suggest you go ahead and insert the duplicate, and then merge duplicates later on.
As Zig Mandel suggests in a comment, you can query over a date partition if you know the date when you expect to see the record, but that may still be expensive compared to inserting and removing duplicates.

Elasticsearch: Remove duplicates from index

I have an index with multiple duplicate entries. They have different ids but the other fields have identical content.
For example:
{id: 1, content: 'content1'}
{id: 2, content: 'content1'}
{id: 3, content: 'content2'}
{id: 4, content: 'content2'}
After removing the duplicates:
{id: 1, content: 'content1'}
{id: 3, content: 'content2'}
Is there a way to delete all duplicates and keep only one distinct entry without manually comparing all entries?
This can be accomplished in several ways. Below I outline two possible approaches:
1) If you don't mind generating new _id values and reindexing all of the documents into a new collection, then you can use Logstash and the fingerprint filter to generate a unique fingerprint (hash) from the fields that you are trying to de-duplicate, and use this fingerprint as the _id for documents as they are written into the new collection. Since the _id field must be unique, any documents that have the same fingerprint will be written to the same _id and therefore deduplicated.
2) You can write a custom script that scrolls over your index. As each document is read, you can create a hash from the fields that you consider to define a unique document (in your case, the content field). Then use this hash as they key in a dictionary (aka hash table). The value associated with this key would be a list of all of the document's _ids that generate this same hash. Once you have all of the hashes and associated lists of _ids, you can execute a delete operation on all but one of the _ids that are associated with each identical hash. Note that this second approach does not require writing documents to a new index in order to de-duplicate, as you would delete documents directly from the original index.
I have written a blog post and code that demonstrates both of these approaches at the following URL: https://alexmarquardt.com/2018/07/23/deduplicating-documents-in-elasticsearch/
Disclaimer: I am a Consulting Engineer at Elastic.
I use rails and if necessary I will import things with the FORCE=y command, which removes and re-indexes everything for that index and type... however not sure what environment you are running ES in. Only issue I can see is if the data source you are importing from (i.e. Database) has duplicate records. I guess I would see first if the data source could be fixed, if that is feasible, and you re-index everything; otherwise you could try to create a custom import method that only indexes one of the duplicate items for each record.
Furthermore, and I know this doesn't comply with you wanting to remove duplicate entries, but you could simply customize your search so that you are only returning one of the duplicate ids back, either by most recent "timestamp" or indexing deduplicated data and grouping by your content field -- see if this post helps. Even though this would still retain the duplicate records in your index, at least they won't come up in the search results.
I also found this as well: Elasticsearch delete duplicates
I tried thinking of many possible scenarios for you to see if any of those options work or at least could be a temp fix.
Here is a script I created based on Alexander Marquardt answer.
import hashlib
from elasticsearch import Elasticsearch, helpers
ES_HOST = 'localhost:9200'
es = Elasticsearch([ES_HOST])
def scroll_over_all_docs(index_name='squad_docs'):
dict_of_duplicate_docs = {}
index_docs_count = es.cat.count(index_name, params={"format": "json"})
total_docs = int(index_docs_count[0]['count'])
count = 0
for hit in helpers.scan(es, index=index_name):
count += 1
text = hit['_source']['text']
id = hit['_id']
hashed_text = hashlib.md5(text.encode('utf-8')).digest()
dict_of_duplicate_docs.setdefault(hashed_text,[]).append(id)
if (count % 100 == 0):
print(f'Progress: {count} / {total_docs}')
return dict_of_duplicate_docs
def delete_duplicates(duplicates, index_name='squad_docs'):
for hash, ids in duplicates.items():
if len(ids) > 1:
print(f'Number of docs: {len(ids)}. Number of docs to delete: {len(ids) -1}')
for id in ids:
if id == ids[0]:
continue
res = es.delete(index=index_name, doc_type= '_doc', id=id)
id_deleted = res['_id']
results = res['result']
print(f'Document id {id_deleted} status: {results}')
reminder_doc = es.get(index=index_name, doc_type= '_all', id=ids[0])
print('Reminder Document:')
print(reminder_doc)
def main():
dict_of_duplicate_docs = scroll_over_all_docs()
delete_duplicates(dict_of_duplicate_docs)
if __name__ == "__main__":
main()

Sphinx search infix and exact words in different fields

I'm using sphinx as search engine and I need to be able to do a search in different fields but using infix for one of the fields and exact word matches for another.
Simple example:
My source has for field_1 the value "abcdef" and for field_2 the value "12345", what I need to accomplish is to be able to search by infix in field_1 and exact word in field_2. So a search like "cde 12345" would return the doc I mentioned.
Before when using sphinx v2.0.4 I was able to obtain these results just by defining infix_fields/prefix_fields on my index but now that I'm using v2.2.9 with the new dict=keywords mode and infix_fields are deprecated.
My index definition:
index my_index : my_base_index
{
source = my_src
path = /path/to/my_index
min_word_len = 1
min_infix_len = 3
}
I've tried so far to use extended query syntax in the following way:
$cl = new SphinxClient();
$q = (#(field_1) *cde* *12345*) | (#(field_2) cde 12345)
$result = $cl->Query($q, 'my_index');
This doesn't work because for each field, sphinx is doing an AND search and one of the words is not in the specified field, "12345" is not a match on field_1 and "cde" is not a match in field_2. Also I don't want to do an OR search, but need the both words to match.
Is there a way to accomplish what I need?
Its a bit tricky, but can do
$q = "((#field_1 *cde*) | (#field_2 cde)) ((#field_1 *12345*) | (#field_2 12345))"
(dont need the brackets around the field name in the #syntax - if just one field, so removed them for brevity)

How to retrieve the field name of a ShapeFile feature field?

I am using gdal-ruby to parse ESRI ShapeFiles like in this demo. I want to iterate through all features in order to push the field values into a database. However, I cannot find out how to retrieve the name of each field which I need to match the database column. By now I can only work with the field index of the field such as:
dataset = Gdal::Ogr.open(filename)
number_of_layers = dataset.get_layer_count
number_of_layers.times do |layer_index|
layer = dataset.get_layer(layer_index)
layer.get_feature_count.times do |feature_index|
feature = layer.get_feature(feature_index)
feature.get_field_count.times do |field_index|
field_value = feature.get_field(field_index)
# How can I find out the name of the field?
puts "Value = #{field_value} for unknown field name"
end
end
end
I checked the available methods with irb and looked into the API documentation. It seems as if I am searching for the wrong terms.
Looking at the OGR API itself, I think you need to go via feature.GetDefnRef, to get the feature definition, then .GetFieldDefn for the relevant field, and finally .GetNameRef...?
...
feature.get_field_count.times do |field_index|
defn_ref = feature.get_defn_ref
field_defn = defn_ref.get_field_defn(field_index)
field_name = field_defn.get_name
field_value = feature.get_field(field_index)
puts "Value = #{field_value} for field named #{field_name}"
end
...
ds = ogr.Open(filename, 1)
layer = ds.GetLayer()
for i in range(len(layer.schema)):
print(layer.schema[i].name)

Resources