Mindmeld Elasticsearch index and QuestionAnswerer - elasticsearch

I am using Mindmeld blueprint application (kwik_e_mart) to understand how the Question Answerer retrieves data from relevant knowledge base data file (newbie to Mindmeld, OOP and Elasticsearch).
See code snippet below:
from mindmeld.components import QuestionAnswerer
config = {"model_type": "keyword"}
qa = QuestionAnswerer(app_path='kwik_e_mart', config=config)
qa.load_kb(app_namespace='kwik_e_mart', index_name='stores',
data_file='kwik_e_mart/data/stores.json', app_path='kwik_e_mart', config=config, clean = True)
Output - Loading Elasticsearch index stores: 100%|██████████| 25/25 [00:00<00:00, 495.28it/s]
Output -Loaded 25 documents
Although Elasticsearch is able to load all 25 documents (see output above), unable to retrieve any data with index greater than 9.
stores = qa.get(index='stores')
stores[0]
Output: - {'address': '23 Elm Street, Suite 800, Springfield, OR, 97077',
'store_name': '23 Elm Street',
'open_time': '7:00',
'location': {'lon': -123.022029, 'lat': 44.046236},
'phone_number': '541-555-1100',
'id': '1',
'close_time': '19:00',
'_score': 1.0}
However, stores [10] gives an error
`stores[10]`
Output: - IndexError Traceback (most recent call last)
<ipython-input-12-08132a2cd460> in <module>
----> 1 stores[10]
IndexError: list index out of range
Not sure why documents at index higher than 9 are unreachable. My understanding is that the elasticsearch index is still pointing to remote blueprint data (http/middmeld/blueprint...) and not pointing to the folder locally.
Not sure how to resolve this. Any help is much appreciated.

By default, the get() method only returns 10 records per search - so only stores[0] through stores[9] will be valid.
You can add the size= option to your get() to increase the number of records it returns:
stores = qa.get(index='stores', size=25)
See the bottom of this section for more info.

Related

Loading multiple CSV files (silos) to compose Tensorflow Federated dataset

I am working on pre-processed data that were already siloed into separated csv files to represent separated local data for federated learning.
To correct implement the federated learning with these multiple CSVs on TensorFlow Federated, I am just trying to reproduce the same approach with a toy example in the iris dataset. However, when trying to use the method tff.simulation.datasets.TestClientData, I am getting the error:
TypeError: can't pickle _thread.RLock objects
The current code is as follows, first, load the three iris dataset CSV files (50 samples on each) into a dictionary from the filenames iris1.csv, iris2.csv, and iris3.csv:
silos = {}
for silo in silos_files:
silo_name = silo.replace(".csv", "")
silos[silo_name] = pd.read_csv(silos_path + silo)
silos[silo_name]["variety"].replace({"Setosa" : 0, "Versicolor" : 1, "Virginica" : 2}, inplace=True)
Creating a new dict with tensors:
silos_tf = collections.OrderedDict()
for key, silo in silos.items():
silos_tf[key] = tf.data.Dataset.from_tensor_slices((silo.drop(columns=["variety"]).values, silo["variety"].values))
Finally, trying to converting the Tensorflow Dataset into a Tensorflow Federated Dataset:
tff_dataset = tff.simulation.datasets.TestClientData(
silos_tf
)
That raises the error:
TypeError Traceback (most recent call last)
<ipython-input-58-a4b5686509ce> in <module>()
1 tff_dataset = tff.simulation.datasets.TestClientData(
----> 2 silos_tf
3 )
/usr/local/lib/python3.7/dist-packages/tensorflow_federated/python/simulation/datasets/from_tensor_slices_client_data.py in __init__(self, tensor_slices_dict)
59 """
60 py_typecheck.check_type(tensor_slices_dict, dict)
---> 61 tensor_slices_dict = copy.deepcopy(tensor_slices_dict)
62 structures = list(tensor_slices_dict.values())
63 example_structure = structures[0]
...
/usr/lib/python3.7/copy.py in deepcopy(x, memo, _nil)
167 reductor = getattr(x, "__reduce_ex__", None)
168 if reductor:
--> 169 rv = reductor(4)
170 else:
171 reductor = getattr(x, "__reduce__", None)
TypeError: can't pickle _thread.RLock objects
I also tried to use Python dictionary instead of OrderedDict but the error is the same. For this experiment, I am using Google Colab with this notebook as reference running with TensorFlow 2.8.0 and TensorFlow Federated version 0.20.0. I also used these previous questions as references:
Is there a reasonable way to create tff clients datat sets?
'tensorflow_federated.python.simulation' has no attribute 'FromTensorSlicesClientData' when using tff-nightly
I am not sure if this is a good way that derives for a case beyond the toy example, please, if any suggestion on how to bring already siloed data for TFF tests, I am thankful.
I did some search of public code in github using class tff.simulation.datasets.TestClientData, then I found the following implementation (source here):
def to_ClientData(clientsData: np.ndarray, clientsDataLabels: np.ndarray,
ds_info, is_train=True) -> tff.simulation.datasets.TestClientData:
"""Transform dataset to be fed to fedjax
:param clientsData: dataset for each client
:param clientsDataLabels:
:param ds_info: dataset information
:param train: True if processing train split
:return: dataset for each client cast into TestClientData
"""
num_clients = ds_info['num_clients']
client_data = collections.OrderedDict()
for i in range(num_clients if is_train else 1):
client_data[str(i)] = collections.OrderedDict(
x=clientsData[i],
y=clientsDataLabels[i])
return tff.simulation.datasets.TestClientData(client_data)
I understood from this snippet that the tff.simulation.datasets.TestClientData class requires as argument an OrderedDict composed by numpy arrays instead of a dict of tensors (as my previous implementation), now I changed the code for the following:
silos_tf = collections.OrderedDict()
for key, silo in silos.items():
silos_tf[key] = collections.OrderedDict(
x=silo.drop(columns=["variety"]).values,
y=silo["variety"].values)
Followed by:
tff_dataset = tff.simulation.datasets.TestClientData(
silos_tf
)
That correctly runs as the following output:
>>> tff_dataset.client_ids
['iris3', 'iris1', 'iris2']

Limit(n) vs Show(n) performance disparity in Pyspark

Trying to get a deeper understanding of how spark works and was playing around with the pyspark cli (2.4.0). I was looking for the difference between using limit(n).show() and show(n). I ended up getting two very different performance times for two very similar queries. Below are the commands I ran. The parquet file referenced in the code below has about 50 columns and is over 50gb in size on remote HDFS.
# Create dataframe
>>> df = sqlContext.read.parquet('hdfs://hdfs.host/path/to.parquet') ↵
# Create test1 dataframe
>>> test1 = df.select('test_col') ↵
>>> test1.schema ↵
StructType(List(StructField(test_col,ArrayType(LongType,true),true)))
>>> test1.explain() ↵
== Physical Plan ==
*(1) Project [test_col#40]
+- *(1) FileScan parquet [test_col#40]
Batched: false,
Format: Parquet,
Location: InMemoryFileIndex[hdfs://hdfs.host/path/to.parquet],
PartitionCount: 25,
PartitionFilters: [],
PushedFilters: [],
ReadSchema: struct<test_col:array<bigint>>
# Create test2 dataframe
>>> test2 = df.select('test_col').limit(5) ↵
>>> test2.schema ↵
StructType(List(StructField(test_col,ArrayType(LongType,true),true)))
>>> test2.explain() ↵
== Physical Plan ==
CollectLimit 5
+- *(1) Project [test_col#40]
+- *(1) FileScan parquet [test_col#40]
Batched: false,
Format: Parquet,
Location: InMemoryFileIndex[hdfs://hdfs.host/path/to.parquet],
PartitionCount: 25,
PartitionFilters: [],
PushedFilters: [],
ReadSchema: struct<test_col:array<bigint>>
Notice that the physical plan is almost identical for both test1 and test2. The only exception is test2's plan starts with "CollectLimit 5". After setting this up I ran test1.show(5) and test2.show(5). Test 1 returned the results instantaneously. Test 2 showed a progress bar with 2010 tasks and took about 20 minutes to complete (I only had one executor)
Question
Why did test 2 (with limit) perform so poorly compared to test 1 (without limit)? The data set and result set were identical and the physical plan was nearly identical.
Keep in mind:
show() is an alias for show(20) and relies internally on take(n: Int): Array[T]
limit(n: Int) returns another dataset and is an expensive operation that reads the whole source
Limit - result in new dataframe and taking longer time because this is because predicate pushdown is currently not supported in your input file format. Hence reading entire dataset and applying limit.

Dumping Beam-scores to an HDF File

I have a translation model (TM), which synthesizes its hypotheses using beam-search. For analysis purposes, I would like to study all hypotheses in each beam emitted by the TM’s ChoiceLayer. I’m able to fetch the hypotheses for each input sequence from the TM’s ChoiceLayer and write it to my file system, using the HDFDumpLayer:
'__SEARCH_dump_beam__': {
'class': 'hdf_dump',
'from': ['output'],
'filename': "<my-path>/beams.hdf",
'is_output_layer': True
}
But beside the hypotheses, I would also like to store the score of each hypothesis. I’m able to fetch the beam scores from the ChoiceLayer using a ChoiceGetBeamScoresLayer, but I was not able to dump the scores using an HDFDumpLayer:
'get_scores': {'class': 'choice_get_beam_scores', 'from': ['output']},
'__SEARCH_dump_scores__': {
'class': 'hdf_dump',
'from': ['get_scores'],
'filename': "<my-path>/beam_scores.hdf",
'is_output_layer': True
}
Running the config like likes this, makes RETURNN complain about the ChoiceGetBeamScoresLayer output not having a time axis:
Exception creating layer root/'__SEARCH_dump_scores__' of class HDFDumpLayer with opts:
{'filename': '<my-path>/beam_scores.hdf',
'is_output_layer': True,
'name': '__SEARCH_dump_scores__',
'network': <TFNetwork 'root' train=False search>,
'output': Data(name='__SEARCH_dump_scores___output', shape=(), time_dim_axis=None, beam=SearchBeam(name='output/output', beam_size=12, dependency=SearchBeam(name='output/prev:output', beam_size=12)), batch_shape_meta=[B]),
'sources': [<ChoiceGetBeamScoresLayer 'get_scores' out_type=Data(shape=(), time_dim_axis=None, beam=SearchBeam(name='output/output', beam_size=12, dependency=SearchBeam(name='output/prev:output', beam_size=12)), batch_shape_meta=[B])>]}
Unhandled exception <class 'AssertionError'> in thread <_MainThread(MainThread, started 139964674299648)>, proc 31228.
...
File "<...>/returnn/repository/returnn/tf/layers/basic.py", line 6226, in __init__
line: assert self.sources[0].output.have_time_axis()
locals:
self = <local> <HDFDumpLayer '__SEARCH_dump_scores__' out_type=Data(shape=(), time_dim_axis=None, beam=SearchBeam(name='output/output', beam_size=12, dependency=SearchBeam(name='output/prev:output', beam_size=12)), batch_shape_meta=[B])>
self.sources = <local> [<ChoiceGetBeamScoresLayer 'get_scores' out_type=Data(shape=(), time_dim_axis=None, beam=SearchBeam(name='output/output', beam_size=12, dependency=SearchBeam(name='output/prev:output', beam_size=12)), batch_shape_meta=[B])>]
output = <not found>
output.have_time_axis = <not found>
AssertionError
I tried to alter the shape of the score data using ExpandDimsLayer and EvalLayer, with several different configurations, but those all lead to different errors.
I’m sure I am not the first person trying to dump beam scores. Can anybody tell me, how to do that properly?
To answer the question:
The HDFDumpLayer assert you are hitting is simply due to the fact that this is not implemented yet (support of dumping data without time axis).
You can create a pull request and add this support for HDFDumpLayer.
If you just want to dump this information in any way, not necessarily via HDFDumpLayer, there are a couple of other options:
With task="search", as the search_output_layer, just select the layer which still includes the beam information, i.e. not after a DecideLayer.
That will simply dump all hypotheses including their beam scores.
Use a custom EvalLayer and dump it in whatever way you want.

Get metadata for a column with google Sheets API

I have a Google spreadsheet that I am connecting to and interacting with using the google-python-api-client package. Following this description on metadata search, and the links in it for the request body, I have written a function to get metadata for a range:
def get_metadata_by_range(range_: Union[dict, str]) -> dict:
if isinstance(range_, str):
print("String range: ", range_)
request_body = {"dataFilters": \
{"a1Range": range_}}
elif isinstance(range_, dict):
print("Dict range: ", range_)
request_body = {"dataFilters": \
[{"gridRange": range_}]}
else:
return None
request = service.spreadsheets().developerMetadata().\
search(spreadsheetId=SPREADSHEET_ID, body=request_body)
return request.execute()
Calling this with a range, either A1 notation or a gridRange will cause an error to occur though. For example, calling it with this line get_metadata_by_range("Metadata!A:A") will cause the following traceback.
String range: Metadata!A:A
Traceback (most recent call last):
File "oqc_server/fab/gapc.py", line 82, in <module>
get_metadata_by_range("Metadata!A:A")
File "oqc_server/fab/gapc.py", line 69, in get_metadata_by_range
return request.execute()
File "/media/kajsa/Storage/Projects/oqc_server/venv/lib/python3.7/site-packages/googleapiclient/_helpers.py", line 130, in positional_wrapper
return wrapped(*args, **kwargs)
File "/media/kajsa/Storage/Projects/oqc_server/venv/lib/python3.7/site-packages/googleapiclient/http.py", line 856, in execute
raise HttpError(resp, content, uri=self.uri)
googleapiclient.errors.HttpError: <HttpError 500 when requesting https://sheets.googleapis.com/v4/spreadsheets/1RhheCsI3kHrm8yK2Yio2kAOU4VOzYdz-eK0vjiMY7co/developerMetadata:search?alt=json returned "Internal error encountered."
Any ideas on what is causing this and how to solve it?
You want to search and retrieve the developer metadata from the range using the method of Method: spreadsheets.developerMetadata.search of Sheets API.
You want to achieve this using google-api-python-client with python.
You have already been able to get and put values for Spreadsheet with Sheets API.
If my understanding is correct, how about this answer? Please think of this as just one of several possible answers.
Modification points:
When you want to search the developer metadata with the range, please set the gridrange to dataFilters[].developerMetadataLookup.metadataLocation.dimensionRange.
When the range is set to dataFilters[].a1Range and dataFilters[].gridRange, I could confirm that the same error occurs.
Sample script:
The sample script for retrieving the developer metadata from the range is as follows. Before you use this, please set the variables of spreadsheet_id and sheet_id.
service = build('sheets', 'v4', credentials=creds)
spreadsheet_id = '###' # Please set the Spreadsheet ID.
sheet_id = ### # Please set the sheet ID.
search_developer_metadata_request_body = {
"dataFilters": [
{
"developerMetadataLookup": {
"metadataLocation": {
"dimensionRange": {
"sheetId": sheet_id,
"dimension": "COLUMNS",
"startIndex": 0,
"endIndex": 1
}
}
}
}
]
}
request = service.spreadsheets().developerMetadata().search(
spreadsheetId=spreadsheet_id, body=search_developer_metadata_request_body)
response = request.execute()
print(response)
Above script retrieves the developer metadata from the column "A" of sheet_id.
Note:
Please modify above script for your actual script.
In the current stage, the Developer Metadata can be added to the Spreadsheet, each sheet in the Spreadsheet and row and column. Please be careful this. Ref
References:
Method: spreadsheets.developerMetadata.search
Adding Developer Metadata- DeveloperMetadataLookup
If I misunderstood your question and this was not the direction you want, I apologize.
There is a bug with developerMetadata related to a1Range objects being passed as filters.
Edit
I've checked the bug again and a fix has been implemented.

Individually update a large amount of documents with the Python DSL Elasticsearch UpdateByQuery

I'm trying to use the UpdateByQuery to update a property of a large amount of documents. But as each document will have a different value, I need to execute ir one by one. I'm traversing a big amount of documents, and for each document I call this funcion:
def update_references(self, query, script_source):
try:
ubq = UpdateByQuery(using=self.client, index=self.index).update_from_dict(query).script(source=script_source)
ubq.execute()
except Exception as err:
return False
return True
Some example values are:
query = {'query': {'match': {'_id': 'VpKI1msBNuDimFsyxxm4'}}}
script_source = 'ctx._source.refs = [\'python\', \'java\']'
The problem is that when I do that, I got an error: "Too many dynamic script compilations within, max: [75/5m]; please use indexed, or scripts with parameters instead; this limit can be changed by the [script.max_compilations_rate] setting".
If I change the max_compilations_rate using Kibana, it has no effect:
PUT _cluster/settings
{
"transient": {
"script.max_compilations_rate": "1500/1m"
}
}
Anyway, it would be better to use a parametrized script. I tried:
def update_references(self, query, script_source, script_params):
try:
ubq = UpdateByQuery(using=self.client, index=self.index).update_from_dict(query).script(source=script_source, params=script_params)
ubq.execute()
except Exception as err:
return False
return True
So, this time:
script_source = 'ctx._source.refs = params.value'
script_params = {'value': [\'python\', \'java\']}
But as I have to update the query and the parameters each time, I need to create a new instance of the UpdateByQuery for each document in the large collection, and the result is the same error.
I also tried to traverse and update the large collection with:
es.update(
index=kwargs["index"],
doc_type="paper",
id=paper["_id"],
body={"doc": {
"refs": paper["refs"] # e.g. [\\'python\\', \\'java\\']
}}
)
But I'm getting the following error: "Failed to establish a new connection: [Errno 99] Cannot assign requested address juil. 10 18:07:14 bib gunicorn[20891]: POST http://localhost:9200/papers/paper/OZKI1msBNuDimFsy0SM9/_update [status:N/A request:0.005s"
So, please, if you have any idea on how to solve this it will be really appreciated.
Best,
You can try it like this.
PUT _cluster/settings
{
"persistent" : {
"script.max_compilations_rate" : "1500/1m"
}
}
The version update is causing these errors.

Resources