Setting codec/searching Elasticsearch for unicode values from Python

Setting codec/searching Elasticsearch for unicode values from Python - elasticsearch

This issue is probably due to my noobishness to ELK, Python, and Unicode.
I have an index containing logstash-digested logs, including a field 'host_req', which contains a host name. Using Elasticsearch-py, I'm pulling that host name out of the record, and using it to search in another index.
However, if the hostname contains multibyte characters, it fails with a UnicodeDecodeError. Exactly the same query works fine when I enter it from the command line with 'curl -XGET'. The unicode character is a lowercase 'a' with a diaeresis (two dots). The UTF-8 value is C3 A4, and the unicode code point seems to be 00E4 (the language is Swedish).
These curl commands work just fine from the command line:
curl -XGET 'http://localhost:9200/logstash-2015.01.30/logs/_search?pretty=1' -d ' { "query" : {"match" :{"req_host" : "www.utkl\u00E4dningskl\u00E4derna.se" }}}'
curl -XGET 'http://localhost:9200/logstash-2015.01.30/logs/_search?pretty=1' -d ' { "query" : {"match" :{"req_host" : "www.utklädningskläderna.se" }}}'
They find and return the record
(the second line shows how the hostname appears in the log I pull it from, showing the lowercase 'a' with a diaersis, in two places)
I've written a very short Python script to show the problem: It uses hardwired queries, printing them and their type, then trying to use them
in a search.
#!/usr/bin/python
# -*- coding: utf-8 -*-
import json
import elasticsearch
es = elasticsearch.Elasticsearch()
if __name__=="__main__":
#uq = u'{ "query": { "match": { "req_host": "www.utklädningskläderna.se" }}}' # raw utf-8 characters. does not work
#uq = u'{ "query": { "match": { "req_host": "www.utkl\u00E4dningskl\u00E4derna.se" }}}' # quoted unicode characters. does not work
#uq = u'{ "query": { "match": { "req_host": "www.utkl\uC3A4dningskl\uC3A4derna.se" }}}' # quoted utf-8 characters. does not work
uq = u'{ "query": { "match": { "req_host": "www.facebook.com" }}}' # non-unicode. works fine
print "uq", type(uq), uq
result = es.search(index="logstash-2015.01.30",doc_type="logs",timeout=1000,body=uq);
if result["hits"]["total"] == 0:
print "nothing found"
else:
print "found some"
If I run it as shown, with the 'facebook' query, it's fine - the output is:
$python testutf8b.py
uq <type 'unicode'> { "query": { "match": { "req_host": "www.facebook.com" }}}
found some
Note that the query string 'uq' is unicode.
But if I use the other three strings, which include the Unicode characters, it blows up. For example, with the second line, I get:
$python testutf8b.py
uq <type 'unicode'> { "query": { "match": { "req_host": "www.utklädningskläderna.se" }}}
Traceback (most recent call last):
File "testutf8b.py", line 15, in <module>
result = es.search(index="logstash-2015.01.30",doc_type="logs",timeout=1000,body=uq);
File "build/bdist.linux-x86_64/egg/elasticsearch/client/utils.py", line 68, in _wrapped
File "build/bdist.linux-x86_64/egg/elasticsearch/client/__init__.py", line 497, in search
File "build/bdist.linux-x86_64/egg/elasticsearch/transport.py", line 307, in perform_request
File "build/bdist.linux-x86_64/egg/elasticsearch/connection/http_urllib3.py", line 82, in perform_request
elasticsearch.exceptions.ConnectionError: ConnectionError('ascii' codec can't decode byte 0xc3 in position 45: ordinal not in range(128)) caused by: UnicodeDecodeError('ascii' codec can't decode byte 0xc3 in position 45: ordinal not in range(128))
$
Again, note that the query string is a unicode string (yes, the source code line is the one with the \u00E4 characters).
I'd really like to resolve this. I've tried various combinations of uq = uq.encode("utf=8") and uq = uq.decode("utf=8"), but it doesn't seem to help. I'm starting to wonder if there's an issue in the elasticsearch-py library.
thanks!
pt
PS: This is under Centos 7, using ES 1.5.0. The logs were digested into ES under a slightly older version, using logstash-1.4.2

Basically, you dont need to pass body as string. Use native python datastructures. Or transform them on the fly. Give a try, pls:
>>> import elasticsearch
>>> es = elasticsearch.Elasticsearch()
>>> es.index(index='unicode-index', body={'host': u'www.utklädningskläderna.se'}, doc_type='log')
{u'_id': u'AUyGJuFMy0qdfghJ6KwJ',
u'_index': u'unicode-index',
u'_type': u'log',
u'_version': 1,
u'created': True}
>>> es.search(index='unicode-index', body={}, doc_type='log')
{u'_shards': {u'failed': 0, u'successful': 5, u'total': 5},
u'hits': {u'hits': [{u'_id': u'AUyBTz5CsiBSSvubLioQ',
u'_index': u'unicode-index',
u'_score': 1.0,
u'_source': {u'host': u'www.utkl\xe4dningskl\xe4derna.se'},
u'_type': u'log'}],
u'max_score': 1.0,
u'total': 1},
u'timed_out': False,
u'took': 5}
>>> es.search(index='unicode-index', body={'query': {'match': {'host': u'www.utklädningskläderna.se'}}}, doc_type='log')
{u'_shards': {u'failed': 0, u'successful': 5, u'total': 5},
u'hits': {u'hits': [{u'_id': u'AUyBTz5CsiBSSvubLioQ',
u'_index': u'unicode-index',
u'_score': 0.30685282,
u'_source': {u'host': u'www.utkl\xe4dningskl\xe4derna.se'},
u'_type': u'log'}],
u'max_score': 0.30685282,
u'total': 1},
u'timed_out': False,
u'took': 122}
>>> import json
>>> body={'query': {'match': {'host': u'www.utklädningskläderna.se'}}}
>>> es.search(index='unicode-index', body=body, doc_type='log')
{u'_shards': {u'failed': 0, u'successful': 5, u'total': 5},
u'hits': {u'hits': [{u'_id': u'AUyBTz5CsiBSSvubLioQ',
u'_index': u'unicode-index',
u'_score': 0.30685282,
u'_source': {u'host': u'www.utkl\xe4dningskl\xe4derna.se'},
u'_type': u'log'}],
u'max_score': 0.30685282,
u'total': 1},
u'timed_out': False,
u'took': 4}
>>> es.search(index='unicode-index', body=json.dumps(body), doc_type='log')
{u'_shards': {u'failed': 0, u'successful': 5, u'total': 5},
u'hits': {u'hits': [{u'_id': u'AUyBTz5CsiBSSvubLioQ',
u'_index': u'unicode-index',
u'_score': 0.30685282,
u'_source': {u'host': u'www.utkl\xe4dningskl\xe4derna.se'},
u'_type': u'log'}],
u'max_score': 0.30685282,
u'total': 1},
u'timed_out': False,
u'took': 5}
>>> json.dumps(body)
'{"query": {"match": {"host": "www.utkl\\u00e4dningskl\\u00e4derna.se"}}}'

Related

Boto3 Amplify list apps

I have a lot of amplify apps which I want to manage via Lambdas. What is the equivalent of the cli command aws amplify list-apps in boto3, I had multiple attempts, but none worked out for me.
My bit of code that was using nextToken looked like this:
amplify = boto3.client('amplify')
apps = amplify.list_apps()
print(apps)
print('First token is: ', apps['nextToken'])
while 'nextToken' in apps:
apps = amplify.list_apps(nextToken=apps['nextToken'])
print('=====NEW APP=====')
print(apps)
print('=================')
Then I tried to use paginators like:
paginator = amplify.get_paginator('list_apps')
response_iterator = paginator.paginate(
PaginationConfig={
'MaxItems': 100,
'PageSize': 100
}
)
for i in response_iterator:
print(i)
Both of the attempts were throwing inconsistent output. The first one was printing first token and second entry but nothing more. The second one gives only the first entry.
Edit with more attemptsinfo + output. Bellow piece of code:
apps = amplify.list_apps()
print(apps)
print('---------------')
new_app = amplify.list_apps(nextToken=apps['nextToken'], maxResults=100)
print(new_app)
print('---------------')```
Returns (some sensitive output bits were removed):
EVG_long_token_x4gbDGaAWGPGOASRtJPSI='}
---------------
{'ResponseMetadata': {'RequestId': 'f6...e9eb', 'HTTPStatusCode': 200, 'HTTPHeaders': {'content-type': 'application/json', 'content-length': ...}, 'RetryAttempts': 0}, 'apps': [{'appId': 'dym7444jed2kq', 'appArn': 'arn:aws:amplify:us-east-2:763175725735:apps/dym7444jed2kq', 'name': 'vesting-interface', 'tags': {}, 'repository': 'https://github.com/...interface', 'platform': 'WEB', 'createTime': datetime.datetime(2021, 5, 4, 3, 41, 34, 717000, tzinfo=tzlocal()), 'updateTime': datetime.datetime(2021, 5, 4, 3, 41, 34, 717000, tzinfo=tzlocal()), 'environmentVariables': {}, 'defaultDomain': 'dym7444jed2kq.amplifyapp.com', 'customRules': _rules_, 'productionBranch': {'lastDeployTime': datetime.datetime(2021, 5, 26, 15, 10, 7, 694000, tzinfo=tzlocal()), 'status': 'SUCCEED', 'thumbnailUrl': 'https://aws-amplify-', 'branchName': 'main'}, - yarn install\n build:\n commands:\n - yarn run build\n artifacts:\n baseDirectory: build\n files:\n - '**/*'\n cache:\n paths:\n - node_modules/**/*\n", 'customHeaders': '', 'enableAutoBranchCreation': False}]}
---------------
I am very confused, why next iteration doesn't has nextToken and how can I get to the next appId.

import boto3
import json
session=boto3.session.Session(profile_name='<Profile_Name>')
amplify_client=session.client('amplify',region_name='ap-south-1')
output=amplify_client.list_apps()
print(output['apps'])

TextMate Grammar - Problem with `end` expression

I'm having huge problems with the end portion of a regex in TextMate:
It looks like end becomes the part of the pattern that's returned between begin and end
Trying to apply multiple endings with one negative lookbehind proves unsuccessful
Here is an example code:
property_name: {
test1: [1, 50, 5000]
test2: something ;;
test3: [
1,
50,
5000
]
test4: "string"
test5: [
"text",
"text2"
]
test6: something2
test7: something3
}
I'm using the following code:
"begin": "\\b([a-z_]+):",
"beginCaptures": {
"1": {
"name" : "parameter.name"
}
}
"end": "(?<!,)\\n(?!\\])",
"patterns": [
{
"name": "parameter.value",
"match": "(.+)"
}
]
My logic for the end regular expression is to consider it ended if there's a new line but only if it's not preceded by a comma (list of values in an array) or followed by a closing square bracket (last value in an array).
Unfortunately it's not working as expected.
What I would like to achieve is that all property_name# and test are matched as parameter.name and the values are matched as parameter.value apart from ;;

How to pass arguments inside select and test function?

I've this JSON data, extracted from qbittorrent:
[
{
"hash": "333333333333333333333333333",
"name": "testtosearchcaseinsensitive",
"magnet_uri": "magnet:somedata",
"size": 1243989552,
"progress": 1.0,
"dlspeed": 0,
"upspeed": 0,
"priority": 0,
"num_seeds": 0,
"num_complete": 2,
"num_leechs": 0,
"num_incomplete": 32,
"ratio": 0.0,
"eta": "1.01:11:52",
"state": "stalledUP",
"seq_dl": false,
"f_l_piece_prio": false,
"category": "category",
"tags": "",
"super_seeding": false,
"force_start": false,
"save_path": "/data/path/",
"added_on": 1567358333,
"completion_on": 1567366287,
"tracker": "somedata",
"dl_limit": null,
"up_limit": null,
"downloaded": 1244073666,
"uploaded": 0,
"downloaded_session": 0,
"uploaded_session": 0,
"amount_left": 0,
"completed": 1243989552,
"ratio_limit": 1.0,
"seen_complete": 1567408837,
"last_activity": 1567366979,
"time_active": "1.01:00:41",
"auto_tmm": true,
"total_size": 1243989552,
"max_ratio": 1,
"max_seeding_time": 2880,
"seeding_time_limit": 2880
},
{
"hash": "44444444444444",
"name": "dontmatch",
"magnet_uri": "magnet:somedata",
"size": 2996838603,
"progress": 1.0,
"dlspeed": 0,
"upspeed": 0,
"priority": 0,
"num_seeds": 0,
"num_complete": 12,
"num_leechs": 0,
"num_incomplete": 0,
"ratio": 0.06452786606740063,
"eta": "100.00:00:00",
"state": "stalledUP",
"seq_dl": false,
"f_l_piece_prio": false,
"category": "category",
"tags": "",
"super_seeding": false,
"force_start": false,
"save_path": "/data/path/",
"added_on": 1566420155,
"completion_on": 1566424710,
"tracker": "some data",
"dl_limit": null,
"up_limit": null,
"downloaded": 0,
"uploaded": 193379600,
"downloaded_session": 0,
"uploaded_session": 0,
"amount_left": 0,
"completed": 2996838603,
"ratio_limit": -2.0,
"seen_complete": 4294967295,
"last_activity": 1566811636,
"time_active": "10.23:07:42",
"auto_tmm": true,
"total_size": 2996838603,
"max_ratio": -1,
"max_seeding_time": -1,
"seeding_time_limit": -2
}
]
So I want to match all data where the name have some text, so, in Bash I write this but I can't make it work.
Some declaration to start, actually I pass data via arguments, so I use $1:
TXTIWANT="test"
MYJSONDATA= Here I put my JSON data
Then this jq equation that doesn't work for me is this:
RESULTS=$(echo "$MYJSONDATA" | jq --raw-output --arg TOSEARCH "$TXTIWANT" '.[] | select(.name|test("$TOSEARCH.";"i")) .name')
But I always got an error or all data, I think because $TOSEARCH is not expanded.
Maybe there's a better way to search a string inside a value?
What I do wrong?

The right syntax for variable (or filter) interpolation with jq looks like this:
"foo \(filter_or_var) bar"
In your case:
jq --raw-output --arg TOSEARCH "$TXTIWANT" '.[]select(.name|test("\($TOSEARCH).";"i")) .name')
side-note: By convention, environment variables (PAGER, EDITOR, ...) and internal shell variables (SHELL, BASH_VERSION, ...) are capitalized. All other variable names should be lower case.

If (as suggested by the name TXTIWANT and by the example, as well as by the wording of the question) the value of "$TXTIWANT" is supposed to be literal text, then using test is problematic, as test will search for a regular expression.
Since it is not clear from the question why you are adding a period to TOSEARCH, in the remainder of this first section, I will ignore whatever requirement you have in mind regarding that.
So if you simply want to find the .name values that contain $TXTIWANT literally (ignoring case), then you could convert both .name and the value of $TXTIWANT to the same case, and then check for containment.
In jq, ignoring the mysterious ".", this could be done like so:
jq --raw-output --arg TOSEARCH "$TXTIWANT" '
($TOSEARCH|ascii_upcase) as $up
| .[]
| .name
| select(ascii_upcase|index($up))'
Search for non-terminal occurrence of $TXTIWANT ignoring case
If the "." signifies there must be an additional character after $TXTIWANT, then you could just add another select as follows:
($TOSEARCH|length) as $len
| ($TOSEARCH|ascii_upcase) as $up
| .[]
| .name
| (ascii_upcase|index($up)) as $ix
| select($ix)
| select($ix + $len < length)

Which is best practice to skip non ascii characters in mixed encoded text in python3?

I was able to import a text file on an elasticsearch index in mylocal machine.
Despite using virtual environment, on the production machine is a nightmare, because I keep having errors like:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 79: ordinal not in range(128)
I am using python3 and I personally was having less issues in python2, maybe it is just frustration of wasted couple of hours.
I can't understand why, I am not able to strip or handle non ascii chars:
I tried to import:
from unidecode import unidecode
def remove_non_ascii(text):
return unidecode(unicode(text, encoding = "utf-8"))
using python2, no success.
back on python3:
import string
printable = set(string.printable)
''.join( filter(lambda x: x in printable, 'mixed non ascii string' )
no success
import codecs
with codecs.open(path, encoding='utf8') as f:
....
no success
tried:
# -*- coding: utf-8 -*-
no success
https://docs.python.org/2/library/unicodedata.html#unicodedata.normalize
no success ...
All of the above seems can't strip or handle the non ascii, it is very cumbersome, I keep on having following errors:
with open(path) as f:
for line in f:
line = line.replace('\n','')
el = line.split('\t')
print (el)
_id = el[0]
_source = el[1]
_name = el[2]
# _description = ''.join( filter(lambda x: x in printable, el[-1]) )
#
_description = remove_non_ascii( el[-1] )
print (_id, _source, _name, _description, setTipe( _source ) )
action = {
"_index": _indexName,
"_type": setTipe( _source ),
"_id": _source,
"_source": {
"name": _name,
"description" : _description
}
}
helpers.bulk(es, [action])
File "<stdin>", line 22, in <module>
File "/usr/local/lib/python2.7/dist-packages/elasticsearch/helpers/__init__.py", line 194, in bulk
for ok, item in streaming_bulk(client, actions, **kwargs):
File "/usr/local/lib/python2.7/dist-packages/elasticsearch/helpers/__init__.py", line 162, in streaming_bulk
for result in _process_bulk_chunk(client, bulk_actions, raise_on_exception, raise_on_error, **kwargs):
File "/usr/local/lib/python2.7/dist-packages/elasticsearch/helpers/__init__.py", line 87, in _process_bulk_chunk
resp = client.bulk('\n'.join(bulk_actions) + '\n', **kwargs)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 79: ordinal not in range(128)
I would like to have a "definite" practice to handle encoding problems in python3 - I am using same scripts on different machines, and having different results...

ASCII characters are 0-255.
def remove_non_ascii(text):
ascii_characters = ""
for character in text:
if ord(character) <= 255:
ascii_characters += character
return ascii_characters

Highlight part of code block

I have a very large code block in my .rst file, which I would like to highlight just a small portion of and make it bold. Consider the following rst:
wall of text. wall of text. wall of text.wall of text. wall of text. wall of text.wall of text. wall of text. wall of text.
wall of text. wall of text. wall of text.wall of text. wall of text. wall of text.wall of text. wall of text. wall of text.
**Example 1: Explain showing a table scan operation**::
EXPLAIN FORMAT=JSON
SELECT * FROM Country WHERE continent='Asia' and population > 5000000;
{
"query_block": {
"select_id": 1,
"cost_info": {
"query_cost": "53.80" # This query costs 53.80 cost units
},
"table": {
"table_name": "Country",
"access_type": "ALL", # ALL is a table scan
"rows_examined_per_scan": 239, # Accessing all 239 rows in the table
"rows_produced_per_join": 11,
"filtered": "4.76",
"cost_info": {
"read_cost": "51.52",
"eval_cost": "2.28",
"prefix_cost": "53.80",
"data_read_per_join": "2K"
},
"used_columns": [
"Code",
"Name",
"Continent",
"Region",
"SurfaceArea",
"IndepYear",
"Population",
"LifeExpectancy",
"GNP",
"GNPOld",
"LocalName",
"GovernmentForm",
"HeadOfState",
"Capital",
"Code2"
],
"attached_condition": "((`world`.`Country`.`Continent` = 'Asia') and (`world`.`Country`.`Population` > 5000000))"
}
}
}
When it converts to html, it syntax highlights by default (good), but I also want to specify a few lines that should be bold (the ones with comments on them, but possibly others too.)
I was thinking of adding a trailing character sequence on the line (.e.g. ###) and then writing a post-parser script to modify the html files generated. Is there a better way?

The code-block directive has an emphasize-lines option. The following highlights the lines with comments in your code.
**Example 1: Explain showing a table scan operation**
.. code-block:: python
:emphasize-lines: 7, 11, 12
EXPLAIN FORMAT=JSON
SELECT * FROM Country WHERE continent='Asia' and population > 5000000;
{
"query_block": {
"select_id": 1,
"cost_info": {
"query_cost": "53.80" # This query costs 53.80 cost units
},
"table": {
"table_name": "Country",
"access_type": "ALL", # ALL is a table scan
"rows_examined_per_scan": 239, # Accessing all 239 rows in the table
"rows_produced_per_join": 11,
"filtered": "4.76",
"cost_info": {
"read_cost": "51.52",
"eval_cost": "2.28",
"prefix_cost": "53.80",
"data_read_per_join": "2K"
},
"used_columns": [
"Code",
"Name",
"Continent",
"Region",
"SurfaceArea",
"IndepYear",
"Population",
"LifeExpectancy",
"GNP",
"GNPOld",
"LocalName",
"GovernmentForm",
"HeadOfState",
"Capital",
"Code2"
],
"attached_condition": "((`world`.`Country`.`Continent` = 'Asia') and (`world`.`Country`.`Population` > 5000000))"
}
}
}

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Setting codec/searching Elasticsearch for unicode values from Python - elasticsearch

Related

Boto3 Amplify list apps

TextMate Grammar - Problem with `end` expression

How to pass arguments inside select and test function?

Which is best practice to skip non ascii characters in mixed encoded text in python3?

Highlight part of code block

Categories

Resources