Suppress LightGBM warnings in Optuna - lightgbm

I am getting below warnings while I am using Optuna to tune my model.
Please tell me how to suppress these warnings?
[LightGBM] [Warning] feature_fraction is set=0.2, colsample_bytree=1.0 will be ignored. Current value: feature_fraction=0.2
[LightGBM] [Warning] min_data_in_leaf is set=5400, min_child_samples=20 will be ignored. Current value: min_data_in_leaf=5400
[LightGBM] [Warning] min_gain_to_split is set=13.203719815769512, min_split_gain=0.0 will be ignored. Current value: min_gain_to_split=13.203719815769512

I am not familiar with Optuna but I ran into this issue using Python/lightgbm.
As of v3.3.2, the parameters tuning page included parameters that seem to be renamed, deprecated, or duplicative. If, however, you stick to setting/tuning the parameters specified in the model object you can avoid this warning.
from lightgbm import LGBMRegressor
params = LGBMRegressor().get_params()
print(params)
These are the only parameters you want to set. If you want to be able to include all the parameters, you could do something like below.
from lightgbm import LGBMRegressor
lgr = LGBMRegressor()
params = lgr.get_params()
aliases = [
{'min_child_weight', 'min_sum_hessian_in_leaf'},
{'min_child_samples', 'min_data_in_leaf'},
{'colsample_bytree', 'feature_fraction'},
{'subsample', 'bagging_fraction'}
]
for alias in aliases:
if len(alias & set(params)) == 2:
arg = random.choice(sorted(alias))
params[arg] = None
lgr = LGBMRegressor(**params)
The code sets one or the other in each parameter pair that seems to be duplicative. Now, when you call lgr.fit(X, y) you should not get the warning.

Related

Django Rest Framework: Creating a serializer with a ListField causes circular dependency error

I am new to Django and Rest Framework. I'm following the documentation on serializers and am trying to create a ListField (https://www.django-rest-framework.org/api-guide/fields/#listfield)
and when I do I get a nasty circular import error
django.core.exceptions.ImproperlyConfigured: The included URLconf 'api.urls' does not appear to have any patterns in it. If you see valid patterns in the file then the issue is probably caused by a circular import.
My serializers file appears as:
class CapacitySerializer(serializers.Serializer):
planeIds = serializers.ListField(
planeId = serializers.IntegerField(min_value=0, max_value=10)
)
passangerNums = serializers.ListField(
passangerNum = serializers.IntegerField(min_value=0)
)
litersPerMinute = serializers.FloatField(required=False)
minutesOfFlight = serializers.FloatField(required=False)
The code would work if I simply left the code as:
class CapacitySerializer(serializers.Serializer):
planeId = serializers.IntegerField(min_value=0, max_value=10)
passangerNum = serializers.IntegerField(min_value=0)
litersPerMinute = serializers.FloatField(required=False)
minutesOfFlight = serializers.FloatField(required=False)
Any idea why this error is being thrown?
Additionally If I expect my data to be lists of planeIds and passengerNums is this not a good way to go about it?
versions:
Django==3.0.4
djangorestframework==3.11.0
The documentation linked required the use of the child parameter. Child is required and not a placeholder name

Snakemake - parameter file treated as a wildcard

I have written a pipeline in Snakemake. It's an ATAC-seq pipeline (bioinformatics pipeline to analyze genomics data from a specific experiment). Basically, until merging alignment step I use {sample_id} wildcard, to later switch to {sample} wildcard (merging two or more sample_ids into one sample).
working DAG here (for simplicity only one sample shown; orange and blue {sample_id}s are merged into one green {sample}
Tha all rule looks as follows:
configfile: "config.yaml"
SAMPLES_DICT = dict()
with open(config['SAMPLE_SHEET'], "r+") as fil:
next(fil)
for lin in fil.readlines():
row = lin.strip("\n").split("\t")
sample_id = row[0]
sample_name = row[1]
if sample_name in SAMPLES_DICT.keys():
SAMPLES_DICT[sample_name].append(sample_id)
else:
SAMPLES_DICT[sample_name] = [sample_id]
SAMPLES = list(SAMPLES_DICT.keys())
SAMPLE_IDS = [sample_id for sample in SAMPLES_DICT.values() for sample_id in sample]
rule all:
input:
# FASTQC output for RAW reads
expand(os.path.join(config['FASTQC'], '{sample_id}_R{read}_fastqc.zip'),
sample_id = SAMPLE_IDS,
read = ['1', '2']),
# Trimming
expand(os.path.join(config['TRIMMED'],
'{sample_id}_R{read}_val_{read}.fq.gz'),
sample_id = SAMPLE_IDS,
read = ['1', '2']),
# Alignment
expand(os.path.join(config['ALIGNMENT'], '{sample_id}_sorted.bam'),
sample_id = SAMPLE_IDS),
# Merging
expand(os.path.join(config['ALIGNMENT'], '{sample}_sorted_merged.bam'),
sample = SAMPLES),
# Marking Duplicates
expand(os.path.join(config['ALIGNMENT'], '{sample}_sorted_md.bam'),
sample = SAMPLES),
# Filtering
expand(os.path.join(config['FILTERED'],
'{sample}.bam'),
sample = SAMPLES),
expand(os.path.join(config['FILTERED'],
'{sample}.bam.bai'),
sample = SAMPLES),
# multiqc report
"multiqc_report.html"
message:
'\n#################### ATAC-seq pipeline #####################\n'
'Running all necessary rules to produce complete output.\n'
'############################################################'
I know it's too messy, I should only leave the necessary bits, but here my understanding of snakemake fails cause I don't know what I have to keep and what I should delete.
This is working, to my knowledge exactly as I want.
However, I added a rule:
rule hmmratac:
input:
bam = os.path.join(config['FILTERED'], '{sample}.bam'),
index = os.path.join(config['FILTERED'], '{sample}.bam.bai')
output:
model = os.path.join(config['HMMRATAC'], '{sample}.model'),
gappedPeak = os.path.join(config['HMMRATAC'], '{sample}_peaks.gappedPeak'),
summits = os.path.join(config['HMMRATAC'], '{sample}_summits.bed'),
states = os.path.join(config['HMMRATAC'], '{sample}.bedgraph'),
logs = os.path.join(config['HMMRATAC'], '{sample}.log'),
sample_name = '{sample}'
log:
os.path.join(config['LOGS'], 'hmmratac', '{sample}.log')
params:
genomes = config['GENOMES'],
blacklisted = config['BLACKLIST']
resources:
mem_mb = 32000
message:
'\n######################### Peak calling ########################\n'
'Peak calling for {output.sample_name}\n.'
'############################################################'
shell:
'HMMRATAC -Xms2g -Xmx{resources.mem_mb}m '
'--bam {input.bam} --index {input.index} '
'--genome {params.genome} --blacklist {params.blacklisted} '
'--output {output.sample_name} --bedgraph true &> {log}'
And into the rule all, after filtering, before multiqc, I added:
# Peak calling
expand(os.path.join(config['HMMRATAC'], '{sample}.model'),
sample = SAMPLES),
Relevant config.yaml fragments:
# Path to blacklisted regions
BLACKLIST: "/mnt/data/.../hg38.blacklist.bed"
# Path to chromosome sizes
GENOMES: "/mnt/data/.../hg38_sizes.genome"
# Path to filtered alignment
FILTERED: "alignment/filtered"
# Path to peaks
HMMRATAC: "peaks/hmmratac"
This is the error* I get (It goes on for every input and output of the rule). *Technically it's a warning but it halts execution of snakemake so I am calling it an error.
File path alignment/filtered//mnt/data/.../hg38.blacklist.bed.bam contains double '/'. This is likely unintended. It can also lead to inconsistent results of the file-matching approach used by Snakemake.
WARNING:snakemake.logging:File path alignment/filtered//mnt/data/.../hg38.blacklist.bed.bam contains double '/'. This is likely unintended. It can also lead to inconsistent results of the file-matching approach used by Snakemake.
It isn't actually ... - I just didn't feel safe providing an absolute path here.
For a couple of days, I have struggled with this error. Looked through the documentation, listened to the introduction. I understand that the above description is far from perfect (it is huge bc I don't even know how to work it down to provide minimal reproducible example...) but I am desperate and hope you can be patient with me.
Any suggestions as to how to google it, where to look for an error would be much appreciated.
Technically it's a warning but it halts execution of snakemake so I am calling it an error.
It would be useful to post the logs from snakemake to see if snakemake terminated with an error and if so what error.
However, in addition to Eric C.'s suggestion to use wildcards.sample instead of {sample} as file name, I think that this is quite suspicious:
alignment/filtered//mnt/data/.../hg38.blacklist.bed.bam
/mnt/ is usually at the root of the file system and you are prepending to it a relative path (alignment/filtered). Are you sure it is correct?

pyyaml parse data with tag

I have yaml data like the input below and i need output as key value pairs
Input
a="""
--- !ruby/hash:ActiveSupport::HashWithIndifferentAccess
code:
- '716'
- '718'
id:
- 488
- 499
"""
ouput needed
{'code': ['716', '718'], 'id': [488, 499]}
The default constructor was giving me an error. I tried adding new constructor and now its not giving me error but i am not able to get key value pairs.
FYI, If i remove the !ruby/hash:ActiveSupport::HashWithIndifferentAccess line from my yaml then it gives me desired output.
def new_constructor(loader, tag_suffix, node):
if type(node.value)=='list':
val=''.join(node.value)
else:
val=node.value
val=node.value
ret_val="""
{0}
""".format(val)
return ret_val
yaml.add_multi_constructor('', new_constructor)
yaml.load(a)
output
"\n [(ScalarNode(tag=u'tag:yaml.org,2002:str', value=u'code'), SequenceNode(tag=u'tag:yaml.org,2002:seq', value=[ScalarNode(tag=u'tag:yaml.org,2002:str', value=u'716'), ScalarNode(tag=u'tag:yaml.org,2002:str', value=u'718')])), (ScalarNode(tag=u'tag:yaml.org,2002:str', value=u'id'), SequenceNode(tag=u'tag:yaml.org,2002:seq', value=[ScalarNode(tag=u'tag:yaml.org,2002:int', value=u'488'), ScalarNode(tag=u'tag:yaml.org,2002:int', value=u'499')]))]\n "
Please suggest.
This is not a solution using PyYAML, but I recommend using ruamel.yaml instead. If for no other reason, it's more actively maintained than PyYAML. A quote from the overview
Many of the bugs filed against PyYAML, but that were never acted upon, have been fixed in ruamel.yaml
To load that string, you can do
import ruamel.yaml
parser = ruamel.yaml.YAML()
obj = parser.load(a) # as defined above.
I strongly recommend following #Andrew F answer, but in case you
wonder why your code did not get the proper result, that is because
you don't correctly process the node under the tag in your tag
handling.
Although the node's value is a list (of tuples with key value pairs),
you should test for the type of the node itself (using isinstance)
and then hand it over to the "normal" mapping processing routine as
the tag is on a mapping:
import yaml
from yaml.loader import SafeLoader
a = """\
--- !ruby/hash:ActiveSupport::HashWithIndifferentAccess
code:
- '716'
- '718'
id:
- 488
- 499
"""
def new_constructor(loader, tag_suffix, node):
if isinstance(node, yaml.nodes.MappingNode):
return loader.construct_mapping(node, deep=True)
raise NotImplementedError
yaml.add_multi_constructor('', new_constructor, Loader=SafeLoader)
data = yaml.load(a, Loader=SafeLoader)
print(data)
which gives:
{'code': ['716', '718'], 'id': [488, 499]}
You should not use PyYAML's yaml.load(), it is documented to be potentially unsafe
and above all it is not necessary. Just add the new constructor to the SafeLoader.

Vision API: How to get JSON-output

I'm having trouble saving the output given by the Google Vision API. I'm using Python and testing with a demo image. I get the following error:
TypeError: [mid:...] + is not JSON serializable
Code that I executed:
import io
import os
import json
# Imports the Google Cloud client library
from google.cloud import vision
from google.cloud.vision import types
# Instantiates a client
vision_client = vision.ImageAnnotatorClient()
# The name of the image file to annotate
file_name = os.path.join(
os.path.dirname(__file__),
'demo-image.jpg') # Your image path from current directory
# Loads the image into memory
with io.open(file_name, 'rb') as image_file:
content = image_file.read()
image = types.Image(content=content)
# Performs label detection on the image file
response = vision_client.label_detection(image=image)
labels = response.label_annotations
print('Labels:')
for label in labels:
print(label.description, label.score, label.mid)
with open('labels.json', 'w') as fp:
json.dump(labels, fp)
the output appears on the screen, however I do not know exactly how I can save it. Anyone have any suggestions?
FYI to anyone seeing this in the future, google-cloud-vision 2.0.0 has switched to using proto-plus which uses different serialization/deserialization code. A possible error you can get if upgrading to 2.0.0 without changing the code is:
object has no attribute 'DESCRIPTOR'
Using google-cloud-vision 2.0.0, protobuf 3.13.0, here is an example of how to serialize and de-serialize (example includes json and protobuf)
import io, json
from google.cloud import vision_v1
from google.cloud.vision_v1 import AnnotateImageResponse
with io.open('000048.jpg', 'rb') as image_file:
content = image_file.read()
image = vision_v1.Image(content=content)
client = vision_v1.ImageAnnotatorClient()
response = client.document_text_detection(image=image)
# serialize / deserialize proto (binary)
serialized_proto_plus = AnnotateImageResponse.serialize(response)
response = AnnotateImageResponse.deserialize(serialized_proto_plus)
print(response.full_text_annotation.text)
# serialize / deserialize json
response_json = AnnotateImageResponse.to_json(response)
response = json.loads(response_json)
print(response['fullTextAnnotation']['text'])
Note 1: proto-plus doesn't support converting to snake_case names, which is supported in protobuf with preserving_proto_field_name=True. So currently there is no way around the field names being converted from response['full_text_annotation'] to response['fullTextAnnotation']
There is an open closed feature request for this: googleapis/proto-plus-python#109
Note 2: The google vision api doesn't return an x coordinate if x=0. If x doesn't exist, the protobuf will default x=0. In python vision 1.0.0 using MessageToJson(), these x values weren't included in the json, but now with python vision 2.0.0 and .To_Json() these values are included as x:0
Maybe you were already able to find a solution to your issue (if that is the case, I invite you to share it as an answer to your own post too), but in any case, let me share some notes that may be useful for other users with a similar issue:
As you can check using the the type() function in Python, response is an object of google.cloud.vision_v1.types.AnnotateImageResponse type, while labels[i] is an object of google.cloud.vision_v1.types.EntityAnnotation type. None of them seem to have any out-of-the-box implementation to transform them to JSON, as you are trying to do, so I believe the easiest way to transform each of the EntityAnnotation in labels would be to turn them into Python dictionaries, then group them all into an array, and transform this into a JSON.
To do so, I have added some simple lines of code to your snippet:
[...]
label_dicts = [] # Array that will contain all the EntityAnnotation dictionaries
print('Labels:')
for label in labels:
# Write each label (EntityAnnotation) into a dictionary
dict = {'description': label.description, 'score': label.score, 'mid': label.mid}
# Populate the array
label_dicts.append(dict)
with open('labels.json', 'w') as fp:
json.dump(label_dicts, fp)
There is a library released by Google
from google.protobuf.json_format import MessageToJson
webdetect = vision_client.web_detection(blob_source)
jsonObj = MessageToJson(webdetect)
I was able to save the output with the following function:
# Save output as JSON
def store_json(json_input):
with open(json_file_name, 'a') as f:
f.write(json_input + '\n')
And as #dsesto mentioned, I had to define a dictionary. In this dictionary I have defined what types of information I would like to save in my output.
with open(photo_file, 'rb') as image:
image_content = base64.b64encode(image.read())
service_request = service.images().annotate(
body={
'requests': [{
'image': {
'content': image_content
},
'features': [{
'type': 'LABEL_DETECTION',
'maxResults': 20,
},
{
'type': 'TEXT_DETECTION',
'maxResults': 20,
},
{
'type': 'WEB_DETECTION',
'maxResults': 20,
}]
}]
})
The objects in the current Vision library lack serialization functions (although this is a good idea).
It is worth noting that they are about to release a substantially different library for Vision (it is on master of vision's repo now, although not released to PyPI yet) where this will be possible. Note that it is a backwards-incompatible upgrade, so there will be some (hopefully not too much) conversion effort.
That library returns plain protobuf objects, which can be serialized to JSON using:
from google.protobuf.json_format import MessageToJson
serialized = MessageToJson(original)
You can also use something like protobuf3-to-dict

Is there a way to dump YAML in same version it was loaded in with ruamel.yaml?

Is there a good way with ruamel.yaml to dump a YAML file back out in the same version as it is loaded in? If I have a %YAML 1.1 directive in the file I would like to be able to dump the file back out in YAML 1.1 without having to hard-code version='1.1'.
So given some data like,
%YAML 1.1
---
is_string: 'on'
is_boolean: on
I would like to avoid hard-coding version='1.1' on the round_trip_dump(),
x = f.read()
d = round_trip_load(x)
round_trip_dump(d, f, explicit_start=True)
The version of the YAML file is a fleeting value, which gets reset after loading. It was (is) my plan to make the version of the latest document loaded available somehow, but with multiple documents in a stream this needs some more thought.
For single document streams you can do the following to capture the version from the directive. This is all done with the new API. With the old API that you are using in the example the same is possible, but more difficult because there is no YAML() instance to attach attributes to:
import sys
from ruamel.yaml import YAML
from ruamel.yaml.parser import Parser
yaml_str = """\
%YAML 1.1
---
is_string: 'on'
is_boolean: on
"""
class MyParser(Parser):
def dispose(self):
self.loader.last_yaml_version = self.yaml_version
Parser.dispose(self)
yaml = YAML()
yaml.Parser = MyParser
data = yaml.load(yaml_str)
yaml2 = YAML()
yaml2.version = yaml.last_yaml_version
yaml2.dump(data, sys.stdout)
which gives:
%YAML 1.1
---
is_string: 'on'
is_boolean: true
Please note that it is necessary to create a clean, new object for output, as the "unversioned" reading doesn't fully reset the yaml instance, when encountering the %YAML 1.1 directive.
It is also possible to dump the value associated with is_boolean as on, but that would affect all booleans in the stream.

Resources