PRAW - appending output to a list - praw

I'm trying to edit some code from PRAW so that instead of printing out the comments of a subreddit post, it appends them to a dataframe where it will be used for further analysis.
The original bit of code I am trying to edit is:
from praw.models import MoreComments
for i in dat_comments_id:
submission = reddit.submission(id=i)
for comment in submission.comments:
if isinstance(comment, MoreComments):
continue
print(comment.body)
What I originally had in mind, which doesn't work was:
subreddit_comments = []
for i in dat_comments_id:
submission = reddit.submission(id=i)
for comment in submission.comments:
if isinstance(comment, MoreComments):
subreddit_comments.append(comment)
Rather than give me a list with all the comments, it prints out all the post ids like this:
<MoreComments count=591, children=['dmh6ry8', 'dmgt4w1', 'dmgsdrf', '...']>,
<MoreComments count=747, children=['dimdq81', 'dimps03', 'dime3no', '...']>,
<MoreComments count=818, children=['do2y328', 'do2m468', 'do2o35v', '...']>,
<MoreComments count=21, children=['di8cx4x', 'di8380y', 'di826lg', '...']>,
<MoreComments count=370, children=['djk11xc', 'djkfd37', 'djkd0qs', '...']>,
<MoreComments count=591, children=['dmh6ry8', 'dmgt4w1', 'dmgsdrf', '...']>,
<MoreComments count=747, children=['dimdq81', 'dimps03', 'dime3no', '...']>,
<MoreComments count=818, children=['do2y328', 'do2m468', 'do2o35v', '...']>,
<MoreComments count=21, children=['di8cx4x', 'di8380y', 'di826lg', '...']>,
<MoreComments count=370, children=['djk11xc', 'djkfd37', 'djkd0qs', '...']>]
How can I edit the code so that it is appending the original output rather than printing it?

I think I stumbled on an answer. I'm still open to better options from anyone more familiar with PRAW.
comment_list = []
for i in dat_comments_id:
submission = reddit.submission(id=i)
submission.comments.replace_more(limit=None)
for c in submission.comments.list():
individual_comment = c.body # This is the line that made the difference
comment_list.append(individual_comment)

Related

Snakemake - parameter file treated as a wildcard

I have written a pipeline in Snakemake. It's an ATAC-seq pipeline (bioinformatics pipeline to analyze genomics data from a specific experiment). Basically, until merging alignment step I use {sample_id} wildcard, to later switch to {sample} wildcard (merging two or more sample_ids into one sample).
working DAG here (for simplicity only one sample shown; orange and blue {sample_id}s are merged into one green {sample}
Tha all rule looks as follows:
configfile: "config.yaml"
SAMPLES_DICT = dict()
with open(config['SAMPLE_SHEET'], "r+") as fil:
next(fil)
for lin in fil.readlines():
row = lin.strip("\n").split("\t")
sample_id = row[0]
sample_name = row[1]
if sample_name in SAMPLES_DICT.keys():
SAMPLES_DICT[sample_name].append(sample_id)
else:
SAMPLES_DICT[sample_name] = [sample_id]
SAMPLES = list(SAMPLES_DICT.keys())
SAMPLE_IDS = [sample_id for sample in SAMPLES_DICT.values() for sample_id in sample]
rule all:
input:
# FASTQC output for RAW reads
expand(os.path.join(config['FASTQC'], '{sample_id}_R{read}_fastqc.zip'),
sample_id = SAMPLE_IDS,
read = ['1', '2']),
# Trimming
expand(os.path.join(config['TRIMMED'],
'{sample_id}_R{read}_val_{read}.fq.gz'),
sample_id = SAMPLE_IDS,
read = ['1', '2']),
# Alignment
expand(os.path.join(config['ALIGNMENT'], '{sample_id}_sorted.bam'),
sample_id = SAMPLE_IDS),
# Merging
expand(os.path.join(config['ALIGNMENT'], '{sample}_sorted_merged.bam'),
sample = SAMPLES),
# Marking Duplicates
expand(os.path.join(config['ALIGNMENT'], '{sample}_sorted_md.bam'),
sample = SAMPLES),
# Filtering
expand(os.path.join(config['FILTERED'],
'{sample}.bam'),
sample = SAMPLES),
expand(os.path.join(config['FILTERED'],
'{sample}.bam.bai'),
sample = SAMPLES),
# multiqc report
"multiqc_report.html"
message:
'\n#################### ATAC-seq pipeline #####################\n'
'Running all necessary rules to produce complete output.\n'
'############################################################'
I know it's too messy, I should only leave the necessary bits, but here my understanding of snakemake fails cause I don't know what I have to keep and what I should delete.
This is working, to my knowledge exactly as I want.
However, I added a rule:
rule hmmratac:
input:
bam = os.path.join(config['FILTERED'], '{sample}.bam'),
index = os.path.join(config['FILTERED'], '{sample}.bam.bai')
output:
model = os.path.join(config['HMMRATAC'], '{sample}.model'),
gappedPeak = os.path.join(config['HMMRATAC'], '{sample}_peaks.gappedPeak'),
summits = os.path.join(config['HMMRATAC'], '{sample}_summits.bed'),
states = os.path.join(config['HMMRATAC'], '{sample}.bedgraph'),
logs = os.path.join(config['HMMRATAC'], '{sample}.log'),
sample_name = '{sample}'
log:
os.path.join(config['LOGS'], 'hmmratac', '{sample}.log')
params:
genomes = config['GENOMES'],
blacklisted = config['BLACKLIST']
resources:
mem_mb = 32000
message:
'\n######################### Peak calling ########################\n'
'Peak calling for {output.sample_name}\n.'
'############################################################'
shell:
'HMMRATAC -Xms2g -Xmx{resources.mem_mb}m '
'--bam {input.bam} --index {input.index} '
'--genome {params.genome} --blacklist {params.blacklisted} '
'--output {output.sample_name} --bedgraph true &> {log}'
And into the rule all, after filtering, before multiqc, I added:
# Peak calling
expand(os.path.join(config['HMMRATAC'], '{sample}.model'),
sample = SAMPLES),
Relevant config.yaml fragments:
# Path to blacklisted regions
BLACKLIST: "/mnt/data/.../hg38.blacklist.bed"
# Path to chromosome sizes
GENOMES: "/mnt/data/.../hg38_sizes.genome"
# Path to filtered alignment
FILTERED: "alignment/filtered"
# Path to peaks
HMMRATAC: "peaks/hmmratac"
This is the error* I get (It goes on for every input and output of the rule). *Technically it's a warning but it halts execution of snakemake so I am calling it an error.
File path alignment/filtered//mnt/data/.../hg38.blacklist.bed.bam contains double '/'. This is likely unintended. It can also lead to inconsistent results of the file-matching approach used by Snakemake.
WARNING:snakemake.logging:File path alignment/filtered//mnt/data/.../hg38.blacklist.bed.bam contains double '/'. This is likely unintended. It can also lead to inconsistent results of the file-matching approach used by Snakemake.
It isn't actually ... - I just didn't feel safe providing an absolute path here.
For a couple of days, I have struggled with this error. Looked through the documentation, listened to the introduction. I understand that the above description is far from perfect (it is huge bc I don't even know how to work it down to provide minimal reproducible example...) but I am desperate and hope you can be patient with me.
Any suggestions as to how to google it, where to look for an error would be much appreciated.
Technically it's a warning but it halts execution of snakemake so I am calling it an error.
It would be useful to post the logs from snakemake to see if snakemake terminated with an error and if so what error.
However, in addition to Eric C.'s suggestion to use wildcards.sample instead of {sample} as file name, I think that this is quite suspicious:
alignment/filtered//mnt/data/.../hg38.blacklist.bed.bam
/mnt/ is usually at the root of the file system and you are prepending to it a relative path (alignment/filtered). Are you sure it is correct?

How do I find a line of code with pyelftools/libdwarf

I have a function name and an offset from the top of that function. I know I can find the line of code from looking at the assembly listing file and compute the offset for the line of code and get the line number that way.
What I'm trying to do is use the .o file to get that same information. I can see the DWARF information for the ELF file and can find the DIE for function in the DWARF data, but how do I actually see the info for the instructions of that function and map that to a line of code. I've been using pyelftools so I would hopefully like to be able to use that but I am open to other options if I can't use pyelftools.
There's a sample in pyelftools that does that: https://github.com/eliben/pyelftools/blob/master/examples/dwarf_decode_address.py
Specifically, finding the line for the address goes like this:
def decode_file_line(dwarfinfo, address):
# Go over all the line programs in the DWARF information, looking for
# one that describes the given address.
for CU in dwarfinfo.iter_CUs():
# First, look at line programs to find the file/line for the address
lineprog = dwarfinfo.line_program_for_CU(CU)
prevstate = None
for entry in lineprog.get_entries():
# We're interested in those entries where a new state is assigned
if entry.state is None:
continue
if entry.state.end_sequence:
# if the line number sequence ends, clear prevstate.
prevstate = None
continue
# Looking for a range of addresses in two consecutive states that
# contain the required address.
if prevstate and prevstate.address <= address < entry.state.address:
filename = lineprog['file_entry'][prevstate.file - 1].name
line = prevstate.line
return filename, line
prevstate = entry.state
return None, None

How do I properly format code for desired appending output?

I'm writing new code and having problem getting desired output. The code reads an html file and finds tags. it outputs the url only. I insert additional code to complete the link. I'm trying to insert the url two times within the string.
####### Parse for <a> tags and save ############
with open("page1.html", 'r') as htmlb:
soup2 = BeautifulSoup(htmlb, 'lxml')
links = []
for link in soup2.findAll('a', attrs={'href': re.compile("^https://")}):
links.append(''"{link}"'<br>')
time.sleep(.1)
with open("page-2.html", 'w') as html:
html.write('{links}\n'.format(links=links))
This should give you the desired html output file:
import re
from bs4 import BeautifulSoup
import html
with open("page1.html", 'r') as htmlb:
soup2 = BeautifulSoup(htmlb, 'lxml')
with open("page2.html", 'w') as h:
for link in soup2.find_all('a'):
h.write("{}<br>".format(link.get('href'),link.get('href')))
This gives me want I want I guess, but not exactly. I would rather see it written out "https://whatever.com/text/text/" than to see "whatever.com/text/text"
####### Parse for <a> tags and save ############
with open("page1.html", 'r') as htmlb:
soup2 = BeautifulSoup(htmlb, 'lxml')
links = []
for link in soup2.findAll('a', attrs={'href': re.compile("^https://")}):
links.append('{0}</a><br>'.format(link,link))
with open("page-2.html", 'w') as html:
html.write('{links}\n'.format(links=links))

Prettify YAML with comments

1. Summary
I can't find, how I can automatically prettify my YAML files.
2. Data
Example:
    I have SashaPrettifyYAML.yaml file:
sasha_commands:
# Sasha comment
sasha_command_help: {call: sublime.command_help, caption: 'Sasha Command: Command Help'}
3. Expected behavior
I want to delete {braces}:
sasha_commands:
# Sasha comment
sasha_command_help:
call: sublime.command_help
caption: 'Sasha Command: Command Help'
4. Not helped
Pretty YAML (based on PyYAML) and online formatters as YAML Formatter and OnlineYAMLTools delete comments;
I can't find the required option in ruamel.yaml.cmd;
align-yaml align, not prettify YAML file.
There is no option to do this in ruamel.yaml.cmd, but it is fairly straightforward to do this with a small python program and using ruamel.yaml, by loading and dumping in round-trip mode (the default).
The only thing you need to do is make sure the flow-style on the data-structure that is the value for the key sasha_command_help is set to block-style (which is how I interpret your definition of "prettifying YAML"):
import sys
import ruamel.yaml
yaml_str = """\
sasha_commands:
# Sasha comment
sasha_command_help: {call: sublime.command_help, caption: 'Sasha Command: Command Help'}
"""
yaml = ruamel.yaml.YAML()
yaml.preserve_quotes = True
data = yaml.load(yaml_str)
data['sasha_commands']['sasha_command_help'].fa.set_block_style()
yaml.dump(data, sys.stdout)
this will exactly give the output you expect.
A recursive data structure walker can be found in scalarstring.py in the ruamel.yaml source, and adapted to make a generic "make-everything-block-style" routine:
import sys
import ruamel.yaml
def block_style(base):
"""
This routine walks over a simple, i.e. consisting of dicts, lists and
primitives, tree loaded from YAML. It recurses into dict values and list
items, and sets block-style on these.
"""
if isinstance(base, dict):
for k in base:
try:
base.fa.set_block_style()
except AttributeError:
pass
block_style(base[k])
elif isinstance(base, list):
for elem in base:
try:
base.fa.set_block_style()
except AttributeError:
pass
block_style(elem)
yaml = ruamel.yaml.YAML()
yaml.preserve_quotes = True
file_in = sys.argv[1]
file_out = sys.argv[2]
with open(file_in) as fp:
data = yaml.load(fp)
block_style(data)
with open(file_out, 'w') as fp:
yaml.dump(data, fp)
If you store the above in prettifyyaml.py you can call it with:
python prettifyyaml.py SashaPrettifyYAML.yaml Prettified.yaml
Since you are already using single quotes around the scalar that has embedded spaces, you won't see a change if you leave out yaml.preserve_quotes = True. But if you had used a double quoted scalar then that line makes sure the double quotes are preserved.
I had the same problem. I wrote my own YAML beautifier https://github.com/wangkuiyi/yamlfmt. I hope it helps.
I tried top results from Google, but none of them address the requirements of https://sqlflow.org/sqlflow, which I am leading:
https://pypi.org/project/yamlfmt cannot handle a file of multiple YAML documents separated by ---
https://github.com/devopyio/yamlfmt cannot handle multiple files.
https://github.com/miekg/yamlfmt/blob/master/fmt.go cannot replace (inline edit) the input files.
You can use yq tool - it's easy to install and use, and it's well maintained.
Supposing you have example.yml file to format, it can be processed by following ways:
from file: yq r --unwrapScalar -p pv -P example.yml '*'
from stdin: cat example.yml | yq r --unwrapScalar -p pv -P - '*'

Stanford NLP Coref Resolution for Conversational Data

I want to make some experiments with Stanford dcoref package on our conversational data. Our data contains usernames (speakers) and the utterances. Is it possible to give a structured data as input (instead of the raw text) to Stanford dcoref annotator? If yes, what should be the format of conversational input data?
Thank you,
-berfin
I was able to get this basic example to work:
<doc id="speaker-example-1">
<post author="Joe Smith" datetime="2018-02-28T20:10:00" id="p1">
I am hungry!
</post>
<post author="Jane Smith" datetime="2018-02-28T20:10:05" id="p2">
Joe Smith is hungry.
</post>
</doc>
I used these properties:
annotators = tokenize,cleanxml,ssplit,pos,lemma,ner,parse,coref
coref.conll = true
coref.algorithm = clustering
# Clean XML tags for SGM (move to sgm specific conf file?)
clean.xmltags = headline|dateline|text|post
clean.singlesentencetags = HEADLINE|DATELINE|SPEAKER|POSTER|POSTDATE
clean.sentenceendingtags = P|POST|QUOTE
clean.turntags = TURN|POST|QUOTE
clean.speakertags = SPEAKER|POSTER
clean.docIdtags = DOCID
clean.datetags = DATETIME|DATE|DATELINE
clean.doctypetags = DOCTYPE
clean.docAnnotations = docID=doc[id],doctype=doc[type],docsourcetype=doctype[source]
clean.sectiontags = HEADLINE|DATELINE|POST
clean.sectionAnnotations = sectionID=post[id],sectionDate=post[date|datetime],sectionDate=postdate,author=post[author],author=poster
clean.quotetags = quote
clean.quoteauthorattributes = orig_author
clean.tokenAnnotations = link=a[href],speaker=post[author],speaker=quote[orig_author]
clean.ssplitDiscardTokens = \\n|\\*NL\\*
Also this document has great info on the coref system:
https://stanfordnlp.github.io/CoreNLP/coref.html
I am looking into using the neural option on my example .xml document, but you might have to put your data into the conll format to run our neural coref with the conll settings. The conll data has conversational data with speaker info among other document formats.
This document contains info on the CoNLL format you'd have to use for the neural algorithm to work.
CoNLL 2012 format: http://conll.cemantix.org/2012/data.html
You need to create a folder with a similar directory structure (but you can put your files in instead)
example:
/Path/to/conll_2012_dir/v9/data/test/data/english/annotations/wb/eng/00/eng_0009.v9_auto_conll
If you run this command:
java -Xmx20g edu.stanford.nlp.coref.CorefSystem -props speaker.properties
with these properties:
coref.algorithm = clustering
coref.conll = true
coref.conllOutputPath = /Path/to/output_dir
coref.data = /Path/to/conll_2012_dir
it will write conll output files to /Path/to/output_dir
That command should read in all files ending with _auto_conll

Resources