I'm trying to do a very simple thing: Read a diff from a git repo via the ruby gem Grit. I'm creating a file and adding the line "This is me changing the first file". Now I do this to get the diff:
r = Grit::Repo.new("myrepo")
c = r.commits.first
d = r.commit_diff(c.id).first
puts d.first.diff
The output of this is:
--- a/First-File.asciidoc
+++ b/First-File.asciidoc
## -1,2 +1 ##
-This is me changing the first file
See that minus in front of the added line? Why would a commit_diff show in reverse? I know that git reverses the diff if I reverse the commit shas, but this is a Grit library call that only gives the commit diff?
Any clues?
Let me answer that question. The commit shows up in correct form, if you do this insteas:
r = Grit::Repo.new("myrepo")
c = r.commits.first
d = c.diffs.first
puts d.first.diff
Not sure what the difference would be between Commit.diff and Repo.commit_diff.
Related
I am getting the diff between two commits using gitpython in below way:
def get_inbetween_commit_diff(repo_path, commit_a, commit_b):
repo = Repo(repo_path)
uni_diff_text = repo.git.diff(
"{}".format(commit_a), "{}".format(commit_b), ignore_blank_lines=True, ignore_space_at_eol=True
)
return uni_diff_text
However, the default repo.git.diff shows the diff with double dot. Is there a way to achieve triple dot diff using gitpython?
Reference on double dot and triple dot diff: https://matthew-brett.github.io/pydagogue/git_diff_dots.html
repo.git.diff calls git directly, so I think you can just do this:
repo.git.diff(
"{}...{}".format(commit_a, commit_b), ignore_blank_lines=True, ignore_space_at_eol=True
)
I've got an input YAML file (test.yml) as follows:
# sample set of lines
foo:
x: 12
y: hello world
ip_range['initial']: 1.2.3.4
ip_range[]: tba
array['first']: Cluster1
array2[]: bar
The source contains square brackets for some keys (possibly empty).
I'm trying to get a line by line list of all the paths in the file, ideally like:
foo.x: 12
foo.y: hello world
foo.ip_range['initial']: 1.2.3.4
foo.ip_range[]: tba
foo.array['first']: Cluster1
array2[]: bar
I've used the yamlpaths library and the yaml-paths CLI, but can't get the desired output. Trying this:
yaml-paths -m -s =foo -K test.yml
outputs:
foo.x
foo.y
foo.ip_range\[\'initial\'\]
foo.ip_range\[\]
foo.array\[\'first\'\]
Each path is on one line, but the output has all the escape characters ( \ ). Modifying the call to remove the -m option ("expand matching parent nodes") fixes that problem but the output is then not one path per line:
yaml-paths -s =foo -K test.yml
gives:
foo: {"x": 12, "y": "hello world", "ip_range['initial']": "1.2.3.4", "ip_range[]": "tba", "array['first']": "Cluster1"}
Any ideas how I can get the one line per path entry but without the escape chars? I was wondering if there is anything for path querying in the ruamel modules?
Your "paths" are nothing more than the joined string representation of the keys (and probably indices) of the
mappings (and potentially sequences) in your YAML document.
That can be trivially generated from data loaded from YAML with a recursive function:
import sys
import ruamel.yaml
yaml_str = """\
# sample set of lines
foo:
x: 12
y: hello world
ip_range['initial']: 1.2.3.4
ip_range[]: tba
array['first']: Cluster1
array2[]: bar
"""
def pathify(d, p=None, paths=None, joinchar='.'):
if p is None:
paths = {}
pathify(d, "", paths, joinchar=joinchar)
return paths
pn = p
if p != "":
pn += '.'
if isinstance(d, dict):
for k in d:
v = d[k]
pathify(v, pn + k, paths, joinchar=joinchar)
elif isinstance(d, list):
for idx, e in enumerate(d):
pathify(e, pn + str(idx), paths, joinchar=joinchar)
else:
paths[p] = d
yaml = ruamel.yaml.YAML(typ='safe')
paths = pathify(yaml.load(yaml_str))
for p, v in paths.items():
print(f'{p} -> {v}')
which gives:
foo.x -> 12
foo.y -> hello world
foo.ip_range['initial'] -> 1.2.3.4
foo.ip_range[] -> tba
foo.array['first'] -> Cluster1
array2[] -> bar
While Anthon's answer certainly produces the output you were after, I think your question was specifically about how to get the yaml-paths command to produce the desired output. I'll address that original question.
As of version 3.5.0, the yamlpath project's yaml-paths command supports a --noescape option which removes the escape symbols from output. Using your input file and the new option, you may find this output more to your liking:
$ yaml-paths --nofile --expand --keynames --noescape --values --search='=~/.*/' test.yml
foo.x: 12
foo.y: hello world
foo.ip_range['initial']: 1.2.3.4
foo.ip_range[]: tba
foo.array['first']: Cluster1
array2[]: bar
Note:
Using the --values option includes the value with each YAML Path.
For interest, I changed the --search expression to match every node in the input file rather than only the "foo" data.
The default output (without setting --noescape) produces YAML Paths which can be used as direct input into other YAML Path parsers and processors; setting --noescape changes this to render human-friendly paths which may not work as downstream YAML Path input.
Disclaimer: I am the author of the yamlpath project. Should you ever run into issues or have questions about it, please visit the project's GitHub project site and engage me via Issues (bugs and feature requests) or Discussions (questions). Thank you!
I think I have a simple problem but I don't how to solve it.
My input folder contains files like this:
AAAAA_S1_R1_001.fastq
AAAAA_S1_R2_001.fastq
BBBBB_S2_R1_001.fastq
BBBBB_S2_R2_001.fastq
My snakemake code:
import glob
samples = [os.path.basename(x) for x in sorted(glob.glob("input/*.fastq"))]
name = []
for x in samples:
if "_R1_" in x:
name.append(x.split("_R1_")[0])
NAME = name
rule all:
input:
expand("output/{sp}_mapped.bam", sp=NAME),
rule bwa:
input:
R1 = "input/{sample}_R1_001.fastq",
R2 = "input/{sample}_R2_001.fastq"
output:
mapped = "output/{sample}_mapped.bam"
params:
ref = "refs/AF086833.fa"
run:
shell("bwa mem {params.ref} {input.R1} {input.R2} | samtools sort > {output.mapped}")
The output file names are:
AAAAA_S1_mapped.bam
BBBBB_S2_mapped.bam
I want the output file to be:
AAAAA_mapped.bam
BBBBB_mapped.bam
How can I or change the outputname or rename the files before or after the bwa rule.
Try this:
import pathlib
indir = pathlib.Path("input")
paths = indir.glob("*_S?_R?_001.fastq")
samples = set([x.stem.split("_")[0] for x in paths])
rule all:
input:
expand("output/{sample}_mapped.bam", sample=samples)
def find_fastqs(wildcards):
fastqs = [str(x) for x in indir.glob(f"{wildcards.sample}_*.fastq")]
return sorted(fastqs)
rule bwa:
input:
fastqs = find_fastqs
output:
mapped = "output/{sample}_mapped.bam"
params:
ref = "refs/AF086833.fa"
shell:
"bwa mem {params.ref} {input.fastqs} | samtools sort > {output.mapped}"
Uses an input function to find the correct samples for rule bwa. There might be a more elegant solution, but I can't see it right now. I think this should work, though.
(Edited to reflect OP's edit.)
Unfortunately, I've also had this problem with filenames with the following logic: {batch}/{seq_run}_{index}_{flowcell}_{lane}_{read_orientation}.fastq.gz.
I think that the core problem is that none of the individual wildcards are unique. Also, not all values for all wildcards can be combined; seq_run1 was run on lane1, not lane2. Therefore, expand() does not work.
After multiple attempts in Snakemake (see below), my solution was to standardize input with mv / sed / rename. Removing {batch}, {flowcell} and {lane} made it possible to use {sample}, a unique combination of {seq_run} and {index}.
What did not work (but it could be worth to try for others in the same situation):
Adding the zip argument to expand()
Renaming output using the following syntax:
output: "_".join(re.split("[/_]", "{full_filename}")[1,2]+".fastq.gz"
Similar to this question, but instead of creating a new file, I'm trying to merge from origin. After creating a new index using Rugged::Repository's merge_commits, and a new merge commit, git reports the new file (coming from origin) as deleted.
Create a merge index,
> origin_target = repo.references['refs/remotes/origin/master'].target
> merge_index = repo.merge_commits(repo.head.target, origin_target)
and a new merge commit,
> options = {
update_ref: 'refs/heads/master',
committer: {name: 'user', email: 'user#foo.com', time: Time.now},
author: {name: 'user', email: 'user#foo.com', time: Time.now},
parents: [repo.head.target, origin_target],
message: "merge `origin/master` into `master`"}
and make sure to use the tree from the merge index.
> options[:tree] = merge_index.write_tree(repo)
Create the commit
> merge_commit = Rugged::Commit.create(repo, options)
Check that our HEAD has been updated:
> repo.head.target.tree
=> #<Rugged::Tree:16816500 {oid: 16c147f358a095bdca52a462376d7b5730e1978e}>
<"first_file.txt" 9d096847743f97ba44edf00a910f24bac13f36e2>
<"second_file.txt" 8178c76d627cade75005b40711b92f4177bc6cfc>
<"newfile.txt" e69de29bb2d1d6434b8b29ae775ad8c2e48c5391>
Looks good. I see the new file in the index. Write it to disk.
> repo.index.write
=> nil
...but git reports the new file as deleted:
$ git st
## master...origin/master [ahead 2]
D newfile.txt
How can I properly update my index and working tree?
There is an important distinction between the Git repository and the working directory. While most common command-line git commands operate on the working directory as well as the repository, the lower-level commands of libgit2 / librugged mostly operate on only the repository. This includes writing the index as in your example.
To update the working directory to match the index, the following command should work (after writing the index):
options = { strategy: force }
repo.checkout_head(options)
Docs for checkout_head: http://www.rubydoc.info/github/libgit2/rugged/Rugged/Repository#checkout_head-instance_method
Note: I tested with update_ref: 'HEAD' for the commit. I'm not sure if update_ref: 'refs/heads/master' will have the same effect.
I'm working on a program that will be adding and updating files in a git repo. Since I can't be sure if a file that I am working with is currently in the repo, I need to check its existence - an action that seems to be harder than I thought it would be.
The 'in' comparison doesn't seem to work on non-root levels on trees in gitpython. Ex.
>>> repo = Repo(path)
>>> hct = repo.head.commit.tree
>>>> 'A' in hct['documents']
False
>>> hct['documents']['A']
<git.Tree "8c74cba527a814a3700a96d8b168715684013857">
So I'm left to wonder, how do people check that a given file is in a git tree before trying to work on it? Trying to access an object for a file that is not in the tree will throw a KeyError, so I can do try-catches. But that feels like a poor use of exception handling for a routine existence check.
Have I missed something really obvious? How does once check for the existence of a file in a commit tree using gitpython (or really any library/method in Python)?
Self Answer
OK, I dug around in the Tree class to see what __contains__ does. Turns out, when searching in sub folders, one has to check for existence of a file using the full relative path from the repo's root. So a working version of the check I did above is:
>>> 'documents/A' in hct['documents']
True
EricP's answer has a bug. Here's a fixed version:
def fileInRepo(repo, filePath):
'''
repo is a gitPython Repo object
filePath is the full path to the file from the repository root
returns true if file is found in the repo at the specified path, false otherwise
'''
pathdir = os.path.dirname(filePath)
# Build up reference to desired repo path
rsub = repo.head.commit.tree
for path_element in pathdir.split(os.path.sep):
# If dir on file path is not in repo, neither is file.
try :
rsub = rsub[path_element]
except KeyError :
return False
return(filePath in rsub)
Usage:
file_found = fileInRepo(repo, 'documents/A')
This is very similar to EricP's code, but handles the case where the folder containing the file is not in the repo. EricP's function raises a KeyError in that case. This function returns False.
(I offered to edit EricP's code but was rejected.)
Expanding on Bill's solution, here is a function that determines whether a file is in a repo:
def fileInRepo(repo,path_to_file):
'''
repo is a gitPython Repo object
path_to_file is the full path to the file from the repository root
returns true if file is found in the repo at the specified path, false otherwise
'''
pathdir = os.path.dirname(path_to_file)
# Build up reference to desired repo path
rsub = repo.head.commit.tree
for path_element in pathdir.split(os.path.sep):
rsub = rsub[path_element]
return(path_to_file in rsub)
Example usage:
file_found = fileInRepo(repo, 'documents/A')
If you want to omit catch try you can check if object is in repo with:
def fileInRepo(repo, path_to_file):
dir_path = os.path.dirname(path_to_file)
rsub = repo.head.commit.tree
path_elements = dir_path.split(os.path.sep)
for el_id, element in enumerate(path_elements):
sub_path = os.path.join(*path_elements[:el_id + 1])
if sub_path in rsub:
rsub = rsub[element]
else:
return False
return path_to_file in rsub
or you can iterate through all items in repo, but it will be for sure slower:
def isFileInRepo(repo, path_to_file):
rsub = repo.head.commit.tree
for element in rsub.traverse():
if element.path == path_to_file:
return True
return False
There already exists a method of Tree that will do what fileInRepo re-implements in Lucidity's answer .
The method is Tree.join:
https://gitpython.readthedocs.io/en/3.1.29/reference.html#git.objects.tree.Tree.join
A less redundant implementation of fileInRepo is:
def fileInRepo(repo, filePath):
try:
repo.head.commit.tree.join(filePath)
return True
except KeyError:
return False