What is the format for specifying multiple tags to exclude from cvs2git - cvs2svn

I understand I can perform a conversion of a CVS repository to GIT excluding all NIGHTLY_XXXX build tags by doing the following
cvs2git --exclude='NIGHTLY_.*' --blobfile=git-blob.dat
--dumpfile=git-dump.dat --username=dev /opt/mycvsrepository/mymodule
But what is the format of the command line arguments if I also want to remove more than 1 wildcard? i.e
"NIGHTLY_" and "BETA_RELEASE_" and "RC_*"
Many thanks in advance

You can construct a more complicated regular expression that matches all of the tag names that should be excluded, like
cvs2git --exclude='(NIGHTLY|BETA_RELEASE|RC)_.*' \
--blobfile=git-blob.dat --dumpfile=git-dump.dat \
--username=dev /opt/mycvsrepository/mymodule
But it is probably easier to use the --exclude option multiple times; e.g.,
cvs2git --exclude='NIGHTLY_.*' \
--exclude='BETA_RELEASE_.*' \
--exclude='RC_.*' \
--blobfile=git-blob.dat --dumpfile=git-dump.dat \
--username=dev /opt/mycvsrepository/mymodule

Sorry to answer my own question
but thanks to this amazing https://regex101.com/ site I realized I need to give it as a regular expression
so the format would be
--exclude='(NIGHTLY_.* )|(BETA_RELEASE_.* )|(RC_.*)'

Related

Bash script execute wget with an input inside url

Complete newbie, I know you probably can't use the variables like that in there but I have 20 minutes to deliver this so HELP
read -r -p "Month?: " month
read -r -p "Year?: " year
URL= "https://gz.blockchair.com/ethereum/blocks/"
wget -w 2 --limit-rate=20k "${URL}blockchair_ethereum_blocks_$year$month*.tsv.gz"
exit
There are two issues with your code.
First, you should remove the whitespace that follows the equal symbol when you declare your URL variable. So the line becomes
URL="https://gz.blockchair.com/ethereum/blocks/"
Then, you are building your URL using a wildcard, which is not allowed in this case. So you cannot do something like month*.tsv.gz as you are doing right know. If you need to perform requests to several URLs, you need to run wget for each one of them.
It's possible to do what you're trying to do with wget, however, this particular site's robots.txt has a rule to disallow crawling of all files (https://gz.blockchair.com/robots.txt):
User-agent: *
Disallow: /
That means the site's admins don't want you to do this. wget respects robots.txt by default, but it's possible to turn that off with -e robots=off.
For this reason, I won't post a specific, copy/pasteable solution.
Here is a generic example for selecting (and downloading) files using a glob pattern, from a typical html index page:
url=https://www.example.com/path/to/index
wget \
--wait 2 --random-wait --limit=20k \
--recursive --no-parent --level 1 \
--no-directories \
-A "file[0-9][0-9]" \
"$url"
This would download all files named file, with a two digit suffix (file52 etc), that are linked on the page at $url, and whose parent path is also $url (--no-parent).
This is a recursive download, recursing one level of links (--level 1). wget allows us to use patterns to accept or reject filenames when recursing (-A and -R for globs, also --accept-regex, --reject-regex).
Certain sites block may block the wget user agent string, it can be spoofed with --user-agent.
Note that certain sites may ban your IP (and/or add it to a blacklist) for scraping, especially doing it repeatedly, or not respecting robots.txt.
In case of downloading blocks for every day in a month, you may just change in original script the * symbol to a argument, let's say day and previously assign a variable days to a list of days.
Then iterate like for day in days… and do your wget stuff.

BWA-mem and sambamba read group line error

This is a two-part question:
help interpreting an error;
help with coding.
I'm trying to run bwa-mem and sambamba to aling raw reads to a reference genome and to sort by position. These are the commands I'm using:
bwa mem \
-K 100000000 -v 3 -t 6 -Y \
-R '\#S200031047L1C001R002\S*[1-2]' \
/path/to/reference/GCF_009858895.2_ASM985889v3_genomic.fna \
/path/to/raw-fastq/'S\S[^_]*_L01_[0-9]+-[0-9]+'_1.fq.gz \
/path/to/raw-fastq/'S\S[^_]*_L01_[0-9]+-[0-9]+'_2.fq.gz | \
/path/to/genomics/sambamba-0.8.2 view -S -f bam \
/dev/stdin | \
/path/to/genomics/sambamba-0.8.2 sort \
/dev/stdin \
--out host_removal/${SAMPLE}/${SAMPLE}.hybrid.sorted.bam
This is the error message I'm getting: [E::bwa_set_rg] the read group line is not started with #RG.
My sequences were generated with an MGI sequencer and the readgroups are identified like this #S200031047L1C001R0020000243/1, i.e., they don't beging with an #RG. How can I specify to sambamba that my readgroups start with #S and not #RG?
The commands written above are a published pipeline I'm modifying for my own research. However, among several changes, I'm not confident on how to define sample id as such stated in the last line of the code: --out host_removal/${SAMPLE}/${SAMPLE}.hybrid.sorted.bam (I'm referring to ${SAMPLE}). Any insights?
Thank you very much!
1. Specifying read groups
Your read group string is not correctly formatted. It should be like
'#RG\tID:$ID\tSM:$SM\tLB:$LB\tPU:$PU\tPL:$PL' where the parts beginning with a $ sign should be replaced with the information specific to your sequencing run and sample. Not all of them are required for all purposes. See this read group documentation by GATK team for an example.
Read group specification always begins with #RG. That's part of SAM format. Sequencers do not produce read groups. I think you may be confusing them with fastq header lines. Entries in the read group string are separated by tabs, denoted with \t. Tags and their values are separated by :.
The difference between $ID (read group id) and $SM (sample id) is that sample is the individual or biological sample which may have been sequenced several times in different libraries ($LB). In the GATK documentation they combine flowcell and library into the read group id. Sample and library could make an intuitive read group id in small projects. If you are working on your own project that is not part of a larger sequencing effort, you can define the read groups as you like. If several people work in the same project, you should be consistent to avoid problems later.
2. Variable substitution
I'm not sure if I understood you correctly, but if you are wondering what ${SAMPLE} means in the command, it's a variable called SAMPLE that will be replaced by its value when the command is run. The curly brackets protect the name so that the shell does not confuse the variable name with characters coming after it. See here for examples.

Folder listing with gsutil with condition

I have got this: gsutil ls -d gs://mystorage/*123*,
which gives me all files matching the pattern "123".
I wonder if i could do this with condition like >123 and <127. To grab all files whose names contain 124, 125 and 126.
Other than *, gsutil supports special wildcard names.
You can use these special wildcards to match the name of your files, but keep in mind that you are working with strings and characters rather than numbers, therefore the solution is not very straight forward. Here is a guide using regexp, that better explains how to work with digits, in a general way.
For your specific question, you would end up with something like:
gsutil ls -d gs://mystorage/*12[456]*

Trick to use file paths with spaces in Mallet (Terminal, OSx)?

Is there a trick to be able to use file paths with spaces in Mallet through the terminal on mac?
For example, all of the following give me errors:
escaping the space
./bin/mallet import-dir --input /Volumes/Macintosh\ HD/Users/MY_NAME/Desktop/en --output /Users/MY_NAME/Desktop/en.mallet --remove-stopwords TRUE --keep-sequence TRUE
double quotes, no escapes
./bin/mallet import-dir --input "/Volumes/Macintosh HD/Users/MY_NAME/Desktop/en" --output /Users/MY_NAME/Desktop/en.mallet --remove-stopwords TRUE --keep-sequence TRUE
and, with double quotes
./bin/mallet import-dir --input "/Volumes/Macintosh\ HD/Users/MY_NAME/Desktop/en" --output /Users/MY_NAME/Desktop/en.mallet --remove-stopwords TRUE --keep-sequence TRUE
and finally with single quotes
./bin/mallet import-dir --input '/Volumes/Macintosh\ HD/Users/MY_NAME/Desktop/en' --output /Users/MY_NAME/Desktop/en.mallet --remove-stopwords TRUE --keep-sequence TRUE
They all want to treat the folder as multiple folders, split on the space:
Labels =
/Volumes/Macintosh\
HD/Users/MY_NAME/Desktop/en
Exception in thread "main" java.lang.IllegalArgumentException: /Volumes/Macintosh\ is not a directory.
at cc.mallet.pipe.iterator.FileIterator.<init>(FileIterator.java:108)
at cc.mallet.pipe.iterator.FileIterator.<init>(FileIterator.java:145)
at cc.mallet.classify.tui.Text2Vectors.main(Text2Vectors.java:322)
Is there anyway around this, other than renaming all of my files with spaces to underscores? (I understand that I don't need to type /Volumes/Macintosh\ HD/... but can just start at /Users. This was just an example.)
The issue is that import-dir is designed to take multiple directories as input. The argument parser would need a way to distinguish this use case from the "escaped space" use case, keeping in mind that Windows paths can end in \.
The best way to support both cases might be to add a --single-input option that would take its argument as a single string.
I also find that the spreadsheet-style import-file command is almost always preferable to working with directories.
As a work around you could:
(1) write some code to read the directory contents and generate a single examples file for use with:
bin/mallet input-file
Here's the mallet quick-start page for importing which describes the input-file version: http://mallet.cs.umass.edu/import.php
(2) Generate a symbolic link to the folder in a location without any spaces in it

Find and replace in file with script

I want to find and replace the VALUE into a xml file :
<test name="NAME" value="VALUE"/>
I have to filter by name (because there are lot of lines like that).
Is it possible ?
Thanks for you help.
Since you tagged the question "bash", I assume that you're not trying to use an XML library (although I think an XML expert might be able to give you something like an XSLT processor command that solves this question very robustly), but that you're simply interested in doing search & replace from the commandline.
I am using perl for this:
perl -pi -e 's#VALUE#replacement#g' *.xml
See perlrun man page: Very shortly put, the -p switches perl into text processing mode, -i stands for "in-place", and -e let's you specify an expression to apply to all lines of input.
Also note (if you are not too familiar with that already) that you may use other characters than # (common ones are %, a comma, etc.) that don't clash with your search & replacement strings.
There is one small caveat: perl will read & write all files given on the commandline, even those that did not change. Thus, the files' modification times will be updated even if they did not change. (I usually work around that with some more shell magic, e.g. using grep -l or grin -l to select files for perl to work on.)
EDIT: If I understand your comments correctly, you also need help with the regular expression to apply. Let me briefly suggest something like this then:
perl -pi -e 's,(name="NAME" value=)"[^"]*",\1"NEWVALUE",g' *.xml
Related: bash XHTML parsing using xpath
You can use SED:
SED 's/\(<test name=\"NAME\"\) value=\"VALUE\"/\1 value=\"YourValue\"/' test.xml
where test.xml is the xml document containing the given node. This is very fragile, and you can work to make it more flexible if you need to do this substitution multiple times. For instance, the current statement is case sensitive, so it won't substitute the value on a node with the name="name", but you can add a case insensitivity flag to the end of the statement, like so:
('s/\(<test name=\"NAME\"\) value=\"VALUE\"/\1 value=\"YourValue\"/I').
Another option would be to use XSLT, but it would require you to download an external library. It's pretty versatile, and could be a viable option for more complex modifications to an XML document.

Resources