How may I rename variables in a df using gsub()? - gsub

I am trying to rename taxa annotations in an abundance matrix for bubble plot creation (Original data 16S MiSeq). My data frame "data_melt" is shown below:
And I am looking to rename the taxa IDs in the "variable" column to simply the last name (class level). For example: "D_0__Archaea.D_1__Altiarchaeota.D_2__Altiarchaeia" to "Altiarchaeia".
I have attempted
data_melt$variable <- gsub("D_0__[A-z].D_1__[A-z].D_2__", "", data_melt$variable)
with no avail. I have used this line of code on other datasets successfully, but there is no change to "data_melt". There aren't even any warning/error messages. Any ideas?
Thank you in advance,
J

You might fix your approach by replacing [A-z]. with [A-Za-z]+\\.:
data_melt$variable <- sub("D_0__[A-Za-z]+\\.D_1__[A-Za-z]+\\D_2__", "", data_melt$variable)
The [A-z] matches more than just letters and . matches any char, while you wanted to match a literal dot. When the dot is escaped, it only matches a literal dot.
However, you may solve the problem by removing all up to and including the last underscore:
sub(".*_", "", data_melt$variable)
Note you may use sub as you expect one replacement to be made.

Related

InDesign Grep: Changing sentence beginnings to Uppercase

I am relatively new to scripting and within an InDesign Script I am trying to change all the first letters of all sentences to uppercase (many of the are lowercase, since I randomly generated the setences from different text sources).
I am so far able to find the text parts with this Grep expression:
\.(\s)+\l
I also found this script by Peter Kahrel, that he shares on InDesign Secrets:
app.findGrepPreferences.findWhat = "^.";
found = app.activeDocument.findGrep();
for (i = 0; i < found.length; i++)
found[i].characters[0].changecase (ChangecaseMode.lowercase);
However, when I now replace the ^. with my own expression, and change lowercase to uppercase, the script does not work, which makes sense, since I do not want to change the first character of my findGrep results, but the last one. But how can I find the last character? The breaks between the sentences have different lengths, so I cannot simply type 2 instead of 0.
Any help would be very appreciated! Thank you!
Edit: I'm working on CS6.
Your GREP returns matches that start with a period, then have any number of spaces (including hard returns, probably), and always end with one lowercase character. So far, so good. You can access the last character (and in fact any last item in any InDesign object collection) in this way:
found[i].characters[-1].changecase (ChangecaseMode.lowercase);
which 'indexes' from the end, rather than from the start.
However! The only character in your matches, other than the period and spaces, is always going to be a lowercase letter. So you can skip the entire "how to find the correct index" thing, and probably slightly speed up the script as well, by simply applying lowercase (or, as you are using it, uppercase) to the entire match:
found[i].changecase (ChangecaseMode.lowercase);
because nothing will happen to not-lowercaseable characters (a word I declare to signify "having the Unicode-defined property of being lowercase and having an uppercase equivalent). (Or the other way around, if I understand your purpose correct.)

Using sed to modify line not containing string

I am trying to write a bash script that uses sed to modify lines in a config file not containing a specific string. To illustrate by example, I could have ...
/some/file/path1 ipAddress1/subnetMask(rw,sync,no_root_squash)
/some/file/path2 ipAddress1/subnetMask(rw,sync,no_root_squash,anonuid=-1)
/some/file/path3 ipAddress2/subnetMask(rw,sync,no_root_squash,anonuid=0)
/some/file/path4 ipAddress2/subnetMask(rw,sync,no_root_squash,anongid=-1)
/some/file/path5 ipAddress2/subnetMask(rw,sync,no_root_squash,anonuid=-1,anongid=-1)
And I want every line's parenthetical list to be changed such that it contains strings anonuid=-1 and anongid=-1 within its parentheses ...
/some/file/path1 ipAddress1/subnetMask(rw,sync,no_root_squash,anonuid=-1,anongid=-1)
/some/file/path2 ipAddress1/subnetMask(rw,sync,no_root_squash,anonuid=-1,anongid=-1)
/some/file/path3 ipAddress2/subnetMask(rw,sync,no_root_squash,anonuid=-1,anongid=-1)
/some/file/path4 ipAddress2/subnetMask(rw,sync,no_root_squash,anongid=-1,anonuid=-1)
/some/file/path5 ipAddress2/subnetMask(rw,sync,no_root_squash,anonuid=-1,anongid=-1)
As can be seen from the example, both anonuid and anongid may already exist within the parentheses, but it is possible that the original parenthetical list has one string but not the other (lines 2, 3, and 4), the list has neither (line 1), the list has both already set properly (line 5), or even one or both of them are set incorrectly (line 3). When either anonuid or anongid is set to a value other than -1, it must be changed to the proper value of -1 (line 3).
What would be the best way to edit my config file using sed such that anonuid=-1 and anongid=-1 is contained in each line's parenthetical list, separated by a comma delimiter of course?
I think this does what you want:
sed -e '/anonuid/{s/anonuid=[-0-9]*/anonuid=-1/;b gid;};s/)$/,anonuid=-1)/;:gid;/anongid/{s/anongid=[-0-9]*/anongid=-1/;b;};s/)$/,anongid=-1)/'
Basically, it has two nearly identical parts with the first dealing with anonuid and the second anongid, each with a bit of logic to decide if it needs to replace or add the appropriate values. (It doesn't bother to check if the value is already correct, that would just complicate things while not changing the results.)
You can use sed to specify the lines you are interested in:
$ sed '/anonuid=..*,anongid=..*)$/!p' $file
The above will print (p) all lines that don't match the regular expression between the two slashes. I negated the expression by using the !. This way, you're not matching lines with both anaonuid and anongid in them.
Now, you can work on the non-matching lines and editing those with the sed s command:
$ sed '/anonuid=..*,anongid=..*)$/!s/from/to/`
The manipulation might be fairly complex, and you might be passing multiple sed commands to get everything just right.
However, if the string no_root_squash appear in each line you want to change, why not take the simple way out:
$ sed 's/no_root_squash.*$/no_root_squash,anonuid=-1,anongid=-1)/' $file
This is looking for that no_root_squash string, and replacing everything from that string to the end of the line with the text you want. Are there lines you are touching that don't need to be edited? Yes, but you're not really changing those lines. You're basically substituting /no_root_squash,anonuid=-1,anongid=-1) with the same /no_root_squash,anonuid=-1,anongid=-1).
This may be faster even though it's replacing text that doesn't need replacing because there's less processing going on. Plus, it's easier to understand and support in the future.
Response
Thanks David! Yeah I was considering going that route, but I didn't want to rely 100% on every line containing no_root_squash. My current config file only ends in that string, but I'm just not 100% sure that won't potentially be different in the field. Do you think there would be a way to change that so it just overwrites from the end of the last string not containing anonuid=-1 or anongid=-1 onward?
What can you guarantee will be in each line?
You might be able to do a capture group:
sed 's/\(sync,[^,)]*\).*/\1,anonuid=-1,anongid=-1)/' $file
The \(..\) is a capture group. It basically captures that portion of the matching regular expression, and then allows you to reuse it via the \1. I'm capturing from the word sync to a group of characters not including a comma or a closing parentheses. Then, I'm appending the capture group, a comma, and your anon uid and gid.
Will that work?
Maybe I am oversimplifying:
sed 's/anonuid=[-0-9]*[^)]//g;s/anongid=[-0-9]*[^)]//g;s/[)]/anonuid=-1,anongid=-1)/g' test.txt > test3.txt
This just drops any current instance of anonuid or anongid and adds the string
"anonuid=-1,anongid=-1" into the parentheses

Split string suppressing all null fields

I want to split a string suppressing all null fields
Command:
",1,2,,3,4,,".split(',')
Result:
["", "1", "2", "", "3", "4", ""]
Expected:
["1", "2", "3", "4"]
How to do this?
Edit
Ok. Just to sum up all that good questions posted.
What I wanted is that split method (or other method) didn't generate empty strings. Looks like it isn't possible.
So, the solution is two step process: split string as usual, and then somehow delete empty strings from resulting array.
The second part is exactly this question
(and its duplicate)
So I would use
",1,2,,3,4,,".split(',').delete_if(&:empty?)
The solution proposed by Nikita Rybak and by user229426 is to use reject method. According to docs reject returns a new array. While delete_if method is more efficient since I don't want a copy. Using select proposed by Mark Byers even more inefficient.
steenslag proposed to replace commas with space and then use split by space:
",1,2,,3,4,,".gsub(',', ' ').split(' ')
Actually, the documentation says that space is actually a white space. But results of "split(/\s/)" and "split(' ')" are not the same. Why's that?
Mark Byers proposed another solution - just using regular expressions. Seems like this is what I need. But this solution implies that you have to be a master of regexp. But this is great solution! For example, if I need spaces to be separators as well as any non-alphanumeric symbol I can rewrite this to
",1,2, ,3 3,4 4 4,,".scan(/\w+[\s*\w*]*/)
the result is:
["1", "2", "3 3", "4 4 4"]
But again regexps are very unintuitive and they need an experience.
Summary
I expect that split to work with whitespaces as if whitespaces were a comma or even regexp. I expect it to do not produce empty strings. I think this is a bug in ruby or my misunderstanding.
Made it a community question.
There's a reject method in Array:
",1,2,,3,4,,".split(',').reject { |s| s.empty? }
Or if you prefer Symbol#to_proc:
",1,2,,3,4,,".split(',').reject(&:empty?)
Hoping to illuminate a bit here:
But results of "split(/\s/)" and "split(' ')" are not the same. Why's that?
If you look at the docs for String#split you'll see that split with ' ' is a special case:
If pattern is a single space, str is split on whitespace,
with leading whitespace and runs of contiguous whitespace characters ignored.
You also mention:
I expect it to do not produce empty strings. I think this is a bug in ruby or my misunderstanding.
The problem probably lies between the keyboard and the chair. ;-)
split will happily produce empty strings as it should, because there are times when you would definitely want this ability, and there are plenty of easy ways to work around it. Consider if you were splitting a csv from an Excel file. Anywhere you see ',,' would be an empty column, not a column you should just get rid of.
Regardless, you've seen a bunch of solutions - and here's another one that might show you the things you can do with ruby and split!
It seems you want to split up data between multiple commas, so why not try that and see what happens?
a = ",1,2,,3,4,,5,,,,6,,,".split(/,+/)
It's a simple enough regular expression: /,+/ means one or more commas, so we'll split on that.
This almost gives you want you want, except that you also want to ignore the leading empty field. You'll note that split ignores the empty field on the end because (from the String#split docs):
If the limit parameter is omitted, trailing null fields are suppressed.
So that means we can either use something that will remove that nil at the front of the array or just remove the initial commas. We can use gsub for that:
a = ",1,2,,3,4,,5,,,,6,,,".gsub(/^,+/,'')
If you print that out you'll see that our trailing empty "field" is now gone. So we can combine them all in one line:
a = ",1,2,,3,4,,5,,,,6,,,".gsub(/^,+/,'').split(/,+/)
And you have another solution!
And incidentally, this points out another possibility, that we can just cleanup our string entirely before sending it to split if we want a simple split. I'll leave it to you to figure out what this one is doing:
a = ",1,2,,3,4,,5,,,,6,,,".gsub(/,+/,',').gsub(/^,/,'').split(',')
There's lots of ways to do things in ruby. If it seems that ruby isn't doing what you want, then take a look at the docs and realize that it probably works the way that it does for a reason (there are plenty of people who would be upset if split wasn't able to spit out empty fields :)
Hope that helps!
You could use split followed by select:
",1,2,,3,4,,".split(',').select{|x|!x.empty?}
Or you could use a regular expression to match what you want to keep instead of splitting on the delimiter:
",1,2,,3,4,,".scan(/[^,]+/)
",1,2,,3,4,,".split(/,/).reject(&:empty?)
",1,2,,3,,,4,,".squeeze(",").sub(/^,*|,*$/,"").split(",")
String#split(pattern) behaves as desired when pattern is a single space (ruby-doc).
",1,2,,3,4,,".gsub(',', ' ').split(' ')

Inserting characters before whatever is on a line, for many lines

I have been looking at regular expressions to try and do this, but the most I can do is find the start of a line with ^, but not replace it.
I can then find the first characters on a line to replace, but can not do it in such a way with keeping it intact.
Unfortunately I don´t have access to a tool like cut since I am on a windows machine...so is there any way to do what I want with just regexp?
Use notepad++. It offers a way to record an sequence of actions which then can be repeated for all lines in the file.
Did you try replacing the regular expression ^ with the text you want to put at the start of each line? Also you should use the multiline option (also called m in some regex dialects) if you want ^ to match the start of every line in your input rather than just the first.
string s = "test test\ntest2 test2";
s = Regex.Replace(s, "^", "foo", RegexOptions.Multiline);
Console.WriteLine(s);
Result:
footest test
footest2 test2
I used to program on the mainframe and got used to SPF panels. I was thrilled to find a Windows version of the same editor at Command Technology. Makes problems like this drop-dead simple. You can use expressions to exclude or include lines, then apply transforms on just the excluded or included lines and do so inside of column boundaries. You can even take the contents of one set of lines and overlay the contents of another set of lines entirely or within column boundaries which makes it very easy to generate mass assignments of values to variables and similar tasks. I use Notepad++ for most stuff but keep a copy of SPFSE around for special-purpose editing like this. It's not cheap but once you figure out how to use it, it pays for itself in time saved.

Least used delimiter character in normal text < ASCII 128

For coding reasons which would horrify you (I'm too embarrassed to say), I need to store a number of text items in a single string.
I will delimit them using a character.
Which character is best to use for this, i.e. which character is the least likely to appear in the text? Must be printable and probably less than 128 in ASCII to avoid locale issues.
I would choose "Unit Separator" ASCII code "US": ASCII 31 (0x1F)
In the old, old days, most things were done serially, without random access. This meant that a few control codes were embedded into ASCII.
ASCII 28 (0x1C) File Separator - Used to indicate separation between files on a data input stream.
ASCII 29 (0x1D) Group Separator - Used to indicate separation between tables on a data input stream (called groups back then).
ASCII 30 (0x1E) Record Separator - Used to indicate separation between records within a table (within a group). These roughly map to a tuple in modern nomenclature.
ASCII 31 (0x1F) Unit Separator - Used to indicate separation between units within a record. The roughly map to fields in modern nomenclature.
Unit Separator is in ASCII, and there is Unicode support for displaying it (typically a "us" in the same glyph) but many fonts don't display it.
If you must display it, I would recommend displaying it in-application, after it was parsed into fields.
Assuming for some embarrassing reason you can't use CSV I'd say go with the data. Take some sample data, and do a simple character count for each value 0-127. Choose one of the ones which doesn't occur. If there is too much choice get a bigger data set. It won't take much time to write, and you'll get the answer best for you.
The answer will be different for different problem domains, so | (pipe) is common in shell scripts, ^ is common in math formulae, and the same is probably true for most other characters.
I personally think I'd go for | (pipe) if given a choice but going with real data is safest.
And whatever you do, make sure you've worked out an escaping scheme!
When using different languages, this symbol: ¬
proved to be the best. However I'm still testing.
Probably | or ^ or ~ you could also combine two characters
You said "printable", but that can include characters such as a tab (0x09) or form feed (0x0c). I almost always choose tabs rather than commas for delimited files, since commas can sometimes appear in text.
(Interestingly enough the ascii table has characters GS (0x1D), RS (0x1E), and US (0x1F) for group, record, and unit separators, whatever those are/were.)
If by "printable" you mean a character that a user could recognize and easily type in, I would go for the pipe | symbol first, with a few other weird characters (# or ~ or ^ or \, or backtick which I can't seem to enter here) as a possibility. These characters +=!$%&*()-'":;<>,.?/ seem like they would be more likely to occur in user input. As for underscore _ and hash # and the brackets {}[] I don't know.
How about you use a CSV style format? Characters can be escaped in a standard CSV format, and there's already a lot of parsers already written.
Can you use a pipe symbol? That's usually the next most common delimiter after comma or tab delimited strings. It's unlikely most text would contain a pipe, and ord('|') returns 124 for me, so that seems to fit your requirements.
For fast escaping I use stuff like this:
say you want to concatinate str1, str2 and str3
what I do is:
delimitedStr=str1.Replace("#","#a").Replace("|","#p")+"|"+str2.Replace("#","#a").Replace("|","#p")+"|"+str3.Replace("#","#a").Replace("|","#p");
then to retrieve original use:
splitStr=delimitedStr.Split("|".ToCharArray());
str1=splitStr[0].Replace("#p","|").Replace("#a","#");
str2=splitStr[1].Replace("#p","|").Replace("#a","#");
str3=splitStr[2].Replace("#p","|").Replace("#a","#");
note: the order of the replace is important
its unbreakable and easy to implement
Pipe for the win! |
We use ascii 0x7f which is pseudo-printable and hardly ever comes up in regular usage.
Well it's going to depend on the nature of your text to some extent but a vertical bar 0x7C doesn't crop up in text very often.
I don't think I've ever seen an ampersand followed by a comma in natural text, but you can check the file first to see if it contains the delimiter, and if so, use an alternative. If you want to always be able to know that the delimiter you use will not cause a conflict, then do a loop checking the file for the delimiter you want, and if it exists, then double the string until the file no longer has a match. It doesn't matter if there are similar strings because your program will only look for exact delimiter matches.
This can be good or bad (usually bad) depending on the situation and language, but keep mind mind that you can always Base64 encode the whole thing. You then don't have to worry about escaping and unescaping various patterns on each side, and you can simply seperate and split strings based on a character which isn't used in your Base64 charset.
I have had to resort to this solution when faced with putting XML documents into XML properties/nodes. Properties can't have CDATA blocks in them at all, and nodes escaped as CDATA obviously cannot have further CDATA blocks inside that without breaking the structure.
CSV is probably a better idea for most situations, though.
Both pipe and caret are the obvious choices. I would note that if users are expected to type the entire response, caret is easier to find on any keyboard than is pipe.
I've used double pipe and double caret before. The idea of a non printable char works if your not hand creating or modifying the file. For quick random access file storage and retrieval field width is used. You don't even have to read the file.. your literally pulling from the file by reference. This is how databases do some storage.. but they also manage the spaces between records and such. And it introduced the problem of max data element width. (Index attach a header which is used to define the width of each element and it's data type in the original old days.. later they introduced compression with remapping chars. This allows for a text file to get about 1/8 the size in transmission.. variable length char encoding for the win
make it dynamic : )
announce your control characters in the file header
for example
delimiter: ~
escape: \
wrapline: $
width: 19
hello world~this i$
s \\just\\ a sampl$
e text~$someVar$~h$
ere is some \~\~ma$
rkdown strikethrou$
gh\~\~ text
would give the strings
hello world
this is \just\ a sample text
$someVar$
here is some ~~markdown strikethrough~~ text
i have implemented something similar:
a plaintar text container format,
to escape and wrap utf16 text in ascii,
as an alternative to mime multipart messages.
see https://github.com/milahu/live-diff-html-editor

Resources