Sscanf forgets to forget the comma in expression - format

I am using scanf, trying to read in an expression before a comma.
sscanf(some_string,
"%s %[ .0-9a-zA-Z!#:/|-_^,],read_other_stuff:%s....]",
&string1, &string2... etc);
Sscanf correct reads in everything up until %[ .0-9a-zA-Z!#:/|-_^,]. This piece of format eats all of the rest of the string, without stopping at a comma, as expected.
How would one make it end at a comma, and read it everything else (including spaces, punctuation other than comma, etc.)

To parse a string up to a ,, code could use strchr(some_string, ',')
To use sscanf(), use 2 calls
int n = 0;
sscanf(some_string, "%*[^,]%n", &n);
// some_string[n] is either \0 or ,
char string1[100];
string1[0] = '\0';
sscanf(&some_string[n], ",%99s", string1);
puts(string1);
Recommend #1

Related

how to document a single space character within a string in reST/Sphinx?

I've gotten lost in an edge case of sorts. I'm working on a conversion of some old plaintext documentation to reST/Sphinx format, with the intent of outputting to a few formats (including HTML and text) from there. Some of the documented functions are for dealing with bitstrings, and a common case within these is a sentence like the following: Starting character is the blank " " which has the value 0.
I tried writing this as an inline literal the following ways: Starting character is the blank `` `` which has the value 0. or Starting character is the blank :literal:` ` which has the value 0. but there are a few problems with how these end up working:
reST syntax objects to a whitespace immediately inside of the literal, and it doesn't get recognized.
The above can be "fixed"--it looks correct in the HTML () and plaintext (" ") output--with a non-breaking space character inside the literal, but technically this is a lie in our case, and if a user copied this character, they wouldn't be copying what they expect.
The space can be wrapped in regular quotes, which allows the literal to be properly recognized, and while the output in HTML is probably fine (" "), in plaintext it ends up double-quoted as "" "".
In both 2/3 above, if the literal falls on the wrap boundary, the plaintext writer (which uses textwrap) will gladly wrap inside the literal and trim the space because it's at the start/end of the line.
I feel like I'm missing something; is there a good way to handle this?
Try using the unicode character codes. If I understand your question, this should work.
Here is a "|space|" and a non-breaking space (|nbspc|)
.. |space| unicode:: U+0020 .. space
.. |nbspc| unicode:: U+00A0 .. non-breaking space
You should see:
Here is a “ ” and a non-breaking space ( )
I was hoping to get out of this without needing custom code to handle it, but, alas, I haven't found a way to do so. I'll wait a few more days before I accept this answer in case someone has a better idea. The code below isn't complete, nor am I sure it's "done" (will sort out exactly what it should look like during our review process) but the basics are intact.
There are two main components to the approach:
introduce a char role which expects the unicode name of a character as its argument, and which produces an inline description of the character while wrapping the character itself in an inline literal node.
modify the text-wrapper Sphinx uses so that it won't break at the space.
Here's the code:
class TextWrapperDeux(TextWrapper):
_wordsep_re = re.compile(
r'((?<!`)\s+(?!`)|' # whitespace not between backticks
r'(?<=\s)(?::[a-z-]+:)`\S+|' # interpreted text start
r'[^\s\w]*\w+[a-zA-Z]-(?=\w+[a-zA-Z])|' # hyphenated words
r'(?<=[\w\!\"\'\&\.\,\?])-{2,}(?=\w))') # em-dash
#property
def wordsep_re(self):
return self._wordsep_re
def char_role(name, rawtext, text, lineno, inliner, options={}, content=[]):
"""Describe a character given by unicode name.
e.g., :char:`SPACE` -> "char:` `(U+00020 SPACE)"
"""
try:
character = nodes.unicodedata.lookup(text)
except KeyError:
msg = inliner.reporter.error(
':char: argument %s must be valid unicode name at line %d' % (text, lineno))
prb = inliner.problematic(rawtext, rawtext, msg)
return [prb], [msg]
app = inliner.document.settings.env.app
describe_char = "(U+%05X %s)" % (ord(character), text)
char = nodes.inline("char:", "char:", nodes.literal(character, character))
char += nodes.inline(describe_char, describe_char)
return [char], []
def setup(app):
app.add_role('char', char_role)
The code above lacks some glue to actually force the use of the new TextWrapper, imports, etc. When a full version settles out I may try to find a meaningful way to republish it; if so I'll link it here.
Markup: Starting character is the :char:`SPACE` which has the value 0.
It'll produce plaintext output like this: Starting character is the char:` `(U+00020 SPACE) which has the value 0.
And HTML output like: Starting character is the <span>char:<code class="docutils literal"> </code><span>(U+00020 SPACE)</span></span> which has the value 0.
The HTML output ends up looking roughly like: Starting character is the char:(U+00020 SPACE) which has the value 0.

removing all spaces within a specific string (email address) using ruby

The user is able to input text, but the way I ingest the data it often contains unnecessary carriage returns and spaces.
To remove those to make the input look more like a real sentence, I use the following:
string.delete!("\n")
string = string.squeeze(" ").gsub(/([.?!]) */,'\1 ')
But in the case of the following, I get an unintended space in the email:
string = "Hey what is \n\n\n up joeblow#dude.com \n okay"
I get the following:
"Hey what is up joeblow#dude. com okay"
How can I enable an exception for the email part of the string so I get the following:
"Hey what is up joeblow#dude.com okay"
Edited
your method does the following:
string.squeeze(" ") # replaces each squence of " " by one space
gsub(/([.?!] */, '\1 ') # check if there is a space after every char in the between the brackets [.?!]
# and whether it finds one or more or none at all
# it adds another space, this is why the email address
# is splitted
I guess what you really want by this is, if there is no space after punctuation marks, add one space. You can do this instead.
string.gsub(/([.?!])\W/, '\1 ') # if there is a non word char after
# those punctuation chars, just add a space
Then you just need to replace every sequence of space chars with one space. so the last solution will be:
string.gsub(/([.?!])(?=\W)/, '\1 ').gsub(/\s+/, ' ')
# ([.?!]) => this will match the ., ?, or !. and capture it
# (?=\W) => this will match any non word char but will not capture it.
# so /([.?!])(?=\W)/ will find punctuation between parenthesis that
# are followed by a non word char (a space or new line, or even
# puctuation for example).
# '\1 ' => \1 is for the captured group (i.e. string that match the
# group ([.?!]) which is a single char in this case.), so it will add
# a space after the matched group.
If you are okay with getting rid of the squeeze statement then, using Nafaa's answer is the simplest way to do it but I've listed an alternate method in case its helpful:
string = string.split(" ").join(" ")
However, if you want to keep that squeeze statement you can amend Nafaa's method and use it after the squeeze statement:
string.gsub(/\s+/, ' ').gsub('. com', '.com')
or just directly change the string:
string.gsub('. com', '.com')

Does csplit on OS X not recognise '$' as end-of-line character?

(I'm using Mac OS X, and this question might be specific to that variant of Unix)
I'm trying to split a file using csplit with a regular expression. It consists of various articles merged into one single long text file. Each article ends with "All Rights Reserved". This is at the end of the line: grep Reserved$ finds them all. Only, csplit claims there is no match.
csplit filename /Reserved$/
yields
csplit: Reserved$: no match
which is a clear and obvious lie. If I leave out the $, it works; but I want to be sure that I don't get any stray occurrences of 'Reserved' in the middle of the text. I tried a different word with the beginning-of-line character ^, and that seems to work. Other words (which do occur at the end of a line in the data) also do not match when used (eg and$).
Is this a known bug with OS X?
[Update: I made sure it's not a DOS/Unix line end character issue by removing all carriage return characters]
I have downloaded the source code of csplit from http://www.opensource.apple.com/source/text_cmds/text_cmds-84/csplit/csplit.c and tested this in the debugger.
The pattern is compiled with
if (regcomp(&cre, re, REG_BASIC|REG_NOSUB) != 0)
errx(1, "%s: bad regular expression", re);
and the lines are matched with
/* Read and output lines until we get a match. */
first = 1;
while ((p = csplit_getline()) != NULL) {
if (fputs(p, ofp) == EOF)
break;
if (!first && regexec(&cre, p, 0, NULL, 0) == 0)
break;
first = 0;
}
The problem is now that the lines returned by csplit_getline() still have a trailing newline character \n. Therefore "Reserved" are not the last characters in the string and the pattern "Reserved$" does not match.
After a quick-and-dirty insertion of
p[strlen(p)-1] = 0;
to remove the trailing newline from the input string the "Reserved$" pattern worked as expected.
There seem to be more problems with csplit in Mac OS X, see the remarks to the answer of Looking for correct Regular Expression for csplit (the repetition count {*} does also not work).
Remark: You can match "Reserved" at the end of the line with the following trick:
csplit filename /Reserved<Ctrl-V><Ctrl-J>/
where you actually use the Control keys to enter a newline character on the command line.

Split a string by 2 various number of chars, skipping non alphanumerics

I have a string like:
hn$8m3kj4.23hs#8;
i need to split it as follow: first entry should be of one char length, second entry of 2 chars, third entry of one char, fourth - by 2 chars and so on.
then concatenate one char with two chars entries by a semicolon :
if some chars at the end remains unpaired, they should be displayed as well.
it is important to skip all non alphanumeric chars.
so the final string should be:
h:n8 m:3k j:42 3:hs 8:
see, 8 has no 2 chars pair but it is displayed anyway.
i have tried with a loop but i get huge code.
also tried regexs but it split by wrong number of chars.
you can try this:
s = "hn$8m3kj4.23hs#8;"
s.gsub(/\W/, '').scan(/(.)(..)?/).map { |i| i.join ':' }.join ' '
=> "h:n8 m:3k j:42 3:hs 8:"
this will not skip underscores though.
if you need to skip them as well, use this one:
s = "hn$8m3k_j4.23hs#8;_"
s.gsub(/\W|_/, '').scan(/(.)(..)?/).map { |i| i.join ':' }.join ' '
=> "h:n8 m:3k j:42 3:hs 8:"
See live demo here

How to remove a ^M character java

Problem:
If String ends with \r, remove \r
I started with something like this
if (masterValue.endsWith(CARRIAGE_RETURN_STR)) {
masterValue = masterValue.replace(CARRIAGE_RETURN_STR, "");
}
where
public static final String CARRIAGE_RETURN_STR = (Character.toString(Constants.CARRIAGE_RETURN));
public static final char CARRIAGE_RETURN = '\r';
This seems awkward to me.
Is there an easy way to just remove \r character?
I then moved on to this:
if (value.contains(CARRIAGE_RETURN_STR)) {
value = value.substring(0, value.length()-3);
//-3 because we start with 0 (1), line ends with \n (2) and we need to remove 1 char (3)
But this too seems awkward .
Can you suggest a easier, more elegant solution?
Regexes can support end-of-string anchoring, you know. (See this Javadoc page for more information)
myString.replaceAll("\\r$", "");
This also takes care of fixing \r\n --> \n, I believe.
I'd write it like this:
if (masterValue.endsWith("\r")) {
masterValue = masterValue.substring(0, masterValue.length() - 1);
}
I see no point in creating a named constant for the String "\r".
By the way, your second attempt is incorrect because:
String.contains("\r") tells you if the String contains a carriage return, not if it ends with a carriage return,
the second argument of String.substring(int, int) is the index of the end character; i.e. the position first character that should NOT be in the substring, and
the length of "\r" is one.

Resources