Utf-8 in subdomain? - utf-8

Is it possible to use UTF-8 in a subdomain? If so, which characters are allowed and how does the can't-mix-encodings thing work?
I've tried to RTFM, but Google wan't of much help

There aren't many things special about subdomains. A given domain name foo.example.com is an ordered list of labels (foo, example, com). So you might want to know if you can use UTF-8 in a given label.
The low level answer is that a label is defined as:
<label> ::= <letter> [ [ <ldh-str> ] <let-dig> ]
<let-dig> ::= <letter> | <digit>
<letter> ::= any one of the 52 alphabetic characters A through Z in upper case and a through z in lower case
<digit> ::= any one of the ten digits 0 through 9
<ldh-str> ::= <let-dig-hyp> | <let-dig-hyp> <ldh-str>
<let-dig-hyp> ::= <let-dig> | "-"
which means that you can only find [-a-zA-Z0-9] in a label.
However, IDNA can be used to encode Unicode characters. In short, a label containing other characters is encoded with: "xn--" + punycode(nameprep(label)).
As for limitations at least:
for characters can't be in a IDN label (U+002E, U+3002, U+FF0E, U+FF61).

Related

Bash ranges of non digit or letters

I played around with bash {..} constructs today. I knew
{a..z}
would generate all letters,
{0..9}
digits etc. (numbers in general obviously), but By mistake I got
{Z..a}
yielding:
Z [ ] ^ _ ` a
The characters in between "Z" (90) and "a" (97) are the ASCII 91-96. The astute reader will notice there is a character missing - "\", 92. I'm guessing because of it's special nature. Is this expected behavior as output? Specifically, I'm guessing the \ is being used to escape the space in front of it after substitution, but #John1024 notes that:
echo {Z..a}a
will complain on missing backticks, while the previous version (no a) does not. How exactly is substitution working? Is there a bug?
Second, I guessed the range operator is cooler than I thought and can do any range of ASCII characters I choose, but {[.._} for example fails. Am I missing something to make this work or is this just a curiosity? Are there any more ranges besides letters/digits I can use? and if not, why not do nothing (fail, echo as is) for 'jumping' from caps to lower?
The \ is being generated; however, it subsequently appears to be treated as escaping the following space. Compare:
$ printf '%s\n' 'Z' '[' ']' '^' '_' '`' 'a'
Z
[
]
^
_
`
a
$ printf '%s\n' {Z..a}
Z
[
]
^
_
`
a
The extra blank line following the [ is the space escaped by the backslash generated by {Z..a}.
A special variable obase can be used with bc to print almost any character range(s):
for n in {91..95}; do printf "\x$(echo "obase=16; $n" | bc)"; done
Result:
[\]^_
↳ https://www.gnu.org/software/bc/manual/html_mono/bc.html#TOC6

grok not reading a word with hyphen

This is my grok pattern
2017-09-25 08:58:17,861 p=14774 u=ec2-user | 14774 1506329897.86160: checking for any_errors_fatal
I'm trying to read the user but it's giving only ec2 , it's not giving the full word
Sorry i'm newer to the grok filter
My current pattern :
%{TIMESTAMP_ISO8601:timestamp} p=%{WORD:process_id} u=%{WORD:user_id}
Current output :
...
...
...
"process_id": [
[
"14774"
]
],
"user_id": [
[
"ec2"
]
]
}
WORD is defined as "\b\w+\b"
See https://github.com/logstash-plugins/logstash-patterns-core/blob/master/patterns/grok-patterns
\b is a word boundary
\w matches a single alphanumeric character (an alphabetic character, or a decimal digit) or "_"
+ means any number of the previous character. So \w+ means any number of characters
Note that \w does NOT match -
So to make it work instead of WORD use
(?<user_id>\b[\w\-]+\b)
This does not use the preddefined grok patterns but "raw" regexp
the (?....) is used instead of %{ as it is "raw" regexp
\- means a literal - sign
[ ] means a character class. So [\w-] will match all the things \w does and - as well
InputAllow1-2 : Success
Grok Filter(?:%{GREEDYDATA:Output}?|-)
Result
{"Output":[["Allow1-2 : Success"]]}

Shell Bash Replace or remove part of a number or string

Good day.
Everyday i receive a list of numbers like the example below:
11986542586
34988745236
2274563215
4532146587
11987455478
3652147859
As you can see some of them have a 9(11 digits total) in as the third digit and some dont(10 digits total, that`s because the ones with an extra 9 are the new Brazilian mobile number format and the ones without it are in the old format.
The thing is that i have to use the numbers in both formats as a parameter for another script and i usually have do this by hand.
I am trying to create a script that reads the length of a mobile number and check it`s and add or remove the "9" of a number or string if the digits condition is met and save it in a separate file condition is met.
So far i am only able to check its length but i don`t know how to add or remove the "9" in the third digit.
#!/bin/bash
Numbers_file="/FILES/dir/dir2/Numbers_File.txt"
while read Numbers
do
LEN=${#Numbers}
if [ $LEN -eq "11" ]; then
echo "lenght = "$LEN
elif [ $LEN -eq "10" ];then
echo "lenght = "$LEN
else
echo "error"
fi
done < $Numbers_file
You can delete the third character of any string with sed as follows:
sed 's/.//3'
Example:
echo "11986542586" | sed 's/.//3'
1186542586
To add a 9 in the third character:
echo "2274563215" | sed 's/./&9/3'
22794563215
If you are absolutely sure about the occurrence happening only at the third position, you can use an awk statement as below,
awk 'substr($0,3,1)=="9"{$0=substr($0,1,2)substr($0,4,length($0))}1' file
1186542586
3488745236
2274563215
4532146587
1187455478
3652147859
Using the POSIX compliant substr() function, process only the lines having 9 at the 3rd position and move around the record not considering that digit alone.
substr(s, m[, n ])
Return the at most n-character substring of s that begins at position m, numbering from 1. If n is omitted, or if n specifies more characters than are left in the string, the length of the substring shall be limited by the length of the string s
There are lots of text manipulation tools that will do this, but the lightest weight is probably cut because this is all it does.
cut only supports a single range but does have an invert function so cut -c4 would give you just the 4th character, but add in --complement and you get everything but character 4.
echo 1234567890 | cut -c4 --complement
12356789

Cannot insert white spaces in string in the examples table

Feature:player
#all
Scenario Outline:Where is the player
Given I navigate to Google
When I enter < player> in the search field
Then the text < keyword1> should be present
#current #football
Examples:
| player | keyword1 |
| Rooney | Manchester |
| Gerrard | Liverpool |
| Terry | Chelsea |
#old #football
Examples:
| player | keyword1 |
| Eric Cantona | Manchester |
If I write Cantona instead of Eric Cantona then it is working, but as soon as you run the program with white space inserted in a string it gives an error.
Try putting quotes around the Scenario Outline placeholders (and removing the leading space from the placeholder). For example:
Scenario Outline: Where is the player
Given I navigate to Google
When I enter "<player>" in the search field
Then the text "<keyword1>" should be present
The problem is that your step definition is only looking for a single word:
When /^I enter (\w+) in the search field$/ do | player |
Cucumber uses regular expressions to match steps to their definitions, and to capture the variables. Your step definition is looking for "I enter ", followed by a single word, followed by " in the search field".
You could change the "player" regex from (\w+) to something like ([\w\s]+). That would match words and white space and should match the multi word example.
When /^I enter ([\w\s]+) in the search field$/ do | player |
Alternatively if you surround your variables with quotes (as suggested by orde) then Cucumber should generate step definitions which match anything inside the quotes using a (.*) group.

Regex - matching leading and trailing spaces, spaces between opening and closing brackets and words, but not between words

I apologize if this question has already been answered, but I have searched and cannot find the answer. I am trying to write a regex that will match all leading and trailing space, the spaces between the opening and closing bracket and the word, but will not match the spaces between words. The following are string format examples of the data I'm parsing:
[Header]
[ SomeSpace]
[ Some1 More Space 15 ]
no leading and trailing space, no space between brackets and only one word.
some leading and trailing space, space between the opening bracket and trailing space.
some leading space, space between word and digits, space between the opening and closing bracket, and trailing space.
The closest single regex I've come up with is:
/[^\[\]a-zA-Z\d]/
But I cannot seem to unmatch only the spaces between the words and digits...
The ruby code I currently am using as a workaround is:
line.gsub!(/^\s*/, "")
line.gsub!(/\[/, "")
line.gsub!(/\]/, "")
s = line.gsub!(/^\s*|\s*$/, "")
s = "[" + s + "]\n"
Obviously, not very pretty...
Any help to streamline this into an elegant gsub line is greatly appreciated.
Thanks!
Lee
If I understand your question correctly, you are trying to turn this text
[Header]
[ SomeSpace]
[ Some1 More Space 15 ]
into this:
[Header]
[SomeSpace]
[Some1 More Space 15]
This regex will do the job. The key addition here is the non-greedy ? quantifier on the inner character class. This makes the character class match as little as possible and leaves the trailing space within the brackets (if there is any) for the following greedy \s*.
s/^\s*\[\s*([\w\s]*?)\s*\]\s*$/[$1]/g
Ruby:
line.gsub! /^\s*\[\s*([\w\s]*?)\s*\]\s*$/, '[\\1]'
sed (ugly and most likely non-performant.. I'm no sed master!)
sed -Ee "s/^ *\[([a-zA-Z0-9 ]+)\] *$/\\1/g" -e "s/^ */[/g" -e "s/ *$/]/g" infile
Regex to match all extra spaces for replacement:
/(?<=^|\[)\s+|\s+(?=$|\])|(?<=\s)\s+/
The first part will match all leading spaces at the start and inside bracket.
The second part will match all trailing spaces at the end and inside bracket.
The last part will detect sequence of 2 or more spaces and remove the extra ones.
Just replace the matches with empty string.
Test data
[Header]
[ SomeSpace]
[ Some1 More Space 15 ]
[ Super Space ]
[ ]
[ ]
[]
[a]
[a ]
[ a]
[ a ]
[a a]
[a a a a a b] [ dasdasd dsd ]
I don't know about elegant but simplest is probably:
line.gsub /^\s*(\[)\s*|\s*(\])\s*$/, '\\1\\2'

Resources