how to write multiline string with same indentation in yaml - yaml

I want to write multiline string, with two lines, which will be equally indentated by 8 spaces. Ideally without need to have those 8 spaces in input, but I'm willing to do that. Anything so that I can have the result. I think i tried whole documentation. ', | > >-, >+, >8 ..., ... Adding 8char indentation which isn't in source is optional extra, but so far it seems, that in yaml you can have anything but what you actually typed.
• What is the combination for actual as-is multiline string without any yaml transformations (or say impossible if it's impossible)?
• What is the combination for actual as-is multiline string uniformly shifted to the right by N spaces (or say impossible if it's impossible)?
EDIT: specific example:
...
something: |
#blahblah
#blahblah
...
I want field somehting to contain 2 line string, each containing #blahblah, each prefixed by 8 spaces. In JSON you would write(I know I can do that in yaml, but I'd like to use yaml way to write yaml file):
{"something": " #blahblah\n #blahblah"}
EDIT 2: providing minimal working example.
Java code:
import java.io.InputStream;
import java.util.Map;
import org.yaml.snakeyaml.Yaml;
public class Test {
public static void main(String[] args) {
Yaml yaml = new Yaml();
InputStream inputStream = Test.class.getClassLoader().getResourceAsStream("test.yaml");
Map<String, Object> obj = yaml.load(inputStream);
String s = (String) obj.get("a");
System.out.println(s);
}
}
for input:
a: |
#abc
#def
I will get:
#abc
#def
For input:
a: |2
#abc
#def
I will get:
#abc
#def
How can this be explained??? If positive number after | should be for removing extra indentation, I understand it, that without number it will remove all indentation. So if someone want preserved indentation of 4 spaces, it seems he needs to do some math, put indentation of 5 and request to remove 1 using |1. No? This is how it should work? What am I missing?

What is the combination for actual as-is multiline string without any yaml transformations (or say impossible if it's impossible)?
--- |2
droggel
jug
You'll get the string as-is, without the 2 spaces indentation specified by the indentation indicator. Since the indicator must be at least 1 (you can't give 0), there is no way to do that without indentation. Still, this seems to do what you want.
What is the combination for actual as-is multiline string uniformly shifted to the right by N spaces (or say impossible if it's impossible)?
YAML is not a programming language and can't do any kind of string transformation, so this is impossible.

Related

Need explanation of the short Kotlin solution for strings in Codewars

I got a task on code wars.
The task is
In this simple Kata your task is to create a function that turns a string into a Mexican Wave. You will be passed a string and you must return that string in an array where an uppercase letter is a person standing up.
Rules are
The input string will always be lower case but maybe empty.
If the character in the string is whitespace then pass over it as if it was an empty seat
Example
wave("hello") => []string{"Hello", "hEllo", "heLlo", "helLo", "hellO"}
So I have found the solution but I want to understand the logic of it. Since its so minimalistic and looks cool but I don't understand what happens there. So the solution is
fun wave(str: String) = str.indices.map { str.take(it) + str.drop(it).capitalize() }.filter { it != str }
Could you please explain?
str.indices just returns the valid indices of the string. This means the numbers from 0 to and including str.length - 1 - a total of str.length numbers.
Then, these numbers are mapped (in other words, transformed) into strings. We will now refer to each of these numbers as "it", as that is what it refers to in the map lambda.
Here's how we do the transformation: we first take the first it characters of str, then combine that with the last str.length - it characters of str, but with the first of those characters capitalized. How do we get the last str.length - it characters? We drop the first it characters.
Here's an example for when str is "hello", illustrated in a table:
it
str.take(it)
str.drop(it)
str.drop(it).capitalize()
Combined
0
hello
Hello
Hello
1
h
ello
Ello
hEllo
2
he
llo
Llo
heLLo
3
hel
lo
Lo
helLo
4
hell
o
O
hellO
Lastly, the solution also filters out transformed strings that are the same as str. This is to handle Rule #2. Transformed strings can only be the same as str if the capitalised character is a whitespace (because capitalising a whitespace character doesn't change it).
Side note: capitalize is deprecated. For other ways to capitalise the first character, see Is there a shorter replacement for Kotlin's deprecated String.capitalize() function?
Here's another way you could do it:
fun wave2(str: String) = str.mapIndexed { i, c -> str.replaceRange(i, i + 1, c.uppercase()) }
.filter { it.any(Char::isUpperCase) }
The filter on the original is way more elegant IMO, this is just as an example of how else you might check for a condition. replaceRange is a way to make a copy of a string with some of the characters changed, in this case we're just replacing the one at the current index by uppercasing what's already there. Not as clever as the original, but good to know!

Is there a way to remove ALL special characters using Lucene filters?

Standard Analyzer removes special characters, but not all of them (eg: '-'). I want to index my string with only alphanumeric characters but referring to the original document.
Example: 'doc-size type' should be indexed as 'docsize' and 'type' and both should point to the original document: 'doc-size type'
It depends what you mean by "special characters", and what other requirements you may have. But the following may give you what you need, or point you in the right direction.
The following examples all assume Lucene version 8.4.1.
Basic Example
Starting with the very specific example you gave, where doc-size type should be indexed as docsize and type, here is a custom analyzer:
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.Tokenizer;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.core.WhitespaceTokenizer;
import org.apache.lucene.analysis.pattern.PatternReplaceFilter;
import java.util.regex.Pattern;
public class MyAnalyzer extends Analyzer {
#Override
protected TokenStreamComponents createComponents(String fieldName) {
final Tokenizer source = new WhitespaceTokenizer();
TokenStream tokenStream = source;
Pattern p = Pattern.compile("\\-");
boolean replaceAll = Boolean.TRUE;
tokenStream = new PatternReplaceFilter(tokenStream, p, "", replaceAll);
return new TokenStreamComponents(source, tokenStream);
}
}
This splits on whitespace, and then removes hyphens, using a PatternReplaceFilter. It works as shown below (I use 「 and 」 as delimiters to show where whitespaces may be part of the inputs/outputs):
Input text:
「doc-size type」
Output tokens:
「docsize」
「type」
NOTE - this will remove all hyphens which are standard keyboard hyphens - but not things such as em-dashes, en-dashes, and so on. It will remove these standard hyphens regardless of where they appear in the text (word starts, word ends, on their own, etc).
A Set of Punctuation Marks
You can change the pattern to cover more punctuation, as needed - for example:
Pattern p = Pattern.compile("[$^-]");
This does the following:
Input text:
「doc-size type $foo^bar」
Output tokens:
「docsize」
「type」
「foobar」
Everything Which is Not a Character or Digit
You can use the following to remove everything which is not a character or digit:
Pattern p = Pattern.compile("[^A-Za-z0-9]");
This does the following:
Input text:
「doc-size 123 %^&*{} type $foo^bar」
Output tokens:
「docsize」
「123」
「」
「type」
「foobar」
Note that this has one empty string in the resulting tags.
WARNING: Whether the above will work for you depends very much on your specific, detailed requirements. For example, you may need to perform extra transformations to handle upper/lowercase differences - i.e. the usual things which typically need to be considered when indexing text.
Note on the Standard Analyzer
The StandardAnalyzer actually does remove hyphens in words (with some obscure exceptions). In your question you mentioned that it does not remove them. The standard analyzer uses the standard tokenizer. And the standard tokenizer implements the Word Break rules from the Unicode Text Segmentation algorithm, as specified here. There's a section discussing how hyphens in words are handled.
So, the Standard analyzer will do this:
Input text:
「doc-size type」
Output tokens:
「doc」
「size」
「type」
That should work with searches for doc as well as doctype - it's just a question of whether it works well enough for your needs.
I understand that may not be what you want. But if you can avoid needing to build a custom analyzer, life will probably be much simpler.

How to read csv using LINQ ,some columns contain ,

i have a CSV in the below way. "India,Inc" is a company name which is single value which contains , in it
How to Get the Values in LINQ
12321,32432,423423,Kevin O'Brien,"India,Inc",234235,23523452,235235
Assuming that you will always have the columns that you specify and that the only variable is that company name can have commas inside, this UGLY code can help you achieve your goal.
var file = File.ReadLines("test.csv");
var value = from p in file
select new string[]
{ p.Split(',')[0],
p.Split(',')[1],
p.Split(',')[2],
p.Split(',')[3],
p.Split(',').Count() == 7 ? p.Split(',')[4] :
(p.Split(',').Count() > 7 ? String.Join(",",p.Split(',').Skip(4).Take(p.Split(',').Count() - 7).ToArray() ) : ""),
p.Split(',')[p.Split(',').Count() - 3],
p.Split(',')[p.Split(',').Count() - 2],
p.Split(',')[p.Split(',').Count() - 1]
};
A regular expression would work, bit nasty due to the recursive nature but it does achieve your goal.
List<string> matches = new List<string>();
string subjectString = "12321,32432,423423,Kevin O'Brien,\"India,Inc\",234235,23523452,235235";
Regex regexObj = new Regex(#"(?<="")\b[123456789a-z,']+\b(?="")|[123456789a-z']+", RegexOptions.IgnoreCase);
Match matchResults = regexObj.Match(subjectString);
while (matchResults.Success)
{
matches.Add(matchResults.Value);
// matched text: matchResults.Value
// match start: matchResults.Index
// match length: matchResults.Length
matchResults = matchResults.NextMatch();
}
This should suffice in most cases. It handles quoted strings, strings with double quotes within them, and embedded commas.
var subjectString = "12321,32432,423423,Kevin O'Brien,\"India,Inc\",234235,\"Test End\"\"\",\"\"\"Test Start\",\"Test\"\"Middle\",23523452,235235";
var result=Regex.Split(subjectString,#",(?=(?:[^""]*""[^""]*"")*[^""]*$)")
.Select(x=>x.StartsWith("\"") && x.EndsWith("\"")?x.Substring(1,x.Length-2):x)
.Select(x=>x.Replace("\"\"","\""));
It does however break, if you have a field with a single double quote inside it, and the string itself is not enclosed in double quotes -- this is invalid in most definitions of a CSV file, where any field that contains CR, LF, Comma, or Double quote must be enclosed in double quotes.
You should be able to reuse the same Regex expression to break on lines as well for small CSV files. Larger ones you would want a better implementation. Replace the double quotes with LF, and remove the matching ones (unquoted LF's). Then use the regular expression again replacing the quotes with CR, and split on matching.
Another option is to use CSVHelper and not traying to reinvent the wheel
var csv = new CsvHelper.CsvReader(new StreamReader("test.csv"));
while (csv.Read())
{
Console.WriteLine(csv.GetField<int>(0));
Console.WriteLine(csv.GetField<string>(1));
Console.WriteLine(csv.GetField<string>(2));
Console.WriteLine(csv.GetField<string>(3));
Console.WriteLine(csv.GetField<string>(4));
}
Guide
I would recommend LINQ to CSV, because it is powerful enough to handle special characters including commas, quotes, and decimals. They have really worked a lot of these issues out for you.
It only takes a few minutes to set up and it is really worth the time because you won't run into these types of issues down the road like you would with custom code. Here are the basic steps, but definitely follow the instructions in the link above.
Install the Nuget package
Create a class to represent a line item (name the fields the way they're named in the csv)
Use CsvContext.Read() to read into an IEnumerable which you can easily manipulate with LINQ
Use CsvContext.Write() to write a List or IEnumerable to a CSV
This is very easy to setup, has very little code, and is much more scalable than doing it yourself.
becuase you're only reading values delminated bycommas, the spaces shouldn't cause an issue if you just treat them like any other character.
var values = File.ReadLines(path)
SelectMany(line => line.Split(','));

ruby extract string between two string

I am having a string as below:
str1='"{\"#Network\":{\"command\":\"Connect\",\"data\":
{\"Id\":\"xx:xx:xx:xx:xx:xx\",\"Name\":\"somename\",\"Pwd\":\"123456789\"}}}\0"'
I wanted to extract the somename string from the above string. Values of xx:xx:xx:xx:xx:xx, somename and 123456789 can change but the syntax will remain same as above.
I saw similar posts on this site but don't know how to use regex in the above case.
Any ideas how to extract the above string.
Parse the string to JSON and get the values that way.
require 'json'
str = "{\"#Network\":{\"command\":\"Connect\",\"data\":{\"Id\":\"xx:xx:xx:xx:xx:xx\",\"Name\":\"somename\",\"Pwd\":\"123456789\"}}}\0"
json = JSON.parse(str.strip)
name = json["#Network"]["data"]["Name"]
pwd = json["#Network"]["data"]["Pwd"]
Since you don't know regex, let's leave them out for now and try manual parsing which is a bit easier to understand.
Your original input, without the outer apostrophes and name of variable is:
"{\"#Network\":{\"command\":\"Connect\",\"data\":{\"Id\":\"xx:xx:xx:xx:xx:xx\",\"Name\":\"somename\",\"Pwd\":\"123456789\"}}}\0"
You say that you need to get the 'somename' value and that the 'grammar will not change'. Cool!.
First, look at what delimits that value: it has quotes, then there's a colon to the left and comma to the right. However, looking at other parts, such layout is also used near the command and near the pwd. So, colon-quote-data-quote-comma is not enough. Looking further to the sides, there's a \"Name\". It never occurs anywhere in the input data except this place. This is just great! That means, that we can quickly find the whereabouts of the data just by searching for the \"Name\" text:
inputdata = .....
estposition = inputdata.index('\"Name\"')
raise "well-known marker wa not found in the input" unless estposition
now, we know:
where the part starts
and that after the "Name" text there's always a colon, a quote, and then the-interesting-data
and that there's always a quote after the interesting-data
let's find all of them:
colonquote = inputdata.index(':\"', estposition)
datastart = colonquote+3
lastquote = inputdata.index('\"', datastart)
dataend = lastquote-1
The index returns the start position of the match, so it would return the position of : and position of \. Since we want to get the text between them, we must add/subtract a few positions to move past the :\" at begining or move back from \" at end.
Then, fetch the data from between them:
value = inputdata[datastart..dataend]
And that's it.
Now, step back and look at the input data once again. You say that grammar is always the same. The various bits are obviously separated by colons and commas. Let's try using it directly:
parts = inputdata.split(/[:,]/)
=> ["\"{\\\"#Network\\\"",
"{\\\"command\\\"",
"\\\"Connect\\\"",
"\\\"data\\\"",
"\n{\\\"Id\\\"",
"\\\"xx",
"xx",
"xx",
"xx",
"xx",
"xx\\\"",
"\\\"Name\\\"",
"\\\"somename\\\"",
"\\\"Pwd\\\"",
"\\\"123456789\\\"}}}\\0\""]
Please ignore the regex for now. Just assume it says a colon or comma. Now, in parts you will get all the, well, parts, that were detected by cutting the inputdata to pieces at every colon or comma.
If the layout never changes and is always the same, then your interesting-data will be always at place 13th:
almostvalue = parts[12]
=> "\\\"somename\\\""
Now, just strip the spurious characters. Since the grammar is constant, there's 2 chars to be cut from both sides:
value = almostvalue[2..-3]
Ok, another way. Since regex already showed up, let's try with them. We know:
data is prefixed with \"Name\" then colon and slash-quote
data consists of some text without quotes inside (well, at least I guess so)
data ends with a slash-quote
the parts in regex syntax would be, respectively:
\"Name\":\"
[^\"]*
\"
together:
inputdata =~ /\\"Name\\":\\"([^\"]*)\\"/
value = $1
Note that I surrounded the interesting part with (), hence after sucessful match that part is available in the $1 special variable.
Yet another way:
If you look at the grammar carefully, it really resembles a set of embedded hashes:
\"
{ \"#Network\" :
{ \"command\" : \"Connect\",
\"data\" :
{ \"Id\" : \"xx:xx:xx:xx:xx:xx\",
\"Name\" : \"somename\",
\"Pwd\" : \"123456789\"
}
}
}
\0\"
If we'd write something similar as Ruby hashes:
{ "#Network" =>
{ "command" => "Connect",
"data" =>
{ "Id" => "xx:xx:xx:xx:xx:xx",
"Name" => "somename",
"Pwd" => "123456789"
}
}
}
What's the difference? the colon was replaced with =>, and the slashes-before-quotes are gone. Oh, and also opening/closing \" is gone and that \0 at the end is gone too. Let's play:
tmp = inputdata[2..-4] # remove opening \" and closing \0\"
tmp.gsub!('\"', '"') # replace every \" with just "
Now, what about colons.. We cannot just replace : with =>, because it would damage the internal colons of the xx:xx:xx:xx:xx:xx part.. But, look: all the other colons have always a quote before them!
tmp.gsub!('":', '"=>') # replace every quote-colon with quote-arrow
Now our tmp is:
{"#Network"=>{"command"=>"Connect","data"=>{"Id"=>"xx:xx:xx:xx:xx:xx","Name"=>"somename","Pwd"=>"123456789"}}}
formatted a little:
{ "#Network"=>
{ "command"=>"Connect",
"data"=>
{ "Id"=>"xx:xx:xx:xx:xx:xx","Name"=>"somename","Pwd"=>"123456789" }
}
}
So, it looks just like a Ruby hash. Let's try 'destringizing' it:
packeddata = eval(tmp)
value = packeddata['#Network']['data']['Name']
Done.
Well, this has grown a bit and Jonas was obviously faster, so I'll leave the JSON part to him since he wrote it already ;) The data was so similar to Ruby hash because it was obviously formatted as JSON which is a hash-like structure too. Using the proper format-reading tools is usually the best idea, but mind that the JSON library when asked to read the data - will read all of the data and then you can ask them "what was inside at the key xx/yy/zz", just like I showed you with the read-it-as-a-Hash attempt. Sometimes when your program is very short on the deadline, you cannot afford to read-it-all. Then, scanning with regex or scanning manually for "known markers" may (not must) be much faster and thus prefereable. But, still, much less convenient. Have fun.

Ruby MatchData class is repeating captures, instead of including additional captures as it "should"

Ruby 1.9.1, OSX 10.5.8
I'm trying to write a simple app that parses through of bunch of java based html template files to replace a period (.) with an underscore if it's contained within a specific tag. I use ruby all the time for these types of utility apps, and thought it would be no problem to whip up something using ruby's regex support. So, I create a Regexp.new... object, open a file, read it in line by line, then match each line against the pattern, if I get a match, I create a new string using replaceString = currentMatch.gsub(/./, '_'), then create another replacement as whole string by newReplaceRegex = Regexp.escape(currentMatch) and finally replace back into the current line with line.gsub(newReplaceRegex, replaceString) Code below, of course, but first...
The problem I'm having is that when accessing the indexes within the returned MatchData object, I'm getting the first result twice, and it's missing the second sub string it should otherwise be finding. More strange, is that when testing this same pattern and same test text using rubular.com, it works as expected. See results here
My pattern:
(<(?:WEBOBJECT|webobject) (?:NAME|name)=(?:[a-zA-Z0-9]+.)+(?:[a-zA-Z0-9]+)(?:>))
Text text:
<WEBOBJECT NAME=admin.normalMode.someOtherPatternWeDontWant.moreThatWeDontWant>moreNonMatchingText<WEBOBJECT NAME=admin.SecondLineMatch>AndEvenMoreNonMatchingText
Here's the relevant code:
tagRegex = Regexp.new('(<(?:WEBOBJECT|webobject) (?:NAME|name)=(?:[a-zA-Z0-9]+\.)+(?:[a-zA-Z0-9]+)(?:>))+')
testFile = File.open('RegexTestingCompFix.txt', "r+")
lineCount=0
testFile.each{|htmlLine|
lineCount += 1
puts ("Current line: #{htmlLine} at line num: #{lineCount}")
tagMatch = tagRegex.match(htmlLine)
if(tagMatch)
matchesArray = tagMatch.to_a
firstMatch = matchesArray[0]
secondMatch = matchesArray[1]
puts "First match: #{firstMatch} and second match #{secondMatch}"
tagMatch.captures.each {|lineMatchCapture|
puts "Current capture for tagMatches: #{lineMatchCapture} of total match count #{matchesArray.size}"
#create a new regex using the match results; make sure to use auto escape method
originalPatternString = Regexp.escape(lineMatchCapture)
replacementRegex = Regexp.new(originalPatternString)
#replace any periods with underscores in a copy of lineMatchCapture
periodToUnderscoreCorrection = lineMatchCapture.gsub(/\./, '_')
#replace original match with underscore replaced copy within line
htmlLine.gsub!(replacementRegex, periodToUnderscoreCorrection)
puts "The modified htmlLine is now: #{htmlLine}"
}
end
}
I would think that I should get the first tag in matchData[0] then the second tag in matchData1, or, what I'm really doing because I don't know how many matches I'll get within any given line is matchData.to_a.each. And in this case, matchData has two captures, but they're both the first tag match
which is: <WEBOBJECT NAME=admin.normalMode.someOtherPatternWeDontWant.moreThatWeDontWant>
So, what the heck am I doing wrong, why does rubular test give me the expected results?
You want to use the on String#scan instead of the Regexp#match:
tag_regex = /<(?:WEBOBJECT|webobject) (?:NAME|name)=(?:[a-zA-Z0-9]+\.)+(?:[a-zA-Z0-9]+)(?:>)/
lines = "<WEBOBJECT NAME=admin.normalMode.someOtherPatternWeDontWant.moreThatWeDontWant>moreNonMatchingText\
<WEBOBJECT NAME=admin.SecondLineMatch>AndEvenMoreNonMatchingText"
lines.scan(tag_regex)
# => ["<WEBOBJECT NAME=admin.normalMode.someOtherPatternWeDontWant.moreThatWeDontWant>", "<WEBOBJECT NAME=admin.SecondLineMatch>"]
A few recommendations for next ruby questions:
newlines and spaces are your friends, you don't loose points for using more lines on your code ;-)
use do-end on blocks instead of {}, improves readability a lot
declare variables in snake case (hello_world) instead of camel case (helloWorld)
Hope this helps
I ended up using the String.scan approach, the only tricky point there was figuring out that this returns an array of arrays, not a MatchData object, so there was some initial confusion on my part, mostly due to my ruby green-ness, but it's working as expected now. Also, I trimmed the regex per Trevoke's suggestion. But snake case? Never...;-) Anyway, here goes:
tagRegex = /(<(?:webobject) (?:name)=(?:\w+\.)+(?:\w+)(?:>))/i
testFile = File.open('RegexTestingCompFix.txt', "r+")
lineCount=0
testFile.each do |htmlLine|
lineCount += 1
puts ("Current line: #{htmlLine} at line num: #{lineCount}")
oldMatches = htmlLine.scan(tagRegex) #oldMatches thusly named due to not explicitly using Regexp or MatchData, as in "the old way..."
if(oldMatches.size > 0)
oldMatches.each_index do |index|
arrayMatch = oldMatches[index]
aMatch = arrayMatch[0]
#create a new regex using the match results; make sure to use auto escape method
replacementRegex = Regexp.new(Regexp.escape(aMatch))
#replace any periods with underscores in a copy of lineMatchCapture
periodToUnderscoreCorrection = aMatch.gsub(/\./, '_')
#replace original match with underscore replaced copy within line, matching against the new escaped literal regex
htmlLine.gsub!(replacementRegex, periodToUnderscoreCorrection)
puts "The modified htmlLine is now: #{htmlLine}"
end # I kind of still prefer the brackets...;-)
end
end
Now, why does MatchData work the way it does? It seems like it's behavior is a bug really, and certainly not very useful in general if you can't get it provide a simple means of accessing all the matches. Just my $.02
Small bits:
This regexp helps you get "normalMode" .. But not "secondLineMatch":
<webobject name=\w+\.((?:\w+)).+> (with option 'i', for "case insensitive")
This regexp helps you get "secondLineMatch" ... But not "normalMode":
<webobject name=\w+\.((?:\w+))> (with option 'i', for "case insensitive").
I'm not really good at regexpt but I'll keep toiling at it.. :)
And I don't know if this helps you at all, but here's a way to get both:
<webobject name=admin.(\w+) (with option 'i').

Resources