I'm using YamlDotNet with an Azure Function v2 to serialise YAML from a markdown file (hosted on GitHub) to a .net object. I'm struggling with this error when attempting to deserialize the YAML string
Expected 'StreamEnd', got 'DocumentStart
I'm getting the markdown file content using HttpClient with a GET request to https://github.com/martinkearn/Content/raw/fd83bf8218b7c5e01f8b498e8a831bcd3fc3c961/Blogs/Test.md which returns a raw markdown file in the response body.
My Model is
public class Article
{
public string Title { get; set; }
public string Author { get; set; }
public List<string> Categories { get; set; }
}
My YAML is
---
title: Test File
author: Martin Kearn
categories:
- Test
- GitHubCMS
- One More Tag
- another tag
---
Here is my code
// get url from request body
var url = "https://raw.githubusercontent.com/martinkearn/Content/fd83bf8218b7c5e01f8b498e8a831bcd3fc3c961/Blogs/Test.md";
// get raw file and extract YAML
using (var client = new HttpClient())
{
//setup HttpClient
client.BaseAddress = new Uri(url);
client.DefaultRequestHeaders.Add("User-Agent", "ExtractYAML Function");
//setup httpContent object
var response = await client.GetAsync(url);
string rawFile = await response.Content.ReadAsStringAsync();
using (var r = new StringReader(rawFile))
{
var deserializer = new DeserializerBuilder()
.WithNamingConvention(new CamelCaseNamingConvention())
.Build();
//This line is causing Expected 'StreamEnd', got 'DocumentStart'
var article = deserializer.Deserialize<Article>(r);
}
}
Your actual downloaded file contains:
---
title: Test File
author: Martin Kearn
categories:
- Test
- GitHubCMS
- One More Tag
- another tag
---
# Test file
The --- is the end-of-directives marker, which is optional if you don't have any directives ( %YAML 1.2, %TAG .... ).
Since you have an empty line after the second directive, this is counted as if your second document contained
---
null
# Test file
You should at least get rid of that empty line and possible remove the second end-of-directives marker, putting the comment at the end of the first document
The end-of-document indicator in YAML is ... at the beginning of a line.
Make your file read:
title: Test File
author: Martin Kearn
categories:
- Test
- GitHubCMS
- One More Tag
- another tag
# Test file
or at most:
---
title: Test File
author: Martin Kearn
categories:
- Test
- GitHubCMS
- One More Tag
- another tag
# Test file
Related
I am using the following code snippet to serialise a dynamic model of a project to a string (which is eventually exported to a YAML file).
dynamic exportModel = exportModelConvertor.ToDynamicModel(project);
var serializerBuilder = new SerializerBuilder();
var serializer = serializerBuilder.EmitDefaults().DisableAliases().Build();
using (var sw = new StringWriter())
{
serializer.Serialize(sw, exportModel);
string result = sw.ToString();
}
Any multi-line strings such as the following:
propertyName = "One line of text
followed by another line
and another line"
are exported in the following format:
propertyName: >
One line of text
followed by another line
and another line
Note the extra (unwanted) line breaks.
According to this YAML Multiline guide, the format used here is the folded block scalar style. Is there a way using YamlDotNet to change the style of this output for all multi-line string properties to literal block scalar style or one of the flow scalar styles?
The YamlDotNet documentation shows how to apply ScalarStyle.DoubleQuoted to a particular property using WithAttributeOverride but this requires a class name and the model to be serialised is dynamic. This also requires listing every property to change (of which there are many). I would like to change the style for all multi-line string properties at once.
To answer my own question, I've now worked out how to do this by deriving from the ChainedEventEmitter class and overriding void Emit(ScalarEventInfo eventInfo, IEmitter emitter). See code sample below.
public class MultilineScalarFlowStyleEmitter : ChainedEventEmitter
{
public MultilineScalarFlowStyleEmitter(IEventEmitter nextEmitter)
: base(nextEmitter) { }
public override void Emit(ScalarEventInfo eventInfo, IEmitter emitter)
{
if (typeof(string).IsAssignableFrom(eventInfo.Source.Type))
{
string value = eventInfo.Source.Value as string;
if (!string.IsNullOrEmpty(value))
{
bool isMultiLine = value.IndexOfAny(new char[] { '\r', '\n', '\x85', '\x2028', '\x2029' }) >= 0;
if (isMultiLine)
eventInfo = new ScalarEventInfo(eventInfo.Source)
{
Style = ScalarStyle.Literal
};
}
}
nextEmitter.Emit(eventInfo, emitter);
}
}
I am parsing a PDF document using itext7. I have fetched all the form fields from the document using AcroForm, but I am unable to get font associated with the field using GetFont method. I also tried to parse /DA dictionary but it returns as a PDFString. Is there any way around to get font information or I have to parse /DA dictionary
Actually iText 7 does have a method to determine form field font information, it's needed for generating form field appearances after all: PdfFormField.getFontAndSize(PdfDictionary).
Unfortunately this method is protected, so one has to cheat a bit to access it, e.g. one can derive one's own form field class from it and make the method public therein:
class PdfFormFieldExt extends PdfFormField {
public PdfFormFieldExt(PdfDictionary pdfObject) {
super(pdfObject);
}
public Object[] getFontAndSize(PdfDictionary asNormal) throws IOException {
return super.getFontAndSize(asNormal);
}
}
(from test class DetermineFormFieldFonts)
Using this class we can extract font information like this:
try ( PdfReader pdfReader = new PdfReader(PDF_SOURCE);
PdfDocument pdfDocument = new PdfDocument(pdfReader) ) {
PdfAcroForm form = PdfAcroForm.getAcroForm(pdfDocument, false);
for (Entry<String, PdfFormField> entry : form.getFormFields().entrySet()) {
String fieldName = entry.getKey();
PdfFormField field = entry.getValue();
System.out.printf("%s - %s\n", fieldName, field.getFont());
PdfFormFieldExt extField = new PdfFormFieldExt(field.getPdfObject());
Object[] fontAndSize = extField.getFontAndSize(field.getWidgets().get(0).getNormalAppearanceObject());
PdfFont font = (PdfFont) fontAndSize[0];
Float size = (Float) fontAndSize[1];
PdfName resourceName = (PdfName) fontAndSize[2];
System.out.printf("%s - %s - %s - %s\n", Strings.repeat(" ", fieldName.length()),
font.getFontProgram().getFontNames(), size, resourceName);
}
}
(DetermineFormFieldFonts test test)
Applied to this sample document with some text fields, one gets:
TextAdobeThai - null
- AdobeThai-Regular - 12.0 - /AdobeThai-Regular
TextArial - null
- Arial - 12.0 - /Arial
TextHelvetica - null
- Helvetica - 12.0 - /Helv
TextWingdings - null
- Wingdings - 12.0 - /Wingdings
As you can see, while PdfFormField.getFont() always returns null, PdfFormField.getFontAndSize(PdfDictionary) returns sensible information.
Tested using the current iText for Java development branch, 7.1.5-SNAPSHOT
When I have an app.net url like https://photos.app.net/5269262/1 - how can I retrieve the image thumbnail of the post?
Running a curl on above url shows a redirect
bash-3.2$ curl -i https://photos.app.net/5269262/1
HTTP/1.1 301 MOVED PERMANENTLY
Location: https://alpha.app.net/pfleidi/post/5269262/photo/1
Following this gives a html page that contains the image in a form of
img src='https://files.app.net/1/60621/aWBTKTYxzYZTqnkESkwx475u_ShTwEOiezzBjM3-ZzVBjq_6rzno42oMw9LxS5VH0WQEgoxWegIDKJo0eRDAc-uwTcOTaGYobfqx19vMOOMiyh2M3IMe6sDNkcQWPZPeE0PjIve4Vy0YFCM8MsHWbYYA2DFNKMdyNUnwmB2KuECjHqe0-Y9_ODD1pnFSOsOjH' data-full-width='2048' data-full-height='1536'
Inside a larger block of <div>tags.
The files api in app.net allows to retrieve thumbnails but I somehow don't get the link between those endpoints and above urls.
The photos.app.net is just a simple redirecter. It is not part of the API proper. In order to get the thumbnail, you will need to fetch the file directly using the file fetch endpoint and the file id (http://developers.app.net/docs/resources/file/lookup/#retrieve-a-file) or fetch the post that the file is included in and examine the oembed annotation.
In this case, you are talking about post id 5269262 and the URL to fetch that post with the annotation is https://alpha-api.app.net/stream/0/posts/5269262?include_annotations=1 and if you examine the resulting json document you will see the thumbnail_url.
For completeness sake I want to post the final solution for me here (in Java) -- it builds on the good and accepted answer of Jonathon Duerig :
private static String getAppNetPreviewUrl(String url) {
Pattern photosPattern = Pattern.compile(".*photos.app.net/([0-9]+)/.*");
Matcher m = photosPattern.matcher(url);
if (!m.matches()) {
return null;
}
String id = m.group(1);
String streamUrl = "https://alpha-api.app.net/stream/0/posts/"
+ id + "?include_annotations=1";
// Now that we have the posting url, we can get it and parse
// for the thumbnail
BufferedReader br = null;
HttpURLConnection urlConnection = null;
try {
urlConnection = (HttpURLConnection) new URL(streamUrl).openConnection();
urlConnection.setDoInput(true);
urlConnection.setDoOutput(false);
urlConnection.setRequestProperty("Accept","application/json");
urlConnection.connect();
StringBuilder builder = new StringBuilder();
br = new BufferedReader(
new InputStreamReader(urlConnection.getInputStream()));
String line;
while ((line=br.readLine())!=null) {
builder.append(line);
}
urlConnection.disconnect();
// Parse the obtained json
JSONObject post = new JSONObject(builder.toString());
JSONObject data = post.getJSONObject("data");
JSONArray annotations = data.getJSONArray("annotations");
JSONObject annotationValue = annotations.getJSONObject(0);
JSONObject value = annotationValue.getJSONObject("value");
String finalUrl = value.getString("thumbnail_large_url");
return finalUrl;
} .......
I'm using pagedown editor. The code I'm using for gerating the preview is following:
$(document).ready(function () {
var previewConverter = Markdown.getSanitizingConverter();
var editor = new Markdown.Editor(previewConverter);
editor.run();
});
While I enter some text to the input:
the dynamically generated output preview will be as expected, and looks following:
The content (the pure entered text shown below) is then saved to database:
"http://www.google.com\n\n<script>alert('hi');</script>\n\n[google][4]\n\n\n [1]: http://www.google.com"
On the server side, before the page is rendered, I'm converting this fetched from database text, using this markdownsharp library v1.13.0.0. After conversion, I'm sanitizing the html using Jeff Atwood's code, which I've found here:
private static Regex _tags = new Regex("<[^>]*(>|$)",
RegexOptions.Singleline | RegexOptions.ExplicitCapture | RegexOptions.Compiled);
private static Regex _whitelist = new Regex(#"
^</?(b(lockquote)?|code|d(d|t|l|el)|em|h(1|2|3)|i|kbd|li|ol|p(re)?|s(ub|up|trong|trike)?|ul)>$|
^<(b|h)r\s?/?>$",
RegexOptions.Singleline | RegexOptions.ExplicitCapture | RegexOptions.Compiled | RegexOptions.IgnorePatternWhitespace);
private static Regex _whitelist_a = new Regex(#"
^<a\s
href=""(\#\d+|(https?|ftp)://[-a-z0-9+&##/%?=~_|!:,.;\(\)]+)""
(\stitle=""[^""<>]+"")?\s?>$|
^</a>$",
RegexOptions.Singleline | RegexOptions.ExplicitCapture | RegexOptions.Compiled | RegexOptions.IgnorePatternWhitespace);
private static Regex _whitelist_img = new Regex(#"
^<img\s
src=""https?://[-a-z0-9+&##/%?=~_|!:,.;\(\)]+""
(\swidth=""\d{1,3}"")?
(\sheight=""\d{1,3}"")?
(\salt=""[^""<>]*"")?
(\stitle=""[^""<>]*"")?
\s?/?>$",
RegexOptions.Singleline | RegexOptions.ExplicitCapture | RegexOptions.Compiled | RegexOptions.IgnorePatternWhitespace);
/// <summary>
/// sanitize any potentially dangerous tags from the provided raw HTML input using
/// a whitelist based approach, leaving the "safe" HTML tags
/// CODESNIPPET:4100A61A-1711-4366-B0B0-144D1179A937
/// </summary>
public static string Sanitize(string html)
{
if (String.IsNullOrEmpty(html)) return html;
string tagname;
Match tag;
// match every HTML tag in the input
MatchCollection tags = _tags.Matches(html);
for (int i = tags.Count - 1; i > -1; i--)
{
tag = tags[i];
tagname = tag.Value.ToLowerInvariant();
if(!(_whitelist.IsMatch(tagname) || _whitelist_a.IsMatch(tagname) || _whitelist_img.IsMatch(tagname)))
{
html = html.Remove(tag.Index, tag.Length);
System.Diagnostics.Debug.WriteLine("tag sanitized: " + tagname);
}
}
return html;
}
The conversion and sanitization process is following::
var md = new MarkdownSharp.Markdown();
var unsafeHtml = md.Transform(content);
var safeHtml = Sanitize(unsafeHtml);
return new HtmlString(safeHtml);
unsafeHtml contains
"<p>http://www.google.com</p>\n\n<script>alert('hi');</script>\n\n<p>google</p>\n"
safeHtml contains
"<p>http://www.google.com</p>\n\nalert('hi');\n\n<p>google</p>\n"
This renders to:
So sanitization and the second link were converted as expected. Unfortunately, the first link is not a link anymore, just text. How to fix this ?
Maybe better approach is not to use server side conversion, but just use javascript to render the markdown text on the page ?
In Markdown.Converter.js we can find _DoAutoLinks(text) function. There is section which automatically add < and > around unadorned raw hyperlinks, and then autolink anything like <http://example.com>. This is why
http://www.google.com
will be first converted to:
<http://www.google.com>
and then to:
http://www.google.com
My temporary workaround is doing something similiar at the c# side:
var unsafeHtml = DoAutolinks(md.Transform(content));
private static string DoAutolinks(string content)
{
/* url pattern - from msdn.microsoft.com/en-us/library/ff650303.aspx */
const string url = #"(ht|f)tp(s?)\:\/\/[0-9a-zA-Z]([-.\w]*[0-9a-zA-Z])*(:(0-9)*)*(\/?)([a-zA-Z0-9\-\.\?\,\'\/\\\+&%\$#_]*)?";
const string pattern = #"<p>(?<url>" + url + ")</p>";
var result = Regex.Replace(content, pattern, "<p>${url}</p>");
return result;
}
Should such functionality - responsible for unadorned links conversion, be included in markdownsharp ?
I am trying to read the following text file:
Author
{
Name xyz
blog www.test.com
rating 123
}
Author
{
Name xyz
blog www.test.com
rating 123
}
Author
{
Name xyz
blog www.test.com
rating 123
}
Author
{
Name xyz
blog www.test.com
rating 123
}
I am using the following snippet to fetch my author record:
public static IEnumerable<string> GetAuthors(string path, string startfrom, string endto)
{
return File.ReadLines(path)
.SkipWhile(line => line != startfrom)
.TakeWhile(line => line != endto);
}
public static void DoSomethingWithAuthors(string fileName)
{
var result = GetAuthors(fileName, "AUTHOR", "}").ToList();
}
The above only returns me one Author Details. Could someone kindly show me how to fetch all authors in one go so I could popluate to an object. Thank you so much!!
I rarely suggest that, but if the file structure is that predicatable you might even use regex to get your author details. As the objects you want to initialize are not complex, you can match the Author bit and take the values from regex match groups.
the regex to match the authors would be something like this:
Author\s*{\s*Name\s+(.*?)\s+blog\s+(.*?)\s+rating\s+(.*?)\s*}
Your values would be in the group 1,2 and 3.
EDIT:
If it doesn't make a difference for you, you can use the ReadToEnd() method and then you can parse the whole file content as a string:
http://msdn.microsoft.com/en-us/library/system.io.streamreader.readtoend(v=vs.100).aspx
As for the regex solution - check this out:
http://msdn.microsoft.com/en-us/library/twcw2f1c.aspx
An adapted version - it might need tweaking but in general it should work:
string text = [yourInputFileAsString]
string pat = #"Author\s*{\s*Name\s+(.*?)\s+blog\s+(.*?)\s+rating\s+(.*?)\s*}";
Regex r = new Regex(pat, RegexOptions.IgnoreCase | RegexOptions.Singleline);
Match m = r.Match(text);
var authors = new List<Author>();
while (m.Success)
{
var name = m.Groups[1].Value;
var blog = m.Groups[2].Value;
var rating = m.Groups[3].Value;
var author = new Author(name, blog, rating);
authors.Add(author);
m = m.NextMatch();
}
It is going to stop at the first } it runs into.
Remove the .TakeWhile(line => line != endto) bit and it should work for you.