Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
I'm brand new to Ruby and programming. I'd like to create a little program to automate one of my more tedious work tasks that I'm currently doing by hand but I'm not sure where to start.
People register to take courses through an online form, and I receive their registration information at the end of each day as a CSV document. I go line by line through that document and generate a confirmation email to send to them based on their input on the online form: the course they'd like to take, their room preference, how much they chose to pay for the course (sliding scale), etc. The email ends up looking something like this:
Dear So and so, Thank you for signing up for "Such-and-such An Awesome Course," with Professor Superdude. The course starts on Monday, September 1, 2030 at 4pm and ends on Thursday at 1pm. You paid such-and-such an amount...
et cetera. So ideally the program would take in the CSV document with information like "student name," "course title," "fee paid," and generate emails based on blocks of text ("Dear , Thank you for signing up for _,") and variables (the dates of the course) that are stored externally so they are easy to edit without going into the source code (maybe as CSV and plain text files).
Additionally, I need the output to be in rich text, so I can bold and underline certain things. I'm familiar with Markdown so I could use that in the source code but it would be ideal if the output could be rich text.
I'm not expecting anyone to write a program for me, but if you could let me know what I should look into or even what I should Google, that would be very helpful.
I assume you're trying to put together an email. If so, I'd probably start with a simple ERB template. If you want to generate HTML, you can write one HTML template and one plain text template; variable substitution works the same way for both, with the exception that you'll need to html-escape anything that contains characters that HTML considers special (ampersands, greater than, less then, for example). See ERB Documentation here.
If you're trying to parse CSV, user FasterCSV or a similar library. FasterCSV is documented here.
If you want to send an email, you can use ActionMailer, the mail gem, or the pony gem. ActionMailer is part of rails, but can be used independently. Pony is a good facade for creating email, as well; both ActionMailer and Pony depend on the "mail" gem, so unless you want to spend more time thinking about how email formats work, use one of those.
If you're not trying to send an email, and instead are trying to create a formatted document, you can still use ERB, but use it to generate output in TeX, or if you're more adventurous than I am, a Word compatible XML document. Alternatively, if you're wedded to Microsoft Word or RTF, you might try either http://ruby-rtf.rubyforge.org/ (Ruby RTF) or use COM/OLE interop to talk to Word, but I would only do that if really I had to; if I had to go that route, I'd probably suck it up and just use the built in mail merge feature in Word perhaps with a little VBA code.
Related
I like reading the PoC||GTFO issues and one thing I found remarkable when I first discovered it, was the "polyglot" nature of their PDF files.
Let met explain: when you consider for example their 8th issue, you may unzip files from it; execute the encryption they are talking about by running it as a script and even better(worse?) with their 9th issue you can even play it as a music file!
I'm currently in the process of writing small scripts every week and writing each time a little one page PDF in LaTeX to explain the said scripts. So I would really enjoy being able to create the same kind of PDF files. Sadly they explained (partly) in their first issue how to include zip files, but they did so through three small sketches of cmd lines without actual explanations.
So my question is basically :
how can one create such a polyglot PDF file containing stuff like a zip as well as being a shell script which may be run using arguments just like normal scripts?
I'm asking here about the process of creation, not just an explanation of how this is possible. The ideal way for me would that there are already some scripts or programs allowing to create easily such PDF files.
I've tried to search the net for the keywords "polyglot files" and others of the kind and wasn't able to find any useful matches. Maybe this process has another name?
I've already read the presentation by Julia Wolf which explains how things works, but I sadly haven't had time to apply the knowledge there to real world, because I'm sadly not used to play with file headers and the way a PDF is constructed.
EDIT:
Okay, I've read more and found the 7th edition of PoC||GTFO to be really informative concerning this subject. I may end up being able to create my own scripts to do such polyglot PDF files if I have some more time to consider it.
I played around with polyglots myself after attending Ange's talks and also talking to him in person. You really need to understand the file formats to be able to nest them into each other.
However, long story short, here are some links I found extremely useful for creating polyglots:
Some older Google Code Trunk
PoC of the polyglot stuff
Especially the second link (to github) will help you creating polyglots, but also understanding how they are working and how they are implemented. Since it is mostly Python stuff and very well / clean written, it is very useful and easy to follow.
I feel dissecting some file formats would be a good place to start. You can find many file format specifications for different file types through Google, but they can be a tough read and will likely take you some time to translate into whatever language you are using.
PDF: https://www.adobe.com/content/dam/acom/en/devnet/pdf/pdfs/PDF32000_2008.pdf
ELF: https://www.cs.cmu.edu/afs/cs/academic/class/15213-s00/doc/elf.pdf
ZIP: http://kat.sdf.org/zip_file_format.txt
The language(s) you select will need a way to read and write raw bytes (not just ascii alphanumeric), so perhaps C would be good for more direct access to memory. Some Python tricks could help with open sourcing the scripts easily.
To dissect the files, you may want to build a tool kinda like https://github.com/kvesel/zipbrk/ to take them apart, then put them all back together in a polyglot format. For example, zip does not require the section headers to be at the start (or even contiguous for that matter), and PDF magic number can appear in multiple places within the file as well. I also believe I recall a polyglot tool being included in one of the PoC||GTFO publishings (maybe issue 8 or 2??) as a polyglot in the pdf file.
Don't forget the hackers bible! :)
https://nostarch.com/gtfo
Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 8 years ago.
Improve this question
I am new to Rails and am trying to put together a little app that will read cucumber results from a mongo database. The results stored in the document are parsed into html. In the rails app I am taking those results and displaying them using the raw() method call. The string that I get back is fairly large and as it turns out, the raw() method is truncating the text that I pass into it. When I output the text without raw() I get the entire string as expected (except that it has been escaped and not rendering as html).
My question is, is there any way to get around this? I really don't want to have to do the html conversion in the rails app or on the client. Both seem too costly. Especially when I can do it elsewhere and just store it in monogdb as an html string. Anyone have any ideas?
Thanks,
Jake
It turns out that there was a part of the string that was causing the rendering of the html to choke. Because cucumber syntax pass variables to Scenario steps using < >, there were places that <style> was written. Because is a valid open html tag, the html stopped rendering. I found this out by looking at the page source (where I was using the inspect element on the developer tools before). I saw that the whole html that I was expecting was in the source. I parsed through the text and used gsub to replace the <style> tag and all is working now.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
I've got an idea, but it's implications scare me. Perhaps you, dear reader, can help. :)
The Setup
I've created a Ruby-based CLI app that allows user configuration via a YAML file. In that file, there is scenario where the user can define pre and post "actions" that display a message (with some arbitrary, non-relevant code in-between). For example:
actions:
- action
# ...other keys...
pre:
message: 'This is the pre message'
action: puts 'PRE COMMAND'
post:
message: 'This is the post message'
action: puts 'POST COMMAND'
In this case, my app would output This is the pre message, evaluate the "pre" action (thus outputting PRE COMMAND), do some irrelevant stuff, output This is the post message, and finally evaluate the "post" action (thus outputting POST COMMAND).
The Problem
You can already guess the problem; it appeared when I used the word "evaluate". That's a scary thing. Even though this is a locally-run, client-centric app, the idea of eval'ing random Ruby is terrifying.
Solution Idea #1
The first idea was just that: eval the actions. I quickly destroyed it (unless one of you knows-more-Ruby-than-me types can convince me otherwise).
Solution Idea #2
Do some "checking" (via Regexp, perhaps) to validate that the command is somehow "valid". That seems wildly large and difficult to contain.
Solution Idea #3
Another idea was to wrap acceptable commands in data structures of my own (thus limiting the possibilities that a user could define). For instance, I might create an open_url action that safely validates and opens a URL in the default browser.
I like this idea, but it seems rather limiting; I'd have to define a zillion wrappers over time, it seems like. But perhaps that's the price you pay for safety?
Your Turn
I appreciate any additional thoughts you have!
You'd probably be a lot better off writing a simple framework that allows for Ruby plugins than to glue together something out of YAML and snippets of code.
You're right that "eval" is terrifying, and it should be, but sometimes it's the most elegant solution out of all possible inelegant solutions. I'd argue that this time is not one of those cases.
It's not at all hard to write a very simple DSL in Ruby where you can express your configuration in code:
action.pre.message = 'This is the pre message'
action.pre.command do
puts "PRE COMMAND"
end
All this depends on is having a number of pre-defined structures that have methods like message= taking a string as an argument or command taking a block. If you want to get fancy you can write some method_missing handlers and make up things as you go along, allowing for maximum flexibility.
You can see many examples of this, from your Rakefile to capistrano, and it usually works out a lot better than having a non-Ruby configuration file format with Ruby in it.
assuming that I know nothing about everything and that I'm starting in programming TODAY what do you say would be necessary for me to learn in order to start working with Natural Language Processing?
I've been struggling with some string parsing methods but so far it is just annoying me and making me create ugly code. I'm looking for some fresh new ideas on how to create a Remember The Milk API like to parse user's input in order to provide an input form for fast data entry that are not based on fields but in simple one line phrases instead.
EDIT: RTM is todo list system. So in order to enter a task you don't need to type in each field to fill values (task name, due date, location, etc). You can simply type in a phrase like "Dentist appointment monday at 2PM in WhateverPlace" and it will parse it and fill all fields for you.
I don't have any kind of technical constraints since it's going to be a personal project but I'm more familiar with .NET world. Actually, I'm not sure this is a matter of language but if it's necessary I'm more than willing to learn a new language to do it.
My project is related to personal finances so the phrases are more like "Spent 10USD on Coffee last night with my girlfriend" and it would fill location, amount of $$$, tags and other stuff.
Thanks a lot for any kind of directions that you might give me!
This does not appear to require full NLP. Simple pattern-based information extraction will probably suffice. The basic idea is to tokenize the text, then recognize/classify certain keywords, and finally recognize patterns/phrases.
In your example, tokenizing gives you "Dentist", "appointment", "monday", "at", "2PM", "in", "WhateverPlace". Your tool will recognize that "monday" is a day of the week, "2PM" is a time, etc. Finally, you can find patterns like [at] [TIME] and [in] [Place] and use those to fill in the fields.
A framework like GATE may help, but even that may be a larger hammer than you really need.
Have a look at NLTK, its a good resource for beginner programmers interested in NLP.
http://www.nltk.org/
It is written in python which is one of the easier programming languages.
Now that I understand your problem, here is my solution:
You can develop a kind of restricted vocabulary, in which all amounts must end witha $ sign or any time must be in form of 00:00 and/or end with AM/PM, regarding detecting items, you can use list of objects from ontology such as Open Cyc. Open Cyc can provide you with list of all objects such beer, coffee, bread and milk etc. this will help you to detect objects in the short phrase. Still it would be a very fuzzy approach.
As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
EDIT: I would really like to see some general discussion about the formats, their pros and cons!
EDIT2: The 'bounty didn't really help to create the needed discussion, there are a few interesting answers but the comprehensive coverage of the topic is still missing. Six persons marked the question as favourites, which shows me that there is an interest in this discussion.
When deciding about internationalization the toughest part IMO is the choice of storage format.
For example the Zend PHP Framework offers the following adapters which cover pretty much all my options:
Array : no, hard to maintain
CSV : don't know, possible problems with encoding
Gettext : frequently used, poEdit for all platforms available BUT complicated
INI : don't know, possible problems with encoding
TBX : no clue
TMX : too much of a big thing? no editors freely available.
QT : not very widespread, no free tools
XLIFF : the comming standard? BUT no free tools available.
XMLTM : no, not what I need
basically I'm stuck with the 4 'bold' choices. I would like to use INI files but I'm reading about the encoding problems... is it really a problem, if I use strict UTF-8 (files, connections, db, etc.)?
I'm on Windows and I tried to figure out how poEdit functions but just didn't manage. No tutorials on the web either, is gettext still a choice or an endangered species anyways?
What about XLIFF, has anybody worked with it? Any tips on what tools to use?
Any ideas for Eclipse integration of any of these technologies?
POEdit isn't really hard to get a hang of. Just create a new .po file, then tell it to import strings from source files. The program scans your PHP files for any function calls matching _("Text"), gettext("Text"), etc. You can even specify your own functions to look for.
You then enter a translation in the appropriate box. When you save your .po file, a .mo file is automatically generated. That's just a binary version of the translations that gettext can easily parse.
In your PHP script make a call to bindtextdomain() telling it where your .mo file is located. Now any strings passed to gettext (or the underscore function) will be translated.
It makes it really easy to keep your translation files up to date. POEdit also has some neat features like allowing comments, showing changed and dropped strings and allowing fuzzy matches, which means you don't have to re-translate strings that have been slightly modified.
There is always Translate Toolkit which allow translating between I think all mentioned formats, and preferred gettext (po) and XLIFF.
you can use INI if you want, it's just that INI doesn't have a way to tell anyone that it is in UTF8, so if someone opens your INI with an editor, it might corrupt yout file.
So the idea is that, if you can trust the user to edit it with a UTF8 encoding.
You can add a BOM at the start of the file, some editors knows about it.
What do you want it to store ? user generated content or your application ressources ?
I worked with two of these formats on the l18n side: TMX and XLIFF. They are pretty similar. TMX is more popular nowdays, but XLIFF is gaining support quickly. There was at least one free XLIFF editor when I last looked into it: Transolution but it is not being developed now.
I do the data storage myself using a custom design - All displayed text is stored in the DB.
I have two tables.
The first table has an identity value, a 32 character varchar field (indexed on this field)
and a 200 character english description of the phrase.
My second table has the identity value from the first table, a language code (EN_UK,EN_US,etc) and an NVARCHAR column for the text.
I use an nvarchar for the text because it supports other character sets which I don't yet use.
The 32 character varchar in the first table stores something like 'pleaselogin' while the second table actually stores the full "Please enter your login and password below".
I have created a huge list of dynamic values which I replace at runtime. An example would be "You have {[dynamic:passworddaysremain]} days to change your password." - this allows me to work around the word ordering in different languages.
I have only had to deal with Arabic numerals so far but will have to work something out for the first user who requires non arabic numbers.
I actually pull this information out of the database on a 2 hourly interval and cache it to the disk in a file for each language in XML. Extensive use of the CDATA is used.
There are many options available, for performance you could use html templates for each language - My method works well but does use the XML DOM a lot at runtime to create the pages.
One rather simple approach is to just use a resource file and resource script. Programs like MSVC have no problem editing them. They're also reasonably friendly to other systems (and to text editors) as well. You can just create separate string tables (and bitmap tables) for each language, and mark each such table with what language it is in.
None of those choices looks very appetizing to me.
If you're sending files out for translation in multiple languages, then you want to be able to trust that the encodings are correct, especially if you no one in your team speaks those languages. Sometimes it's difficult to spot an encoding problem in a foreign language, and it is just too easy to inadvertantly corrupt file encodings if you let your OS 'guess'.
You really want a format that declares its encoding. Otherwise, translators or their translation tools might select something other than UTF-8. For my money, any kind of simple XML format is best, but it looks like you'd need to roll your own in Zend. XLIFF and TMX are certainly overkill.
A format like Java's XML resources would be ideal.
This might be a little different from what's been posted so far and may not be exactly what you're looking for, but I thought I would add it, if for nothing else but a different approach. I went with an object-oriented approach. What I did was create a system that encapsulates language files into a class by storing them in an array of string=>translation pairs. Access to the translation is through a method called translate with the key string as a parameter. Extending classes inherit the parent's language array and can add to it or overwrite it. Because the classes are extensible, you can change a base class and have the changes propagate through the children, making more maintainable than an array by itself. Plus, you only call the classes you need.
We just store the strings in the DB and have a translator mode built into the application to handle actually adding strings for different languages.
In the application we use various tricks to create text ids, like
£("btn_save")
£(Order.class,"amt")
The translations is loaded from the db when the system boots, or when a reload is manually triggered. The £ method takes care of looking up the translated string according the the language specified in the user session.
You can check my l10n tool called iL10Nz on http://www.myl10n.net
You can upload po/pot files, xliff, ini files , translate, download.
you can also check out this video on youtube
http://www.youtube.com/watch?v=LJLmxMFxaxA
Thanks
Olivier