How to exchange correctly using the Money gem - ruby

I use the Money gem to work with different currencies. I'm seeing a strange behavior with the "JPY" currency.
I have the following rates:
config.add_rate('USD', 'EUR', 0.92)
config.add_rate('USD', 'JPY', 123.0)
Trying to exchange currencies, I get strange results:
10.to_money.exchange_to('EUR')
=> #<Money fractional:920 currency:EUR>
10.to_money.exchange_to('JPY')
=> #<Money fractional:1230 currency:JPY>
The "JPY" conversion should be #<Money fractional:123000 currency:JPY>. Any ideas on what's going on?

It really depends on definition of Currency. Below code shows that 10 USD is indeed equal to 1230 yen.
require "rails"
require "money-rails"
Money.add_rate('USD', 'EUR', 0.92)
Money.add_rate('USD', 'JPY', 123.0)
p 10.to_money.exchange_to('JPY') == Money.new(1230,"JPY")
#=> true
Your expectation that you should see 123000 may not be correct if you inspect the JPY currency
p Money.new(1230,"JPY").currency
#<Money::Currency id: jpy, priority: 6, symbol_first: true, thousands_separator: ,, html_entity: ¥, decimal_mark: ., name: Japanese Yen, symbol: ¥, subunit_to_unit: 1, exponent: 0.0, iso_code: JPY, iso_numeric: 392, subunit: , smallest_denomination: 1>
Important field to note in Currency definition is the value of subunit_to_unit: 1. As per documentation:
:subunit_to_unit the proportion between the unit and the subunit
This means that in case of Yen, the value displayed is in Yen, and it need not be multiplied by 100 as is the case with USD or EUR.
p 10.to_money.exchange_to('EUR')
#=> #<Money fractional:920 currency:EUR>
p 10.to_money.exchange_to('JPY')
#=> #<Money fractional:1230 currency:JPY>
Below is Currency definition for EUR
#<Money::Currency id: eur, priority: 2, symbol_first: true, thousands_separator: ., html_entity: €, decimal_mark: ,, name: Euro, symbol: €, subunit_to_unit: 100, exponent: 2.0, iso_code: EUR, iso_numeric: 978, subunit: Cent, smallest_denomination: 1>
In case of EUR, subunit_to_unit: 100 indicates that value is in cents (or equivalent)

Related

Removing categories with patsy and statsmodels

I am using statsmodels and patsy for building a logistic regression model. I'll use pseudocode here. Let's assume I have a dataframe containing a categorical variable, say Country, with 200 levels. I have reasons to believe some of them would be predictive, so I build a model as in
formula = 'outcome ~ C(Country)'
patsy splits Country into its levels and the model is build using all countries. I then see that the coefficient in GB is high so I want to remove only GB. Can I do something like this in patsy:
formula = 'outcome ~ C(country) - C(country)[GB]'
I tried and it did not change anything.
I don't know if there is a way to subset a Category with patsy formula, but you can do it in the DataFrame.
For example
import numpy as np
import pandas as pd
import statsmodels.api as sm
# sample data
size = 100
np.random.seed(1)
countries = ['IT', 'UK', 'US', 'FR', 'ES']
df = pd.DataFrame({
'outcome': np.random.random(size),
'Country': np.random.choice(countries, size)
})
df['Country'] = df.Country.astype('category')
print(df.Country)
0 ES
1 IT
2 UK
3 US
4 UK
..
95 FR
96 UK
97 ES
98 UK
99 US
Name: Country, Length: 100, dtype: category
Categories (5, object): ['ES', 'FR', 'IT', 'UK', 'US']
Let us suppose we want to remove Category "US"
# create a deep copy excluding 'US'
_df = df[df.Country!='US'].copy(deep=True)
print(_df.Country)
0 ES
1 IT
2 UK
4 UK
5 ES
..
94 UK
95 FR
96 UK
97 ES
98 UK
Name: Country, Length: 83, dtype: category
Categories (5, object): ['ES', 'FR', 'IT', 'UK', 'US']
Even if there are no more elements with category "US" in the DataFrame, the category is still there. If we use this DataFrame in a statsmodels model, we'd get a singular matrix error, so we need to remove unused categories
# remove unused category 'US'
_df['Country'] = _df.Country.cat.remove_unused_categories()
print(_df.Country)
0 ES
1 IT
2 UK
4 UK
5 ES
..
94 UK
95 FR
96 UK
97 ES
98 UK
Name: Country, Length: 83, dtype: category
Categories (4, object): ['ES', 'FR', 'IT', 'UK']
and now we can fit a model
mod = sm.Logit.from_formula('outcome ~ Country', data=_df)
fit = mod.fit()
print(fit.summary())
Optimization terminated successfully.
Current function value: 0.684054
Iterations 4
Logit Regression Results
==============================================================================
Dep. Variable: outcome No. Observations: 83
Model: Logit Df Residuals: 79
Method: MLE Df Model: 3
Date: Sun, 16 May 2021 Pseudo R-squ.: 0.01179
Time: 22:43:37 Log-Likelihood: -56.776
converged: True LL-Null: -57.454
Covariance Type: nonrobust LLR p-value: 0.7160
=================================================================================
coef std err z P>|z| [0.025 0.975]
---------------------------------------------------------------------------------
Intercept -0.1493 0.438 -0.341 0.733 -1.007 0.708
Country[T.FR] 0.4129 0.614 0.673 0.501 -0.790 1.616
Country[T.IT] -0.1223 0.607 -0.201 0.840 -1.312 1.068
Country[T.UK] 0.1027 0.653 0.157 0.875 -1.178 1.383
=================================================================================

Teacher-Student System: Training Student with Top-k Hypotheses List

I want to configure a teacher-student system, where a teacher seq2seq model generates a top-k list of hypotheses, which are used to train a student seq2seq model.
My plan to implement this, is to batch the teacher hypotheses, meaning that the teacher outputs a tensor with batch axis length of k * B, where B is the input batch axis length. The output batch tensor, now contains k hypotheses for each sequence in the input batch tensor, sorted by position of the associated input sequence in the input batch.
This tensor is set as the student’s training target. However, the student’s batch tensor still has a batch axis length of B, so I utilize tf.repeat to repeat the sequences in the output tensor of the student’s encoder k times, before feeding that tensor into the student’s decoder.
For debugging purposes I made the simplification to repeat the single best hypothesis of the teacher, for now, before I’m going to implement the top-k list selection.
Here is a summary of my config file:
[...]
# Variables:
student_target = "teacher_hypotheses_stack"
[...]
# Custom repeat function:
def repeat(source, src_name="source", **kwargs):
import tensorflow as tf
input = source(0)
input = tf.Print(input, [src_name, "in", input, tf.shape(input)])
output = tf.repeat(input, repeats=3, axis=1)
output = tf.Print(output, [src_name, "out", output, tf.shape(output)])
return output
def repeat_t(source, **kwargs):
return repeat(source, "teacher")
def repeat_s(source, **kwargs):
return repeat(source, "student")
[...]
# Configuration of the teacher + repeating of its output
**teacher_network(), # The teacher_network is a encoder-decoder seq2seq model. The teacher performs search during training and is untrainable
"teacher_stack": {
"class": "eval", "from": ["teacher_decision"], "eval": repeat_t,
"trainable": False
# "register_as_extern_data": "teacher_hypotheses_stack"
},
"teacher_stack_reinterpreter": { # This is an attempt to explicitly (re-)select the batch axis. It is probably unecessary...
"class": "reinterpret_data",
"set_axes": {"B": 1, "T": 0},
"enforce_time_major": True,
"from": ["teacher_stack"],
"trainable": False,
"register_as_extern_data": "teacher_hypotheses_stack"
}
[...]
# Repeating of the student's encoder ouput + configuration of its decoder
"student_encoder": {"class": "copy", "from": ["student_lstm6_fw", "student_lstm6_bw"]}, # dim: EncValueTotalDim
"student_encoder_repeater": {"class": "eval", "from": ["student_encoder"], "eval": repeat},
"student_encoder_stack": { # This is an attempt to explicitly (re-)select the batch axis. It is probably unecessary...
"class": "reinterpret_data",
"set_axes": {"B": 1, "T": 0},
"enforce_time_major": True,
"from": ["student_encoder_repeater"]
},
"student_enc_ctx": {"class": "linear", "activation": None, "with_bias": True, "from": ["student_encoder_stack"], "n_out": EncKeyTotalDim}, # preprocessed_attended in Blocks
"student_inv_fertility": {"class": "linear", "activation": "sigmoid", "with_bias": False, "from": ["student_encoder_stack"], "n_out": AttNumHeads},
"student_enc_value": {"class": "split_dims", "axis": "F", "dims": (AttNumHeads, EncValuePerHeadDim), "from": ["student_encoder_stack"]}, # (B, enc-T, H, D'/H)
"model1_output": {"class": "rec", "from": [], 'cheating': config.bool("cheating", False), "unit": {
'output': {'class': 'choice', 'target': student_target, 'beam_size': beam_size, 'cheating': config.bool("cheating", False), 'from': ["model1_output_prob"], "initial_output": 0},
"end": {"class": "compare", "from": ["output"], "value": 0},
'model1_target_embed': {'class': 'linear', 'activation': None, "with_bias": False, 'from': ['output'], "n_out": target_embed_size, "initial_output": 0}, # feedback_input
"model1_weight_feedback": {"class": "linear", "activation": None, "with_bias": False, "from": ["prev:model1_accum_att_weights"], "n_out": EncKeyTotalDim, "dropout": 0.3},
"model1_s_transformed": {"class": "linear", "activation": None, "with_bias": False, "from": ["model1_s"], "n_out": EncKeyTotalDim, "dropout": 0.3},
"model1_energy_in": {"class": "combine", "kind": "add", "from": ["base:student_enc_ctx", "model1_weight_feedback", "model1_s_transformed"], "n_out": EncKeyTotalDim},
"model1_energy_tanh": {"class": "activation", "activation": "tanh", "from": ["model1_energy_in"]},
"model1_energy": {"class": "linear", "activation": None, "with_bias": False, "from": ["model1_energy_tanh"], "n_out": AttNumHeads}, # (B, enc-T, H)
"model1_att_weights": {"class": "softmax_over_spatial", "from": ["model1_energy"]}, # (B, enc-T, H)
"model1_accum_att_weights": {"class": "eval", "from": ["prev:model1_accum_att_weights", "model1_att_weights", "base:student_inv_fertility"],
"eval": "source(0) + source(1) * source(2) * 0.5", "out_type": {"dim": AttNumHeads, "shape": (None, AttNumHeads)}},
"model1_att0": {"class": "generic_attention", "weights": "model1_att_weights", "base": "base:student_enc_value"}, # (B, H, V)
"model1_att": {"class": "merge_dims", "axes": "except_batch", "from": ["model1_att0"]}, # (B, H*V)
"model1_s": {"class": "rnn_cell", "unit": "LSTMBlock", "from": ["prev:model1_target_embed", "prev:model1_att"], "n_out": 1000, "dropout": 0.3}, # transform
"model1_readout_in": {"class": "linear", "from": ["model1_s", "prev:model1_target_embed", "model1_att"], "activation": None, "n_out": 1000, "dropout": 0.3}, # merge + post_merge bias
"model1_readout": {"class": "reduce_out", "mode": "max", "num_pieces": 2, "from": ["model1_readout_in"]},
"model1_output_prob": {
"class": "softmax", "from": ["model1_readout"], "dropout": 0.3,
"target": student_target,
"loss": "ce", "loss_opts": {"label_smoothing": 0.1}
}
}, "target": student_target},
[...]
Running this config will print the following error message to the console:
[...]
Create Adam optimizer.
Initialize optimizer (default) with slots ['m', 'v'].
These additional variable were created by the optimizer: [<tf.Variable 'optimize/beta1_power:0' shape=() dtype=float32_ref>, <tf.Variable 'optimize/beta2_power:0' shape=() dtype=float32_ref>].
[teacher][in][[6656 6657 6658...]...][17 23]
[teacher][out][[6656 6656 6656...]...][17 69]
TensorFlow exception: assertion failed: [x.shape[0] != y.shape[0]] [69 17] [23]
[[node objective/loss/error/sparse_labels/check_dim_equal/assert_equal_1/Assert/Assert (defined at home/philipp/Documents/bachelor-thesis/returnn/returnn-venv/lib/python3.7/site-packages/tensorflow_core/python/framework/ops.py:1748) ]]
[...]
Execute again to debug the op inputs...
FetchHelper(0): <tf.Tensor 'objective/loss/error/sparse_labels/check_dim_equal/Shape_1_1:0' shape=(1,) dtype=int32> = shape (1,), dtype int32, min/max 23/23, ([23])
FetchHelper(0): <tf.Tensor 'objective/loss/error/sparse_labels/check_dim_equal/assert_equal_1/Assert/Assert/data_0_1:0' shape=() dtype=string> = bytes(b'x.shape[0] != y.shape[0]')
FetchHelper(0): <tf.Tensor 'objective/loss/error/sparse_labels/check_dim_equal/Shape_2:0' shape=(2,) dtype=int32> = shape (2,), dtype int32, min/max 17/69, ([69 17])
FetchHelper(0): <tf.Tensor 'objective/loss/error/sparse_labels/check_dim_equal/assert_equal_1/All_1:0' shape=() dtype=bool> = bool_(False)
[teacher][in][[6656 6657 6658...]...][17 23]
[teacher][out][[6656 6656 6656...]...][17 69]
Op inputs:
<tf.Tensor 'objective/loss/error/sparse_labels/check_dim_equal/assert_equal_1/All:0' shape=() dtype=bool>: bool_(False)
<tf.Tensor 'objective/loss/error/sparse_labels/check_dim_equal/assert_equal_1/Assert/Assert/data_0:0' shape=() dtype=string>: bytes(b'x.shape[0] != y.shape[0]')
<tf.Tensor 'objective/loss/error/sparse_labels/check_dim_equal/Shape:0' shape=(2,) dtype=int32>: shape (2,), dtype int32, min/max 17/69, ([69 17])
<tf.Tensor 'objective/loss/error/sparse_labels/check_dim_equal/Shape_1:0' shape=(1,) dtype=int32>: shape (1,), dtype int32, min/max 23/23, ([23])
Step meta information:
{'seq_idx': [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22],
'seq_tag': ['seq-0','seq-1','seq-2','seq-3','seq-4','seq-5','seq-6','seq-7','seq-8','seq-9','seq-10','seq-11','seq-12','seq-13','seq-14','seq-15','seq-16','seq-17','seq-18','seq-19','seq-20','seq-21','seq-22']}
Feed dict:
<tf.Tensor 'extern_data/placeholders/data/data:0' shape=(?, ?, 80) dtype=float32>: shape (23, 42, 80), dtype float32, min/max -0.5/0.4, mean/stddev -0.050000004/0.28722814, Data(name='data', shape=(None, 80), batch_shape_meta=[B,T|'time:var:extern_data:data',F|80])
<tf.Tensor 'extern_data/placeholders/data/data_dim0_size:0' shape=(?,) dtype=int32>: shape (23,), dtype int32, min/max 42/42, ([42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42])
<tf.Tensor 'extern_data/placeholders/source_text/source_text:0' shape=(?, ?, 512) dtype=float32>: shape (23, 13, 512), dtype float32, min/max -0.5/0.4, mean/stddev -0.050011758/0.28722063, Data(name='source_text', shape=(None, 512), available_for_inference=False, batch_shape_meta=[B,T|'time:var:extern_data:source_text',F|512])
<tf.Tensor 'extern_data/placeholders/source_text/source_text_dim0_size:0' shape=(?,) dtype=int32>: shape (23,), dtype int32, min/max 13/13, ([13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13])
<tf.Tensor 'extern_data/placeholders/target_text/target_text:0' shape=(?, ?) dtype=int32>: shape (23, 17), dtype int32, min/max 6656/6694, Data(name='target_text', shape=(None,), dtype='int32', sparse=True, dim=35209, available_for_inference=False, batch_shape_meta=[B,T|'time:var:extern_data:target_text'])
<tf.Tensor 'extern_data/placeholders/target_text/target_text_dim0_size:0' shape=(?,) dtype=int32>: shape (23,), dtype int32, min/max 17/17, ([17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17])
<tf.Tensor 'globals/train_flag:0' shape=() dtype=bool>: bool(True)
EXCEPTION
[...]
File "home/philipp/Documents/bachelor-thesis/returnn/repository/TFUtil.py", line 4374, in sparse_labels_with_seq_lens
x = check_dim_equal(x, 0, seq_lens, 0)
[...]
So, the network is build without errors, but on the first training step, it crashes due to an assertion error. To me it looks like RETURNN or TensorFlow validates the batch length against its original value somehow. But I don’t know where and why, so I have no clue what to do about this.
What am I doing wrong? Is my idea even implementable with RETURNN this way?
EDIT (10th June 2020): For clarification: My ultimate goal is to let the teacher generate a top-k list of hypotheses for each input sequence, which are then used to train the student. So, for each input sequence of the student, there are k solutions/target sequences.
To train the student, it must predict the probability of each hypothesis, and then the cross-entropy loss is calculated to determine the update gradients. But if there are k target sequences for each input sequence, the student must decode the encoder states k times, at each time targeting a different target sequence.
This is why I want to repeat the encoder states k times, to make the student decoder’s data parallel and then use the default cross-entropy loss implementation of RETURNN:
input-seq-1 --- teacher-hyp-1-1;
input-seq-1 --- teacher-hyp-1-2;
...;
input-seq-1 --- teacher-hyp-1-k;
input-seq-2 --- teacher-hyp-2-1;
...
Is there a more proper way to achieve my goal?
EDIT (12th June 2020 #1): Yes, I know that the DecisionLayer of the teacher already selects the best hypothesis and that this way, I’m only repeating that best hypothesis k times. I’m doing this as an intermediate step towards my ultimate goal. Later, I want to fetch the top-k list from the teacher’s ChoiceLayer somehow, but I felt like this is a different construction site.
But Albert, you say RETURNN would extend the data on batch dimension automatically somehow? How can I imagine that?
EDIT (12th June 2020 #2): Okay, now I select the top-k (this time k=4) hypotheses list from the teacher’s choice layer (or output layer) by:
"teacher_hypotheses": {
"class": "copy", "from": ["extra.search:teacherMT_output"],
"register_as_extern_data": "teacher_hypotheses_stack"
}
But using this Data as training target of the student leads to the error:
TensorFlow exception: assertion failed: [shape[0]:] [92] [!=] [dim:] [23]
[[node studentMT_output/rec/subnet_base/check_seq_len_batch_size/check_input_dim/assert_equal_1/Assert/Assert (defined at home/philipp/Documents/bachelor-thesis/returnn/returnn-venv/lib/python3.7/site-packages/tensorflow_core/python/framework/ops.py:1748) ]]
Which is, I assume, due to the issue that the target data of the student, the hypotheses list, has a batch axis length k=4 times longer than the one of the student’s input data/encoder state data.
Doesn’t the student encoder state data need to be extended/repeated here, to match the target data?
EDIT (12th June 2020 #3): I consider the initial issue as solved. The overall issue is continued here Teacher-Student System: Training Student With k Target Sequences for Each Input Sequence
It does not only validate the batch length. It will collapse the batch and time (it has used flatten_with_seq_len_mask, see code of Loss.init and that function) and then calculate the loss on that flattened tensor. So also the seq length need to match. This might be a problem but I'm not sure. As you have the same target also for the rec layer itself, it should have the same seq length in training.
You can debug this by carefully checking the output of debug_print_layer_output_template, i.e. check the Data (batch-shape-meta) output, if the axes are all correct as you expect them to be.
(debug_print_layer_output_template can and should always be enabled. It will not make it slower.)
You can also temporarily enable debug_print_layer_output_shape, which will really print the shape of all tensors. That way you can verify how it looks like.
Your usage of ReinterpretDataLayer looks very wrong. You should never ever explicitly set the axes by integer (like "set_axes": {"B": 1, "T": 0}). Why are you doing this at all? This could be the reason why it is messed up in the end.
Your repeat function is not very generic. You are using hard coded axes integers there as well. You should never do that. Instead, you would write sth like:
input_data = source(0, as_data=True)
input = input_data.placeholder
...
output = tf.repeat(input, repeats=3, axis=input_data.batch_dim_axis)
Did I understand this correct, that this is what you want to do? Repeat in the batch axis?
In that case, you also need to adapt the seq length information of the output of that layer. You cannot simply use that function as-is in an EvalLayer. You would also need to define out_type to a function which correctly returns the correct Data template. E.g. like this:
def repeat_out(out):
out = out.copy()
out.size_placeholder[0] = tf.repeat(out.size_placeholder[0], axis=0, repeats=3)
return out
...
"student_encoder_repeater": {
"class": "eval", "from": ["student_encoder"], "eval": repeat,
"out_type": lambda sources, **kwargs: repeat_out(sources[0].output)
}
Now you have the additional problem that every time you call this repeat_out, you will get another seq length info. RETURNN will not be able to tell whether these seq lengths are all the same or different (at compile time). And that will cause errors or strange effects. To solve this, you should reuse the same seq length. E.g. like this:
"teacher_stack_": {
"class": "eval", "from": "teacher_decision", "eval": repeat
},
"teacher_stack": {
"class": "reinterpret_data", "from": "teacher_stack_", "size_base": "student_encoder_repeater"
}
Btw, why do you want to do this repetition at all? What's the idea behind that? You repeat both the student and the teacher 3 times? So just increasing your learning rate by factor 3 would do the same?
Edit: It seems as if this is done to match the top-k list. In that case, this is all wrong, as RETURNN should already automatically do such repetition. You should not do this manually.
Edit: To understand how the repetition (and also beam search resolving in general) works, first thing is you should look at the log output (you must have debug_print_layer_output_template enabled, but you should have that anyway all the time). You will see the output of each layer, esp its Data output object. This is already useful to check if the shapes are all as you expect (check batch_shape_meta in the log). However, this is only the static shape at compile time, so batch-dim is just a marker there. You will also see the search beam information. This will keep track if the batch originates from some beam search (any ChoiceLayer basically), and has a beam, and the beam size. Now, in the code, check SearchChoices.translate_to_common_search_beam, and its usages. When you follow the code, you will see SelectSearchSourcesLayer, and effectively your case will end up with output.copy_extend_with_beam(search_choices.get_beam_info()).
Edit: To repeat, this is done automatically. You do not need to call copy_extend_with_beam manually.
If you expect to get the top-k list from the teacher, you are also likely doing it wrong, as I see that you used "teacher_decision" as input. I guess this is coming from a DecisionLayer? In that case, it already took only the first-best from the top-k beam.
Edit: Now I understand that you are ignoring this, and instead want to take only the first best, and then also repeat this. I would recommend to not do that, as you are making it unnecessary complicated, and you are kind of fighting RETURNN which knows what the batch-dim should be and will get confused. (You can make it work by what I wrote above, but really, this is just unnecessary complicated.)
Btw, there is no point in setting an EvalLayer to "trainable": False. That has no effect. The eval layer has no parameters anyway.

How to find an expression in a text file and process all lines until the next occurrence of the expression and repeat until end of the file

I have a text file:
Some comment on the 1st line of the file.
processing date: 31.8.2016
amount: -1.23
currency: EUR
balance: 1234.56
payer reference: /VS123456/SS0011223344/KS1212
type of the transaction: Some type of the transaction 1
additional info: Amount: 1.23 EUR 29.08.2016 Place: 123456789XY
processing date: 30.8.2016
amount: -2.23
currency: EUR
balance: 12345.56
payer reference: /VS123456/SS0011223344/KS1212
type of the transaction: Some type of the transaction 2
additional info: Amount: 2.23 EUR 28.08.2016 Place: 123456789XY
processing date: 29.8.2016
amount: -3.23
currency: EUR
balance: 123456.56
payer reference: /VS123456/SS0011223344/KS1212
type of the transaction: Some type of the transaction 2
additional info: Amount: 2.23 EUR 27.08.2016 Place: 123456789XY
I need to process the file so I will have the values on the right side, 31.8.2016, -1.23, EUR, 1234.56, etc., stored in a MySQL database.
I only achieved returning either 1 occurrence of the line which contains a particular string or all the lines using find or find_all, but this is not sufficient as I somehow need to identify the block starting with "processing date:" and ending with "additional info:" and process the values there, then process next block, and next, until the end of the file.
Any hints how to achieve this?
I'd start with this:
File.foreach('data.txt', "\n\n") do |li|
next unless li[/^processing/]
puts "'#{li.strip}'"
end
If "data.txt" contains your content, foreach will read the file and return paragraphs, not lines, of text in li. Once you have those you can manipulate them as you need. This is very fast and efficient and doesn't have the scalability problems readlines or any read-based I/O could have.
This is the output:
'processing date: 31.8.2016
amount: -1.23
currency: EUR
balance: 1234.56
payer reference: /VS123456/SS0011223344/KS1212
type of the transaction: Some type of the transaction 1
additional info: Amount: 1.23 EUR 29.08.2016 Place: 123456789XY'
'processing date: 30.8.2016
amount: -2.23
currency: EUR
balance: 12345.56
payer reference: /VS123456/SS0011223344/KS1212
type of the transaction: Some type of the transaction 2
additional info: Amount: 2.23 EUR 28.08.2016 Place: 123456789XY'
'processing date: 29.8.2016
amount: -3.23
currency: EUR
balance: 123456.56
payer reference: /VS123456/SS0011223344/KS1212
type of the transaction: Some type of the transaction 2
additional info: Amount: 2.23 EUR 27.08.2016 Place: 123456789XY'
You can see by the wrapping ' that the file is being read in chunks or paragraphs delineated by "\n\n" then each chunk is stripped to remove trailing blanks.
See the foreach documentation for more information.
split(':', 2) is your friend:
'processing date: 31.8.2016'.split(':', 2) # => ["processing date", " 31.8.2016"]
'amount: -1.23'.split(':', 2) # => ["amount", " -1.23"]
'currency: EUR'.split(':', 2) # => ["currency", " EUR"]
'balance: 1234.56'.split(':', 2) # => ["balance", " 1234.56"]
'payer reference: /VS123456/SS0011223344/KS1212'.split(':', 2) # => ["payer reference", " /VS123456/SS0011223344/KS1212"]
'type of the transaction: Some type of the transaction 1'.split(':', 2) # => ["type of the transaction", " Some type of the transaction 1"]
'additional info: Amount: 1.23 EUR 29.08.2016 Place: 123456789XY'.split(':', 2) # => ["additional info", " Amount: 1.23 EUR 29.08.2016 Place: 123456789XY"]
From that you can do:
text = 'processing date: 31.8.2016
amount: -1.23
currency: EUR
balance: 1234.56
payer reference: /VS123456/SS0011223344/KS1212
type of the transaction: Some type of the transaction 1
additional info: Amount: 1.23 EUR 29.08.2016 Place: 123456789XY'
text.lines.map{ |li| li.split(':', 2).map(&:strip) }.to_h
# => {"processing date"=>"31.8.2016", "amount"=>"-1.23", "currency"=>"EUR", "balance"=>"1234.56", "payer reference"=>"/VS123456/SS0011223344/KS1212", "type of the transaction"=>"Some type of the transaction 1", "additional info"=>"Amount: 1.23 EUR 29.08.2016 Place: 123456789XY"}
There are a number of ways to continue parsing the information into more usable data but that's for you to figure out.

Unidad de Fomento (CLF/UF) and money gem

I've been struggling to understand how does the money gem formats the Unidad de Fomento. I've tested the version 6.5 and the 6.7 and both seems to present odd formats:
# Money 6.5
usd = Money.new(243, 'USD')
usd.to_f #=> 2.43
usd.format #=> "$2.43"
clf = Money.new(243, 'CLF')
clf.to_f #=> 243
clf.format #=> "CLF243"
# Money 6.7
usd = Money.new(243, 'USD')
usd.to_f #=> 2.43
usd.format #=> "$2.43"
clf = Money.new(243, 'CLF')
clf.to_f #=> 0.0243
clf.format #=> "CLF0.0243"
Is it meant to be this way or it is a bug?
It was an intentional change introduced with version 6.6.
See the changelog and the commit on GitHub. Unfortunately there is no hint as of why it was done.
OK, I think I've got it. I'm a fool who thinks we lived in a world of cents like USD or EUR (base 10 exponent 2, 10^2 cents equals 1 unit of the currency). There are many currencies that do not have any kind of minor currencies like the Japanse Yen (JPY) and there are also currencies with no base 10 at all. This article in the wikipedia explains it very well: https://en.wikipedia.org/wiki/ISO_4217
So, in my examples it seems that a long time ago the CLF was a currency with exponent 0, so it hasn't any type of minor currency. 2.34 was an invalid amount so money converted it into 234. The ISO changed, and then CLF was converted into a currency with exponent 4.
This comment on the money issue tracker solved my issue: https://github.com/RubyMoney/money/issues/614#issuecomment-194813943

How to compare two Time objects only down to the hour

I want to compare two Time objects only down to the hour, while ignoring the difference of minutes and seconds.
I'm currently using t0.strftime("%Y%m%d%H") == t1.strftime("%Y%m%d%H"), but I don't think this is a good solution.
Is there better way to do this?
You can use this trick in pure Ruby
t0.to_a[2..9] == t1.to_a[2..9]
Where Time#to_a
$> Time.now.to_a
# => [7, 44, 2, 8, 3, 2014, 6, 67, false, "GMT"]
# [ sec, min, hour, day, month, year, wday, yday, isdst, zone ]
So you can check that the times are equals or not up to the level you want and without missing important components of the object like the zone, etc.
If you have ActiveSupport (either through Rails, or just installed as a gem), it includes an extension to the Time class that adds a change method which will truncate times:
$> require "active_support/core_ext/time"
# => true
$> t = Time.now
# => 2014-03-07 21:30:01 -0500
$> t.change(hour: 0)
# => 2014-03-07 00:00:00 -0500
This won't modify the original time value either. So you can do this:
t0.change(minute: 0) == t1.change(minute: 0)
It'll zero out everything at a lower granularity (seconds, etc.).
require 'time'
t1 = Time.new ; sleep 30 ; t2 = Time.new
t1.hour == t2.hour
This should give you a boolean answer.

Resources