Group and Count an Array of Structs - ruby

Ruby noob here!
I have an array of structs that look like this
Token = Struct.new(:token, :ordinal)
So an array of these would look like this, in tabular form:
Token | Ordinal
---------------
C | 2
CC | 3
C | 5
And I want to group by the "token" (i.e. the left hand column) of the struct and get a count, but also preserve the "ordinal" element. So the above would look like this
Token | Merged Ordinal | Count
------------------------------
C | 2, 5 | 2
CC | 3 | 1
Notice that the last column is a count of the grouped tokens and the middle column merges the "ordinal". The first column ("Token") can contain a variable number of characters, and I want to group on these.
I have tried various methods, using group_by (I can get the count, but not the middle column), inject, iterating (does not seem very functional) but I just can't get it right, partly because I don't have a good grasp of Ruby and the available operations / functions.
I have also had a good look around SO, but I am not getting very far.
Any help, pointers would be much appreciated!

Use Enumerable#group_by to do the grouping for you and use the resulting hash to get what you want with map or similar.
structs.group_by(&:token).map do |token, with_same_token|
[token, with_same_token.map(&:ordinal), with_same_token.size]
end

Related

Google Spreadsheet API returning grid limits error

I am trying to update a Google Sheet using the Ruby API (that is just a wrapper around the SheetsV4 API)
I am running into the following error
Google::Apis::ClientError: badRequest: Range ('MySheet'!AA1) exceeds grid limits. Max rows: 1000, max columns: 26
I have found references of this problem on the google forum, however there did not seem to be a solution to the problem other that to use a different method to write to the spreadsheet.
The thing is, I need to copy an existing spreadsheet template, and enter my raw data in various sheets. So far I have been using this code (where service is a client of the Ruby SheetsV4 API)
def write_table(values, sheet: 'Sheet1', column: 1, row: 1, range: nil, value_input_option: 'RAW')
google_range = begin
if range
"#{sheet}!#{range}"
elsif column && row
"#{sheet}!#{integer_to_A1_notation(column)}#{row}"
end
end
value_range_object = ::Google::Apis::SheetsV4::ValueRange.new(
range: google_range, values: values
)
service.update_spreadsheet_value(spreadsheet_id,
google_range,
value_range_object,
value_input_option: value_input_option
)
end
It was working quite well so far, but after adding more data to my extracts, I went over the 26th column, (columns AA onwards) and now I am getting the error.
Is there some option to pass to update_spreadsheet_value so we can raise this limit ?
Otherwise, what is the other way to write to the spreadsheet using append ?
EDIT - A clear description of my scenario
I have a template Google spreadsheet with 8 sheets(tabs), 4 of which are titled RAW-XX and this is where I try to update my data.
At the beginning, those raw tabs only have headers on 30 columns (A1 --> AD1)
My code needs to be able to fill all the cells A2 --> AD42
(1) for the first time
(2) and my code needs to be able to re-run again to replace those values by fresh ones, without appending
So basically I was thinking of using update_spreadsheet_value rather than append_xx because of the requirement (2). But becuase of this bug/limitation (unclear) in the API, this does not work. ALso important to note : I am not actually updating all those 30 columns in one go, but actually in several calls to the update method (with up to 10 columns each time)
I've thought that
- Maybe I am missing an option to send to the Google API to allow more than 26 columns in one go ?
- Maybe this is actually an undocumented hard limitation of the update API
- Maybe I can resort to deleting existing data + using append
EDIT 2
Suppose I have a template at version 1 with multiple sheets (Note that I am using =xx to indicate a formula, and [empty] to indicate there is nothing in the cell, and 1 to indicate the raw value "1" was supplied
Sheet1 - RAW
RAW Number of foos | RAW Number of Bars |
[empty] | [empty] |
Sheet2 - FORMATTED
Number of foos | Number of Bars
='Sheet1 - RAW'!A2 | ='Sheet1 - RAW'B2
Now I call my app "for the first time", this copies the existing template to a new file "generated_spreadsheet" and injects data in the RAW sheet. It turns out at this moment, my app says there is 1 foo and 0 bar
Sheet1 - RAW
RAW Number of foos | RAW Number of Bars |
1 | 0 |
Sheet2 - FORMATTED
Number of foos | Number of Bars
='Sheet1 - RAW'!A2 | ='Sheet1 - RAW'!B2
Maybe if I call my app later, maybe the template AND the data have changed in between, so I want to REPLACE everything in my "generated_spreadsheet"
The new template has become in between
Sheet1 - RAW
RAW Number of foos | RAW Number of Bars |
[empty] | [empty] |
Sheet2 - FORMATTED
Number of foos | Number of Bars | All items
='Sheet1 - RAW'!A2 | ='Sheet1 - RAW'!B2 | =A2 + B2
Suppose now my app says there is still 1 foo and the number of bars went from 0 to 2, I want to update the "generated_spreadsheet" so it looks like
Sheet1 - RAW
RAW Number of foos | RAW Number of Bars |
1 | 3 |
Sheet2 - FORMATTED
Number of foos | Number of Bars | All items
='Sheet1 - RAW'!A2 | ='Sheet1 - RAW'!B2 | =A2 + B2
How about using values.append? In my environment, I also experienced the same situation with you. In order to avoid this issue, I used values.append.
Please modify as follows and try it again.
From:
service.update_spreadsheet_value(
To:
service.append_spreadsheet_value(
Reference:
Method: spreadsheets.values.append
If this was not the result you want, I'm sorry.
this because out of range.
AA1 means column is AA, also means 27, so this start point AA1 not exist, that's why you met this error.
you can try Z1, this should be ok.
worksheet.resize(2000)
will resize your sheet 2000 rows
It happened to me as well when I didn't have the empty columns (I removed all the empty columns from the spreadsheet). I simply added an empty one next to my last column and it works.

Why do tabulate or summarize not take into account missing values when implemented inside a program?

As an illustrative example, suppose this is your dataset:
cat sex age
1 1 13
1 0 14
1 1 .
2 1 23
2 1 45
2 1 15
If you want to create a table of frequencies between cat and sex, you tabulate these two variables and you get the following result:
tab cat sex
| sex
cat | 0 1 | Total
-----------+----------------------+----------
1 | 1 2 | 3
2 | 0 3 | 3
-----------+----------------------+----------
Total | 1 5 | 6
I am writing a Stata program where the three variables are involved, i.e. cat, sex and age. Getting the matrix of frequencies for the first two variables is just an intermediate step that I need for further computation.
cap program drop myexample
program def myexample, rclass byable(recall) sortpreserve
version 14
syntax varlist [aweight iweight fweight] [if] [in] [ , AGgregate ]
args var1 var2 var3
tempname F
marksample touse
set more off
if "`aggregate'" == "" {
local var1: word 1 of `varlist'
local var2: word 2 of `varlist'
local var3: word 3 of `varlist'
qui: tab `var1' `var2' [`weight' `exp'] if `touse', matcell(`F') label matcol(`var2')
mat list `F'
}
end
However, when I run:
myexample cat sex age
I get this result which is not what I expected:
__000001[2,2]
c1 c2
r1 1 1
r2 0 3
That is, given that age contains a missing value, even if it is not directly involved in the tabulation, the program ignores the missing value and does not take into account that observation. I need to get the result of the first tabulation. I have tried using summarize instead, but the same problem arises. When implemented inside the program, missing values are not counted.
You are complaining about behaviour which you built into your own program. The responsibility and the explanation are in your hands.
The effect of
marksample touse
followed by calling up a command with the qualifier
if `touse'
is to ignore missing values. marksample by default marks as "to use" those observations in which all variables specified have non-missing values; the other observations are marked as to be ignored. It also takes account of any if or in qualifiers and any zero weights.
It's also true, as #Noobie explains, that omitting missing values from a tabulation is default for tabulate in any case.
So, to get the result you want you'd need to modify your marksample call to
marksample touse, novarlist
and to call up tabulate with the missing option (if it's compulsory) or to allow users to specify a missing option which you then pass to tabulate.
You also ask about summarize. By design that command ignores missing values. I don't know what you would expect summarize to do about them. It could report a count of missing values. If you want that, several other commands will oblige, such as codebook or missings (Stata Journal). You can always include a report on missings in your program, such as using count to count the missings and display the result.
I understand your program to be very much work in progress, so won't comment on details you don't ask about.
This is caused by marksample. Rule 5 in help mark states
The marker variable is set to 0 in observations for which any of the
numeric variables in varlist contain a numeric missing value.
You should use the novarlist option. According to the help file,
novarlist is for use with marksample. It specifies that missing values
among variables in varlist not cause the marker variable to be set to 0.
if I understand well you want tab to include missing values? If so, you just have to ask for it
tab myvar1 myvar2, mi
from the documentation
missing : treat missing values like other values

Compact data structure for sorted array of pairs (integer, byte)?

I have quite a specific data set that I need to store in most compact way as a byte array. It is a live stream of integers that are constantly increasing, often by one, but not always one. Each integer value has a tag that is a byte value. There may be values with same value and tag, but I need to store only distincts. Only supported operations are adding new elements, removal and check if element exists - I keep this data set to check if some pair has been 'seen' recently.
Some sample data:
# | value | tag |
1 | 1000 | 0 |
2 | 1000 | 1 |
3 | 1000 | 2 |
4 | 1001 | 0 |
5 | 1002 | 2 |
6 | 1004 | 1 |
7 | 1004 | 2 |
8 | 1005 | 0 |
As I said this is a live stream, but I can tolerate storing only last few thousands. The goal is to make it as memory efficient as possible in the storage (and in RAM), operations can cost much.
If I had no tags, I could store ranges or values, (1000-1002), (1002-1005) etc, there are usually about 5-6 values in a row without gaps. But the tags mess all this.
My current approach is to encode each value + tag pair in a few bytes - one byte for tag and 1 or more bytes for 'delta' from previous value.
This way I need to store first value, 1000 in above case, and than I store deltas - 0 for #1, #2, 1 for #4, 1 for #5, 2 for #6 etc.
Most deltas are small 1-10, so I can store it in one byte only - first bit is a flag if value is small enough to fit in 7 bits, if not - next 7 bits store a value of how may bytes delta occupies.
Maybe there is a better, more compact, approach?
Since you have only 127 different tag values, you could maintain 127 different tables, one for each tag, thus saving yourself from having to store the tags. In each table you could still use your nifty trick with deltas.
Let the pair (value, tag) where value is a uint32 and tag is a uint8 be a typical item stored in your data structure.
Use an associative array data structure that maps uint32 to an array list of uint16. In C++ terms, the data structure is the following.
std::map<std::uint32_t, std::vector<std::uint16_t>>
Each array list stays sorted with distinct values and never exceeds a size of 216.
Let D be an instance of this data structure. We store (value, tag) in the array list D[value >> 8] as (static_cast<std::uint16_t>(value) << 8) + tag.
The idea is basically that the data is paged. The most-significant 3 bytes of value determine the page, and then the least-significant byte of value and the single byte of tag are stored in the page.
This should exploit the structure of your data very efficiently because, assuming each page is holding many values, you're using 2 bytes per item.

Encode a string variable in non-alphanumeric order

I want to encode a string variable in such a way that assigned numerical codes respect the original order of string values (as shown when using browse). Why? I need encoded variable labels to get the correct variable names when using reshape wide.
Suppose var is a string variable with no labels:
var label(var)
"zoo" none
"abc" none
If you start with:
encode var, gen(var2)
the labels are 1="abc" 2="zoo" as can be seen with
label li
But I want the labels sorted as they come, as shown in browse for an unchanged order of variables later.
I didn't find an encode option in which the labels are added in the order I see when using browse.
My best idea is to do it by hand:
ssc install labutil
labvalch var, f(1 2) t(2 1)
This is nice, but I have >50 list entries.
Other approach: When using reshape use another order, but I don't think that works.
reshape wide x, i(id) j(var)
I only found
ssc install labutil
labmask regioncode, values(region)
as some alternative to encode but I'm not able to cope with strings using labmask.
First off, it's a rule in Stata that string variables can't have value labels. Only numeric variables can have value labels. In essence, what you want as value labels are already in your string variable as string values. So, the nub of the problem is that you need to create a numeric variable with values in the right order.
Let's solve the problem in its easiest form: string values occur once and once only. So
gen long order = _n
labmask order, values(var)
then solves the problem, as the numeric values 1, 2, ... are linked with the string values zoo, abc, whatever, which become value labels. Incidentally, a better reference for labmask, one of mine, is
http://www.stata-journal.com/sjpdf.html?articlenum=gr0034
Now let's make it more complicated. String values might occur once or more times, but we want the numeric variable to respect first occurrence in the data.
gen long order1 = _n
egen order2 = min(order1), by(var)
egen order = group(order2)
labmask order, values(var)
Here's how that works.
gen long order1 = _n
puts the observation numbers 1, 2, whatever in a new variable.
egen order2 = min(order1), by(var)
finds the first occurrence of each distinct value of var.
egen order = group(order2)
maps those numbers to 1, 2, whatever.
labmask order, values(var)
links the numeric values of order and the string values of var, which become its value labels.
Here is an example of how that works in practice.
. l, sep(0)
+---------------------------------+
| var order1 order2 order |
|---------------------------------|
1. | zoo 1 1 zoo |
2. | abc 2 2 abc |
3. | zoo 3 1 zoo |
4. | abc 4 2 abc |
5. | new 5 5 new |
6. | newer 6 6 newer |
+---------------------------------+
. l, nola sep(0)
+---------------------------------+
| var order1 order2 order |
|---------------------------------|
1. | zoo 1 1 1 |
2. | abc 2 2 2 |
3. | zoo 3 1 1 |
4. | abc 4 2 2 |
5. | new 5 5 3 |
6. | newer 6 6 4 |
+---------------------------------+
You would drop order1 order2 once you have got the right answer.
See also sencode for another solution. (search sencode to find references and download locations.)
The user-written command sencode (super encode) by Roger Newson, and available running ssc describe sencode can be used for what you want. Instead of assigning numerical codes based on the alphanumeric order of the string variable, they can be assigned using the order in which the values appear in the original dataset.
clear all
set more off
*------- example data ---------
input str10 var
abc
zoo
zoo
zoo
elephant
elephant
abc
abc
elephant
zoo
end
*------- encode ---------------
encode var, generate(var2)
sencode var, generate(var3)
list, separator(0)
list, separator(0) nolabel
The variable var3 is in the desired form. Contrast that with var2.
I'm not sure if there's an elegant solution, because I think that levelsof orders strings alphabetically.
As long as your list is unique this should work.
clear
input str3 myVar
"zoo"
"abc"
"def"
end
* for reshape
generate iVar = 1
generate jVar = _n
* reshape to wide
reshape wide myVar, i(iVar) j(jVar)
list
* create label
local i = 0
foreach v of varlist myVar* {
local ++i
local myVarName = `v'
label define myLabel `i' "`myVarName'", add
}
* reshape to wide
reshape long myVar, i(iVar) j(myVarEncoded)
* assign label
label value myVarEncoded myLabel

hadoop stream, how to set partition?

I'm very new with hadoop stream and have some difficulties with the partitioning.
According to what is found in a line, my mapper function either returns
key1, 0, somegeneralvalues # some kind of "header" line where linetype = 0
or
key1, 1, value1, value2, othervalues... # "data" line, different values, linetype =1
To properly reduce I need to group all lines having the same key1, and to sort them by value1, value2, and the linetype ( 0 or 1), something like:
1 0 foo bar... # header first
1 1 888 999.... # data line, with lower value1
1 1 999 111.... # a few datalines may follow. Sort by value1,value2 should be performed
------------ #possible partition here, and only here in this example
2 0 baz foobar....
2 1 123 888...
2 1 123 999...
2 1 456 111...
Is there a way to ensure such partitioning ? so far I've tried to play with options such as
-partitioner,'org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner'
-D stream.num.map.output.key.fields=4 # please use 4 fields to sort data
-D mapred.text.key.partitioner.options=-k1,1 # please make partitions based on first key
or alternatively
-D num.key.fields.for.partition=1 # Seriously, please group by key1 !
which yet only brought rage and despair.
If it's worth mentioning it, my scripts work properly if I use cat data | mapper | sort | reduce
and I'm using the amazon elastic map reduce ruby client, so I'm passing the options with
--arg '-D','options' for the ruby script.
Any help would be highly appreciated ! Thanks in advance
Thanks to ryanbwork I've been able to solve this problem. Yay !
The right idea was indeed to create a key that consists of a concatenation of the values. To go a little further, it is also possible to create a key that looks like
<'1.0.foo.bar', {'0','foo','bar'}>
<'1.1.888.999', {'1','888','999'}>
Options can then be passed to hadoop so that it can partition by the first "part" of the key. If I'm not mistaking in the interpretation, it looks like
-partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartioner
-D stream.map.output.field.separator=. # I added some "." in the key
-D stream.num.map.output.key.fields=4 # 4 "sub-fields" are used to sort
-D num.key.fields.for.partition=1 # only one field is used to partition
This solution, based on what ryanbwork said, allows to create more reducers, while ensuring the data is properly splitted, and sorted.
After reading this post I'd propose modifying your mapper such that it returns pairs whose 'keys' include your key value, your linetype value, and the value1/value2 values all concatenated together. You'd keep the 'value' part of the pair the same. So for example, you'd return the following pairs to represent your first two examples:
<'10foobar',{'0','foo','bar'}>
<'11888999',{'1','888','999'}>
Now if you were to utilize a single reducer, all of your records would be get sent to the same reduce task and sorted in alphabetical order based on their 'key'. This would fulfill your requirement that pairs get sorted by key, then by linetype, then by value1 and finally value2, and you could access these values individually in the 'value' portion of the pair. I'm not very familiar with the different built in partioner/sort classes, but I'd assume you could just use the defaults and get this to work.

Resources