Keep labels of merged variables only - Stata - label

I have a database A. I want to merge it with a few variables from database B (which has hundreds of variables). All variables in B have labels. So, if I do:
use A.dta
merge 1:1 id using B.dta, keepusing(var1 var2)
I get all value labels from B copied into A.
If I do instead:
merge 1:1 id using B.dta, keepusing(var1 var2) nolabel
var1 and var2 have no labels in A.
There seems to be no option in merge which allows for a solution in between (i.e. to copy only the value labels of the merged ones).
A workaround would be to run:
labelbook, problems
label drop `r(notused)'
after the first method. Yet, this needs to be run every time a merge is done (and I am merging many many times). It can also be quite slow (dataset B has many many variables).
Another option would be to create a temporary dataset "B-minus" containing only the variables and value labels I want, and merge from it. But this also entails running the same time-consuming code above, so it's no different.
Is there a "better" way to achieve this?
MCVE:
webuse voter, clear
label list // shows two variables with value labels (candidat and inc)
drop candidat inc
label drop candidat inc2 // we drop all value labels
merge 1:1 pop frac using http://www.stata-press.com/data/r14/voter, nogen keepusing(candidat)
label list // instead of having only the candidat label, we also have inc

There is no such an option in merge but you could simply use macro list manipulation:
webuse voter, clear
label list // shows two variables with value labels (candidat and inc)
drop candidat inc
label drop candidat inc2 // we drop all value labels
local labkeep candidat // define which labels you want to keep
merge 1:1 pop frac using http://www.stata-press.com/data/r14/voter, nogen keepusing(candidat)
quietly label dir
local secondary "`r(names)'"
display "`secondary'"
local newlabels : list secondary - labkeep
display "`newlabels'"
label drop `newlabels'
label list

Update: Not using preserve/restore (thanks #Pearly Spencer for highlighting this) further improves the speed of the method. To see old code with preserve/restore, see the older versions of this answer.
I think I found a faster method to solve the problem (at least judging by results using timer on, timer off).
So, to recap, the current, slow approach is to merge databases, and then drop all unused labels using
labelbook, problems
label drop `r(notused)'
An alternative and faster method is to load a smaller dataset using only the needed variables. This will only contain the labels of selected variables. Then, merge this smaller database with the original one. Importantly, the merge direction is reversed! This way we eliminate the need for preserve/restore, which as #Pearly Spencer suggested, can slow down things a bit, particularly in larger datasets.
In terms of my original example, the code would be:
*** Open and work with dataset A ***
use A.dta // load original dataset
... // do stuff with it (added just for generality)
save A_final.dta // name of final dataset
*** Load dataset B with subset of needed variables only ***
use id var1 var2 using B.dta, clear // this loads id (needed for merging), var1 and var2 and their labels only!
*** Merge modified A dataset into smaller B dataset ***
merge 1:1 id using A_final.dta, keep(using match) // we do not specify any variables to load, as all those in A_final.dta needed) IMPORTANT: if we want to keep all observations of the original dataset (A, which is the one being merged into B), we need to use "using" rather than "master" in the "keep()" option.
save A_final.dta, replace // Create final version of A. Done!
That's it! I'm not sure this is the optimal solution, but in my case, where I am merging many datasets which have hundreds of variables, it is way faster.
The code in terms of the MCVE would be:
*** Open original dataset and work with it ***
webuse voter, clear
label list // shows two variables with value labels (candidat and inc)
drop candidat inc
label drop candidat inc2 // we drop all value labels
save final.dta
*** Create temporary dataset ***
use pop frac candidat using http://www.stata-press.com/data/r14/voter, clear // this is key. Only load needed variables!
*** Merge temporary dataset with original one ***
merge 1:1 pop frac using final.dta, nogen
label list // we only have the "candidat" label! Success!
save final.dta, replace

Related

custom array printing in gdb

I know gdb has several means of exploring data, some of them quite convenient. However, I cannot combine them to get that I need/want. I would like to display some custom string based on the first n values of a big array starting at <PT_arr>, and the last m values of the same array at a distance (in this case) 4096. Looking something like this:
table beginning:
0x804cfe0 <PT_arr>: 0x00100300 0x00200300 0x00300300 0x00400300
table end:
0x804cfe0 <PT_arr+4064>: 0x00500300 0x00600300 0x00700300 0x00800300
printf let's me add custom text (like table beginning)
the examine x gives me that nice alignment, let's me read many elements and group them by byte, words, etc; and shows addresses at the left (which is ideal for my case).
x aligns the content of regions of memory in an easy to read manner with the size and unit parameters. (what I want)
display is constantly printing. (what I want).
The issue with display (manual), is that unlike examine x (manual) it doesn't have a size or unit parameter.
Is there a way to accomplish that?
Thanks.

Conditional Formatting based on Another Range

I want to set conditional formatting on a sheet with range A2:D15 using a custom formula that changes the cell background color. I have column F which includes a list of names (F2:F13), and column H which includes what class the name is (G2:G13). I want to compare each row by saying that if the class in G2 = "Paladin" and F2 is not blank, then perform the conditional formatting. I want this to span all 12 rows in F and G, but I cannot pass an array using an if function.
Example sheet: https://docs.google.com/spreadsheets/d/1a32ItT0HpRsov_oG5-CVHVe3HZV9WP-LypkxugsoK0g/edit?usp=sharing
I tried using this formula:
=if(and(not(isblank(F2)),G2="Paladin"),1)
It successfully changes the first result in my range because it happens to be true, but I need it to include the entire array, so I tried using this:
=if(and(not(isblank(F2:F13)),G2:G13="Paladin"),1)
Also played around with this =if(and(F2=A2,G2="Paladin"),1) - same problem I reckon, but more accurate if I could find a way to use arrays.
However, IF function as I understand it cannot evaluate arrays. I tried using $ signs to play around with it, similar to this example I found: https://www.benlcollins.com/formula-examples/array-formula-intro/ - but that is using numerical data and when I use $ it either applies the conditional formatting on the entire row, or entire column, or the entire range of A3:D16.
you will need 4 rules:
=FILTER(A2, COUNTIF(FILTER(F$2:F,G$2:G="Paladin"), A2))
=FILTER(B2, COUNTIF(FILTER(F$2:F,G$2:G="Paladin"), B2))
=FILTER(C2, COUNTIF(FILTER(F$2:F,G$2:G="Paladin"), C2))
=FILTER(D2, COUNTIF(FILTER(F$2:F,G$2:G="Paladin"), D2))

(Using Julia) How can I reduce my data matrix by averaging values from the same hour?

I am trying to reduce the size of my data and I cannot make it work. I have data points taken every minute over 1 month. I want to reduce this data to have one sample for every hour. The problem is: Some of my runs have "NA" value, so I delete these rows. There is not exactly 60 points for every hour - it varies.
I have a 'Timestamp' column. I have used this to make a 'datehour' column which has the same value if the data set has the same date and hour. I want to average all the values with the same 'datehour' value.
How can I do this? I have tried using the if and for loop below, but it takes so long to run.
Thanks for all your help! I am new to Julia and come from a Matlab background.
======= CODE ==========
uniquedatehour=unique(datehour,1)
index=[]
avedata=reshape([],0,length(alldata[1,:]))
for j in uniquedatehour
for i in 1:length(datehour)
if datehour[i]==j
index=vcat(index,i)
else
rows=alldata[index,:]
rows=convert(Array{Float64,2},rows)
avehour=mean(rows,1)
avedata=vcat(avedata,avehour)
index=[]
continue
end
end
end
There are several layers to optimizing this code. I am assuming that your data is sorted on datehour (your code assumes this).
Layer one: general recommendation
Wrap your code in a function. Executing code in global scope in Julia is much slower than within a function. By wrapping it make sure to either pass data to your function as arguments or if data is in global scope it should be qualified with const;
Layer two: recommendations to your algorithm
Statement like [] creates an array of type Any which is slow, you should use type qualifier like index=Int[] to make it fast;
Using vcat like index=vcat(index,i) is inefficient, it is better to do push!(index, i) in place;
It is better to preallocate avedata with e.g. fill(NA, length(uniquedatehour), size(alldata, 2)) and assign values to an existing matrix than to do vcat on it;
Your code will produce incorrect results if I am not mistaken as it will not catch the last entry of uniquedatehour vector (assume it has only one element and check what happens - avedata will have zero rows)
Line rows=convert(Array{Float64,2},rows) is probably not needed at all. If alldata is not Matrix{Float64} it is better to convert it at the beginning with Matrix{Float64}(alldata);
You can change line rows=alldata[index,:] to a view like view(alldata, index, :) to avoid allocation;
In general you can avoid creation of index vector as it is enough that you remember start s and end e position of the range of the same values and then use range s:e to select rows you want.
If you correct those things please post your updated code and maybe I can help further as there is still room for improvement but requires a bit different algorithmic approach (but maybe you will prefer option below for simplicity).
Layer three: how I would do it
I would use DataFrames package to handle this problem like this:
using DataFrames
df = DataFrame(alldata) # assuming alldata is Matrix{Float64}, otherwise convert it here
df[:grouping] = datehour
agg = aggregate(df, :grouping, mean) # maybe this is all what you need if DataFrame is OK for you
Matrix(agg[2:end]) # here is how you can convert DataFrame back to a matrix
This is not the fastest solution (as it converts to a DataFrame and back but it is much simpler for me).

Hide Labels with No Data in SPSS

I just started using SPSS, there is a option of Select cases that I was trying in SPSS, and later on finding frequency based on that filter.
For Eg:
Suppose Q1 has 12 parts, Q1_1 Q1_2 Q1_3 Q1_4 Q1_5 Q1_6 Q1_7 Q1_8 Q1_9 Q1_10 Q1_11 Q1_12
I want to see data in these variables based on a condition that I used in select cases. Now when I try to see frequencies of these variables based on the filter, only 4 out of 12 satisfy has data.
Now my question is can I hide rest 8 and show only 4 with data on my output window.
It's not entirely clear what you are trying to describe however reading between the lines, I'm guessing you are trying to delete tables generated from FREQUENCIES which may happen to be empty (likely due to a filter applied but perhaps not necessarily either)
You could do this with SPSS Scripting but avoiding that, you may want to explore using CTABLES, which though may not be in the exact same format as FREQUENCY table output it will still none the less retrieve the same information.
Solution below. Assumes Python Integration with SPSS SELECT VARIABLES installed and of course the CTABLE add-on module.
/****** Simulate example data ******/.
input program.
loop #j = 1 to 100.
compute ID=#j.
vector Q(12).
loop #i = 1 to 12.
do if #j<51 and #i<9.
compute Q(#i) = $sysmis.
else.
compute Q(#i) = trunc(rv.uniform(1,5)).
end if.
end loop.
end case.
end loop.
end file.
end input program.
execute.
/************************************/.
/* frequencies without filtering applied */.
freq q1 to q12.
/* frequencies WITH filtering applied */.
/* Empty table here shoult be removed */.
temp.
select if (ID<51).
freq q1 to q12.
spssinc select variables macroname="!Qp" /properties pattern = "^Q\d+$"/options separator="+" order=file.
spssinc select variables macroname="!Qs" /properties pattern = "^Q\d+$"/options separator=" " order=file.
temp.
select if (ID<51).
ctables /table (!Qp)[c][count colpct]
/categories variables=!Qs empty=exclude.
Note if you had assess empty variables at a total level then there is a function in spssaux2 (spssaux2.FindEmptyVars) which could help you find the empty variables and then you could build the syntax to exclude these and so contain the variables with only valid responses and then run FREQUENCIES. But I don't think spssaux2.FindEmptyVars will honor any filtering.

Hashing table design in C

I have a design issue regarding HASH function.
In my program I am using a hash table of size 2^13, where the slot is calculated based on the value of the node(the hash key) which I want to insert.
Now, say my each node has two value |A|B| however I am inserting value into hash table using A.
Later on, I want to search a particular node which B not A.
Is it possible to that way? Is yes, could you highlight some design approaches?
The constraint is that I have to use A as the hash key.
Sorry, I can't share the code. Small example:
Value[] = {Part1, Part2, Part3};
insert(value)
check_for_index(value.part1)
value.part1 to be used to calculate the index of the slot.
Once slot is found then insert the "value"
Later on,
search_in_hash(part2)
check_for_index("But here I need the value.part1 to check for slot index")
So, how can I relate the part1, part2 & part3 such that I later on I can find the slot by either part2 or part3
If the problem statement is vague kindly let me know.
Unless you intend to do a search element-by-element (in which case you don't need a hash, just a plain list), then what you basically ask is - can I have a hash such that hash(X) == hash(Y), but X!=Y, so that you could map to a location using part1 and then map to the same one using part2 or 3. That completely goes against what hashing stands for.
What you should do is (as viraptor also suggested), create 3 structures, each hashed using a different part of the value, and push the full value to all 3. Then when you need to search use the proper hash by the part you want to search by.
for e.g.:
value[] = {part1, part2, part3};
hash1.insert(part1, value)
hash2.insert(part2, value)
hash3.insert(part3, value)
then
hash2.search_in_hash(part2)
or
hash3.search_in_hash(part3)
The above 2 should produce the exact same values.
Also make sure that all data manipulations (removing values, changing them), is done on all 3 structures simultaneously. For e.g. -
value = hash2.search_in_hash(part2)
hash1.remove(value.part1)
hash2.remove(part2) // you can assert that part2 == value.part2
hash3.remove(value.part3)

Resources