confirm conditional statement applies to >0 observations in Stata - validation

This is something that has puzzled me for some time and I have yet to find an answer.
I am in a situation where I am applying a standardized data cleaning process to (supposedly) similarly structured files, one file for each year. I have a statement such as the following:
replace field="Plant" if field=="Plant & Machinery"
Which was a result of the original code-writing based on the data file for year 1. Then I generalize the code to loop through the years of data. The problem becomes if in year 3, the analogous value in that variable was coded as "Plant and MachInery ", such that the code line above would not make the intended change due to the difference in the text string, but not result in an error alerting the change was not made.
What I am after is some sort of confirmation that >0 observations actually satisfied the condition each instance the code is executed in the loop, otherwise return an error. Any combination of trimming, removing spaces, and standardizing the text case are not workaround options. At the same time, I don't want to add a count if and then assert statement before every conditional replace as that becomes quite bulky.
Aside from going to the raw files to ensure the variable values are standardized, is there any way to do this validation "on the fly" as I have tried to describe? Maybe just write a custom program that combines a count if, assert and replace?

The idea has surfaced occasionally that replace should return the number of observations changed, but there are good reasons why not, notably that it is not a r-class or e-class command any way and it's quite important not to change the way it works because that could break innumerable programs and do-files.
So, I think the essence of any answer is that you have to set up your own monitoring process counting how many values have (or would be) changed.
One pattern is -- when working on a current variable:
gen was = .
foreach ... {
...
replace was = current
replace current = ...
qui count if was != current
<use the result>
}

Related

How to load value from dynamically specified parameter in NiFi

I have several processes with almost same flow like "Get some parameters, extract data from database according to them and upload them to target". The parameters vary slightly across processes as well as targets but only a bit. Most of the process is the same. I would like to extract those differences to parameter-context and dynamically load them. My idea is to have parameters defined following way and then using them.
So core of question is:
How to dynamically choose which parameter group load and use?
Having several parameter contexts with same-named/different-valued parameters and dynamically switching them would be probably the best, but it is not possible as far as I know.
Also duplicating flows is out-of-the-table. Any error correction would be spread out over several places and maintenance would be a nightmare.
Moreover, I know I can do it like "In GenetrateFlowFile for process A set value1=#{A_value1} and in GenetrateFlowFile for process B set value1=#{B_value1}. But this is tedious, error-prone and scales kinda bad. Not speaking of situation when I can have dozens of parameters and several processes. Also it is a kind of hardcoding, not configuring...
I was hoping for something like defining group=A and then using it like value1=#{ ${ group:append('_value1') } } but this does not work - it is evaluated as parameter literally named ${ group:append('_value1') }.
TL;DR: Use evaluateELString().
The actual solution is to set in GenetrateFlowFile processor group=A and in next UpdateAttribute processor set the following:
value1=${ group:prepend('hash{ '):append('_value1 }'):replace('hash', '#'):evaluateELString() }
The magic being done here is "Take value of group slap around it #{ and _value1 } to make it valid NiFi Expression Language statement and then evaluate it." (Notice - the word hash and function replace is there since I didn´t manage to escape the # char right before {.)
If you would like to have your value1 at the beginning of the statement then you can use following code. The result is same, it is easier to use (often-changed value value1 is at the beginning of the statement) and is less readable "what is really going on?"-wise.
value1=${ literal('value1'):prepend('_'):prepend(${ group }):prepend('hash{ '):append(' }'):replace('hash', '#'):evaluateELString() }

Report Builder Expressions

Im new to Report Builder and having issues with some expressions that Im trying to implement in a report. I got the standard ones to work however as soon as I try any distinctions, I get error messages. Over the last couple weeks, Ive tried many combinations, read the expression help, google and looking at other questions at internet sites. To reduce my frustrations, I even would jump to other expressions and walk away hoping I would have different insight coming back.
Its probably something simple or something I dont know about writing expressions.
Im hoping that someone can help with these expressions; they are the versions I get the least errors with(usually just expression expected) and show what Im trying to accomplish.
=IIF((Fields!RECORDFLAG.Value)='D',COUNTDISTINCT(Fields!TICKETNUM.Value),0)
=IIF((Fields!TRANSTYPE.Value)='1' and (Fields!RECORDFLAG.VALUE)='A' or
'B',SUM(Fields!DOLLARS.Value),0)
=IIF((Fields!TRANSTYPE.Value)='1' and
(Fields!RECORDFLAG.VALUE)='P',SUM(Fields!DOLLARS.Value),0)
=Sum([DOLLARS] case when [RECORDFLAG]='P' then -1*[DOLLARS])
Thank You.
=IIF((Fields!RECORDFLAG.Value)=”D”,COUNTDISTINCT(Fields!TICK‌​ETNUM.Value))
The error message gives you the answer here - no false part of the iif() has been specified. Use =IIF((Fields!RECORDFLAG.Value)=”D”,COUNTDISTINCT(Fields!TICK‌​ETNUM.Value), 0)
=IIF((Fields!TRANSTYPE.Value)="1" and (Fields!RECORDFLAG.VALUE)="A" or "B",SUM(Fields!DOLLARS.Value),0)
This is not how an OR works in SSRS. Use:
=IIF((Fields!TRANSTYPE.Value)="1" and (Fields!RECORDFLAG.VALUE="A" or Fields!RECORDFLAG.Value = "B"),SUM(Fields!DOLLARS.Value),0)
The 0s are returned due to your report design. countdistinct() is an aggregate function - it's meant to be used on a set of data. However, your iif() is only testing on a per row basis - you're basically saying "if the current row is thing, count all the distinct values" which doesn't make sense. There are a couple of ways forward:
You can count the number of times a certain value occurs in a given condition using a sum(). This is not the same as the countdistinct(), but if you use =sum(iif(Fields!RECORDFLAG.Value = "D", 1, 0)) then you will get the number of times RECORDFLAG is D in that set. Note: this requires the data to be aggregated (so in SSRS, grouped in a tablix).
You can use custom code to count distinct values in a set. See https://itsalocke.com/aggregate-on-a-lookup-in-ssrs/. You can apply this even if you have only one dataset - just reference the same one twice.
You can change the way your report works. You can group on Fields!RECORDFLAG.Value and filter the group to where Fields!RECORDFLAG.Value = "D". Then in your textbox, use =countdistinct(Fields!TICKETNUM.Value) to get the distinct values for TICKETNUM when RECORDFLAG is D.

Hide Labels with No Data in SPSS

I just started using SPSS, there is a option of Select cases that I was trying in SPSS, and later on finding frequency based on that filter.
For Eg:
Suppose Q1 has 12 parts, Q1_1 Q1_2 Q1_3 Q1_4 Q1_5 Q1_6 Q1_7 Q1_8 Q1_9 Q1_10 Q1_11 Q1_12
I want to see data in these variables based on a condition that I used in select cases. Now when I try to see frequencies of these variables based on the filter, only 4 out of 12 satisfy has data.
Now my question is can I hide rest 8 and show only 4 with data on my output window.
It's not entirely clear what you are trying to describe however reading between the lines, I'm guessing you are trying to delete tables generated from FREQUENCIES which may happen to be empty (likely due to a filter applied but perhaps not necessarily either)
You could do this with SPSS Scripting but avoiding that, you may want to explore using CTABLES, which though may not be in the exact same format as FREQUENCY table output it will still none the less retrieve the same information.
Solution below. Assumes Python Integration with SPSS SELECT VARIABLES installed and of course the CTABLE add-on module.
/****** Simulate example data ******/.
input program.
loop #j = 1 to 100.
compute ID=#j.
vector Q(12).
loop #i = 1 to 12.
do if #j<51 and #i<9.
compute Q(#i) = $sysmis.
else.
compute Q(#i) = trunc(rv.uniform(1,5)).
end if.
end loop.
end case.
end loop.
end file.
end input program.
execute.
/************************************/.
/* frequencies without filtering applied */.
freq q1 to q12.
/* frequencies WITH filtering applied */.
/* Empty table here shoult be removed */.
temp.
select if (ID<51).
freq q1 to q12.
spssinc select variables macroname="!Qp" /properties pattern = "^Q\d+$"/options separator="+" order=file.
spssinc select variables macroname="!Qs" /properties pattern = "^Q\d+$"/options separator=" " order=file.
temp.
select if (ID<51).
ctables /table (!Qp)[c][count colpct]
/categories variables=!Qs empty=exclude.
Note if you had assess empty variables at a total level then there is a function in spssaux2 (spssaux2.FindEmptyVars) which could help you find the empty variables and then you could build the syntax to exclude these and so contain the variables with only valid responses and then run FREQUENCIES. But I don't think spssaux2.FindEmptyVars will honor any filtering.

Format statement with unknown columns

I am attempting to use fortran to write out a comma-delimited file for import into another commercial package. The issue is that I have an unknown number of data columns. My output needs to look like this:
a_string,a_float,a_different_float,float_array_elem1,float_array_elem2,...,float_array_elemn
which would result in something that might look like this:
L1080,546876.23,4325678.21,300.2,150.125,...,0.125
L1090,563245.1,2356345.21,27.1245,...,0.00983
I have three issues. One, I would prefer the elements to be tightly grouped (variable column width), two, I do not know how to define a variable number of array elements in the format statement, and three, the array elements can span a large range--maybe 12 orders of magnitude. The following code conceptually does what I want, but the variable 'n' and the lack of column-width definition throws an error (of course):
WRITE(50,900) linenames(ii),loc(ii,1:2),recon(ii,1:n)
900 FORMAT(A,',',F,',',F,n(',',F))
(I should note that n is fixed at run-time.) The write statement does what I want it to when I do WRITE(50,*), except that it's width-delimited.
I think this thread almost answered my question, but I got quite confused: SO. Right now I have a shell script with awk fixing the issue, but that solution is...inelegant. I could do some manipulation to make the output a string, and then just write it, but I would rather like to avoid that option if at all possible.
I'm doing this in Fortran 90 but I like to try to keep my code as backwards-compatible as possible.
the format close to what you want is f0.3, this will give no spaces and a fixed number of decimal places. I think if you want to also lop off trailing zeros you'll need to do a good bit of work.
The 'n' in your write statement can be larger than the number of data values, so one (old school) approach is to put a big number there, eg 100000. Modern fortran does have some syntax to specify indefinite repeat, i'm sure someone will offer that up.
----edit
the unlimited repeat is as you might guess an asterisk..and is evideltly "brand new" in f2008
In order to make sure that no space occurs between the entries in your line, you can write them separately in character variables and then print them out using theadjustl() function in fortran:
program csv
implicit none
integer, parameter :: dp = kind(1.0d0)
integer, parameter :: nn = 3
real(dp), parameter :: floatarray(nn) = [ -1.0_dp, -2.0_dp, -3.0_dp ]
integer :: ii
character(30) :: buffer(nn+2), myformat
! Create format string with appropriate number of fields.
write(myformat, "(A,I0,A)") "(A,", nn + 2, "(',',A))"
! You should execute the following lines in a loop for every line you want to output
write(buffer(1), "(F20.2)") 1.0_dp ! a_float
write(buffer(2), "(F20.2)") 2.0_dp ! a_different_float
do ii = 1, nn
write(buffer(2+ii), "(F20.3)") floatarray(ii)
end do
write(*, myformat) "a_string", (trim(adjustl(buffer(ii))), ii = 1, nn + 2)
end program csv
The demonstration above is only for one output line, but you can easily write a loop around the appropriate block to execute it for all your output lines. Also, you can choose different numerical format for the different entries, if you wish.

sed optimization (large file modification based on smaller dataset)

I do have to deal with very large plain text files (over 10 gigabytes, yeah I know it depends what we should call large), with very long lines.
My most recent task involves some line editing based on data from another file.
The data file (which should be modified) contains 1500000 lines, each of them are e.g. 800 chars long. Each line is unique, and contains only one identity number, each identity number is unique)
The modifier file is e.g. 1800 lines long, contains an identity number, and an amount and a date which should be modified in the data file.
I just transformed (with Vim regex) the modifier file to sed, but it's very inefficient.
Let's say I have a line like this in the data file:
(some 500 character)id_number(some 300 character)
And I need to modify data in the 300 char part.
Based on the modifier file, I come up with sed lines like this:
/id_number/ s/^\(.\{650\}\).\{20\}/\1CHANGED_AMOUNT_AND_DATA/
So I have 1800 lines like this.
But I know, that even on a very fast server, if I do a
sed -i.bak -f modifier.sed data.file
It's very slow, because it has to read every pattern x every line.
Isn't there a better way?
Note: I'm not a programmer, had never learnt (in school) about algorithms.
I can use awk, sed, an outdated version of perl on the server.
My suggested approaches (in order of desirably) would be to process this data as:
A database (even a simple SQLite-based DB with an index will perform much better than sed/awk on a 10GB file)
A flat file containing fixed record lengths
A flat file containing variable record lengths
Using a database takes care of all those little details that slow down text-file processing (finding the record you care about, modifying the data, storing it back to the DB). Take a look for DBD::SQLite in the case of Perl.
If you want to stick with flat files, you'll want to maintain an index manually alongside the big file so you can more easily look up the record numbers you'll need to manipulate. Or, better yet, perhaps your ID numbers are your record numbers?
If you have variable record lengths, I'd suggest converting to fixed-record lengths (since it appears only your ID is variable length). If you can't do that, perhaps any existing data will not ever move around in the file? Then you can maintain that previously mentioned index and add new entries as necessary, with the difference is that instead of the index pointing to record number, you now point to the absolute position in the file.
I suggest you a programm written in Perl (as I am not a sed/awk guru and I don't what they are exactly capable of).
You "algorithm" is simple: you need to construct, first of all, an hashmap which could give you the new data string to apply for each ID. This is achieved reading the modifier file of course.
Once this hasmap in populated you may browse each line of your data file, read the ID in the middle of the line, and generate the new line as you've described above.
I am not a Perl guru too , but I think that the programm is quite simple. If you need help to write it, ask for it :-)
With perl you should use substr to get id_number, especially if id_number has constant width.
my $id_number=substr($str, 500, id_number_length);
After that if $id_number is in range, you should use substr to replace remaining text.
substr($str, -300,300, $new_text);
Perl's regular expressions are very fast, but not in this case.
My suggestion is, don't use database. Well written perl script will outperform database in order of magnitude in this sort of task. Trust me, I have many practical experience with it. You will not have imported data into database when perl will be finished.
When you write 1500000 lines with 800 chars it seems 1.2GB for me. If you will have very slow disk (30MB/s) you will read it in a 40 seconds. With better 50 -> 24s, 100 -> 12s and so. But perl hash lookup (like db join) speed on 2GHz CPU is above 5Mlookups/s. It means that your CPU bound work will be in seconds and you IO bound work will be in tens of seconds. If it is really 10GB numbers will change but proportion is same.
You have not specified if data modification changes size or not (if modification can be done in place) thus we will not assume it and will work as filter. You have not specified what format of your "modifier file" and what sort of modification. Assume that it is separated by tab something like:
<id><tab><position_after_id><tab><amount><tab><data>
We will read data from stdin and write to stdout and script can be something like this:
my $modifier_filename = 'modifier_file.txt';
open my $mf, '<', $modifier_filename or die "Can't open '$modifier_filename': $!";
my %modifications;
while (<$mf>) {
chomp;
my ($id, $position, $amount, $data) = split /\t/;
$modifications{$id} = [$position, $amount, $data];
}
close $mf;
# make matching regexp (use quotemeta to prevent regexp meaningful characters)
my $id_regexp = join '|', map quotemeta, keys %modifications;
$id_regexp = qr/($id_regexp)/; # compile regexp
while (<>) {
next unless m/$id_regexp/;
next unless $modifications{$1};
my ($position, $amount, $data) = #{$modifications{$1}};
substr $_, $+[1] + $position, $amount, $data;
}
continue { print }
On mine laptop it takes about half minute for 1.5 million rows, 1800 lookup ids, 1.2GB data. For 10GB it should not be over 5 minutes. Is it reasonable quick for you?
If you start think you are not IO bound (for example if use some NAS) but CPU bound you can sacrifice some readability and change to this:
my $mod;
while (<>) {
next unless m/$id_regexp/;
$mod = $modifications{$1};
next unless $mod;
substr $_, $+[1] + $mod->[0], $mod->[1], $mod->[2];
}
continue { print }
You should almost certainly use a database, as MikeyB suggested.
If you don't want to use a database for some reason, then if the list of modifications will fit in memory (as it currently will at 1800 lines), the most efficient method is a hashtable populated with the modifications as suggested by yves Baumes.
If you get to the point where even the list of modifications becomes huge, you need to sort both files by their IDs and then perform a list merge -- basically:
Compare the ID at the "top" of the input file with the ID at the "top" of the modifications file
Adjust the record accordingly if they match
Write it out
Discard the "top" line from whichever file had the (alphabetically or numerically) lowest ID and read another line from that file
Goto 1.
Behind the scenes, a database will almost certainly use a list merge if you perform this alteration using a single SQL UPDATE command.
Good deal on the sqlloader or datadump decision. That's the way to go.

Resources