how to group observations in stata - syntax

I'm a beginner with stata so this question might be easy for some of you.
I have a Dataset with Firmspecific data. One variable is Branche which contains the following lines of business: Consumer, Utilities, Food/Beverage, Technology, Logistics/Transportation, Retail, Insurance etc.
Now I want to form groups, for example the group Consumer which should contain Retail, Food/Beverages, Consumer but with the command generate Consumer = Consumer Retail Food/Beverages it doesn't work. Does anyone know what the right command would be? Thanks!

You can use user-written string recode command strrec:
ssc install strrec
strrec Branche ("Consumer" "Retail" "Food/Beverage" = 1 "Consumer"), gen(trunk)
You will need to add additional categories as you see fit. This creates a new variable, trunk, that has labeled integer(s).
You can refer to particular trunks like this:
list if trunk == 1
list if trunk == "Consumer":trunk
The reason I used an integer with value labels rather than a string is that some of the panel data commands do not like string IDs. I am guessing you are headed that route.

Related

My flow fails for no reason: Invalid Template Language? This is what I do

Team,
Occasionally my flow fails and its enough test it manually to running again. However, I want to avoid that this error ocurrs again to stay in calm.
The error that appears is this:
Unable to process template language expressions in action 'Periodo' inputs at line '0' and column '0': 'The template language function 'split' expects its first parameter to be of type string. The provided value is of type 'Null'. Please see https://aka.ms/logicexpressions#split for usage details.'.
And it appears in 2 of the 4 variables that I create:
Client and Periodo
The variable Clientlooks this:
The same scenario to "Periodo".
The variables are build in the same way:
His formula:
trim(first(split(first(skip(split(outputs('Compos'),'client = '),1)),'indicator')))
His formula:
trim(first(split(first(skip(split(outputs('Compos'),'period = '),1)),'DATA_REPORT_DELIVERY')))
The same scenario to the 4 variables. 4 of them strings (numbers).
Also I attached email example where I extract the info:
CO NIV ICE REFRESCOS DE SOYA has finished successfully.CO NIV ICE REFRESCOS DE SOYA
User
binary.struggle#mail.com
Parameters
output = 7
country = 170
period = 202204012
DATA_REPORT_DELIVERY = NO
read_persistance = YES
write_persistance = YES
client = 18277
indicator_group = SALES
Could you give some help? I reach some attepmpts succeded but it fails for no apparent reason:
Thank you.
I'm not sure if you're interested but I'd do it a slightly different way. It's a little more verbose but it will work and it makes your expressions a lot simpler.
I've just taken two of your desired outputs and provided a solution for those, one being client and the other being country. You can apply the other two as need be given it's the same pattern.
If I take client for example, this is the concept.
Initialize Data
This is your string that you provided in your question.
Initialize Split Lines
This will split up your string for each new line. The expression for this step is ...
split(variables('Data'), '\n')
However, you can't just enter that expression into the editor, you need to do it and then edit in in code view and change it from \\n to \n.
Filter For 'client'
This will filter the array created from the split line step and find the item that contains the word client.
`contains(item(), 'client')`
On the other parallel branches, you'd change out the word to whatever you're searching for, e.g. country.
This should give us a single item array with a string.
Initialize 'client'
Finally, we want to extract the value on the right hand side of the equals sign. The expression for this is ...
trim(split(body('Filter_For_''client''')[0], '=')[1])
Again, just change out the body name for the other action in each case.
I need to put body('Filter_For_''client''')[0] and specify the first item in an array because the filter step returns an array. We're going to assume the length is always 1.
Result
You can see from all of that, you have the value as need be. Like I said, it's a little more verbose but (I think) easier to follow and troubleshoot if something goes wrong.

rasa__nlu how to train entity to capture dynamic values

am new to rasa_nlu am bulting a bot where it takes values that are dynamic i want the dynamic value to be captured by an entity
examples:
with notes please { bring your college marksheet }
with notes please { come prepared
where the string inside braces are dynamic values i need to capture them and make use of them. suggest me a way.
Thank you in advance
The Rasa component CRFEntityExtractor is able to generalize to different entity values. Just make sure you add enough training data, so that the additional random field can pick up the pattern.
The following example learns up to pick person names (you have to add more than these three examples, but this should give an idea).
## intent: greet_with_name
- Hi, I'm [Donald](name)
- Hello, it's [Steve](name)
- Hi, Im [Sam](name)
- ...

Summing XML Data in MSWord Mail Merge

I have a report card written in Word that uses an XML file for its input. In the XML file, if a student remains in the same section all three trimesters there will be one node for that class; if they change sections at the trimester they'll have one node for each section. The nodes look something like this (greatly simplified):
<ReportCardSectionFB Abs1="2" Abs2="11" CourseID="ELMATH1" CourseTitle="Math" PeriodStart="3" TeacherName="Jones, Jennifer" TermCode="Year" SectionID="ELMATH1-4" />
<ReportCardSectionFB Abs1="1.50" Abs2="6" CourseID="ELMATH1" CourseTitle="Math" PeriodStart="3" TeacherName="Smith, Tina" TermCode="Year" SectionID="ELMATH1-3" />
There is no indicator within the XML as to which trimester the node belongs to.
In the Word document, we're pulling the absence data with the following mail merge command:
{MERGEFIELD "ReportCardSectionFB[#PeriodStart='3']/ #Abs1" \# 0.# \* MERGEFORMAT }
That's not working in this situation: it only gets the absence data from the first node it comes across, i.e.: 2.0. Is there a way to get the sum of #Abs1 for all period 3 classes, i.e.: 3.5? If not, is there a way to only get the last #Abs1 for period 3, i.e.: 1.5?
I recommend you to use this 3rd party product, which can use xml as input and is capable of merging it with MS Word template. I is also much more powerful than the built-in Word's mail merge. You can see some examples here.
You could also try summing the absences in Synergy - there's a new checkbox under AttDef1, 2, etc. that adds up all the absences for the data range - Include all day data for the entire date range regardless of section enrollment or section timeframe. That way the absences should be the same for each section, if that works for your district.
You can also try the SET function in Word to nest the MERGEFIELDS as bookmarks and use the Word operator functions to then add the bookmarks.

iphone's phone number splitting algorithm?

iPhone has a pretty good telephone number splitting function, for example:
Singapore mobile: +65 9852 4135
Singapore resident line: +65 6325 6524
China mobile: +86 135-6952-3685
China resident line: +86 10-65236528
HongKong: +886 956-238-82
USA: +1 (732) 865-3286
Notice the nice features here:
- the splitting of country code, area code, and the rest is automatic;
- the delimiter is also nicely adopted to different countries, e.g. "()", "-" and space.
Note the parsing logic is doable to me, however, I don't know where to get the knowledge of most countries' telephone number format.
where could i found such knowledge, or an open source code that implemented it?
You can get similar functionality with the libphonenumber code library.
Interestingly enough, you cannot use an NSNumberFormatter for this, but you can write your own custom class for it. Just create a new class, set properties such as countryCode, areaCode and number, and then create a method that formats the number based on the countryCode.
Here's a great example: http://the-lost-beauty.blogspot.com/2010/01/locale-sensitive-phone-number.html
As an aside: a friend told me about a gigantic regular expression he had to maintain that could pick telephone numbers out of intercepted communications from hundreds of countries around the world. It was very non-trivial.
Thankfully your problem is easier, as you can just have a table with the per-country formats:
format[usa] = "+d (ddd) ddd-dddd";
format[hk] = "+ddd ddd-ddd-dd";
format[china_mobile] = "+dd ddd-dddd-dddd";
...
Then when you're printing, you simply output one digit from the phone number string in each d spot as needed. This assumes you know the country, which is a safe enough assumption for telephone devices -- pick "default" formats for the few surrounding countries.
Since some countries have different formats with different lengths you might need to store your table with additional information:
format[germany][10] = "..."
format[germany][11] = "....."

Algorithms or Patterns for reading text

My company has a client that tracks prices for products from different companies at different locations. This information goes into a database.
These companies email the prices to our client each day, and of course the emails are all formatted differently. It is impossible to have any of the companies change their format - they will not do it.
Some look sort of like this:
This is example text that could be many lines long...
Location 1
Product 1 Product 2 Product 3
$20.99 $21.99 $33.79
Location 2
Product 1 Product 2 Product 3
$24.99 $22.88 $35.59
Others look sort of like this:
PRODUCT       PRICE   + / -
------------  -------- -------
Location 1
1             2007.30 +048.20
2             2022.50 +048.20
Maybe some multiline text here about a holiday or something...
Location 2
1             2017.30 +048.20
2             2032.50 +048.20
Currently we have individual parsers written for each company's email format. But these formats change slightly pretty frequently. We can't count on the prices being on the same row or column each time.
It's trivial for us to look at the emails and determine which price goes with which product at which location. But not so much for our code. So I'm trying to find a more flexible solution and would like your suggestions about what approaches to take. I'm open to anything from regex to neural networks - I'll learn what I need to to make this work, I just don't know what I need to learn. Is this a lex/parsing problem? More similar to OCR?
The code doesn't have to figure out the formats all on its own. The emails fall into a few main 'styles' like the ones above. We really need the code to just be flexible enough that a new product line or whitespace or something doesn't make the file unparsable.
Thanks for any suggestions about where to start.
I think this problem would be suitable for proper parser generator. Regular expressions are too difficult to test and debug if they go wrong. However, I would go for a parser generator that is simple to use as if it was part of a language.
For these type of tasks I would go with pyparsing as its got the power of a full lr parser but without a difficult grammer to define and very good helper functions. The code is easy to read too.
from pyparsing import *
aaa =""" This is example text that could be many lines long...
another line
Location 1
Product 1 Product 2 Product 3
$20.99 $21.99 $33.79
stuff in here you want to ignore
Location 2
Product 1 Product 2 Product 3
$24.99 $22.88 $35.59 """
result = SkipTo("Location").suppress() \
# in place of "location" could be any type of match like a re.
+ OneOrMore(Word(alphas) + Word(nums)) \
+ OneOrMore(Word(nums+"$.")) \
all_results = OneOrMore(Group(result))
parsed = all_results.parseString(aaa)
for block in parsed:
print block
This returns a list of lists.
['Location', '1', 'Product', '1', 'Product', '2', 'Product', '3', '$20.99', '$21.99', '$33.79']
['Location', '2', 'Product', '1', 'Product', '2', 'Product', '3', '$24.99', '$22.88', '$35.59']
You can group things as you want but for simplicity I have just returned lists. Whitespace is ignored by default which makes things a lot simpler.
I do not know if there are equivalents in other languages.
You have given two pattern samples for text files.
I think these can be handled with scripting.
Something like: AWK, sed, grep with bash scripting.
One pattern in the first sample,
Section starts with keyword Location [Number]
second line of section has columns describing product names
third line of section has columns with prices for the products
There can be variable number of products per section.
There can be variable number of sections per file.
Products and prices are always on their designated lines of a section.
Whitespace separation identifies the (product,price) column-association.
Number of products in a section matches the number of prices in that section.
The collected data would probably be assimilated in a database.
The one thing I know I would use here is regular expressions. Three or four expressions could drive the parse logic for each e-mail format.
Trying to write the parse engine more generally than that would, I think, be skirting the edge of overprogramming it.

Resources