I have the following XML group...
<BOXES>
<BOX_CODE>01</BOX_CODE>
<BOX_CODE>12</BOX_CODE>
<BOX_CODE>15</BOX_CODE>
<BOX_CODE>45</BOX_CODE>
<BOX_CODE>46</BOX_CODE>
<BOX_CODE>70</BOX_CODE>
<BOX_CODE>80</BOX_CODE>
<BOX_CODE>98</BOX_CODE>
<BOX_CODE>SA</BOX_CODE>
</BOXES>
... and in the RTF template I would like to display each of those values in a separate column, like this:
01 | 12 | 15 | 45 | 46 | 70 | 80 | 98 | SA
I am trying to use a for-each-group function but not getting the results I want.
Keep in mind that the number of BOX_CODE values is dynamic. In my example there are 9, but there could be less or more at any given time.
I tried using for-each-group#column but did not get the results I wanted. Any help would be greatly appreciated.
Well, I went about it a different way. I created the group like this instead...
<BOX_GROUP>
<BOXES><BOX_CODE>01</BOX_CODE></BOXES>
<BOXES><BOX_CODE>12</BOX_CODE></BOXES>
<BOXES><BOX_CODE>15</BOX_CODE></BOXES>
<BOXES><BOX_CODE>45</BOX_CODE></BOXES>
<BOXES><BOX_CODE>46</BOX_CODE></BOXES>
<BOXES><BOX_CODE>70</BOX_CODE></BOXES>
<BOXES><BOX_CODE>80</BOX_CODE></BOXES>
<BOXES><BOX_CODE>98</BOX_CODE></BOXES>
<BOXES><BOX_CODE>SA</BOX_CODE></BOXES>
</BOX_GROUP>
And I was able to get the desired results using...
<?for-each-group#column: BOXES; BOX_CODE?>
<?BOX_CODE?>
<?end for-each-group?>
Related
I'm using a transaction to see how long a device is RFM mode and the duration field increases with each table row. How I think it should work is that while the field is 'yes' it would calculate the duration that all events equal 'yes', but I have a lot of superfluous data that shouldn't be there IMO.
I only want to keep the largest duration event so I want to compare the current events duration to the next events duration and if its smaller than the current event, keep the current event.
index=crowdstrike sourcetype=crowdstrike:device:json
| transaction falcon_device.hostname startswith="falcon_device.reduced_functionality_mode=yes" endswith="falcon_device.reduced_functionality_mode=no"
| table _time duration
_time
duration
2022-10-28 06:07:45
888198
2022-10-28 05:33:44
892400
2022-10-28 04:57:44
896360
2022-08-22 18:25:53
3862
2022-08-22 18:01:53
7703
2022-08-22 17:35:53
11543
In the data above the duration goes from 896360 to 3862, and can happen on any date, and the duration runs in cycles like that where it increases until it starts over. So in the comparison I would keep the event at the 10-28 inflection point and so on at all other inflection points throughout the dataset.
How would I construct that multi event comparison?
By definition, the transaction command bundles together all events with the same hostname value, starting with the first "yes" and ending with the first "no". There is no option to include events by size, but there are options that govern the maximum time span of a transaction (maxspan), how many events can be in a transaction (maxevents), and how long the time between events can be (maxpause). That the duration value you want to keep (896360) is 10 days even though the previous transaction was only 36 minutes before it makes me wonder about the logic being used in this query. Consider using some of the options available to better define a "transaction".
What problem are you trying to solve with this query? It's possible there's another solution that doesn't use transaction (which is very non-performant).
Sans sample data, something like the following will probably work:
index=crowdstrike sourcetype=crowdstrike:device:json falcon_device.hostname=* falcon_device.reduced_functionality_mode=yes
| stats max(_time) as yestime by falcon_device.hostname
| append
[| search index=crowdstrike sourcetype=crowdstrike:device:json falcon_device.hostname=* falcon_device.reduced_functionality_mode=no
| stats max(_time) as notime by falcon_device.hostname ]
| stats values(*) as * by falcon_device.hostname
| eval elapsed_seconds=yestime-notime
Thanks for your answers but it wasn't working out. I ended up talking to some professional splunkers and got the below as a solution.
index=crowdstrike sourcetype=crowdstrike:device:json
| addinfo ```adds info_max_time```
| fields + _time, falcon_device.reduced_functionality_mode falcon_device.hostname info_max_time
| rename falcon_device.reduced_functionality_mode AS mode, falcon_device.hostname AS Hostname
| sort 0 + Hostname, -_time ``` events are not always returned in descending order per hostname, which would break streamstats```
| streamstats current=f last(mode) as new_mode last(_time) as time_change by Hostname ```compute potential time of state change```
| eval new_mode=coalesce(new_mode,mode."+++"), time_change=coalesce(time_change,info_max_time) ```take care of boundaries of search```
| table _time, Hostname, mode, new_mode, time_change
| where mode!=new_mode ```keep only state change events```
| streamstats current=f last(time_change) AS change_end by Hostname ```add start time of the next state as change_end time for the current state```
| fieldformat time_change=strftime(time_change, "%Y-%m-%d %T")
| fieldformat change_end=strftime(change_end, "%Y-%m-%d %T")
``` uncomment the following to sort by duration```
```| search change_end=* AND new_mode="yes"
| eval duration = round( (change_end-time_change)/(3600),1)
| table time_change, Hostname, new_mode, duration
| sort -duration```
I am looking to merge data in way described below:
I have a table below:
table: PTLANALYSIS
RENTALDATE
OUTBOUND,
INBOUND,
VEHICLE_SIZE,
COMPETITOR,
RATE;
The data I am trying to load into the tabs:
RENTALDATE,
OUTBOUND,
INBOUND,
VEHICLE_SIZE,
LOLY,
KAY,
RATE;
Now LOLY and KAY are suppose to be in column "Competitor" in table PTLANALYSIS. Can someone help me merge my data in an appropriate manner, the output should look something like this...
Rental Date | OUTBOUND | INBOUND | VEHICLE_SIZE | COMPETITOR | RATE
12/28/2019 223 333 small loly 33.5
12/28/2019 223 333 small kay 33.5
Currently it looks like this in my csv..
Rental Date | OUTBOUND | INBOUND | VEHICLE_SIZE | lolyRATE | KAYRATE
12/28/2019 223 333 small 33.5 NULL
12/28/2019 223 333 small NULL 33.5
Thanks in advance!
Most of the columns in the CSV file have fixed targets. You need to evaluate the LOLYRATE and KAYRATE to conditionally populate COMPETITOR and RATE. Something like this:
insert into PTLANALYSIS (
RENTALDATE
OUTBOUND,
INBOUND,
VEHICLE_SIZE,
COMPETITOR,
RATE
)
select
RENTALDATE,
OUTBOUND,
INBOUND,
VEHICLE_SIZE,
case when LOLYRATE is not null then 'loly' else 'kay' end as competitor,
coalesce(LOLYRATE, KAYRATE) as rate
from ext_table
;
You haven't said how you intend to load the data but I have assumed an external table, because it allows you to use SQL, and everything is easier with SQL. Find out more.
I am dealing with panel data with a time gap. but not the same time gap.
Year variable has 1980, 1990, 2000, 2010, 2015, and 2020.
As you can see it has a 10 year time gap up to 2010, but five-years between 2010 and 2020.
After setting up for panel data structure in Stata (using xtset command), I wanted to use the time (lag) operator for my main variable interest and outcome variable. However, when I use L. in front of the variable name, Stata tells me no observations.
Isn't it automatically taking the previous time period?
Or do I create manually the lag variables?
What we need to know, but can't see, is exactly what code you used, specifically xtset. But it's possible to guess. Here I fabricate one panel; a structure with more panels doesn't show different problems.
clear
input Y Year
1 1980
2 1990
3 2000
4 2010
5 2015
6 2020
end
gen ID = 42
If you just specify panel and year variables, Stata expects unit spacing, so lag 1 with yearly data means "the previous year". Asking for a lag 1 variable is legal, but all values are missing.
xtset ID Year
gen lag1 = L1.Y
If you specify delta(5) then a lag 1 variable is missing in all but two observations.
xtset ID Year, delta(5)
gen lag5 = L1.Y
If you try delta(10) that won't work (unless you drop 2015).
xtset ID Year, delta(10)
You can also do this:
bysort ID (Year) : gen prev = Y[_n-1]
Bringing your results together
list , sep(0)
+------------------------------------+
| Y Year ID lag1 lag5 prev |
|------------------------------------|
1. | 1 1980 42 . . . |
2. | 2 1990 42 . . 1 |
3. | 3 2000 42 . . 2 |
4. | 4 2010 42 . . 3 |
5. | 5 2015 42 . 4 4 |
6. | 6 2020 42 . 5 5 |
+------------------------------------+
The no observations error message presumably comes from some other command.
I have implemented an Apache Pig script. When I execute the script it results in many mappers for a specific step, but has only one reducer for that step. Because of this condition (many mappers, one reducer) the Hadoop cluster is almost idle while the single reducer executes. In order to better use the resources of the cluster I would like to also have many reducers running in parallel.
Even if I set the parallelism in the Pig script using the SET DEFAULT_PARALLEL command I still result in having only 1 reducer.
The code part issuing the problem is the following:
SET DEFAULT_PARALLEL 5;
inputData = LOAD 'input_data.txt' AS (group_name:chararray, item:int);
inputDataGrouped = GROUP inputData BY (group_name);
-- The GeneratePairsUDF generates a bag containing pairs of integers, e.g. {(1, 5), (1, 8), ..., (8, 5)}
pairs = FOREACH inputDataGrouped GENERATE GeneratePairsUDF(inputData.item) AS pairs_bag;
pairsFlat = FOREACH pairs GENERATE FLATTEN(pairs_bag) AS (item1:int, item2:int);
The 'inputData' and 'inputDataGrouped' aliases are computed in the mapper.
The 'pairs' and 'pairsFlat' in the reducer.
If I change the script by removing the line with the FLATTEN command (pairsFlat = FOREACH pairs GENERATE FLATTEN(pairs_bag) AS (item1:int, item2:int);) then the execution results in 5 reducers (and thus in a parallel execution).
It seems that the FLATTEN command is the problem and avoids that many reducers are created.
How could I reach the same result of FLATTEN but having the script being executed in parallel (with many reducers)?
Edit:
EXPLAIN plan when having two FOREACH (as above):
Map Plan
inputDataGrouped: Local Rearrange[tuple]{chararray}(false) - scope-32
| |
| Project[chararray][0] - scope-33
|
|---inputData: New For Each(false,false)[bag] - scope-29
| |
| Cast[chararray] - scope-24
| |
| |---Project[bytearray][0] - scope-23
| |
| Cast[int] - scope-27
| |
| |---Project[bytearray][1] - scope-26
|
|---inputData: Load(file:///input_data.txt:org.apache.pig.builtin.PigStorage) - scope-22--------
Reduce Plan
pairsFlat: Store(fakefile:org.apache.pig.builtin.PigStorage) - scope-42
|
|---pairsFlat: New For Each(true)[bag] - scope-41
| |
| Project[bag][0] - scope-39
|
|---pairs: New For Each(false)[bag] - scope-38
| |
| POUserFunc(GeneratePairsUDF)[bag] - scope-36
| |
| |---Project[bag][1] - scope-35
| |
| |---Project[bag][1] - scope-34
|
|---inputDataGrouped: Package[tuple]{chararray} - scope-31--------
Global sort: false
EXPLAIN plan when having only one FOREACH with FLATTEN wrapping the UDF:
Map Plan
inputDataGrouped: Local Rearrange[tuple]{chararray}(false) - scope-29
| |
| Project[chararray][0] - scope-30
|
|---inputData: New For Each(false,false)[bag] - scope-26
| |
| Cast[chararray] - scope-21
| |
| |---Project[bytearray][0] - scope-20
| |
| Cast[int] - scope-24
| |
| |---Project[bytearray][1] - scope-23
|
|---inputData: Load(file:///input_data.txt:org.apache.pig.builtin.PigStorage) - scope-19--------
Reduce Plan
pairs: Store(fakefile:org.apache.pig.builtin.PigStorage) - scope-36
|
|---pairs: New For Each(true)[bag] - scope-35
| |
| POUserFunc(GeneratePairsUDF)[bag] - scope-33
| |
| |---Project[bag][1] - scope-32
| |
| |---Project[bag][1] - scope-31
|
|---inputDataGrouped: Package[tuple]{chararray} - scope-28--------
Global sort: false
There is no surety if pig uses the configuration DEFAULT_PARALLEL value for every steps in the pig script. Try PARALLEL along with your specific join/group step which you feel taking time (In your case GROUP step).
inputDataGrouped = GROUP inputData BY (group_name) PARALLEL 67;
If still it is not working then you might have to see your data for skewness issue.
I think there is a skewness in the data. Only a small number of mappers are producing exponentially large output. Look at the distribution of keys in your data. Like data contains few Groups with large number of records.
I tried "set default parallel" and "PARALLEL 100" but no luck. Pig still uses 1 reducer.
It turned out I have to generate a random number from 1 to 100 for each record and group these records by that random number.
We are wasting time on grouping, but it is much faster for me because now I can use more reducers.
Here is the code (SUBMITTER is my own UDF):
tmpRecord = FOREACH record GENERATE (int)(RANDOM()*100.0) as rnd, data;
groupTmpRecord = GROUP tmpRecord BY rnd;
result = FOREACH groupTmpRecord GENERATE FLATTEN(SUBMITTER(tmpRecord));
To answer your question we must first know how many reducers pig enforces to accomplish the - Global Rearrange process. Because as per my understanding the Generate / Projection should not require a single reducer. I cannot say the same thing about Flatten. However we know from common-sense that during flatten the aim is to de-nestify the tuples from bags and vice versa. And to do that all the tuples belonging to a bag should definitely be available in the same reducer. I might be wrong. But can anyone add something here to get this user an answer please ?
I created a toy example of my code below.
In this toy example I would like to create a measure of all higher prices minus lower prices within a self-created reference group. So within each reference group, I would like to take each individual and subtract its price value from all higher price values from other individuals in the same group. I do not want to have negative differences. Then I would like to sum all these differences. In creating this code I found some help here:
http://www.stata.com/support/faqs/data-management/try-all-values-with-foreach/
However, the code didn't work perfectly for me, because my dataset is quite large (several 100K obs) and the examples on the website and my code only work until the numlist maximum of 1600 in Stata. (I am using version 12). The toy example with the auto dataset works, due to small size of the dataset.
I would like to ask if someone has an idea how to code this more efficiently, so that I can get around the numlist restriction. I thought about summing the differences directly without saving them in intermediate variables, but that also blow up the numlist restriction.
clear all
sysuse auto
ren headroom refgroup
bysort refgroup : egen pricerank = rank(price)
qui: su pricerank, meanonly
gen test = `r(max)'
su test
foreach i of num 1/`r(max)' {
qui: bys refgroup: gen intermediate`i' = price[_n+`i'] -price if price[_n+`i'] > price
}
egen price_diff = rowmax(intermediate*)
drop intermediate*
If I understand this correctly, this isn't even a problem that requires explicit loops. The sum of all higher prices is just the difference between two cumulative sums. You might need to think through what you want to do if prices are tied.
. clear
. set obs 10
obs was 0, now 10
. gen group = _n > 5
. set seed 2803
. gen price = ceil(1000 * runiform())
. bysort group (price) : gen sumhigherprices = sum(price)
. by group : replace sumhigherprices = sumhigherprices[_N] - sumhigherprices
(10 real changes made)
. list
+--------------------------+
| group price sumhig~s |
|--------------------------|
1. | 0 218 1448 |
2. | 0 264 1184 |
3. | 0 301 883 |
4. | 0 335 548 |
5. | 0 548 0 |
|--------------------------|
6. | 1 125 3027 |
7. | 1 213 2814 |
8. | 1 828 1986 |
9. | 1 988 998 |
10. | 1 998 0 |
+--------------------------+
Edit: For what the OP needs, there is an extra line
. by group : replace sumhigherprices = sumhigherprices - (_N - _n) * price
If I understand the wording of the problem correctly, maybe this can help. It uses joinby (new observations are created and depending on the size of the original database, you may or not hit the Stata hard-limit on number of observations). The code reproduces the results that would follow from the code of the original post. This is a second attempt. The code before this final edit did not provide the sought-after results. The wording of the problem was somewhat difficult for me to understand.
clear all
set more off
* Load data
sysuse auto
* Delete unnecessary vars
ren headroom refgroup
keep refgroup price
* Generate id´s based on rankings (sort)
bysort refgroup (price): gen id = _n
* Pretty list
order refgroup id
sort refgroup id price
list, sepby(refgroup)
* joinby procedure
tempfile main
save "`main'"
rename (price id) =0
joinby refgroup using "`main'"
list, sepby(refgroup)
* Do not compare with itself and drop duplicates
drop if id0 >= id
* Compute differences and max
gen dif = abs(price0 - price)
collapse (max) dif, by(refgroup id0)
list, sepby(refgroup)