I have a Stata dataset with these variables:
clear
input str11 datel str15 timel int a double b int c double time float(hours minutes event)
"23-FEB-2006" "10:14:57.837759" . 45.04 2 36897837 10 14 .
"23-FEB-2006" "10:14:57.990093" 100 . . 36897990 10 14 .
"23-FEB-2006" "10:14:57.993023" 100 . . 36897993 10 14 .
"23-FEB-2006" "10:14:57.993023" 1800 . . 36897993 10 14 .
"23-FEB-2006" "10:14:58.133639" . 45.04 1 36898133 10 14 .
"23-FEB-2006" "10:15:01.773054" . 45.04 1 36901773 10 15 .
"23-FEB-2006" "10:15:01.776960" . 45.04 1 36901776 10 15 .
"23-FEB-2006" "10:15:02.776896" . 45.04 3 36902776 10 15 .
"23-FEB-2006" "10:15:07.482650" . 45.04 5 36907482 10 15 .
"23-FEB-2006" "10:15:07.885944" . 45.04 3 36907885 10 15 .
"23-FEB-2006" "10:15:09.550877" . 45.04 7 36909550 10 15 .
"23-FEB-2006" "10:15:22.151906" 100 . . 36922151 10 15 1
"23-FEB-2006" "10:15:22.155812" 100 . . 36922155 10 15 1
"23-FEB-2006" "10:15:22.155812" 1200 . . 36922155 10 15 1
"23-FEB-2006" "10:15:22.155812" 300 . . 36922155 10 15 1
"23-FEB-2006" "10:15:22.155812" 100 . . 36922155 10 15 1
"23-FEB-2006" "10:15:22.642109" 200 . . 36922642 10 15 .
"23-FEB-2006" "10:15:22.832527" 100 . . 36922832 10 15 .
"23-FEB-2006" "10:15:22.990720" . 45.04 3 36922990 10 15 .
"23-FEB-2006" "10:15:23.311988" . 45.04 1 36923311 10 15 .
"23-FEB-2006" "10:15:23.319800" . 45.05 3 36923319 10 15 .
"23-FEB-2006" "10:15:23.331518" . 45.1 1 36923331 10 15 .
"23-FEB-2006" "10:15:23.335424" . 45.11 1 36923335 10 15 .
"23-FEB-2006" "10:15:23.335424" . 45.11 2 36923335 10 15 .
"23-FEB-2006" "10:15:23.336401" . 45.1 1 36923336 10 15 .
"23-FEB-2006" "10:15:23.336401" . 45.1 1 36923336 10 15 .
"23-FEB-2006" "10:15:23.336401" . 45.1 1 36923336 10 15 .
"23-FEB-2006" "10:15:23.336401" . 45.1 1 36923336 10 15 .
"23-FEB-2006" "10:15:23.336401" . 45.1 1 36923336 10 15 .
end
The variable event assumes value = 1 for the entire duration of the event or value = . otherwise.
I want to keep the event and a temporal window of 10 minutes before and after the event.
For example, below you can see that I want to keep the event (i.e. when the variable event = 1), 10 minutes before line 17467 and 10 minutes after line 17471, while dropping the other observations:
The dataset could have more than one event.
One can do this with the community-contributed command rangestat:
generate double datetime = clock(datel + substr(timel, 1, 12), "DMYhms")
rangestat (count) event, interval(datetime -1000 1000)
drop if event_count == 0
format datetime %tcDDmonCCYY_HH:MM:SS.sss
list datetime event event_count, sepby(event_count) abbreviate(15)
+----------------------------------------------+
| datetime event event_count |
|----------------------------------------------|
1. | 23feb2006 10:15:22.151 1 5 |
2. | 23feb2006 10:15:22.155 1 5 |
3. | 23feb2006 10:15:22.155 1 5 |
4. | 23feb2006 10:15:22.155 1 5 |
5. | 23feb2006 10:15:22.155 1 5 |
6. | 23feb2006 10:15:22.642 . 5 |
7. | 23feb2006 10:15:22.832 . 5 |
8. | 23feb2006 10:15:22.990 . 5 |
+----------------------------------------------+
Related
This question already has an answer here:
Calculating consecutive ones
(1 answer)
Closed 1 year ago.
I am trying to generate a counter variable that describes the duration of a temporal episode in panel data.
I am using long format data that looks something like this:
clear
input byte id int time byte var1 int aim1
1 1 0 .
1 2 0 .
1 3 1 1
1 4 1 2
1 5 0 .
1 6 0 .
1 7 0 .
2 1 0 .
2 2 1 1
2 3 1 2
2 4 1 3
2 5 0 .
2 6 1 1
2 7 1 2
end
I want to generate a variable like aim1 that starts with a value of 1 when var1==1, and counts up one unit with each subsequent observation per ID where var1 is still equal to 1. For each observation where var1!=1, aim1 should contain missing values.
I already tried using rangestat (count) to solve the problem, however the created variable does not restart the count with each episode:
ssc install rangestat
gen var2=1 if var1==1
rangestat (count) aim2=var2, interval(time -7 0) by (id)
Here are two ways to do it: (1) from first principles, but see this paper for more and (2) using tsspell from SSC.
clear
input byte id int time byte var1 int aim1
1 1 0 .
1 2 0 .
1 3 1 1
1 4 1 2
1 5 0 .
1 6 0 .
1 7 0 .
2 1 0 .
2 2 1 1
2 3 1 2
2 4 1 3
2 5 0 .
2 6 1 1
2 7 1 2
end
bysort id (time) : gen wanted = 1 if var1 == 1 & var1[_n-1] != 1
by id: replace wanted = wanted[_n-1] + 1 if var1 == 1 & missing(wanted)
tsset id time
ssc inst tsspell
tsspell, cond(var1 == 1)
list, sepby(id _spell)
+---------------------------------------------------------+
| id time var1 aim1 wanted _seq _spell _end |
|---------------------------------------------------------|
1. | 1 1 0 . . 0 0 0 |
2. | 1 2 0 . . 0 0 0 |
|---------------------------------------------------------|
3. | 1 3 1 1 1 1 1 0 |
4. | 1 4 1 2 2 2 1 1 |
|---------------------------------------------------------|
5. | 1 5 0 . . 0 0 0 |
6. | 1 6 0 . . 0 0 0 |
7. | 1 7 0 . . 0 0 0 |
|---------------------------------------------------------|
8. | 2 1 0 . . 0 0 0 |
|---------------------------------------------------------|
9. | 2 2 1 1 1 1 1 0 |
10. | 2 3 1 2 2 2 1 0 |
11. | 2 4 1 3 3 3 1 1 |
|---------------------------------------------------------|
12. | 2 5 0 . . 0 0 0 |
|---------------------------------------------------------|
13. | 2 6 1 1 1 1 2 0 |
14. | 2 7 1 2 2 2 2 1 |
+---------------------------------------------------------+
The approach of tsspell is very close to what you ask for, except (a) its counter (by default _seq is 0 when out of spell, but replace _seq = . if _seq == 0 gets what you ask (b) its auxiliary variables (by default _spell and _end) are useful in many problems. You must install tsspell before you can use it with ssc install tsspell.
I have a text file:
10 1 15
10 12 30
10 9 45
10 8 40
10 15 55
12 9 0
12 7 18
12 10 1
9 1 1
9 2 1
9 0 1
14 5 5
And I would like to get this file as an output of my MapReduce job:
9 0 1
9 1 1
9 2 1
10 1 15
10 9 40
10 9 45
10 12 30
10 15 55
12 7 18
12 9 0
12 10 1
14 5 5
It means it has to be sorted by 1st, 2nd and 3rd columns.
I use this command:
#!/bin/bash
IN_DIR="/user/cloudera/temp"
OUT_DIR="/user/cloudera/temp_out"
NUM_REDUCERS=1
hdfs dfs -rmr ${OUT_DIR} > /dev/null
hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar \
-D mapred.jab.name="Parsing mista pages job 1 (parsing)" \
-D stream.num.map.output.key.fields=3 \
-D mapreduce.job.output.key.comparator.class=org.apache.hadoop.mapreduce.lib.partition.KeyFieldBasedComparator \
-D mapreduce.partition.keycomparator.options='-k1,1n -k2,2n -k3,3n' \
-D mapreduce.job.reduces=${NUM_REDUCERS} \
-mapper 'cat' \
-reducer 'cat' \
-input ${IN_DIR} \
-output ${OUT_DIR}
hdfs dfs -cat ${OUT_DIR}/* | head -100
And get exactly what I want. BUT. When I do NUM_REDUCERS=2 I get this output:
[cloudera#quickstart ~]$ hdfs dfs -cat /user/cloudera/temp_out/part-00000 | head -100
9 1 1
10 9 45
10 12 30
10 15 55
12 7 18
12 10 1
14 5 5
[cloudera#quickstart ~]$ hdfs dfs -cat /user/cloudera/temp_out/part-00001 | head -100
9 0 1
9 2 1
10 1 15
10 9 40
12 9 0
Why partitioner splits my data with same keys (for example '9') to different reducers?
How can I force partitioner to split Mapper output by the key and sort it by value. For example, if I have 4 reducers the reducers input should be:
reducer 1
9 0 1
9 1 1
9 2 1
reducer 2
10 1 15
10 9 40
10 9 45
10 12 30
10 15 55
reducer 3
12 7 18
12 9 0
12 10 1
reducer 4:
14 5 5
you can overwrite the default Partioner to put each key into diferent reduce .Set the same Nums of reduce . let each reduce to deal with only one key .
for example()
groupMap.put("9", 0);
groupMap.put("10", 1);
groupMap.put("12", 2);
groupMap.put("14", 3);
Add -partitioner argument to use your own partition in your job.
I think it might works for you
I have 4 column data files which have approximately 100 lines. I'd like to substract every nth from (n+3)th line and print the values in a new column ($5). The column data has not a regular pattern for each column.
My sample file:
cat input
1 2 3 20
1 2 3 10
1 2 3 5
1 2 3 20
1 2 3 30
1 2 3 40
1 2 3 .
1 2 3 .
1 2 3 . (and so on)
Output should be:
1 2 3 20 0 #(20-20)
1 2 3 10 20 #(30-10)
1 2 3 5 35 #(40-5)
1 2 3 20 ? #(. - 20)
1 2 3 30 ? #(. - 30)
1 2 3 40 ? #(. - 40)
1 2 3 .
1 2 3 .
1 2 3 . (and so on)
How can i do this in awk?
Thank you
For this I think the easiest thing is to read through the file twice. The first time (the NR==FNR block) we save all the 4th column values in an array indexed by the line number. The next block is executed for the second pass and creates a 5th column with the desired calculation (checking first to make sure that we wouldn't go passed the end of the file).
$ cat input
1 2 3 20
1 2 3 10
1 2 3 5
1 2 3 20
1 2 3 30
1 2 3 40
$ awk 'NR==FNR{a[NR]=$4; last=NR; next} {$5 = (FNR+3 <= last ? a[FNR+3] - $4 : "")}1' input input
1 2 3 20 0
1 2 3 10 20
1 2 3 5 35
1 2 3 20
1 2 3 30
1 2 3 40
You can do this using tac + awk + tac:
tac input |
awk '{a[NR]=$4} NR>3 { $5 = (a[NR-3] ~ /^[0-9]+$/ ? a[NR-3] - $4 : "?") } 1' |
tac | column -t
1 2 3 20 0
1 2 3 10 20
1 2 3 5 35
1 2 3 20 ?
1 2 3 30 ?
1 2 3 40 ?
1 2 3 .
1 2 3 .
1 2 3 .
I'm currently working on my karaoke files and i see a lot of Non capitalized words.
The .txt files are structured as a key-value pair and i was wondering how to capitalize the first letter of every value word.
Example txt:
#TITLE:fire and Water
#ARTIST:Some band
#CREATOR:yunho
#LANGUAGE:Korean
#EDITION:UAS
#MP3:2NE1 - Fire.mp3
#COVER:2NE1 - Fire.jpg
#VIDEO:2NE1 - Fire.avi
#VIDEOGAP:11.6
#BPM:595
#GAP:3860
F -4 4 16 I
F 2 4 16 go
F 8 6 16 by
F 16 4 16 the
F 22 6 16 name
F 30 4 16 of
F 36 10 16 C
F 46 10 16 L
F 58 6 16 of
F 66 5 16 2
F 71 3 16 N
F 74 4 16 E
F 78 18 16 1
I'd like to capitalize the words after the keys TITLE, ARTISTS, LANGUAGE and EDITION
so for the example txt:
#TITLE:**F**ire **A**nd **W**ater
#ARTIST:**S**ome **B**and
#CREATOR:yunho
#LANGUAGE:**K**orean
#EDITION:**U**AS
#MP3:2NE1 - Fire.mp3
#COVER:2NE1 - Fire.jpg
#VIDEO:2NE1 - Fire.avi
#VIDEOGAP:11.6
#BPM:595
#GAP:3860
F -4 4 16 I
F 2 4 16 go
F 8 6 16 by
F 16 4 16 the
F 22 6 16 name
F 30 4 16 of
F 36 10 16 C
F 46 10 16 L
F 58 6 16 of
F 66 5 16 2
F 71 3 16 N
F 74 4 16 E
F 78 18 16 1
Another thing is that i have loads of these txt's files all in designated directories. I want to run the program from the parent recursive for all *.txt files
Example directories:
Library/Some Band/Some Band - Some Song/some txt file.txt
Library/Some Band2/Some Band2 - Some Song/sometxtfile.txt
Library/Some Band3/Some Band3 - Some Song/some3333 txt file.txt
I've tried to do so with find . -name '*.txt' -exec sed -i command {} +
but i got stuck on the search and replace with sed... anyone care to help me out?
You can use this gnu-sed command to uppercase starting letter for matching lines:
sed -E '/^#(TITLE|ARTIST|LANGUAGE|EDITION):/s/\b([a-z])/\u\1/g' file
#TITLE:Fire And Water
#ARTIST:Some Band
#CREATOR:yunho
#LANGUAGE:Korean
#EDITION:UAS
#MP3:2NE1 - Fire.mp3
#COVER:2NE1 - Fire.jpg
#VIDEO:2NE1 - Fire.avi
#VIDEOGAP:11.6
#BPM:595
#GAP:3860
F -4 4 16 I
F 2 4 16 go
F 8 6 16 by
F 16 4 16 the
F 22 6 16 name
F 30 4 16 of
F 36 10 16 C
F 46 10 16 L
F 58 6 16 of
F 66 5 16 2
F 71 3 16 N
F 74 4 16 E
F 78 18 16
For find + sed command use:
find . -name '*.txt' -exec \
sed -E -i '/^#(TITLE|ARTIST|LANGUAGE|EDITION):/s/\b([a-z])/\u\1/g' {} +
Title sums it up.
$ echo `seq 0 10` `seq 5 15` | sort -n
0 1 2 3 4 5 6 7 8 9 10 5 6 7 8 9 10 11 12 13 14 15
Why doesn't this work?
Even if I don't use seq:
echo '0 1 2 3 4 5 6 7 8 9 10 5 6 7 8 9 10 11 12 13 14 15' | sort -n
0 1 2 3 4 5 6 7 8 9 10 5 6 7 8 9 10 11 12 13 14 15
And even ditching echo directly:
$ echo '0 1 2 3 4 5 6 7 8 9 10 5 6 7 8 9 10 11 12 13 14 15' > numbers
$ sort -n numbers
0 1 2 3 4 5 6 7 8 9 10 5 6 7 8 9 10 11 12 13 14 15
sort(1) sorts lines. You have to parse whitespace delimited data yourself:
echo `seq 0 10` `seq 5 15` | tr " " "\n" | sort -n
Because you need newlines for sort:
$ echo `seq 0 10` `seq 5 15` | tr " " "\\n" | sort -n | tr "\\n" " "; echo ""
0 1 2 3 4 5 5 6 6 7 7 8 8 9 9 10 10 11 12 13 14 15
$
You have single line of input. There is nothing to sort.
The command as you typed it results in the sequence of numbers being all passed to sort in one line. That's not what you want. Just pass the output of seq directly to sort:
(seq 0 10; seq 5 15) | sort -n
By the way, as you just found out, the construct
echo `command`
doesn't usually do what you expect and is redundant for what you actually expect: It tells the shell to capture the output of command and pass it to echo, which produces it as output again. Just let the output of the command go through directly (unless you really mean to have it processed by echo, maybe to expand escape sequences, or to collapse everything to one line).