I have a large dataset for analysis and I am looking for shell scripting to filter the rows only to what I require, so I am able to load the dataset for further analysis in R.
The structure for data is as follows:
Size,ModifiedTime,AccessTime,contentid
4886,"Jun 11, 2009 06:51:08 PM","Mar 15, 2013 09:24:53 AM",000000285b7925f511b3159a72f80a4a
4096,"Aug 21, 2008 03:54:28 PM","May 12, 2009 04:45:41 PM",0000011afae4d1227c4df57b410ea52c
84848,"Feb 12, 2007 12:40:00 PM","Apr 07, 2014 09:39:03 AM",000001cec02017ca3eb81ddc4cd1c9ff
518,"Aug 22, 2006 02:12:03 PM","Dec 25, 2007 06:48:18 AM",00000233565d1c17c3135a9504c455ca
264158,"Dec 08, 2009 03:28:14 PM","Apr 08, 2013 11:52:15 AM",000003020ba74b9d1b6075d3c1b8fcb3
725963,"Sep 29, 2008 03:45:21 PM","May 17, 2011 08:48:40 AM",0000034b98d29d84ce7b61ee68be7658
1340,"Sep 07, 2011 03:36:54 AM","Mar 12, 2013 02:55:01 AM",000004ed899e26ae1c9b1ece35a98af1
75264,"Jul 28, 2011 05:09:58 PM","Jun 07, 2014 04:21:28 PM",000005a09fd2eb706c5800eb06084160
198724,"Jul 23, 2012 02:25:58 PM","Jan 21, 2013 12:58:07 PM",0000060b9d552c35f281b5033dcfa1b4
It is essentially a large csv file.
Now I want to filter rows for which AccessTime is less than 10 years and then write it into a separate csv file, which in this case should print 2nd row (excluding header)
I tried the following: create a temp time variable and compare with the AccessTime, if it's less then print row.
BEGIN{
FPAT = "([^,]+)|(\"[^\"]+\")"; #this to read csv as some column value contains ,
OFS=",";
date=$(date -d "-3650 days" +"%s"); #temp time variable in epoch format
}
{
command="date -d" $6 " +%s"; #$6 refers to AccessTime column
( command | getline temp ); #converts Accesstime value to epoch format
close(command);
if(temp<date) print $6
}
But when I run this command, it doesn't print anything.
Any help is much appreciated.
Desired output:
Size,ModifiedTime,AccessTime,contentid
4096,"Aug 21, 2008 03:54:28 PM","May 12, 2009 04:45:41 PM",0000011afae4d1227c4df57b410ea52c
518,"Aug 22, 2006 02:12:03 PM","Dec 25, 2007 06:48:18 AM",00000233565d1c17c3135a9504c455ca
$ awk '
BEGIN {
m["Jan"]="01" # lookups for months
m["Feb"]="02" # Feb -> 02
m["Mar"]="03" # Mar -> 03
m["Apr"]="04" # etc.
m["May"]="05"
m["Jun"]="06"
m["Jul"]="07"
m["Aug"]="08"
m["Sep"]="09"
m["Oct"]="10"
m["Nov"]="11" # below we get todays date
m["Dec"]="12" # 10 years ago
dcmd="date +\"%Y%m%d,\" --date=\"10 years ago\"" # returns 20101204,
if((dcmd | getline d)<=0) # if getline fails
exit 1 # exit
# d=strftime("%Y%m%d")-10^5 "," # use this for GNU awk
}
$9 m[$7] $8>=d' file # explained below
d gets value 20101204. (notice the trailing comma) from the date +"%Y%m%d," --date="10 years ago". Reading the AccessTime from the file and rearranging the components with $9 m[$7] $8, for example, for Mar 15, 2013 is 20130315, (notice the comma again). The condition is the comparison of those two dates.
Output:
4886 Jun 11, 2009 06:51:08 PM Mar 15, 2013 09:24:53 AM 000000285b7925f511b3159a72f80a4a
84848 Feb 12, 2007 12:40:00 PM Apr 07, 2014 09:39:03 AM 000001cec02017ca3eb81ddc4cd1c9ff
264158 Dec 08, 2009 03:28:14 PM Apr 08, 2013 11:52:15 AM 000003020ba74b9d1b6075d3c1b8fcb3
725963 Sep 29, 2008 03:45:21 PM May 17, 2011 08:48:40 AM 0000034b98d29d84ce7b61ee68be7658
1340 Sep 07, 2011 03:36:54 AM Mar 12, 2013 02:55:01 AM 000004ed899e26ae1c9b1ece35a98af1
75264 Jul 28, 2011 05:09:58 PM Jun 07, 2014 04:21:28 PM 000005a09fd2eb706c5800eb06084160
198724 Jul 23, 2012 02:25:58 PM Jan 21, 2013 12:58:07 PM 0000060b9d552c35f281b5033dcfa1b4
With GNU awk for time functions, FPAT, and gensub():
$ cat tst.awk
BEGIN {
OFS = ","
FPAT = "([^" OFS "]*)|(\"[^\"]+\")"
now = strftime("%Y %m %d %H %M %S")
year = gensub(/ .*/,"",1,now)
rest = gensub(/[^ ]+/,"",1,now)
secs = mktime((year-10) rest)
thresh = strftime("%Y%m%d%H%M%S",secs)
}
NR > 1 {
split($3,t,/[ ,:]+/)
mthNr = (index("JanFebMarAprMayJunJulAugSepOctNovDec",t[1])+2)/3
hour = t[4] + ( (t[7] == "PM") && (t[4] < 12) ? 12 : 0 )
curr = sprintf("%04d%02d%02d%02d%02d%02d", t[3], mthNr, t[2], hour, t[5], t[6])
}
(NR == 1) || (curr < thresh)
$ awk -f tst.awk file
Size,ModifiedTime,AccessTime,contentid
4096,"Aug 21, 2008 03:54:28 PM","May 12, 2009 04:45:41 PM",0000011afae4d1227c4df57b410ea52c
518,"Aug 22, 2006 02:12:03 PM","Dec 25, 2007 06:48:18 AM",00000233565d1c17c3135a9504c455ca
Based on your shown samples, written and tested with shown samples in GNU awk. Also considering that you need to compare your 2nd occurrence of date with current date. Also this solution is not dealing with leap seconds concept(trying to figure out another way too here).
awk '
BEGIN{
num=split("jan,feb,mar,apr,may,jun,jul,aug,sept,oct,nov,dec",arr1,",")
for(i=1;i<=num;i++){
month[arr1[i]]=sprintf("%02d",i)
}
}
match($0,/[AP]M.*[AP]M/){
val=substr($0,RSTART,RLENGTH)
sub(/^[AP]M +/,"",val)
sub(/ [AP]M +$/,"",val)
split(val,array,"[ ,]")
dat=array[4] OFS month[tolower(array[1])] OFS array[2] OFS array[5]
timE=(systime()-mktime(gensub(/[ ":-]/," ","g",dat)))/(365*60*24*60)
if(timE>10){ print }
}
' Input_file
This will not print header in case you need to print it then add FNR==1{print;next} before match function.
Another shorter awk solution.
$ awk -F, -v ct=$(date "+%s") ' NR>1 { dc="date -d"$4 $5 " \"+%s\""; dc|getline t; yrs=(ct - t)/(24*60*60*365) } yrs>10 || NR==1 ' monte.txt
Size,ModifiedTime,AccessTime,contentid
4096,"Aug 21, 2008 03:54:28 PM","May 12, 2009 04:45:41 PM",0000011afae4d1227c4df57b410ea52c
518,"Aug 22, 2006 02:12:03 PM","Dec 25, 2007 06:48:18 AM",00000233565d1c17c3135a9504c455ca
$
Explanation:
The date command works if we just pass the string representation of access time.
$ date -d"Jun 11, 2009 06:51:08 PM"
Thu Jun 11 18:51:08 IST 2009
It works even without the comma
$ date -d"Jun 11 2009 06:51:08 PM"
Thu Jun 11 18:51:08 IST 2009
So there is no need to clean the data. Just passing $4 and $5 from the input file with comma as delimiter would work.
For comparison, I have used the epoch
awk -F, -v ct=$(date "+%s") ' #get the current epoch seconds via ct
NR>1 {
dc="date -d"$4 $5 " \"+%s\""; #build the date command using access time $4 and %5
dc|getline t; #execute the command and get the output in temp t
yrs=(ct - t)/(24*60*60*365) #calcualte the number of years between ct and t
}
yrs>10 || NR==1 #print if diff yrs > 10 or NR==1 for header
'
Another solution:
If you want to apply the logic for 10 years in the date command, then we need to just remove the double quotes from the $5.
$ awk -F, -v ct=$(date "+%s") ' NR>1 { c5=substr($5,1,length($5)-1);dc="date -d"$4 c5 " + 10 years \" \"+%s\""; dc|getline t } t<ct ' monte.txt
Size,ModifiedTime,AccessTime,contentid
4096,"Aug 21, 2008 03:54:28 PM","May 12, 2009 04:45:41 PM",0000011afae4d1227c4df57b410ea52c
518,"Aug 22, 2006 02:12:03 PM","Dec 25, 2007 06:48:18 AM",00000233565d1c17c3135a9504c455ca
$
This question already has answers here:
how to iterate though delimited values in bash
(3 answers)
Closed 4 years ago.
I'm very new to bash, and I have a log like this:
10, "a#gmail.com"
2, "b#gmail.com"
3333, "c#hotmail.com", "d#gmail.com"
4, "e#hotmail.com", "f#hotmail.com", "g#gmail.com"
55, "h#gmail.com"
I would like it to be:
10, "a#gmail.com"
2, "b#gmail.com"
3333, "c#hotmail.com"
3333, "d#gmail.com"
4, "e#hotmail.com"
4, "f#hotmail.com"
4, "g#gmail.com"
55, "h#gmail.com"
How do I do it in bash?
The standard UNIX tool for manipulating text is awk:
$ awk 'BEGIN{FS=OFS=", "} {for (i=2;i<=NF;i++) print $1, $i}' file
10, "a#gmail.com"
2, "b#gmail.com"
3333, "c#hotmail.com"
3333, "d#gmail.com"
4, "e#hotmail.com"
4, "f#hotmail.com"
4, "g#gmail.com"
55, "h#gmail.com"
As the first argument, I'm passing the path to the file.
If no argument will be passed I will say about the error and will exit with (-1).
I'm iterating thru the file, with commas changed to space.
Each iteration I will be tacking separated by space word. If this word is the number I will store it and go to the next word. If it is non-numeric I will print the previous number and the current word, separated by the comma. Before the for loop, I'm initializing number with 0, just in case ;)
#!/bin/bash
if [ -z "${1}" ]; then
echo "No file specified"
exit -1
else
file=$1
echo "Parsing file \"$file\":"
fi
number="0"
for word in $(sed "s#,# #g" $file); do
if [[ $line =~ ^[0-9]+ ]] ; then
number=${word};
continue;
else
echo "$number, ${word}"
fi
done
exit 0
Run:
test#LAPTOP-EQKIVD8A:~$ cat new.txt
10, "a#gmail.com"
2, "b#gmail.com"
3333, "c#hotmail.com", "d#gmail.com"
4, "e#hotmail.com", "f#hotmail.com", "g#gmail.com"
55, "h#gmail.com"
test#LAPTOP-EQKIVD8A:~$ ./script.sh new.txt
Parsing file "new.txt":
10, "a#gmail.com"
2, "b#gmail.com"
3333, "c#hotmail.com"
3333, "d#gmail.com"
4, "e#hotmail.com"
4, "f#hotmail.com"
4, "g#gmail.com"
55, "h#gmail.com"
A file which contains line as below,
[root#test-server ~]# cat /tmp/output.log
Done
[root#test-server ~]#
But it has some hidden characters,
[root#test-server ~]# cat -e /tmp/output.log
^[[?25l$
$
^[[?12l^[[?25hDone!$
$
[root#test-server ~]#
here are the bytes code of these characters,
[root#test-server ~]# irb
irb(main):001:0> File.open("/tmp/output.log").each_byte.to_a
=> [27, 91, 63, 50, 53, 108, 10, 10, 27, 91, 63, 49, 50, 108, 27, 91, 63, 50, 53, 104, 68, 111, 110, 101, 10, 10]
irb(main):002:0>
Am trying to remove this using below gsub method which is not working.
filename="/tmp/stop_output.log"
content = File.read(filename)
content.gsub!(/\e\[?(\d+)(;(\d+))?l|\e\[?(\d+)(;(\d+))?h/, '')
Not sure if am missing something with gsub,
Got it fixed myself. I found that not escaping ?.
irb(main):021:0> filename="/tmp/output.log"
=> "/tmp/output.log"
irb(main):022:0> content = File.read(filename)
=> "\e[?25l\n\n\e[?12l\e[?25hDone\n\n"
irb(main):023:0> content.gsub!(/\e\[\?(\d+)(\d+)l|\e\[\?(\d+)(\d+)h/, '')
=> "\n\nDone\n\n"
irb(main):024:0> puts content
Done
=> nil
irb(main):025:0>
Thanks!
I have a variable in bash number which contains values 01, 02, 03, 04, 05, 06, 07, 08, 09, 010, 011, 012.
I would like to remove the leading zeros before 010, 011 and 012. I only want to remove the leading zeroes if the number is a double digit number.
How can I achieve this?
Thanks in advance!
Try:
a="012"
printf '%02d\n' "$((10#${a}))"
12
Another way:
a="014"
printf "%02d\n" $(echo "obase=10;$a" |bc)
14
another one:
[[ $a =~ ^0+[1-9]{2,}$ ]] && a="$(echo $((10#${a})))"
echo $a
This one will remove all 0s from beginning for 2 or more non-zero digit numbers.
The 2nd column in my csv file has duplicates. I want to add the associated values from column 1 based on those duplicates.
Example csv :
56, cc=DK
49, cc=US
34, cc=GB
32, cc=DE
32, cc=NZ
31, cc=DK
31, cc=GB
31, cc=GB
Example result :
96, cc=GB # where 96 = 34+31+31
87, cc=DK # where 87 = 56+31
32, cc=DE
32, cc=NZ
You can use associative arrays in awk:
awk '{s[$2]+=$1}END{for(k in s)print s[k]", ",k}' inFile
Expanding on that for readability, and using sum/key rather than s/k:
{ # Do for each line.
sum[$2] += $1 # Add first field to accumulator,
# indexed by second field.
# initial value is zero.
}
END { # Do this bit when whole file processed.
for (key in sum) # For each key like cc=US:
print sum[key] ", " key # Output the sum and key.
}
Here's a sample run on my box:
pax$ echo;echo '56, cc=DK
49, cc=US
34, cc=GB
32, cc=DE
32, cc=NZ
31, cc=DK
31, cc=GB
31, cc=GB' | awk '{s[$2]+=$1}END{for(k in s)print s[k]", "k}'
32, cc=DE
96, cc=GB
32, cc=NZ
49, cc=US
87, cc=DK
This works despite the fact that the first column is of the form 999, (note the comma at the end), simply because awk, when evaluating strings in a numeric context, uses only the prefix that is valid in that context. Hence 45xyzzy would become 45 and, more importantly, 49, becomes 49.
Perl solution:
perl -ane '$h{ $F[1] } += $F[0] }{ print "$h{$_}\t$_\n" for keys %h' input.csv
Explanation:
-n processes the input line by line
-a splits the input line on whitespace into fields in the #F array
the hash table %h records the sum for each key (2nd column). It just adds the value of the first column to it.
}{ (called "Eskimo greeting") separates what's executed for each line (-n) from the code to be run after the whole input was processed
It's ok to use awk for such simple task, but if you have bunch of similar tasks and you may need to change it in the future it's easy to mess something up.
Since it's typical database problem, consider using sqlite.
You can:
add row names and remove extra white spaces:
$ cat <(echo "num, name") originalInput.txt | tr -d ' ' > input.csv
import data to temporary sqlite db:
$ sqlite3 --batch temp.db <<EOF!
.mode csv
.import input.csv input
EOF!
select from db:
$sqlite3 temp.db 'SELECT sum(num), name FROM input GROUP BY name'
32|cc=DE
87|cc=DK
96|cc=GB
32|cc=NZ
49|cc=US
It is slightly bit more code and uses external sqlite3 command, but it's significantly less error prone and more flexible. You can do easily join several csv files, use fancy sorting, and more.
Also, imagine yourself looking at the code six month later trying to understand quickly what it does.