How to format number in PL/SQL? - oracle

I need to convert some numbers to chars according to the following logic :
Input => Expected Output | Current Output
0 => 0 | 0.00 << Wrong
.1111 => 0.11 | 0.11
.1 => 0.1 | 0.10 << Wrong
1.111 => 1.11 | 1.11
Basically my logic is to have the minimum of characters. Only the user friendly caracters that describe the number.
Here is my current function
to_char(Value,'9999999999999990D99');
As you can see for 0 for example, it returns 0.00
Does anyone know how to solve that please ?
Thanks.

Looks like you want this one:
rtrim(to_char(Value,'fm99999999999990D99'),'.')
Ie, you need to add 'fm' in format mask and them remove '.':
Example:
select
to_char(Value,'9999999999999990D99') xx
,to_char(Value,'fm9999999999999990D99') x_fm -- just FM
,rtrim(to_char(Value,'fm99999999999990D99'),'.') x_fm_trim -- FM + rtrim
from xmltable('0, 0.1111, 0.1, 1.111' columns value number path '.');
XX X_FM X_FM_TRIM
-------------------- -------------------- ------------------
0.00 0. 0
0.11 0.11 0.11
0.10 0.1 0.1
1.11 1.11 1.11

Related

Invalid syntax loop in Stata

I'm trying to run a for loop to make a balance table in Stata (comparing the demographics of my dataset with national-level statistics)
For this, I'm prepping my dataset and attempting to calculate the percentages/averages for some key demographics.
preserve
rename unearnedinc_wins95 unearninc_wins95
foreach var of varlist fem age nonwhite hhsize parent employed savings_wins95 debt_wins95 earnedinc_wins95 unearninc_wins95 underfpl2019 { //continuous or binary; to put categorical vars use kwallis test
dis "for variable `var':"
tabstat `var'
summ `var'
local `var'_samplemean=r(mean)
}
clear
set obs 11
gen var=""
gen sample=.
gen F=.
gen pvalue=.
replace var="% Female" if _n==1
replace var="Age" if _n==2
replace var="% Non-white" if _n==3
replace var="HH size" if _n==4
replace var="% Parent" if _n==5
replace var="% Employed" if _n==6
replace var="Savings stock ($)" if _n==7
replace var="Debt stock ($)" if _n==8
replace var="Earned income last mo. ($)" if _n==9
replace var="Unearned income last mo. ($)" if _n==10
replace var="% Under FPL 2019" if _n==11
foreach col of varlist sample {
replace `col'=100*round(`fem_`col'mean', 0.01) if _n==1
replace `col'=round(`age_`col'mean') if _n==2
replace `col'=100*round(`nonwhite_`col'mean', 0.01) if _n==3
replace `col'=round(`hhsize_`col'mean', 0.1) if _n==4
replace `col'=100*round(`parent_`col'mean', 0.01) if _n==5
replace `col'=100*round(`employed_`col'mean', 0.01) if _n==6
replace `col'=round(`savings_wins95_`col'mean') if _n==7
replace `col'=round(`debt_wins95_`col'mean') if _n==8
replace `col'=round(`earnedinc_wins95_`col'mean') if _n==9
replace `col'=round(`unearninc_wins95_`col'mean') if _n==10
replace `col'=100*round(`underfpl2019_`col'mean', 0.01) if _n==11
}
I'm trying to run the following loop, but in the second half of the loop, I keep getting an 'invalid syntax' error. For context, in the first half of the loop (before clearing the dataset), the code stores the average values of the variables as a macro (`var'_samplemean). Can someone help me out and mend this loop?
My sample data:
clear
input byte fem float(age nonwhite) byte(hhsize parent) float employed double(savings_wins95 debt_wins95 earnedinc_wins95 unearninc_wins95) float underfpl2019
1 35 1 6 1 1 0 2500 0 0 0
0 40 0 4 1 1 0 10000 1043 0 0
0 40 0 4 1 1 0 20000 2400 0 0
0 40 0 4 1 1 .24 20000 2000 0 0
0 40 0 4 1 1 10 . 2600 0 0
Thanks!
Thanks for sharing the snippet of data. Apart from the fact the variable unearninc_wins95 has already been renamed in your sample data, the code runs fine for me without returning an error.
That being said, the columns for your F-statistics and p-values are empty once the loop at the bottom of your code completes. As far as I can see there is no local/varlist called sample which you're attempting to call with the line foreach col of varlist sample{. This could be because you haven't included it in your code, in which case please do, or it could be because you haven't created the local/varlist sample, in which case this could well be the source of your error message.
Taking a step back, there are more efficient ways of achieving what I think you're after. For example, you can get (part of) what you want using the package stat2data (if you don't have it installed already, run ssc install stat2data from the command prompt). You can then run the following code:
stat2data fem age nonwhite hhsize parent employed savings_wins95 debt_wins95 earnedinc_wins95 unearninc_wins95 underfpl2019, saving("~/yourstats.dta") stat(count mean)
*which returns:
preserve
use "~/yourstats.dta", clear
. list, sep(11)
+----------------------------+
| _name sN smean |
|----------------------------|
1. | fem 5 .2 |
2. | age 5 39 |
3. | nonwhite 5 .2 |
4. | hhsize 5 4.4 |
5. | parent 5 1 |
6. | employed 5 1 |
7. | savings_wins 5 2.048 |
8. | debt_wins95 4 13125 |
9. | earnedinc_wi 5 1608.6 |
10. | unearninc_wi 5 0 |
11. | underfpl2019 5 0 |
+----------------------------+
restore
This is missing the empty F-statistic and p-value variables you created in your code above, but you can always add them in the same way you have with gen F=. and gen pvalue=.. The presence of these variables though indicates you want to run some tests at some point and then fill the cells with values from them. I'd offer advice on how to do this but it's not obvious to me from your code what you want to test. If you can clarify this I will try and edit this answer to include that.
This doesn't answer your question directly; as others gently point out the question is hard to answer without a reproducible example. But I have several small comments on your code which are better presented in this form.
Assuming that all the variables needed are indeed present in the dataset, I would recommend something more like this:
local myvarlist fem age nonwhite hhsize parent employed savings_wins95 debt_wins95 earnedinc_wins95 unearninc_wins95 underfpl2019
local desc `" "% Female" "Age" "% Non-white" "HH size" "% Parent" "% Employed" "Savings stock ($)" "Debt stock ($)" "Earned income last mo. ($)" "Unearned income last mo. ($)" "% Under FPL 2019" "'
local i = 1
gen variable = ""
gen mean = ""
local i = 1
foreach var of local myvars {
summ `var', meanonly
local this : word `i' of `desc'
replace variable = "`this'" in `i'
if inlist(`i', 1, 3, 5, 6, 11) {
replace mean = strofreal(100 * r(mean), "%2.0f") in `i'
}
else if `i' == 4 {
replace mean = strofreal(r(mean), "%2.1f") in `i'
}
else replace mean = strofreal(r(mean), "%2.0f") in `i'
local ++i
}
This has not been tested.
Points arising include:
Using in is preferable for what you want over testing the observation number with if.
round() is treacherous for rounding to so many decimal places. Most of the time you will get what you want, but occasionally you will get bizarre results arising from the fact that Stata works in binary, like any equivalent program. It is safer to treat rounding as a problem in string manipulation and use display formats as offering precisely what you want.
If the text you want to show is just the variable label for each variable, this code could be simplified further.
The code hints at intent to show other stuff, which is easily done compatibly with this design.

Sorting data with gnuplot

Sometimes it might be required to sort data. Unfortunately, gnuplot (as far as I know) doesn't offer this possibility. Of course, you can use external tools like awk, Perl, Python, etc. However, for maximum platform independence and avoiding the installation of additional programs and related complications, and also for curiosity, I was interested whether gnuplot can sort somehow nevertheless.
I will be grateful for comments on improvements, limitations.
Does anybody have ideas how to sort alphanumerical data with gnuplot only?
### Sorting with gnuplot
reset session
# generate some random example data
N = 10
set samples N
RandomNo(n) = sprintf("%.02f",rand(0)*n)
set table $Data
plot '+' u (RandomNo(10)):(RandomNo(10)):(RandomNo(10)) w table
unset table
print $Data
# Settings for sorting
ColNo = 2 # ColumnNo for sorting
stats $Data nooutput # get the number of rows if data is from file
RowCount = STATS_records # with the example data above, of course RowCount=N
# create the sortkey and put it into an array
array SortKey[RowCount]
set table $Dummy
plot $Data u (SortKey[$0+1] = sprintf("%.06f%02d",column(ColNo),$0+1)) w table
unset table
# print $Dummy
# get lines as whole into array
set datafile separator "\n"
array DataSeq[RowCount]
set table $Dummy2
plot $Data u (SortKey[$0+1]):(DataSeq[$0+1] = stringcolumn(1)) with table
unset table
print $Dummy2
set datafile separator whitespace
# do the actual sorting with 'smooth unique'
set table $Dummy3
plot $Dummy2 u 1:0 smooth unique
unset table
# print $Dummy3
# extract the sorted sortkeys
set table $Dummy4
plot $Dummy3 u (SortKey[$0+1]=$2) with table
unset table
# print $Dummy4
# create the table with sorted lines
set table $DataSorted
plot $Data u (DataSeq[SortKey[$0+1]+1]) with table
unset table
print $DataSorted
### end of code
First datablock unsorted data
second datablock intermediate with sortkeys
third datablock sorted data by the second column
Output:
5.24 6.68 3.09
1.64 1.27 9.82
6.44 9.23 7.03
8.14 8.87 3.82
4.27 5.98 0.93
7.96 3.64 6.15
6.21 6.28 6.17
1.52 3.17 3.58
4.24 2.16 8.99
8.73 6.54 1.13
6.68000001 5.24 6.68 3.09
1.27000002 1.64 1.27 9.82
9.23000003 6.44 9.23 7.03
8.87000004 8.14 8.87 3.82
5.98000005 4.27 5.98 0.93
3.64000006 7.96 3.64 6.15
6.28000007 6.21 6.28 6.17
3.17000008 1.52 3.17 3.58
2.16000009 4.24 2.16 8.99
6.54000010 8.73 6.54 1.13
1.64 1.27 9.82
4.24 2.16 8.99
1.52 3.17 3.58
7.96 3.64 6.15
4.27 5.98 0.93
6.21 6.28 6.17
8.73 6.54 1.13
5.24 6.68 3.09
8.14 8.87 3.82
6.44 9.23 7.03
For curiosity, I wanted to know whether an alphanumerical sort could be implemented with gnuplot code only.
This avoids the need for external tools and ensures maximum platform compatibility.
I haven't heard yet about an external tool which could assist gnuplot and which works under Windows and Linux and MacOS.
I am happy to take comments and suggestions about bugs, simplifications, improvements, performance comparisons, and limits.
For alphanumerical sort, the first stage is alphanumerical string comparison, which to my knowledge does not exist in gnuplot directly. So, the first part Compare.plt is about comparison of strings.
### compare function for strings
# Compare.plt
# function cmp(a,b,cs) returns a<b:-1, a==b:0, a>b:+1
# cs=0: case-insensitive, cs=1: case-sensitive
reset session
ASCII = ' !"' . "#$%&'()*+,-./0123456789:;<=>?#".\
"ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_\`".\
"abcdefghijklmnopqrstuvwxyz{|}~"
ord(c) = strstrt(ASCII,c)>0 ? strstrt(ASCII,c)+31 : 0
# comparing char: case-sensitive
cmpcharcs(c1,c2) = sgn(ord(c1)-ord(c2))
# comparing char: case-insentitive
cmpcharci(c1,c2) = sgn(( cmpcharci_o1=ord(c1), ((cmpcharci_o1>96) && (cmpcharci_o1<123)) ?\
cmpcharci_o1-32 : cmpcharci_o1) - \
( cmpcharci_o2=ord(c2), ((cmpcharci_o2>96) && (cmpcharci_o2<123)) ?\
cmpcharci_o2-32 : cmpcharci_o2) )
# function cmp returns a<b:-1, a==b:0, a>b:+1
# cs=0: case-insensitive, cs=1: case-sensitive
cmp(a,b,cs) = ((cmp_r=0, cmp_flag=0, cmp_maxlen=strlen(a)>strlen(b) ? strlen(a) : strlen(b)),\
(sum[cmp_i=1:cmp_maxlen] \
((cmp_flag==0 && (cmp_c1 = substr(a,cmp_i,cmp_i), cmp_c2 = substr(b,cmp_i,cmp_i), \
(cmp_r = (cs==0 ? cmpcharci(cmp_c1,cmp_c2) : cmpcharcs(cmp_c1,cmp_c2) ) )!=0 ? \
(cmp_flag=1, cmp_r) : 0)), 1 )), cmp_r)
cmpsymb(a,b,cs) = (cmpsymb_r = cmp(a,b,cs))<0 ? "<" : cmpsymb_r>0 ? ">" : "="
### end of code
Example:
### example compare strings
load "Compare.plt"
a="Alligator"
b="Tiger"
print sprintf("% 2d: % 9s% 2s% 6s", cmp(a,b,0), a, cmpsymb(a,b,0), b)
a="Tiger"
print sprintf("% 2d: % 9s% 2s% 6s", cmp(a,b,0), a, cmpsymb(a,b,0), b)
a="Zebra"
print sprintf("% 2d: % 9s% 2s% 6s", cmp(a,b,0), a, cmpsymb(a,b,0), b)
### end of code
Result:
-1: Alligator < Tiger
0: Tiger = Tiger
1: Zebra > Tiger
The second part makes use of the comparison for sorting.
### alpha-numerical sort with gnuplot
reset session
load "Compare.plt"
$Data <<EOD
1 0.123 Orange
2 0.456 Apple
3 0.789 Peach
4 0.987 Pineapple
5 0.654 Banana
6 0.321 Raspberry
7 0.111 Lemon
EOD
stats $Data u 0 nooutput
RowCount = STATS_records
ColSort = 3
array Key[RowCount]
array Index[RowCount]
set table $Dummy
plot $Data u (Key[$0+1]=stringcolumn(ColSort),Index[$0+1]=$0+1) w table
unset table
# Bubblesort
do for [n=RowCount:2:-1] {
do for [i=1:n-1] {
if ( cmp(Key[i],Key[i+1],0) > 0) {
tmp=Key[i]; Key[i]=Key[i+1]; Key[i+1]=tmp
tmp2=Index[i]; Index[i]=Index[i+1]; Index[i+1]=tmp2
}
}
}
set datafile separator "\n"
set table $Dummy # and reuse Key-array
plot $Data u (Key[$0+1]=stringcolumn(1)) with table
unset table
set datafile separator whitespace
set table $DataSorted
plot $Data u (Key[Index[$0+1]]) with table
unset table
print $DataSorted
set grid xtics,ytics
plot [-0.5:RowCount-0.5][0:1.1] $DataSorted u 0:2:xtic(3) w lp lt 7 lc rgb "red"
### end of code
Input:
1 0.123 Orange
2 0.456 Apple
3 0.789 Peach
4 0.987 Pineapple
5 0.654 Banana
6 0.321 Raspberry
7 0.111 Lemon
Output:
2 0.456 Apple
5 0.654 Banana
7 0.111 Lemon
1 0.123 Orange
3 0.789 Peach
4 0.987 Pineapple
6 0.321 Raspberry
and the output graph:

How can I extract parts of one column and append them to other columns?

I have a large .csv file that I need to extract information from and add this information to another column. My csv looks something like this:
file_name,#,Date,Time,Temp (°C) ,Intensity
trap12u_10733862_150809.txt,1,05/28/15,06:00:00.0,20.424,215.3,,
trap12u_10733862_150809.txt,2,05/28/15,07:00:00.0,21.091,1,130.2,,
trap12u_10733862_150809.txt,3,05/28/15,08:00:00.0,26.195,3,100.0,,
trap11u_10733862_150809.txt,4,05/28/15,09:00:00.0,25.222,3,444.5,,
trap11u_10733862_150809.txt,5,05/28/15,10:00:00.0,26.195,3,100.0,,
trap11u_10733862_150809.txt,6,05/28/15,11:00:00.0,25.902,2,927.8,,
trap11u_10733862_150809.txt,7,05/28/15,12:00:00.0,25.708,2,325.0,,
trap12c_10733862_150809.txt,8,05/28/15,13:00:00.0,26.292,3,100.0,,
trap12c_10733862_150809.txt,9,05/28/15,14:00:00.0,26.390,2,066.7,,
trap12c_10733862_150809.txt,10,05/28/15,15:00:00.0,26.097,1,463.9,,
I want to create a two new columns that contains the data from the "file_name" column. I want to extract the one to two numbers after the text "trap" and I want to extract the c or the u and create new columns with this data. Data should look like something like this after processing:
file_name,#,Date,Time,Temp (°C) ,Intensity,can_und,trap_no
trap12u_10733862_150809.txt,1,05/28/15,06:00:00.0,20.424,215.3,,u,12
trap12u_10733862_150809.txt,2,05/28/15,07:00:00.0,21.091,1,130.2,,u,12
trap12u_10733862_150809.txt,3,05/28/15,08:00:00.0,26.195,3,100.0,,u,12
trap11u_10733862_150809.txt,4,05/28/15,09:00:00.0,25.222,3,444.5,,u,11
trap12c_10733862_150809.txt,8,05/28/15,13:00:00.0,26.292,3,100.0,,c,12
trap12c_10733862_150809.txt,9,05/28/15,14:00:00.0,26.390,2,066.7,,c,12
trap12c_10733862_150809.txt,10,05/28/15,15:00:00.0,26.097,1,463.9,,c,12
I suspect the way to do this is with awk and a regular expression, but I'm not sure how to implement the regular expression. How can I extract parts of one column and append them to other columns?
Using sed you can do this:
sed -E '1s/.*/&,can_und,trap_no/; 2,$s/trap([0-9]+)([a-z]).*/&\2,\1/' file.csv
file_name,#,Date,Time,Temp (°C) ,Intensity,can_und,trap_no
trap12u_10733862_150809.txt,1,05/28/15,06:00:00.0,20.424,215.3,,u,12
trap12u_10733862_150809.txt,2,05/28/15,07:00:00.0,21.091,1,130.2,,u,12
trap12u_10733862_150809.txt,3,05/28/15,08:00:00.0,26.195,3,100.0,,u,12
trap11u_10733862_150809.txt,4,05/28/15,09:00:00.0,25.222,3,444.5,,u,11
trap11u_10733862_150809.txt,5,05/28/15,10:00:00.0,26.195,3,100.0,,u,11
trap11u_10733862_150809.txt,6,05/28/15,11:00:00.0,25.902,2,927.8,,u,11
trap11u_10733862_150809.txt,7,05/28/15,12:00:00.0,25.708,2,325.0,,u,11
trap12c_10733862_150809.txt,8,05/28/15,13:00:00.0,26.292,3,100.0,,c,12
trap12c_10733862_150809.txt,9,05/28/15,14:00:00.0,26.390,2,066.7,,c,12
trap12c_10733862_150809.txt,10,05/28/15,15:00:00.0,26.097,1,463.9,,c,12
gawk approach:
awk -F, 'NR==1{ print $0,"can_und,trap_no" }
NR>1{ match($1,/^trap([0-9]+)([a-z])/,a); print $0 a[2],a[1] }' OFS="," file
The output:
file_name,#,Date,Time,Temp (°C) ,Intensity,can_und,trap_no
trap12u_10733862_150809.txt,1,05/28/15,06:00:00.0,20.424,215.3,,u,12
trap12u_10733862_150809.txt,2,05/28/15,07:00:00.0,21.091,1,130.2,,u,12
trap12u_10733862_150809.txt,3,05/28/15,08:00:00.0,26.195,3,100.0,,u,12
trap11u_10733862_150809.txt,4,05/28/15,09:00:00.0,25.222,3,444.5,,u,11
trap11u_10733862_150809.txt,5,05/28/15,10:00:00.0,26.195,3,100.0,,u,11
trap11u_10733862_150809.txt,6,05/28/15,11:00:00.0,25.902,2,927.8,,u,11
trap11u_10733862_150809.txt,7,05/28/15,12:00:00.0,25.708,2,325.0,,u,11
trap12c_10733862_150809.txt,8,05/28/15,13:00:00.0,26.292,3,100.0,,c,12
trap12c_10733862_150809.txt,9,05/28/15,14:00:00.0,26.390,2,066.7,,c,12
trap12c_10733862_150809.txt,10,05/28/15,15:00:00.0,26.097,1,463.9,,c,12
NR==1{ print $0,"can_und,trap_no" } - print the header line
match($1,/^trap([0-9]+)([a-z])/,a) - matches the number following trap word and the next following suffix letter
With use of sed, this will be like:
sed 's/trap\([[:digit:]]\+\)\(.\)\(.*\)$/trap\1\2\3\2,\1/' file
Use sed -i ... to replace it in file.
Using python pandas reader because python is awesome for numerical analysis:
First: I had to modify the data header row so that the columns were consistent by appending 3 commas:
file_name,#,Date,Time,Temp (°C) ,Intensity,,,
There is probably a way to tell pandas to ignore the column differences - but I am yet a noob.
Python code to read your data into columns and create 2 new columns named 'cu_int' and 'cu_char' which contain the parsed elements of the filenames:
import pandas
def main():
df = pandas.read_csv("file.csv")
df['cu_int'] = 0 # Add the new columns to the data frame.
df['cu_char'] = ' '
for index, df_row in df.iterrows():
file_name = df['file_name'][index].strip()
trap_string = file_name.split("_")[0] # Get the file_name string prior to the underscore
numeric_offset_beg = len("trap") # Parse the number following the 'trap' string.
numeric_offset_end = len(trap_string) - 1 # Leave off the 'c' or 'u' char.
numeric_value = trap_string[numeric_offset_beg : numeric_offset_end]
cu_value = trap_string[len(trap_string) - 1]
df['cu_int'] = int(numeric_value)
df['cu_char'] = cu_value
# The pandas dataframe is ready for number crunching.
# For now just print it out:
print df
if __name__ == "__main__":
main()
The printed output (note there are inconsistencies in the data set posted - see row 1 as an example):
$ python read_csv.py
file_name # Date Time Temp (°C) Intensity Unnamed: 6 Unnamed: 7 Unnamed: 8 cu_int cu_char
0 trap12u_10733862_150809.txt 1 05/28/15 06:00:00.0 20.424 215.3 NaN NaN NaN 12 c
1 trap12u_10733862_150809.txt 2 05/28/15 07:00:00.0 21.091 1.0 130.2 NaN NaN 12 c
2 trap12u_10733862_150809.txt 3 05/28/15 08:00:00.0 26.195 3.0 100.0 NaN NaN 12 c
3 trap11u_10733862_150809.txt 4 05/28/15 09:00:00.0 25.222 3.0 444.5 NaN NaN 12 c
4 trap11u_10733862_150809.txt 5 05/28/15 10:00:00.0 26.195 3.0 100.0 NaN NaN 12 c
5 trap11u_10733862_150809.txt 6 05/28/15 11:00:00.0 25.902 2.0 927.8 NaN NaN 12 c
6 trap11u_10733862_150809.txt 7 05/28/15 12:00:00.0 25.708 2.0 325.0 NaN NaN 12 c
7 trap12c_10733862_150809.txt 8 05/28/15 13:00:00.0 26.292 3.0 100.0 NaN NaN 12 c
8 trap12c_10733862_150809.txt 9 05/28/15 14:00:00.0 26.390 2.0 66.7 NaN NaN 12 c
9 trap12c_10733862_150809.txt 10 05/28/15 15:00:00.0 26.097 1.0 463.9 NaN NaN 12 c

Try to find better way to manipulate large, multiple txt files through multiple directory using Ruby script

I'm working on collecting test measurement data from product in manufacturing environment.
The test measurement result of units under test are generated by the test system. It is in an 2Mb txt file and was keep in share folders separated by products.
The folder structure looks like...
LOGS
|-Product1
| |-log_p1_1.txt
| |-log_p1_2.txt
| |..
|-Product2
| |-log_p2_1.txt
| |-log_p2_2.txt
| |..
|-...
My ruby script can iterate through each Product directory under LOGS and then read each log_px_n.txt file, parse data I need in the file and update it into database.
The thing is that all log_px_n.txt files of must be keep in its current directory, both old file and new files, while I need to keep my database update as soon as the new log_px_n.tx file was generated.
what I did today is to try iterate through each Product directories then read each individual .txt file and after that update file into database if it was not exist.
My script looks like..
Dir['*'].each do |product|
product_dir = File.join(BASE_DIR, product)
Dir.chdir(product_dir)
Dir['*.txt'].each do |log|
if (Time.now - File.mtime(log) < SIX_HOURS_AGO) # take only new files in last six hours
# Here we do..
# - read each 2Mb .txt file
# - extract infomation from txt file
# - update into database
end
end
end
There are upto 30 differents product directories and each product contain around 1000 .txt file (2Mb each), and they are growing !
I don't have issue about disk space to store such .txt file but the time it take to complete this operation.
It takes >45min to complete task each time when run above code block.
Is there any better way to deal with this situation ?
Update:
I tried as Iced's suggest to use profiler, so I run below code and got following result...
require 'profiler'
class MyCollector
def initialize(dir, period, *filetypes)
#dir = dir
#filetypes = filetypes.join(',')
#period = period
end
def collect
Dir.chdir(#dir)
Dir.glob('*').each do |product|
products_dir = File.join(#dir, product)
Dir.chdir(products_dir)
puts "at product #{product}"
Dir.glob("**/*.{#{#filetypes}}").each do |log|
if Time.now - File.mtime(log) < #period
puts Time.new
end
end
end
end
path = '//10.1.2.54/Shares/Talend/PRODFILES/LOGS'
SIX_HOURS_AGO = 21600
Profiler__::start_profile
collector = MyCollector.new(path, SIX_HOURS_AGO, "LOG")
collector.collect
Profiler__::stop_profile
Profiler__::print_profile(STDOUT)
The result shows...
at product ABU43E
..
..
..
at product AXF40J
at product ACZ16C
2014-04-21 17:32:07 +0700
at product ABZ14C
at product AXF90E
at product ABZ14B
at product ABK43E
at product ABK01A
2014-04-21 17:32:24 +0700
2014-04-21 17:32:24 +0700
at product ABU05G
at product ABZABF
2014-04-21 17:32:28 +0700
2014-04-21 17:32:28 +0700
2014-04-21 17:32:28 +0700
2014-04-21 17:32:28 +0700
2014-04-21 17:32:28 +0700
2014-04-21 17:32:28 +0700
% cumulative self self total
time seconds seconds calls ms/call ms/call name
32.54 1.99 1.99 43 46.40 265.60 Array#each
24.17 3.48 1.48 41075 0.04 0.04 File#mtime
13.72 4.32 0.84 43 19.AX 19.AX Dir#glob
9.13 4.88 0.AX 41075 0.01 0.03 Time#-
8.14 5.38 0.50 41075 0.01 0.01 Float#quo
6.65 5.79 0.41 41075 0.01 0.01 Time#now
2.06 5.91 0.13 41084 0.00 0.00 Time#initialize
1.79 6.02 0.11 41075 0.00 0.00 Float#<
1.79 6.13 0.11 41075 0.00 0.00 Float#/
0.00 6.13 0.00 1 0.00 0.00 Array#join
0.00 6.13 0.00 51 0.00 0.00 Kernel.puts
0.00 6.13 0.00 51 0.00 0.00 IO#puts
0.00 6.13 0.00 102 0.00 0.00 IO#write
0.00 6.13 0.00 42 0.00 0.00 File#join
0.00 6.13 0.00 43 0.00 0.00 Dir#chdir
0.00 6.13 0.00 10 0.00 0.00 Class#new
0.00 6.13 0.00 1 0.00 0.00 MyCollector#initialize
0.00 6.13 0.00 9 0.00 0.00 Integer#round
0.00 6.13 0.00 9 0.00 0.00 Time#to_s
0.00 6.13 0.00 1 0.00 6131.00 MyCollector#collect
0.00 6.13 0.00 1 0.00 6131.00 #toplevel
[Finished in 477.5s]
It turn out that it take up to 7 mins to walk over each files in each directories. then call mtime.
Although my .txt file is 2Mb, it should not suppose to take time that long, no ?
Any suggestion, pls ?
Relying on mtime is not robust. In fact, Rails switched from using mtime to hash in naming the versions of asset files.
You should keep a list of file-hash pair. That can be obtained like this:
require "digest"
file_hash_pair =
Dir.glob("LOGS/**/*")
.select{|f| File.file?(f)}
.map{|f| [f, Digest::SHA1.hexdigest(File.read(f))]}
and perhaps you can keep the content of this in a file as YAML. You can run the code above each time, and whenever file_hash_pair is different from the previous value, you can tell that there was a change. If file_hash_pair.transpose[0] changed, then you can tell there was a file manipulation. If for a particular [file, hash] pair, the hash changed, then you can tell that the file file changed.

Importing CSV into Postgresql with duplicate values that are not duplicate rows

I am using Rails 4 and postgresql database and I have a question about entering in a CSV dataset into the database.
Date Advertiser Name Impressions Clicks CPM CPA CPC CTR
10/21/13 Advertiser 1 77 0 4.05 0.00 0.00 0.00
10/21/13 Advertiser 2 10732 23 5.18 0.00 2.42 0.21
10/21/13 Advertiser 3 16941 14 4.64 11.23 5.62 0.08
10/22/13 Advertiser 1 59 0 3.67 0.00 0.00 0.00
10/22/13 Advertiser 2 10130 15 5.24 53.05 3.54 0.15
10/22/13 Advertiser 3 18400 22 4.59 10.55 3.84 0.12
10/23/13 Advertiser 1 77 0 4.06 0.00 0.00 0.00
10/23/13 Advertiser 2 9520 22 5.58 26.58 2.42 0.23
Using the data above I need to create a show page for each Advertiser.
Ultimately I need to have a list of Advertiser's that I can click on any one of them and go to their show page and display the informations relevant to each advertiser (impressions, clicks, cpm, etc)
Where I am confused is how to import the CSV data when there are rows with duplicate Advertiser's, but the other columns contain relevant and non duplicate information. How can I set up my database tables so that I will not have duplicate Advertiser's and still import and then display the correct information?
You will want to create two models: Advertiser and Site. (or maybe date).
Advertiser "has many" Sites, and Site "has one" advertiser. This association will allow you to import your data correctly.
See: http://api.rubyonrails.org/classes/ActiveRecord/Associations/ClassMethods.html
Instead of creating two different models I just created 1 advertiser model and inputted the complete dataset into that model.
require 'csv'
desc "Import advertisers from csv file"
task :import => [:environment] do
CSV.foreach('db/MediaMathPerformanceReport.csv', :headers => true) do |row|
Advertiser.create!(row.to_hash)
end
end
After the data was imported by the above rake task, I simply set up the show route as follows:
def show
#advertiser = Advertiser.where(advertiser_name: advertiser_name)
end

Resources