Read date stamp from .csv into matlab - need faster code - performance

i have data in a csv file that looks like this. A timestamp and data, approximately 200 000 rows per file.
2015-10-19T22:15:30.12202 +02:00,62.7
2015-10-19T22:15:30.12696 +02:00,58.5
etc
I want to import it into matlab and convert the timestamp to a numeric format. The data is stored in a matrix of size N x 2. It should look like this:
736154.4935, 62.7
736154.4955, 58.5
etc
Looking only at the date conversion, below is the code i use. rawdata{2} is the vector of time strings ("2015-10-19T22:15:30.12696 +02:00"). I have to do an extra calculation since datenum only supports milliseconds, and not further.
for i = 1:length(rawdata{2})
currDate = strsplit(rawdata{2}{i}, ' ');
currDate = currDate{1};
add_days = str2double(currDate(24:25))/(100000*3600*24);
timestamp = datenum(currDate, 'yyyy-mm-ddTHH:MM:SS.FFF')+add_days;
data(i,1) = timestamp;
end
I have 1000+ files that each have 200 000+ rows like this. My code works, but it's too slow to be practical. Is there any way i can speed this up?
EDIT: After using profiling as suggested in the comments I found that strsplit took the most time. Since strsplit isn't really necessary in this case I was able to cut significant time off!
Now datenum is what takes up the majority of the time, but I'm not sure i can get around it. Any suggestions welcome!

Related

Can't Group DateTime by Hour or Dump Result Apache Pig

I'm working on a project that requires me to find the temporal average (e.g: hour, day, month) for multiple datasets and then do calculations on those averages. The issue I am running into is that Apache Pig will not group by the time nor dump the DateTime values. I've tried several solutions posted here on Stack Overlflow and elsewhere to no avail. I've also read over the documentation, and am unable to find a solution.
Here is my code so far:
data = LOAD 'TestData' USING PigStorage(',');
t_data = foreach data generate (chararray)$0 as date, (double)$305 as w_top, (double)$306 as t_top, (double)$310 as w_mid, (double)$311 as t_mid, (double)$315 as w_bot, (double)$316 as t_bot, (double)$319 as pressure;
times = FOREACH t_data GENERATE ToDate(date,'YYYY-MM-ddThh:mm:ss.s') as (date), w_top, t_top, w_mid, t_mid, w_bot, t_bot, pressure;
grp_hourly = GROUP times by GetHour(date);
average = foreach grp_hourly generate flatten(group), times.date, AVG(times.w_top), AVG(times.t_top), AVG(times.w_mid), AVG(times.t_mid), AVG(times.w_bot), AVG(times.t_bot);
And some sample lines from the data:
2011-01-06 15:00:00.0 ,0.07225,-11.36384,-0.045,-11.24599,0.036,-12.44104,1021.707
2011-01-06 15:00:00.1 ,0.09975,-11.34448,-0.0325,-11.26053,0.041,-12.45392,1021.694
2011-01-06 15:00:00.2 ,0.15375,-11.35576,-0.02975,-11.26536,0.01025,-12.44748,1021.407
2011-01-06 15:00:00.3 ,-0.00225,-11.42034,-0.03775,-11.28477,-0.013,-12.44429,1021.764
2011-01-06 15:00:00.4 ,0.01625,-11.33965,-0.0395,-11.27989,-0.0395,-12.42172,1021.484
What I Currently Get as Output:
I get a file with one average of every variable I feed into APACHE Pig without a date and time (most likely the average of each variable over the entire data set). I need them for each hour and to be printed with the output. Any tips would be appreciated. Sorry if my post is messy, I don't post to Stack Overflow often.
The date and time pattern string in ToDate doesn't exactly match your data. You have YYYY-MM-ddThh:mm:ss.s but your data looks like 2011-01-06 15:00:00.0. You need to match the spaces in your data, and since your hours are on the 24 hour, you need to use HH instead of hh. Check out the documentation for Java SimpleDateFormat class. Try this pattern string instead:
times = FOREACH t_data GENERATE ToDate(date,'yyyy-MM-dd HH:mm:ss.s ') as date;
To debug your code, try dumping right after creating the relation times instead of at the end since it seems like the problem is with ToDate().
Savage's answer was correct. The issue I had in my code was a quotation mark that was too close to the date and time string. So instead of writing mine like this:
(date,'YYYY-MM-ddThh:mm:ss.s')
It should be written like this:
(date,'YYYY-MM-ddThh:mm:ss.s ')

Is it faster to sort dates or sort strings in SPSS? If so, by how much?

I have a dataset of around 5 million records. The dates are read in as strings. They are in the form MM/DD/YYYY HH:MM:SS. I am only interested in the date part of it so I read them in as (A10) format which effectively trims the time.
I then do ALTER TYPE DateVar (SDATE10). I do this as I thought sorting dates would be quicker but I can't find confirmation of this.
Is there a way to time SPSS commands to work out questions like this?
The quickest way I can think of is to use python for the timestamps, and normal SPSS syntax for the sorting - just to replicate real-life conditions
***Start timer, in python.
begin program.
import time
start = time.time()
end program.
***go out of python, into normal SPSS syntax, and do your stuff.
/*Put the syntax you want to test here
***get back to python, stop timer, and calculate time difference.
begin program.
end = time.time()
print("It took ",end - start, " seconds")
end program.
Check the output log, and it will show you the time.
Not very scientific, but quick and easy.
I recommend re-starting SPSS between tests - just to be sure one test is not affecting the other.
From my experience, alter type does something that affects code execution times. Not sure what, but everything seems slower after an alter type. So you might also consider saving and re-opening after using alter type.
You should keep the Date format, because:
Dates In spss are actually numbers (formatted in the display as dates but just numbers all the same). Sorting numbers is faster than sorting strings.
In any case, sorting by dates as strings will not order the file by dates (eg. "12-OCT-2017" > "11-NOV-2017").
See another good reason in #horace_vr's comment below.

is it easy to modify this python code to use pandas and would it help if i did?

I have written a Python 2.7 script that reads a CSV file and then does some standard deviation calculations . It works absolutely fine however it is very very slow. A CSV I tried with 100 million lines took around 28 hours to complete. I did some googling and it appears that maybe using the pandas module might makes this quicker .
I have posted part of the code below, since i am a pretty novice when it comes to python , i am unsure if using pandas would actually help at all and if it did would the function need to be completely re-written.
Just some context for the CSV file, it has 3 columns, first column is an IP address, second is a url and the third is a timestamp.
def parseCsvToDict(filepath):
with open(csv_file_path) as f:
ip_dict = dict()
csv_data = csv.reader(f)
f.next() # skip header line
for row in csv_data:
if len(row) == 3: #Some lines in the csv have more/less than the 3 fields they should have so this is a cheat to get the script working ignoring an wrong data
current_ip, URI, current_timestamp = row
epoch_time = convert_time(current_timestamp) # convert each time to epoch
if current_ip not in ip_dict.keys():
ip_dict[current_ip] = dict()
if URI not in ip_dict[current_ip].keys():
ip_dict[current_ip][URI] = list()
ip_dict[current_ip][URI].append(epoch_time)
return(ip_dict)
Once the above function has finished the data is parsed to another function that calculates the standard deviation for each IP/URL pair (using numpy.std).
Do you think that using pandas may increase the speed and would it require a complete rewrite or is it easy to modify the above code?
The following should work:
import pandas as pd
colnames = ["current_IP", "URI", "current_timestamp", "dummy"]
df = pd.read_csv(filepath, names=colnames)
# Remove incomplete and redundant rows:
df = df[~df.current_timestamp.isnull() & df.dummy.isnull()]
Notice this assumes you have enough RAM. In your code, you are already assuming you have enough memory for the dictionary, but the latter may be significatively smaller than the memory used by the above, for two reasons.
If it is because most lines are dropped, then just parse the csv by chunks: arguments skiprows and nrows are your friends, and then pd.concat
If it is because IPs/URLs are repeated, then you will want to transform IPs and URLs from normal columns to indices: parse by chunks as above, and on each chunk do
indexed = df.set_index(["current_IP", "URI"]).sort_index()
I expect this will indeed give you a performance boost.
EDIT: ... including a performance boost to the calculation of the standard deviation (hint: df.groupby())
I will not be able to give you an exact solution, but here are a couple of ideas.
Based on your data, you read 100000000. / 28 / 60 / 60 approximately 1000 lines per second. Not really slow, but I believe that just reading such a big file can cause a problem.
So take a look at this performance comparison of how to read a huge file. Basically a guy suggests that doing this:
file = open("sample.txt")
while 1:
lines = file.readlines(100000)
if not lines:
break
for line in lines:
pass # do something
can give you like 3x read boost. I also suggest you to try defaultdict instead of your if k in dict create [] otherwise append.
And last, not related to python: working in data-analysis, I have found an amazing tool for working with csv/json. It is csvkit, which allows to manipulate csv data with ease.
In addition to what Salvador Dali said in his answer: If you want to keep as much of the current code of your script, you may find that PyPy can speed up your program:
“If you want your code to run faster, you should probably just use PyPy.” — Guido van Rossum (creator of Python)

Sampling 1000 lines from a bunch of gzipped files with PIG

I'm very new to Pig so I may be going about this the wrong way. I have a bunch of gzipped files in a directory in Hadoop. I'm trying to sample around 1000 lines from all of these files put together. It doesn't have to be exact, so I wanted to use SAMPLE. SAMPLE needs a probability of sampling a line, rather than the number of lines that I need, so I thought I should count up the number of lines among all these files and than simply divide 1000 by that count and use it as the probability. This will work, since I don't need to have exactly 100 lines at the end. Here is what I got so far:
raw = LOAD '/data_dir';
cnt = FOREACH (GROUP raw ALL) GENERATE COUNT_STAR(raw);
cntdiv = FOREACH cnt GENERATE (float)100/ct.$0;
Now I'm not sure how to use the value in cntdiv in SAMPLE. I tried SAMPLE raw cntdiv and SAMPLE raw cntdiv.$0, but they don't work. Can I even use that value in the call to SAMPLE? Maybe there is a much better way of accomplishing what I'm trying to do?
Check out the description in the ticket originally requesting this feature: https://issues.apache.org/jira/browse/PIG-1926
I haven't tested this, but it looks like this should work:
raw = LOAD '/data_dir';
samplerate = FOREACH (GROUP raw ALL) GENERATE 1000.0/COUNT_STAR(raw) AS rate;
thousand = SAMPLE raw samplerate.rate;
The important thing is to refer to your scalar by name (rate), not by position ($0).

How can I combine comma format with scientific format in SAS?

I have data that I would like to represent as comma10.2 when less than 1,000,000 and e10. when greater than or equal to 1,000,000. It seems like there might be a way to do this using the picture format, so I thought I might also making missing values show up as --. This is what I've got so far:
proc format;
picture DashMiss . = '--' (noedit)
low - <1000000 = "000,009.99"
1000000 - high = ????;
run;
I'm not sure how to represent scientific notation using picture (hence the question marks). I don't have to just use picture if there's an easier way to do it.
I figured out how to use brackets to add the conditional format:
proc format;
picture DashMiss . = '--' (noedit)
low - <1000000 = "000,009.99"
1000000 - high = [e10.];
run;
I believe you could've simply used the best6. format or bestd6.2 to achieve the same results. It naturally uses scientific notation whenever the length is beyond the first of the 2 integers.

Resources