Can't Group DateTime by Hour or Dump Result Apache Pig - hadoop

I'm working on a project that requires me to find the temporal average (e.g: hour, day, month) for multiple datasets and then do calculations on those averages. The issue I am running into is that Apache Pig will not group by the time nor dump the DateTime values. I've tried several solutions posted here on Stack Overlflow and elsewhere to no avail. I've also read over the documentation, and am unable to find a solution.
Here is my code so far:
data = LOAD 'TestData' USING PigStorage(',');
t_data = foreach data generate (chararray)$0 as date, (double)$305 as w_top, (double)$306 as t_top, (double)$310 as w_mid, (double)$311 as t_mid, (double)$315 as w_bot, (double)$316 as t_bot, (double)$319 as pressure;
times = FOREACH t_data GENERATE ToDate(date,'YYYY-MM-ddThh:mm:ss.s') as (date), w_top, t_top, w_mid, t_mid, w_bot, t_bot, pressure;
grp_hourly = GROUP times by GetHour(date);
average = foreach grp_hourly generate flatten(group), times.date, AVG(times.w_top), AVG(times.t_top), AVG(times.w_mid), AVG(times.t_mid), AVG(times.w_bot), AVG(times.t_bot);
And some sample lines from the data:
2011-01-06 15:00:00.0 ,0.07225,-11.36384,-0.045,-11.24599,0.036,-12.44104,1021.707
2011-01-06 15:00:00.1 ,0.09975,-11.34448,-0.0325,-11.26053,0.041,-12.45392,1021.694
2011-01-06 15:00:00.2 ,0.15375,-11.35576,-0.02975,-11.26536,0.01025,-12.44748,1021.407
2011-01-06 15:00:00.3 ,-0.00225,-11.42034,-0.03775,-11.28477,-0.013,-12.44429,1021.764
2011-01-06 15:00:00.4 ,0.01625,-11.33965,-0.0395,-11.27989,-0.0395,-12.42172,1021.484
What I Currently Get as Output:
I get a file with one average of every variable I feed into APACHE Pig without a date and time (most likely the average of each variable over the entire data set). I need them for each hour and to be printed with the output. Any tips would be appreciated. Sorry if my post is messy, I don't post to Stack Overflow often.

The date and time pattern string in ToDate doesn't exactly match your data. You have YYYY-MM-ddThh:mm:ss.s but your data looks like 2011-01-06 15:00:00.0. You need to match the spaces in your data, and since your hours are on the 24 hour, you need to use HH instead of hh. Check out the documentation for Java SimpleDateFormat class. Try this pattern string instead:
times = FOREACH t_data GENERATE ToDate(date,'yyyy-MM-dd HH:mm:ss.s ') as date;
To debug your code, try dumping right after creating the relation times instead of at the end since it seems like the problem is with ToDate().

Savage's answer was correct. The issue I had in my code was a quotation mark that was too close to the date and time string. So instead of writing mine like this:
(date,'YYYY-MM-ddThh:mm:ss.s')
It should be written like this:
(date,'YYYY-MM-ddThh:mm:ss.s ')

Related

Read date stamp from .csv into matlab - need faster code

i have data in a csv file that looks like this. A timestamp and data, approximately 200 000 rows per file.
2015-10-19T22:15:30.12202 +02:00,62.7
2015-10-19T22:15:30.12696 +02:00,58.5
etc
I want to import it into matlab and convert the timestamp to a numeric format. The data is stored in a matrix of size N x 2. It should look like this:
736154.4935, 62.7
736154.4955, 58.5
etc
Looking only at the date conversion, below is the code i use. rawdata{2} is the vector of time strings ("2015-10-19T22:15:30.12696 +02:00"). I have to do an extra calculation since datenum only supports milliseconds, and not further.
for i = 1:length(rawdata{2})
currDate = strsplit(rawdata{2}{i}, ' ');
currDate = currDate{1};
add_days = str2double(currDate(24:25))/(100000*3600*24);
timestamp = datenum(currDate, 'yyyy-mm-ddTHH:MM:SS.FFF')+add_days;
data(i,1) = timestamp;
end
I have 1000+ files that each have 200 000+ rows like this. My code works, but it's too slow to be practical. Is there any way i can speed this up?
EDIT: After using profiling as suggested in the comments I found that strsplit took the most time. Since strsplit isn't really necessary in this case I was able to cut significant time off!
Now datenum is what takes up the majority of the time, but I'm not sure i can get around it. Any suggestions welcome!

Query Pandas DataFrame distinctly with multiple conditions by unique row values

I have a DataFrame with event logs:
eventtime, eventname, user, execution_in_s, delta_event_time
The eventname e.g. can be "new_order", "login" or "update_order".
My problem is that I want to know if there is eventname == "error" in the periods between login and update_order by distinct user. A period for me has a start time and an end time.
That all sounded easy until I tried it this morning.
For the time frame of the 24h logs I might not have a pair, because the login might have happened yesterday. I am not sure how to deal with something like that.
delta_event_time is a computed column of the eventtime minus the executions_in_s. I am considering these the real time stamps. I computed them:
event_frame["delta_event_time"] = event_frame["eventtime"] - pandas.to_timedelta(event_frame["execution_in_s"], unit='s')
I tried something like this:
events_keys = numpy.array(["login", "new_order"])
users = numpy.unique(event_frame["user"])
for user in users:
event_name = event_frame[event_frame["eventname"].isin(events_keys) & event_frame["user" == user]]["event_name"]
But this not using the time periods.
I know that Pandas has between_time() but I don't know how to query a DataFrame with periods, by user.
Do I need to iterate over the DataFrame with .iterrows() to calculate the start and end time tupels? It takes a lot of time to do that, just for basic things in my tries. I somehow think that this would make Pandas useless for this task.
I tried event_frame.sort(["user", "eventname"]) which works nicely so that I can see the relevant lines already. I did not have any luck with .groupby("user"), because it mixed users although they are unique row values.
Maybe a better workflow solution is to dump the DataFrame into a MongoDB instead of pursuing a solution with Pandas to perform the analysis in this case. I am not sure, because I am new to the framework.
Here is pseudocode for what I think will solve your problem. I will update it if you share a sample of your data.
grouped = event_frame.groupby('user') # This should work.
# I cannot believe that it didn't work for you! I won't buy it till you show us proof!
for name, group in grouped:
group.set_index('eventtime') # This will make it easier to work with time series.
# I am changing index here because different users may have similar or
# overlapping times, and it is a pain in the neck to resolve indexing conflicts.
login_ind = group[group['eventname'] == 'login'].index
error_ind = group[group['eventname'] == 'error'].index
update_ind = group[group['eventname'] == 'update_order'].index
# Here you can compare the lists login_ind, error_ind and update_ind however you wish.
# Note that the list can even have a length of 0.
# User name is stored in the variable name. So you can get it from there.
Best way might be to create a function that does the comparing. Because then you can create a dict by declaring error_user = {}.
Then calling your function inside for name, group in grouped: like so: error_user[name] = function_which_checks_when_user_saw_error(login_ind, error_ind, update_ind).

is it easy to modify this python code to use pandas and would it help if i did?

I have written a Python 2.7 script that reads a CSV file and then does some standard deviation calculations . It works absolutely fine however it is very very slow. A CSV I tried with 100 million lines took around 28 hours to complete. I did some googling and it appears that maybe using the pandas module might makes this quicker .
I have posted part of the code below, since i am a pretty novice when it comes to python , i am unsure if using pandas would actually help at all and if it did would the function need to be completely re-written.
Just some context for the CSV file, it has 3 columns, first column is an IP address, second is a url and the third is a timestamp.
def parseCsvToDict(filepath):
with open(csv_file_path) as f:
ip_dict = dict()
csv_data = csv.reader(f)
f.next() # skip header line
for row in csv_data:
if len(row) == 3: #Some lines in the csv have more/less than the 3 fields they should have so this is a cheat to get the script working ignoring an wrong data
current_ip, URI, current_timestamp = row
epoch_time = convert_time(current_timestamp) # convert each time to epoch
if current_ip not in ip_dict.keys():
ip_dict[current_ip] = dict()
if URI not in ip_dict[current_ip].keys():
ip_dict[current_ip][URI] = list()
ip_dict[current_ip][URI].append(epoch_time)
return(ip_dict)
Once the above function has finished the data is parsed to another function that calculates the standard deviation for each IP/URL pair (using numpy.std).
Do you think that using pandas may increase the speed and would it require a complete rewrite or is it easy to modify the above code?
The following should work:
import pandas as pd
colnames = ["current_IP", "URI", "current_timestamp", "dummy"]
df = pd.read_csv(filepath, names=colnames)
# Remove incomplete and redundant rows:
df = df[~df.current_timestamp.isnull() & df.dummy.isnull()]
Notice this assumes you have enough RAM. In your code, you are already assuming you have enough memory for the dictionary, but the latter may be significatively smaller than the memory used by the above, for two reasons.
If it is because most lines are dropped, then just parse the csv by chunks: arguments skiprows and nrows are your friends, and then pd.concat
If it is because IPs/URLs are repeated, then you will want to transform IPs and URLs from normal columns to indices: parse by chunks as above, and on each chunk do
indexed = df.set_index(["current_IP", "URI"]).sort_index()
I expect this will indeed give you a performance boost.
EDIT: ... including a performance boost to the calculation of the standard deviation (hint: df.groupby())
I will not be able to give you an exact solution, but here are a couple of ideas.
Based on your data, you read 100000000. / 28 / 60 / 60 approximately 1000 lines per second. Not really slow, but I believe that just reading such a big file can cause a problem.
So take a look at this performance comparison of how to read a huge file. Basically a guy suggests that doing this:
file = open("sample.txt")
while 1:
lines = file.readlines(100000)
if not lines:
break
for line in lines:
pass # do something
can give you like 3x read boost. I also suggest you to try defaultdict instead of your if k in dict create [] otherwise append.
And last, not related to python: working in data-analysis, I have found an amazing tool for working with csv/json. It is csvkit, which allows to manipulate csv data with ease.
In addition to what Salvador Dali said in his answer: If you want to keep as much of the current code of your script, you may find that PyPy can speed up your program:
“If you want your code to run faster, you should probably just use PyPy.” — Guido van Rossum (creator of Python)

Sampling 1000 lines from a bunch of gzipped files with PIG

I'm very new to Pig so I may be going about this the wrong way. I have a bunch of gzipped files in a directory in Hadoop. I'm trying to sample around 1000 lines from all of these files put together. It doesn't have to be exact, so I wanted to use SAMPLE. SAMPLE needs a probability of sampling a line, rather than the number of lines that I need, so I thought I should count up the number of lines among all these files and than simply divide 1000 by that count and use it as the probability. This will work, since I don't need to have exactly 100 lines at the end. Here is what I got so far:
raw = LOAD '/data_dir';
cnt = FOREACH (GROUP raw ALL) GENERATE COUNT_STAR(raw);
cntdiv = FOREACH cnt GENERATE (float)100/ct.$0;
Now I'm not sure how to use the value in cntdiv in SAMPLE. I tried SAMPLE raw cntdiv and SAMPLE raw cntdiv.$0, but they don't work. Can I even use that value in the call to SAMPLE? Maybe there is a much better way of accomplishing what I'm trying to do?
Check out the description in the ticket originally requesting this feature: https://issues.apache.org/jira/browse/PIG-1926
I haven't tested this, but it looks like this should work:
raw = LOAD '/data_dir';
samplerate = FOREACH (GROUP raw ALL) GENERATE 1000.0/COUNT_STAR(raw) AS rate;
thousand = SAMPLE raw samplerate.rate;
The important thing is to refer to your scalar by name (rate), not by position ($0).

How can I QUICKLY get a string from one of the first couple lines of a long CSV at a remote URL?

I'm working on an assignment where I retrieve several stock prices from online, using Yahoo's stock price system. Unfortunately, the Yahoo API I'm required to use returns a .csv file that apparently contains a line for every single day that stock has been traded, which is at least 5 thousand lines for the stocks I'm working with, and over 10 thousand lines for some of them (example).
I only care about the current price, though, which is in the second line.
I'm currently doing this:
require 'open-uri'
def get_ticker_price(stock)
open("http://ichart.finance.yahoo.com/table.csv?s=#{stock}") do |io|
io.read.split(',')[10].to_f
end
end
…but it's really slow.
Is all the delay coming from getting the file, or is there some from the way I'm handling it? Is io.read reading the entire file?
Is there a way to download only the first couple lines from the Yahoo CSV file?
If the answers to questions 1 & 2 don't render this one irrelevant, is there a better way to process it that doesn't require looking at the entire file (assuming that's what io.read is doing)?
You can use query string parameters to reduce the data to the current date, by using date range parameters.
example for MO on 7/13/2012: (start/end month starts w/ a zero-index, { 00 - 11 } ).
http://ichart.finance.yahoo.com/table.csv?s=MO&a=06&b=13&c=2012&d=6&e=13&f=2012&g=d
api description here:
http://etraderzone.com/free-scripts/47-historical-quotes-yahoo.html

Resources