How to write for loop to run regression on multiple files contained in same folder? - for-loop

I need to develop a script to run a simple OLS on multiple csv files stored in the same folder.
All have the same column names and regression will always be based upon the same columns ("x_var" and "y_var").
The below code is used to read in the csvs and rename them.
## Read in files from folder
file.List <- list.files(pattern = "*.csv")
for(i in 1:length(file.List))
{
assign(paste(gsub(".csv","", file.List[i])), read.csv(file.List[i]))
}
However, after this [very initial stage!] I've got a bit lost........
Each dataframe has 7 identical columns. a, b, c, d, x_var, e, y_var.....
I need to run a simple OLS using lm(x_car ~ y_var, data = dataframes) and plot the result on each dataframe and assumed a 'for loop' would be the best option, but am not too sure of how to do so....
After each regression is run I want it to extract the coefficients/R2 etc into a csv and save the plot separately.......
Tried below, but have gone very wrong [and not working at all];
list <- list(gsub(" finalSIRTAnalysis.csv","", file.List))
for(i in length(file.List))
{
lm(x_var ~ y_var, data = [i])
}
Can't even make a start on this........and need some advice, if anyone has any good ideas (such as creating an external function first.....)

I am not sure if the function lm is available to compute the results using multiple variable sources. Try merging the database. I have have a similar issue because I have 5k files and is computationally impossible to merge them all. But maybe this answer can help you.
https://stackoverflow.com/a/63770065/14744492

Related

Perl efficiency in writing in file

I am creating a database with some informations of files.
e.g: file_name | size | modify_date ...
I was thinking what is more efficient in this situation:
1) For each file get the info and print them in my file
foreach my $file ( #listOfFiles) {
my %temporary_hash = get_info_for_file($file); //store in a tempoarary hash
the informations for current file
print_info(%temporary_hash, $output_file); // print the information in my output file
}
2) Store the info for every file in a hash and print all the hash at once
foreach my $file( #listOfFiles){
store_info_in_hash( get_info_for_file($file), %hash); // for each file, store the
information in a global hash
}
print_all_info(%hash, $output_file); //after i have informations for each file
print the whole hash in my output file
You are wrong to consider efficiency before you have even got your program working
You should write your code as clearly as possible and debug it. Only then, if it is not running fast enough for your purpose, you should put your code through a profiler to discover the bottlenecks that are taking the most time
The two options you show will probably not be very different unless your files are enormous
Doing a benchmark test on the two options i got those results ( which, if I increase the information size for each file, will lead to even bigger differencies between the two).

is it easy to modify this python code to use pandas and would it help if i did?

I have written a Python 2.7 script that reads a CSV file and then does some standard deviation calculations . It works absolutely fine however it is very very slow. A CSV I tried with 100 million lines took around 28 hours to complete. I did some googling and it appears that maybe using the pandas module might makes this quicker .
I have posted part of the code below, since i am a pretty novice when it comes to python , i am unsure if using pandas would actually help at all and if it did would the function need to be completely re-written.
Just some context for the CSV file, it has 3 columns, first column is an IP address, second is a url and the third is a timestamp.
def parseCsvToDict(filepath):
with open(csv_file_path) as f:
ip_dict = dict()
csv_data = csv.reader(f)
f.next() # skip header line
for row in csv_data:
if len(row) == 3: #Some lines in the csv have more/less than the 3 fields they should have so this is a cheat to get the script working ignoring an wrong data
current_ip, URI, current_timestamp = row
epoch_time = convert_time(current_timestamp) # convert each time to epoch
if current_ip not in ip_dict.keys():
ip_dict[current_ip] = dict()
if URI not in ip_dict[current_ip].keys():
ip_dict[current_ip][URI] = list()
ip_dict[current_ip][URI].append(epoch_time)
return(ip_dict)
Once the above function has finished the data is parsed to another function that calculates the standard deviation for each IP/URL pair (using numpy.std).
Do you think that using pandas may increase the speed and would it require a complete rewrite or is it easy to modify the above code?
The following should work:
import pandas as pd
colnames = ["current_IP", "URI", "current_timestamp", "dummy"]
df = pd.read_csv(filepath, names=colnames)
# Remove incomplete and redundant rows:
df = df[~df.current_timestamp.isnull() & df.dummy.isnull()]
Notice this assumes you have enough RAM. In your code, you are already assuming you have enough memory for the dictionary, but the latter may be significatively smaller than the memory used by the above, for two reasons.
If it is because most lines are dropped, then just parse the csv by chunks: arguments skiprows and nrows are your friends, and then pd.concat
If it is because IPs/URLs are repeated, then you will want to transform IPs and URLs from normal columns to indices: parse by chunks as above, and on each chunk do
indexed = df.set_index(["current_IP", "URI"]).sort_index()
I expect this will indeed give you a performance boost.
EDIT: ... including a performance boost to the calculation of the standard deviation (hint: df.groupby())
I will not be able to give you an exact solution, but here are a couple of ideas.
Based on your data, you read 100000000. / 28 / 60 / 60 approximately 1000 lines per second. Not really slow, but I believe that just reading such a big file can cause a problem.
So take a look at this performance comparison of how to read a huge file. Basically a guy suggests that doing this:
file = open("sample.txt")
while 1:
lines = file.readlines(100000)
if not lines:
break
for line in lines:
pass # do something
can give you like 3x read boost. I also suggest you to try defaultdict instead of your if k in dict create [] otherwise append.
And last, not related to python: working in data-analysis, I have found an amazing tool for working with csv/json. It is csvkit, which allows to manipulate csv data with ease.
In addition to what Salvador Dali said in his answer: If you want to keep as much of the current code of your script, you may find that PyPy can speed up your program:
“If you want your code to run faster, you should probably just use PyPy.” — Guido van Rossum (creator of Python)

Logic to compare rows in pig

I need logic for below scenario which needs to be implemented using Pig scripts. Can anyone please help in providing some ideas on how to do this.
Input contains a column groupName with some data like others and unknown. This data needs to be replaced by its previous record data.
Input:
id,groupName
123,casc0001
124,casc0002
125,sale0001
126,unknown
127,nave9876
128,casc0001
129,sale0002
130,others
131,casc0004
132,unknown
133,unknown
134,others
135,nave1234
output:
123,casc0001
124,casc0002
125,sale0001
126,sale0001
127,nave9876
128,casc0001
129,sale0002
130,sale0002
131,casc0004
132,casc0004
133,casc0004
134,casc0004
135,nave1234
In the above input 126,unknown to be replaced with 125,sale0001. 130,others need to be replaced by 129,sale0002. 132,unknown 133,unknown 134,others to be replaced with 131,casc0004.
--Edit--
I tried lead function in Pig. But it is used only to compare n rows at a time. Which cannot solve this completely.
Another logic which is working, but looking for optimized one.
Cogroup for the same data set (like Dataset and Dataset_self)
-Filter Dataset.id=Dataset_self.id or Dataset_self.groupname='others' or Dataset_self.groupname='unknown'
-Generate IdDiff like (Dataset_self.id-Dataset.id), CASE when id=id then ( id, group) else (id_self,group)
-Foreach (group id){
ordered = order by id,diff,group;
limited = ordered limit 1;
generate limited ;
}
This is going to be a complicated problem on a distributed system like hadoop, especially that your file is going to be split between nodes. In your case what if 126 happens to be the first record in a new split. Then you will need to trace the previous file split which is most likely on a different node. Lets say you come up with a MapReduce program to do this, in all likelyhood it would an extremely slow and inefficient way to do it. The solution might be simpler if you are in a single node system where the splittable property of your input format is false, and the nuber of reducers is set to 1.
In that case you could almost make the argument that a traditional database like Oracle or Terra data might be a better fit for your problem as you have lead or lag functions readily available which could be used to do exactly what u need.

Sampling 1000 lines from a bunch of gzipped files with PIG

I'm very new to Pig so I may be going about this the wrong way. I have a bunch of gzipped files in a directory in Hadoop. I'm trying to sample around 1000 lines from all of these files put together. It doesn't have to be exact, so I wanted to use SAMPLE. SAMPLE needs a probability of sampling a line, rather than the number of lines that I need, so I thought I should count up the number of lines among all these files and than simply divide 1000 by that count and use it as the probability. This will work, since I don't need to have exactly 100 lines at the end. Here is what I got so far:
raw = LOAD '/data_dir';
cnt = FOREACH (GROUP raw ALL) GENERATE COUNT_STAR(raw);
cntdiv = FOREACH cnt GENERATE (float)100/ct.$0;
Now I'm not sure how to use the value in cntdiv in SAMPLE. I tried SAMPLE raw cntdiv and SAMPLE raw cntdiv.$0, but they don't work. Can I even use that value in the call to SAMPLE? Maybe there is a much better way of accomplishing what I'm trying to do?
Check out the description in the ticket originally requesting this feature: https://issues.apache.org/jira/browse/PIG-1926
I haven't tested this, but it looks like this should work:
raw = LOAD '/data_dir';
samplerate = FOREACH (GROUP raw ALL) GENERATE 1000.0/COUNT_STAR(raw) AS rate;
thousand = SAMPLE raw samplerate.rate;
The important thing is to refer to your scalar by name (rate), not by position ($0).

Pass data from workspace to a function

I created a GUI and used uiimport to import a dataset into matlab workspace, I would like to pass this imported data to another function in matlab...How do I pass this imported dataset into another function....I tried doing diz...but it couldnt pick diz....it doesnt pick the data on the matlab workspace....any ideas??
[file_input, pathname] = uigetfile( ...
{'*.txt', 'Text (*.txt)'; ...
'*.xls', 'Excel (*.xls)'; ...
'*.*', 'All Files (*.*)'}, ...
'Select files');
uiimport(file_input);
M = dlmread(file_input);
X = freed(M);
I think that you need to assign the result of this statement:
uiimport(file_input);
to a variable, like this
dataset = uiimport(file_input);
and then pass that to your next function:
M = dlmread(dataset);
This is a very basic feature of Matlab, which suggests to me that you would find it valuable to read some of the on-line help and some of the documentation for Matlab. When you've done that you'll probably find neater and quicker ways of doing this.
EDIT: Well, #Tim, if all else fails RTFM. So I did, and my previous answer is incorrect. What you need to pass to dlmread is the name of the file to read. So, you either use uiimport or dlmread to read the file, but not both. Which one you use depends on what you are trying to do and on the format of the input file. So, go RTFM and I'll do the same. If you are still having trouble, update your question and provide details of the contents of the file.
In your script you have three ways to read the file. Choose one on them depending on your file format. But first I would combine file name with the path:
file_input = fullfile(pathname,file_input);
I wouldn't use UIIMPORT in a script, since user can change way to read the data, and variable name depends on file name and user.
With DLMREAD you can only read numerical data from the file. You can also skip some number of rows or columns with
M = dlmread(file_input,'\t',1,1);
skipping the first row and one column on the left.
Or you can define a range in kind of Excel style. See the DLMREAD documentation for more details.
The filename you pass to DLMREAD must be a string. Don't pass a file handle or any data. You will get "Filename must be a string", if it's not a string. Easy.
FREAD reads data from a binary file. See the documentation if you really have to do it.
There are many other functions to read the data from file. If you still have problems, show us an example of your file format, so we can suggest the best way to read it.

Resources