How can I create text files after applying a function with lapply for multiple files? - lapply

I am a new R user. I want to perform the same set of multiple tasks for many files and write the results into text files. The code looks like this:
ps_files<-dir(pattern = ".ps")
lapply(ps_files, function(x) {
#read files
read<-data.table::fread(x, data.table = F, stringsAsFactors = F)
#merge with bim file
merge_bim<-dplyr::inner_join(read, bim_df[,c(1:2, 4:6)], by=c("V1"="V2"))
#paste Chr for rows in column1
merge_bim$V1.y<-paste0("Chr",merge_bim$V1.y)
#filter significant snps
sig_snps<-filter(merge_bim, V4.x<=0.00001)
#get columns needed by annovar
output<-sig_snps[,c("V1.y","V4.y","V4.y","V5","V6")]
#create text files of results
write.table(output, sep="\t", col.names=F, row.names=F, append=F, quote=F)
})
The directory has five files in it. When I try to run until the output variable, my expected results come out. But when I try to run until the write.table, I get "NULL" for each file (i. e. [[1]] NULL...[[5]] NULL), and text files are not produced. I tried suggestions from multiple websites but the error persists. I'm also not sure if lapply is the most appropriate function to use.

Related

"Different row counts implied by arguments" in attempt to plot BAM file data

I'm attempting to use this tutorial to manipulate and plot ATAC-sequencing data. I have all the libraries listed in that tutorial installed and loaded, except while they use biocLite(BSgenome.Hsapiens.UCSC.hg19) for the human genome, I'm using biocLite(TxDb.Mmusculus.UCSC.mm10.knownGene) for the mouse genome.
Here I have loaded in my BAM file
sorted_AL1.1BAM <-"Sorted_1_S1_L001_R1_001.fastq.gz.subread.bam"
And created an object called TSS, which is transcription start site regions from the mouse genome. I want to ultimately plot the average signal in my read data across mouse transcription start sites.
TSSs <- resize(genes(TxDb.Mmusculus.UCSC.mm10.knownGene), fix = "start", 1)
The problem occurs with the following code:
nucFree <- regionPlot(bamFile = sorted_AL1.1BAM, testRanges = TSSs, style = "point",
format = "bam", paired = TRUE, minFragmentLength = 0, maxFragmentLength = 100,
forceFragment = 50)
The error is as follows:
Reading Bam header information.....Done
Filtering regions which extend outside of genome boundaries.....Done
Filtered 24528 of 24528 regions
Splitting regions by Watson and Crick strand..Error in DataFrame(..., check.names = FALSE) :
different row counts implied by arguments
I assume my BAM file contains empty values that need to be changed to NAs. My issue is that I'm not sure how to visualize and manipulate BAM files in R in order to do this. Any help would be appreciated.
I tried the following:
data.frame(sorted_AL1.1BAM)
sorted_AL1.1BAM[sorted_AL1.1BAM == ''] <- NA
I expected this to resolve the issue of different row counts, but I get the same error message.

Merging of two part files with header as only first line Hadoop

how can i merge two or more part files in hadoop to single file in such a way that merge output is having entire data but, only one header that is in the 1st line of merge output .
File 1
column1|column2|column3
20000|newyork|john
30000|sydney|joseph
File n
column1|column2|column3
60000|delhi|mike
30000|sydney|joseph
Merged output should be
column1|column2|column3
20000|newyork|john
30000|sydney|joseph
60000|delhi|mike
30000|sydney|joseph
Is there any easy way using hadoop fs -cat command.. ?
or by any other method..
Method 1:
Leaving the headers on is fairly complicated without creating an index or rank, since in Pig a collection of tuples is unsorted. Here's what a Pig job looks like, using rank and order by to place the header on top.
header_ranked.pig
HEADER = LOAD 'header.txt' USING PigStorage('|') AS (b0:int,b1:chararray,b2:chararray,b3:chararray);
H1 = LOAD 'header_test' USING PigStorage('|') AS (c1:chararray,c2:chararray,c3:chararray);
F_H1 = FILTER H1 BY NOT (c1 MATCHES 'column1' AND c2 MATCHES 'column2' AND c3 MATCHES 'column3');
R_H1 = RANK F_H1 by c1 DESC DENSE;
U = UNION R_H1, HEADER;
O = ORDER U by rank_F_H1;
F = FOREACH O GENERATE c1,c2,c3;
dump F;
The two sample files, each containing 2 records and a header line, were placed in a directory called header_test. Additionally, in order for this program to work, I had to create a header file in the following format:
header.txt
0|column1|column2|column3
Walking through the code, the file containing the headers (slightly modified to include an additional column, which is the rank value of 0) is loaded into the HEADER alias.
Next the actual data is loaded into the H1 alias, as it grabs all files under the header_test directory.
F_H1 filters out all headers from the data. If you had 20 files that were loaded into H1 from the header_test directory, those 20 headers would now be filtered out of the data.
R_H1 creates a rank on the filtered data, in descending order and without skipping any numbers.
U effectively concatenates the ranked filtered data with the 0|column1|column2|column3 header line.
O orders the data by the rank, so that the header (which has a rank of 0), appears on top.
And finally, F gets rid of the ranking, leaving the clean tuples.
Results
(column1,column2,column3)
(60000,delhi,mike)
(30000,sydney,joseph)
(30000,sydney,joseph)
(20000,newyork,john)
Method 2:
Basically, leave the headers on one file, strip them from the rest, and then mash them together. Not sure it'll stay sorted, though, haven't tested it thoroughly.
H1 = LOAD 'header_test/header1.txt' USING PigStorage('|') AS (c1:chararray,c2:chararray,c3:chararray);
H2 = LOAD 'header_test/header2.txt' USING PigStorage('|') AS (d1:chararray,d2:chararray,d3:chararray);
F_H2 = FILTER H2 BY NOT (d1 MATCHES 'column1' AND d2 MATCHES 'column2' AND d3 MATCHES 'column3');
U = UNION H1, F_H2;
dump U;
Results
(column1,column2,column3)
(20000,newyork,john)
(30000,sydney,joseph)
(60000,delhi,mike)
(30000,sydney,joseph)

generate a different number of columns based on input number

Suppose I have some XML data that has an unknown number of sub-nodes. Is there a method that allows me to input the number of sub-nodes into the program as a parameter, and have it process them? current code is something like this
SourceXML = LOAD '$input' using org.apache.pig.piggybank.storage.XMLLoader('$TopNode') as test:chararray;
test2 = LIMIT SourceXML 3;
test3 = FOREACH test2 GENERATE REGEX_EXTRACT(test,'<$tag1>(.*)</$tag1>',1),
REGEX_EXTRACT(test,'<$tag2>(.*)</$tag2>',1);
dump test3;
however I may not know in advance how many simple elements there are in the target data (how many $tag# there are). I am hoping to use a .txt file containing parameters that looks something like this:
input=/inputpath/lowerlevelsofpath
numberSimpleElements=3
tag1=tag1name
tag2=tag2name
tag3=tag3name
With a regex_extract being done on each tag in the input file
Any ideas on how to accomplish this?
You could do following
Split the text by some regex, so that each row now has value.
Generate (tag, value) for each row
Do a join between (tag, value) and (list of tags)

Matlab: reading images from folder does not return filenames in order

I am reading jpg files from a folder. My code is as follows:
inputImg= dir('C:\Documents and Settings\Administrator\Desktop\TestImages\*.jpg');
inputDir = 'C:\Documents and Settings\Administrator\Desktop\TestImages\';
inputN = {inputImg.name};
for i = 1:numel(dstNFiles)
dstFileName = dstImageFiles(i).name;
dstName = strcat(dstDir,dstFileName);
image = imread(dstName);
%% do some work here
end
All those jpg images in my forlder are orderly named in the manner "01.jpg, 02.jpg,...200.jpg". But I found that it is not reading these files in order. I tried to print the dstFileName, and it gives totally random ordered filenames, like:
01.jpg, 02.jpg, 03.jpg, 04.jpg,05.jpg,06.jpg,07.jpg,08.jpg,09.jpg,10.jpg,100.jpg,101.jpg,11.jpg, ... 199.jpg,200.jpg, 24.jpg,25.jpg,...
How could I solve this? Thanks.
The file list is in the correct alphabetic order!
Consider using padding when saving.
Ie. save 10.jpg as 0010.jpg
If you can't change the file name you have to write your own ordering function.

Load all the images from a directory

I have certain images in a directory and I want to load all those images to do some processing. I tried using the load function.
imagefiles = dir('F:\SIFT_Yantao\demo-data\*.jpg');
nfiles = length(imagefiles); % Number of files found
for i=1:nfiles
currentfilename=imagefiles(i).name;
I2 = imread(currentfilename);
[pathstr, name, ext] = fileparts(currentfilename);
textfilename = [name '.mat'];
fulltxtfilename = [pathstr textfilename];
load(fulltxtfilename);
descr2 = des2;
frames2 = loc2;
do_match(I1, descr1, frames1, I2, descr2, frames2) ;
end
I am getting an error as unable to read xyz.jpg no such file or directory found, where xyz is my first image in that directory.
I also want to load all formats of images from the directory instead of just jpg...how can i do that?
You can easily load multiple images with same type as follows:
function Seq = loadImages(imgPath, imgType)
%imgPath = 'path/to/images/folder/';
%imgType = '*.png'; % change based on image type
images = dir([imgPath imgType]);
N = length(images);
% check images
if( ~exist(imgPath, 'dir') || N<1 )
display('Directory not found or no matching images found.');
end
% preallocate cell
Seq{N,1} = []
for idx = 1:N
Seq{d} = imread([imgPath images(idx).name]);
end
end
I believe you want the imread function, not load. See the documentation.
The full path (inc. directory) is not held in imgfiles.name, just the file name, so it can't find the file because you haven't told it where to look. If you don't want to change directories, use fullfile again when reading the file.
You're also using the wrong function for reading the images - try imread.
Other notes: it's best not to use i for variables, and your loop is overwriting I2 at every step, so you will end up with only one image, not four.
You can use the imageSet object in the Computer Vision System Toolbox. It loads image file names from a given directory, and gives you the ability to read the images sequentially. It also gives you the option to recurse into subdirectories.

Resources