How to split a file into chunks with 1000 lines in each chunk in Bash? [duplicate] - bash

This question already has answers here:
How can I split a large text file into smaller files with an equal number of lines?
(12 answers)
Closed 7 years ago.
I have a file that is 6200 lines long that looks like:
chrom chromStart chromEnd score a a.1
1 chr1 834359 867552 4 0.020979021 0.0000000000
2 chr1 1880283 1940830 9 0.075757576 0.0000000000
3 chr1 1960387 2064958 13 0.115093240 0.0006596306
4 chr1 2206040 2249092 5 0.019230769 0.0000000000
5 chr1 2325759 2408930 11 0.021296885 0.0080355001
I need to break the file into files that are 1000 lines long. How can this be done?

This sounds like a case for the POSIX split command:
split -l 1000 file-to-be-split prefix.
This will split the 'file to be split' into files with 1000 lines each (except the last, of course), and the names will start with prefix. and will end with aa, ab, ac, ...

Related

How to replace the value of multiple columns in a file based on two columns in another file with bash?

I'm trying to replace the value of multiple columns in a file using awk. The reason to use awk is that the file is very large and cant do it loading it in memory. I've tried to do with pandas (python).
I have a large database as a textfile. I put here a example of the info in the file (tab-delimited):
CHROM POS REF ALT GT_00 d_GT_00 c_GT_00 de_GT_00 can_GT_00 epi_GT_00
chr1 10 T A 7 1 1 2 5 7
chr1 10 T A 7 1 1 3 0 1
chr1 10 T G 7 2 1 1 8 2
chr1 11 None None 2 0 0 0 5 4
chr1 11 G T 2 1 0 0 2 3
If the first two columns (CHROM,POS) are the same in the rows, I have to sum the values of the columns that contain '_00' in the header.
So, the expected output, is:
CHROM POS REF ALT GT_00 d_GT_00 c_GT_00 de_GT_00 can_GT_00 epi_GT_00
chr1 10 T A 21 4 3 6 13 10
chr1 10 T A 21 4 3 6 13 10
chr1 10 T G 21 4 3 6 13 10
chr1 11 None None 4 1 0 0 7 7
chr1 11 G T 4 1 0 0 7 7
I dont know how to do this, because I'm very new in programing, so, I have to do the following with this awk code.
awk -F'\t' 'FNR==1{next};
{keys[$1"\t"$2]
for (i=5;i<=10;i++)
{sum[$1"\t"$2, i] += $i}
}END {for (key in keys) { printf "%s", key
for (i=5;i<=10;i++) {printf "%s%s", "\t", sum[key,i]} printf "\n"}} OFS='\t' out.txt
With this code, and using as 'out.txt' the first textfile, I get:
chr1 10 21 4 3 6 13 10
chr1 11 4 1 0 0 7 7
Now, I'm trying to replace, in the rows with chr1 10, the 6 values in the first row, and in the rows with chr1 11, the 6 values in the second row.
I have accomplished to change the value in one column with the this code:
awk -F"\t" 'NR==FNR{h[$1"\t"$2]=$3;next}
{
printf $1"\t"$2"\t"$3"\t"$4"\t"h[$1"\t"$2]"\t";
for (i=6;i<=NF;i++)
{printf "%s",$i "\t"};
printf "\n"
}' OFS="\t" file1 file2
but need to do the same for all the columns.
How can I do it using a similar code?
Note: I have more columns that doesn't have '_00' in the header name
here you go with a memory efficient perl on-liner which should solve your problem. You may need to add the correct input filed separator e.g. -F'\t' and a regex to skip comment lines.
perl -lane 'if(!$prev || $prev eq "$F[0]:$F[1]"){push #r,[#F[4..$#F]]; push #snp,join"\t",#F[0..3]}else{for $r (#r){$o[$_]+=$$r[$_] for 0..scalar(#$r)-1}; print join"\t",($_,#o) for #snp; #snp=(join"\t",#F[0..3]); #o=(); #r=([#F[4..$#F]])} $prev="$F[0]:$F[1]"; END{for $r (#r){$o[$_]+=$$r[$_] for 0..scalar(#$r)-1}; print join"\t",($_,#o) for #snp;}' < \
<(echo -e "chr1 10 A T 1 2 3\nchr1 10 A G 1 2 3\nchr1 11 A T 4 5 6\nchr2 12 G C 7 8 9")
formatted version with comments for you :)
if(!$prev || $prev eq "$F[0]:$F[1]"){ # CHROM:POS compare to previous line
push #r,[#F[4..$#F]]; # store values in array of array reference
push #snp,join"\t",#F[0..3] # store CHROM,POS,REF,ALT
}else{
for $r (#r){ # CHROM:POS is new
$o[$_]+=$$r[$_] for 0..scalar(#$r)-1 # sum up values in array references
};
print join"\t",($_,#o) for #snp; # join CHROM,POS,REF,ALT with summed values
#snp=(join"\t",#F[0..3]); # re-initialize
#o=();
#r=([#F[4..$#F]])
}
$prev="$F[0]:$F[1]"; # store CHROM:POS info
END{ # print final lines
for $r (#r){
$o[$_]+=$$r[$_] for 0..scalar(#$r)-1
};
print join"\t",($_,#o) for #snp;
}

Compare three files using two columns and print unique entries in each file using awk/sed [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
I have three files with following format:
$ cat a.bed
chr1 6 6 aa
chr1 8 8 bb
chr2 22 22 aa
chr3 24 24 bb
$ cat b.bed
chr1 12 12 cc
chr1 6 6 dd
chr5 14 14 cc
$ cat c.bed
chr1 8 8 ss
chr4 11 11 dd
chr1 6 6 aa
I want to compare these files using first two columns and print information for each row whether it is present in one file or multiple files, like:
chr1 6 6 aa 3 a.bed,b.bed,c.bed
chr1 8 8 bb 2 a.bed,c.bed
chr2 22 22 aa 1 a.bed
chr3 24 24 bb 1 a.bed
chr1 12 12 cc 1 b.bed
chr5 14 14 cc 1 b.bed
chr4 11 11 dd 1 c.bed
where 5th column gives number of of files it is present in and 6th column gives name of the files.
awk to the rescue!
$ awk '{a[$1,$2]=(($1,$2) in a?a[$1,$2]",":$0 OFS)FILENAME}
END{for(k in a) print a[k]}' {a,b,c}.bed
results won't be in the same order though.
Explanation
x=c?a:b is the ternary operator, sets x to a or b based on value of c (similar to if-then-else). Here we assign the value of map for key ($1,$2) either by appending FILENAME (if already exists) or setting to the current line (again by appending FILENAME). In the END block, just iterates over this map, and prints the values.
Try these four lines of gawk (doesn't appear to work in awk):
gawk '{print $0, FILENAME}' a.bed > abc.bed
gawk '{print $0, FILENAME}' b.bed >> abc.bed
gawk '{print $0, FILENAME}' c.bed >> abc.bed
gawk '{f = $5;k=$1 " " $2 " " $3 " " $4;if(k in a){a[k] = a[k] "," f}else{a[k] = f};c[k]++};END{for(k in a){print k, c[k], a[k]}}' abc.bed
Single char variables for brevity:
f - file name,
k - key, i.e. the data,
a - an array of keys,
c - an array of key counts.
Er, if I am reading it right, your input and output data samples don't match, e.g. there are only 2 'chr1 6 6 aa' not 3.

How to combine column from multiple text files? [duplicate]

This question already has answers here:
How can I sum values in column based on the value in another column?
(5 answers)
Combine text from two files, output to another [duplicate]
(2 answers)
Closed 6 years ago.
I want to extract and combine a certain column from a bunch of text files into a single file as shown.
File1_example.txt
A 123 1
B 234 2
C 345 3
D 456 4
File2_example.txt
A 123 5
B 234 6
C 345 7
D 456 8
File3_example.txt
A 123 9
B 234 10
C 345 11
D 456 12
...
..
.
File100_example.txt
A 123 55
B 234 66
C 345 77
D 456 88
How can I loop through my files of interest and paste these columns together so that the final result is like below without having to type out 1000 unique file names?
1 5 9 ... 55
2 6 10 ... 66
3 7 11 ... 77
4 8 12 ... 88
Try this:
paste File[0-9]*_example.txt | awk '{i=3;while($i){printf("%s ",$i);i+=3}printf("\n")}'
Example:
File1_example.txt:
A 123 1
B 234 2
C 345 3
D 456 4
File2_example.txt:
A 123 5
B 234 6
C 345 7
D 456 8
Run command as:
$ paste File[0-9]*_example.txt | awk '{i=3;while($i){printf("%s ",$i);i+=3}printf("\n")}'
Output:
1 5
2 6
3 7
4 8
I tested below code with first 3 files
cat File*_example.txt | awk '{a[$1$2]= a[$1$2] $3 " "} END{for(x in a){print a[x]}}' | sort
1 5 9
2 6 10
3 7 11
4 8 12
1) use an awk array, a[$1$2]= a[$1$2] $3 " " index is column1 and column2, array value appends all column 3.
2) END{for(x in a){print a[x]}} travesrsed array a and prints all values.
3)use sort to sort the output.
when cating you need to ensure the file order is preserved, one way is to explicitly specify the files
cat File{1..100}_example.txt | awk '{print $NF}' | pr 4ts' '
extract last column by awk and align using pr

Updating n-th column in csv using awk [duplicate]

This question already has answers here:
awk doesn't print separator
(2 answers)
Closed 7 years ago.
Input file
1,2,3,4,5,6,7,8,9,10
11,22,33,44,55,66,77,88,99,100
111,222,333,444,555,666,777,888,999,1000
Expected Output
1,2,3,4,5,6,7,8MNINS,9,10
11,22,33,44,55,66,77,88MNINS,99,100
111,222,333,444,555,666,777,888MNINS,999,1000
I tried the following command
awk -F "," '{$8=$8"MNINS"}1' 1.csv > 2.csv
output:
1 2 3 4 5 6 7 8MNINS 9 10
11 22 33 44 55 66 77 88MNINS 99 100
111 222 333 444 555 666 777 888MNINS 999 1000
It is removed all the commas, so my csv file is changing into space seperated file.
Please help
You need to specify comma as Output field separator value.
awk -F "," -v OFS="," '{$8=$8"MNINS"}1' 1.csv > 2.csv

Write the number of elements per line of a file and its repetitions with awk

I have a file with all different integer in which each line may have different lenghts, like this:
1 2 3 4 5
16 7 8
9 10 101 102 13 14
15 6 17
24 28 31 30 18
I would like to print in output the number of elements that a line presents and the number of times there is the same number of elements per lines; the output of this example should be:
3 2
5 2
6 1
In the first column there are the number of elements per line, in the second the number of lines that presents the same number of elements.
The first line in the file has 5 elements and also the 5th one etc etc.
Print the count for the number of fields:
$ awk '{a[NF]++}END{for(k in a)print k,a[k]}' file
5 2
6 1
3 2
Pipe to sort for ordered output:
$ awk '{a[NF]++}END{for(k in a)print k,a[k]}' file | sort
3 2
5 2
6 1

Resources