Splitting a large text file into smaller files

Splitting a large text file into smaller files - bash

I have a large text file and I want to split it into a few different smaller text files. Maybe someone has code for that?
Original file:
111
222
333
444
555
666
then split it to 3 txt files
File 1
111
222
File 2
333
444
File 3
555
666

If you want to split your original files into 3 files, without splitting lines, and getting the pieces into file_01, file_02 and file_03, try this:
split --numeric-suffixes=1 -n l/3 original_file file_

With GNU awk:
awk 'NR%2!=0{print >"File " ++c}; NR%2==0{print >"File " c}' original_file
or shorter:
awk 'NR%2!=0{++c} {print >"File " c}' file
% is modulo operation

edit: Question originally asked for a pythonic solution.
There are similar questions throughout the site, but here's a solution to your example:
# read ('r') the file ('text.txt'), and split at each line break ('\n')
textFile = open('text.txt','r').read().split('\n')
# setup temporary array as a place holder for the files (stored as strings) to write,
# and a counter (i) as a pointer
temp = ['']
i = 0
# for each index and element in textfile
for ind,element in enumerate(textFile):
# add the element to the placeholder
temp[i] += element+'\n'
# if the index is odd, and we are not at the end of the text file,
# create a new string for the next file
if ind%2 and ind<len(textFile)-1:
temp.append('')
i += 1
# go through each index and string of the temporary array
for ind,string in enumerate(temp):
# write as a .txt file, named 'output'+the index of the array (output0, output1, etc.
with open('output'+str(ind)+'.txt','w') as output:
output.write(string)

Related

Change names of a columns using a mapping file

I have a file with 3 columns like this:
NC_0001 10 x
NC_0001 11 x
NC_0002 90 y
I want to change the names of the first column using another file .txt that contains the conversion, it's like:
NC_0001 1
NC_0001 1
NC_0002 2
...
So finally I should have:
1 10 x
1 11 x
2 90 y
How can I do that?
P.S. the first file is very huge (50 GB) so I must use a unix command like awk.

awk -f script.awk map_file data_file
NR == FNR { # for the first file
tab[$1] = $2 # create a k/v of the colname and rename value
}
NR != FNR { # for the second file
$1 = tab[$1] # set first column equal to the map value
print # print
}
As a one-liner
awk 'NR==FNR{t[$1]=$2} NR!=FNR{$1=t[$1];print}' map_file data_file
If possible, you should split the first file and run this command on each partition file in parallel. Then, join the results.

How can i use bash to find 2 values that appear on the same line of a file?

I have 3 files:
File 1:
1111111
2222222
3333333
4444444
5555555
File 2:
6666666
7777777
8888888
9999999
File 3
8888888 7777777
9999999 6666666
4444444 8888888
I want to search file 3 for lines that contain a string from both file 1 and file 2, so the result of this example would be:
4444444 8888888
because 444444 is in file 1 and 888888 is file 2.
I currently have a solution, however my files contain 500+ lines and it can take a very long time to run my script:
#!/bin/sh
cat file1 | while read line
do
cat file2 | while read line2
do
grep -w -m 1 "$line" file3 | grep -w -m 1 "$line2" >> results
done
done
How can i improve this script to run this faster?

The current process is going to be slow due to the repeated scans of file2 (once for each row in file1) and file3 (once for each row in the cartesian product of file1 and file2). The additional invocation of sub-processes(as a result of the pipes |) is also going to slow things down.
So, to speed this up we want to look at reducing the number of times each file is scanned and limit the number of sub-processes we spawn.
Assumptions:
there are only 2 fields (when using white space as delimiter) in each row of file3 (eg, we won't see a row like "field1 has several strings" "and field2 does, too") otherwise we will need to come back revisit the parsing of file3
First our data files (I've added a couple extra lines):
$ cat file1
1111111
2222222
3333333
4444444
5555555
5555555 # duplicate entry
$ cat file2
6666666
7777777
8888888
9999999
$ cat file3
8888888 7777777
9999999 6666666
4444444 8888888
8888888 4444444 # switch position of values
8888888XX 4444444XX # larger values; we want to validate that we're matching on exact values and not sub-strings
5555555 7777777 # want to make sure we get a single hit even though 5555555 is duplicated in `file1`
One solution using awk:
$ awk '
BEGIN { filenum=0 }
FNR==1 { filenum++ }
filenum==1 { array1[$1]++ ; next }
filenum==2 { array2[$1]++ ; next }
filenum==3 { if ( array1[$1]+array2[$2] >= 2 || array1[$2]+array2[$1] >= 2) print $0 }
' file1 file2 file3
Explanation:
this single awk script will process our 3 files in the order in which they're listed (on the last line)
in order to aply different logic for each file we need to know which file we're processing; we'll use the variable filenum to keep track of which file we're currently processing
BEGIN { filenum=0 } - initialize our filenum variable; while the variable should automatically be set to zero the first time it's referenced, it doesn't hurt to be explicit
FNR maintains a running count of the records processed for the current file; each time a new file is opened FNR is reset to 1
when FNR==1 we know we just started processing a new file, so increment our variable { filenum++ }
as we read values from file1 and file2 we're going to use said values as the indexes for the associative arrays array1[] and array2[], respectively
filenum==1 { array1[$1]++ ; next } - create entry in our first associative array (array1[]) with the index equal to field1 (from file1); value of the array will be a number > 0 (1 === field exists once in file, 2 == field exists twice in file); next says to skip the rest of processing and go to the next row in the current file
filenum==2 { array2[$1]++ ; next } - same as previous command except in this case we're saving fields from file2 in our second associative array (array2[])
filenum==3 - optional because if we get this far in this script we have to be on our third file (file3); again, doesn't hurt to be explicit (and makes this easier to read/understand)
{ if ( ... ) } - test if the fields from file3 exist in both file1 and file2
array1[$1]+array2[$2] >= 2 - if (file3) field1 is in file1 and field2 is in file2 then we should find matches in both arrays and the sum of the array element values should be >= 2
array1[$2]+array2[$1] >= 2- same as previous command except we're testing for our 2 fields (file3) being in the opposite source files/arrays
print $0 - if our test returns true (ie, the current fields from file3 exist in both file1 and file2) then print the current line (to stdout)
Running this awk script against my 3 files generates the following output:
4444444 8888888 # same as the desired output listed in the question
8888888 4444444 # verifies we still match if we swap positions; also verifies
# we're matching on actual values and not a sub-string (ie, no
# sign of the row `8888888XX 4444444XX`)
5555555 7777777 # only shows up in output once even though 5555555 shows up
# twice in `file1`
At this point we've a) limited ourselves to a single scan of each file and b) eliminated all sub-process calls, so this should run rather quickly.
NOTE: One trade-off of this awk solution is the requirement for memory to store the contents of file1 and file2 in the arrays; which shouldn't be an issue for the relatively small data sets referenced in the question.

You can do it faster if load all data first and than process it
f1=$(cat file1)
f2=$(cat file2)
IFSOLD=$IFS; IFS=$'\n'
f3=( $(cat file3) )
IFS=$IFSOLD
for item in "${f3[#]}"; {
sub=( $item )
test1=${sub[0]}; test1=${f1//[!$test1]/}
test2=${sub[1]}; test2=${f2//[!$test2]/}
[[ "$test1 $test2" == "$item" ]] && result+="$item\n"
}
echo -e "$result" > result

find and replace substrings in a file which match strings in another file

I have two txt files: File1 is a tsv with 9 columns. Following is its first row (SRR6691737.359236/0_14228//11999_12313 is the first column and after Repeat is the 9th column):
SRR6691737.359236/0_14228//11999_12313 Censor repeat 5 264 1169 + . Repeat BOVA2 SINE 1 260 9
File2 is a tsv with 9 columns. Following is its first row (after Read is the 9th column):
CM011822.1 reefer discordance 63738705 63738727 . + . Read SRR6691737.359236 11999 12313; Dup 277
File1 contains information of read name (SRR6691737.359236), read length (0_14228) and coordinates (11999_12313) while file two contains only read name and coordinate. All read names and coordinates in file1 are present in file2, but file2 may also contain the same read names with different coordinates. Also file2 contains read names which are not present in file1.
I want to write a script which finds read names and coordinates in file2 that match those in file1 and adds the read length from file1 to file2. i.e. changes the last column of file2:
Read SRR6691737.359236 11999 12313; Dup 277
to:
Read SRR6691737.359236/0_14228//11999_12313; Dup 277
any help?

If unclear how your input files look look like.
You write:
I have two txt files: File1 is a tsv with 9 columns. Following is
its first row (SRR6691737.359236/0_14228//11999_12313 is the first
column and after Repeat is the 9th column):
SRR6691737.359236/0_14228//11999_12313 Censor repeat 5 264 1169 + . Repeat BOV, ancd A2 SINE 1 260 9
If I try to check the columns (and put them in a 'Column,Value' pair):
Column,Value
1,SRR6691737.359236/0_14228//11999_12313
2,Censor
3,repeat
4,5
5,264
6,1169
7,+
8,.
9,Repeat
10,BOVA2
11,SINE
12,1
13,260
14,9
That seems to have 14 columns, you specify 9 columns...
Can you edit your question, and be clear about this?
i.e. specify as csv
SRR6691737.359236/0_14228//11999_12313,Censor,repeat,5,.....
Added info, after feedback :
file1 contains the following fields (tab-, ancd separated):
SRR6691737.359236/0_14228//11999_12313
Censor
5
264
1169
+
.
Repeat BOVA2 SINE 1 260 9
You want to convert this (using a script) to a tab-separated file:
CM011822.1
reefer
distance
63738705
63738727
+
.
Read SRR6691737.359236 11999 12313
Dup 277
More info is needed to solve this!
field 1: How/Where is the info for 'CM011822.1' coming from?
field 2 and 3: 'reefer'/'distance'. Is this fixed text, should, ancd these fields always contain these texts or are there exceptions?
field 4 and 5: Where are these values (63738705 ; 63738727) coming from?
OK, it's clear that there are more questions to be asked than can give here …
second change...:
create a file, name if 'mani.awk':
FILENAME=="file1"{
split($1,a,"/");
x=a[1] " " a[4];
y=x; gsub(/_/," ",y);
r[y]=$1;
c=1; for (i in r) { print c++,i,"....",r[i]; }
}
FILENAME=="file2"{
print "<--", $0, "--> " ;
for (i in r) {
if ($9 ~ i) {
print "B:" r[i];
split(r[i],b,"/");
$9="Read " r[i];
print "OK";
}
};
print "<--", $0, "--> " ;
}
After this gawk -f mani.awk file1 file2 should produce the correct result.
If not, than i suggest you to learn AWK 😉, and change the script as needed.

How can I compare rows in Unix text files and add them together in another text file?

I have a text file 1 that has 3 columns. The first column contains a number, the second a word (which can be either a sting like dog or a number like 1050), the third column a TAG in capital letters.
I have another text file 2 that has 2 columns. The first column has a number, the second one has a TAG in capital letters.
I want to compare every row in my text file 1 with every row in my text file 2. If the TAG in column [3] in text file 1 is the same as the TAG in column [2] in text file 2, then I want to store the number in text file 1 next to the number in text file 2 next to the word in text file 1. There are no duplicate TAGS in text file 2 and there are no duplicate words in text file 1.
Illustration:
Text file 1
2 2737 HPL
32 hello PLS
3 world PLS
323 . OPS
Text file 2
342 HPL
56 PLS
342 DCC
4 OPS
I want:
2 342 2737
32 56 hello
3 56 world
323 4 .

You can do this in awk like this:
awk 'FNR==NR { h[$2] = $1; next } $3 in h { print $1, h[$3], $2 }' file2 file1
The first part saves the key and column from file 2 in an associative array (h), the second part compares column 3 from file 1 to this array and prints the relevant parts.

bash- searching for a string in a file and returning all the matching positions

I have a fasta file_imagine as a txt file in which even lines are sequences of characters and odd lines are sequence id's_ I would like to search for a string in sequences and get the position for matching substrings as well as their ids. Example:
Input:
>111
AACCTTGG
>222
CTTCCAACC
>333
AATCG
search for "CC" . output:
3 111
4 8 222

$ awk -F'CC' 'NR%2==1{id=substr($0,2);next} NF>1{x=1+length($1); b=x; for (i=2;i<NF;i++){x+=length(FS $i); b=b " " x}; print b,id}' file
3 111
4 8 222
Explanation:
-F'CC'
awk breaks input lines into fields. We instruct it to use the sequence of interest, CC in this example, as the field separator.
NR%2==1{id=substr($0,2);next}
On odd number lines, we save the id to variable id. The assumption is that the first character is > and the id is whatever comes after. Having captured the id, we instruct awk to skip the remaining commands and start over with the next line.
NF>1{x=1+length($1); b=x; for (i=2;i<NF;i++){x+=length(FS $i); b=b " " x}; print b,id}
If awk finds only one field on an input line, NF==1, that means that there were no field separators found and we ignore those lines.
For the rest of the lines, we calculate the positions of each match in x and then save each value of x found in the string b.
Finally, we print the match locations, b, and the id.

Will print line number and the position of each start of each match.
awk 'NR%2==1{t=substr($0,2)}{z=a="";while(y=match($0,"CC")){a=a?a" "z+y:z+y;$0=substr($0,z=(y+RLENGTH));z-=1}}a{print a,t }' file
Neater
awk '
NR%2==1{t=substr($0,2)}
{
z=a=""
while ( y = match($0,"CC") ) {
a=a?a" "z+y:z+y
$0=substr($0,z=(y+RLENGTH))
z-=1
}
}
a { print a,t }' file
.
3 111
4 8 222

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Splitting a large text file into smaller files - bash

I have a large text file and I want to split it into a few different smaller text files. Maybe someone has code for that? Original file: 111 222 333 444 555 666 then split it to 3 txt files File 1 111 222 File 2 333 444 File 3 555 666

If you want to split your original files into 3 files, without splitting lines, and getting the pieces into file_01, file_02 and file_03, try this: split --numeric-suffixes=1 -n l/3 original_file file_

With GNU awk: awk 'NR%2!=0{print >"File " ++c}; NR%2==0{print >"File " c}' original_file or shorter: awk 'NR%2!=0{++c} {print >"File " c}' file % is modulo operation

Related

Change names of a columns using a mapping file

How can i use bash to find 2 values that appear on the same line of a file?

find and replace substrings in a file which match strings in another file

How can I compare rows in Unix text files and add them together in another text file?

bash- searching for a string in a file and returning all the matching positions

Categories

Resources