Splitting non-equally file in bash - bash

I have a file in csv format. I know positions where I want to chip off a chunk from the file and write it as a new csv file.
split command splits a file into equal-sized chunks. I wonder if there exists an effective (the file is huge) way to split file into chunks of different sizes?

I assume you want to split the file at a newline character. If this is the case you can use the head and tail commands to grab a number of lines from the beginning and from the end of your file, respectively.
If you want to copy a new of lines from within the file you can use sed, e.g.
sed -e 1,Nd -e Mq file
where N should be replaced with the line number of the line preceding the first line to display and M should be the line number of the last line to display.

Related

Mac OS X split csv not working

I am attempting to split a large 200,000 line csv file into smaller pieces using the terminal command:
split -l 20000 users.csv
From what I have read online this should chop up the 200,000 line csv into ten 20,000 line files but this doesn't happen. All I get is a text file called 'xaa' that is just the original csv, all 200,000 lines.
Like I said in the title I am running on Mac OS High Sierra v.10.13.5
What exactly am I missing here?
As Ken Thomases points out in the comments, the most likely culprit is that the file is using non-newline line separators, and the most likely culprit is CR (carriage return).
You can tell if this is the case using the file utility. A file with such line separators looks like this:
$ file foo
foo: ASCII text, with CR line terminators
The reason split would behave this way with those line separators is that the file would appear to be only one line long (no newline characters). So split would write that one (very long) line, then exit.
you should use the split command with the -b option. This will split your files based on kilobytes or megabytes. This may break up a line at the end but can manually be accounted for.

Replace a part of a file by a part of another file

I have two files containing a lot of floating numbers. I would like to replace one of the floating numbers from file 1 by a floating number from File 2, using lines and characters to find the numbers (and not their values).
A lot of topics on the subject, but I couldn't find anything that uses a second file to copy the values from.
Here are examples of my two files:
File1:
14 4
2.64895E-01 4.75834E+02 2.85629E+05 -9.65829E+01
2.76893E-01 8.53749E+02 4.56385E+05 -7.65658E+01
6.25576E-01 5.27841E+02 5.72960E+05 -7.46175E+01
8.56285E-01 4.67285E+02 5.75962E+05 -5.17586E+01
File2:
Some text on the first line
1
Some text on the third line
0
AND01 0.53758275 0.65728944
AND02 0.64889566 0.53386002
AND03 0.65729386 0.64628194
AND04 0.26586960 0.46582925
AND05 0.46480534 0.57415869
In this particular example, I would like to replace the first number of the second line of File1 (2.64895E-01) by the second floating number written on line 5 of File2 (0.65728944).
Note: the value of the numbers will change according to which file I consider, so I have to identify the numbers by their positions inside the files.
I am very new to using bash scripts and have only use "sed" command till now to modify my files.
Any help is welcome :)
Thanks a lot for your inputs!
It's not hard to do it in bash, but if that's not a strict requirement, an easier and more concise solution is possible with an actual text-processing tool like awk:
awk 'NR==5 {val=$2} NR>FNR {FNR==2 && $1=val; print}' file2 file1
Explanation: read file2 first, and store the second field of the 5th record in variable val (the first part: NR==5 {val=$2}). Then, read file1, print every line, but replace the first field of the second record (FNR is current-file record number, and NR is total number of records in all files so far) with value stored in val.
In general, an awk program consists of pattern { actions } sequences. pattern is a condition under which a series of actions will get executed. $1..$NF are variables with field values, and each line (record) is split into fields on the field separator (FS variable, or -F'..' option), which defaults to a space.
The result (output):
14 4
0.53758275 4.75834E+02 2.85629E+05 -9.65829E+01
2.76893E-01 8.53749E+02 4.56385E+05 -7.65658E+01
6.25576E-01 5.27841E+02 5.72960E+05 -7.46175E+01
8.56285E-01 4.67285E+02 5.75962E+05 -5.17586E+01

split the multiple lines files into small files based on the content

I want to create a script to split large files into multiple files with respect to line numbers. Mainly if a file is getting split there should be a complete line at the end / beginning.
No partial line should present in any of the split files.
split is what you might be looking for.
split --lines <linenumber> <file>
and you will get a bunch of splitted files named like: PREFIXaa, PREFIXab...
For further info see man split.

Split text file into multiple files

I am having large text file having 1000 abstracts with empty line in between each abstract . I want to split this file into 1000 text files.
My file looks like
16503654 Three-dimensional structure of neuropeptide k bound to dodecylphosphocholine micelles. Neuropeptide K (NPK), an N-terminally extended form of neurokinin A (NKA), represents the most potent and longest lasting vasodepressor and cardiomodulatory tachykinin reported thus far.
16504520 Computer-aided analysis of the interactions of glutamine synthetase with its inhibitors. Mechanism of inhibition of glutamine synthetase (EC 6.3.1.2; GS) by phosphinothricin and its analogues was studied in some detail using molecular modeling methods.
You can use split and set "NUMBER lines per output file" to 2. Each file would have one text line and one empty line.
split -l 2 file
Something like this:
awk 'NF{print > $1;close($1);}' file
This will create 1000 files with filename being the abstract number. This awk code writes the records to a file whose name is retrieved from the 1st field($1). This is only done only if the number of fields is more than 0(NF)
You could always use the csplit command. This is a file splitter but based on a regex.
something along the lines of :
csplit -ks -f /tmp/files INPUTFILENAMEGOESHERE '/^$/'
It is untested and may need a little tweaking though.
CSPLIT

Is there a better split function for terminal?

I'm trying to split a very big CSV file into smaller more manageable ones. I've tried split but it seems that it tops out at 676 files.
The CSV file I have is in excess of 80mb and I'd like to split it into 50 line files.
Note by better I mean one that uses a numbering structure instead of split's a-z sequencing.
split is the right tool, the problem is that the suffix is only 2 long 26^2 = 676, if you make it longer you should be fine:
split -a LEN file
Use 'cat' to number each line and pipe the output to 'grep' with params to only print n lines

Resources