Splitting file in bash - bash

I have a .TXT file containing account numbers. Sample:
TRV001 TRV002 TRV003 TRV004... The values are separated by space.
I want to split this file containing first 1000 account numbers in one file and next 1000 accounts in the next file using bash.These account numbers are coming from a report so we don't know how many account number are we going to get in the file.

Assuming the source file is called acc, you can use awk piped through to split
awk '{ for (i=1;i<=NF;i++) { print $i } }' acc | split -l 1000
For field in each line, print the field on a separate line using awk and then put the output in separate files (default prefix x) using split

Thanks all for help, I was able to work it out. I changed the format of the file to have only one account number per line of the file and then used split -l 1000 to split the files.

Related

The for loop overwrites or entries duplicates

Say, I have 250 files, and from which I need to extract certain information and store them in a text file.
I have tried for loop in the shell as following,
text= 'home/path/tothe/textfiles'
for sam in $(find ${text} -name \*_PG.tsv);do
#echo ${sam}
awk '{if($2=="ID") print FILENAME"\t""yes""\t""SAP""\t""LUFTA"}' ${sam}
done >> ${text}/metadata.txt
With > operator the output text file is overwritten and with >> the output text file is being entered multiple times or duplicate entry.
I would like to know where should I change to get rid of these issues. Thanks for suggestions !!
I think you can do this with a single invocation of awk:
path=home/path/tothe/textfiles
awk -v OFS='\t' '$2 == "ID" {
print FILENAME, "yes", "SAP", "LUFTA"
}' "$path"/*_PG.tsv > "$path"/metadata.txt
careful with your variable assignments, there should be no spaces around the =
use the shell to expand the list of files, without find
pass the full list of files as arguments to awk, instead of looping one by one
set the Output Field Separator OFS instead of writing \t to separate your fields
redirect the output to the metadata file
I assume that your awk script is behaving as you expect - I removed the useless if since awk scripts are written like condition { action }. I guess you only want one line of output per file, so you can probably add an exit inside the block to avoid processing the rest of the file.

Fastest way -- Appending a line to a file only if it does not already exist

given this question Appending a line to a file only if it does not already exist
is there a faster way than the solution provided by #drAlberT?
grep -q -F 'string' foo.bar || echo 'string' >> foo.bar
I have implemented the above solution and I have to iterate it over a 500k lines file (i.e. check if a line is not already in a 500k lines set). Moreover, I've to run this process for a lot of times, maybe 10-50 million times. Needless to say it's kind of slow as it takes 25-30ms to run on my server (so 3-10+ days of runtime in total).
EDIT: the flow is the following: I have a file with 500k lines, each time I run, I get maybe 10-30 new lines and I check if they are already there or not. If not I add them, then I repeat many times. The order of my 500k lines files is important as I'm going through it with another process.
EDIT2: the 500k lines file is always containing unique lines, and I only care about "full lines", no substrings.
Thanks a lot!
Few suggested improvements:
Try using awk instead of grep so that you can both detect the string and write it in one action;
If you do use grep don't use a Bash loop to feed each potential match to grep and then append that one word to the file. Instead, read all the potential lines into grep as matches (using -f file_name) and print the matches. Then invert the matches and append the inverted match. See last pipeline here;
Exit as soon as you see the string (for a single string) rather than continuing to loop over a big file;
Don't call the script millions of times with one or just a few lines -- organize the glue script (in Bash I suppose) so that the core script is called once or a few times with all the lines instead;
Perhaps use multicores since the files are not dependent on each other. Maybe with GNU Parallel (or you could use Python or Ruby or Perl that has support for threads).
Consider this awk for a single line to add:
$ awk -v line=line_to_append 'FNR==NR && line==$0{f=1; exit}
END{if (!f) print line >> FILENAME}' file
Or for multiple lines:
$ awk 'FNR==NR {lines[$0]; next}
$0 in lines{delete lines[$0]}
END{for (e in lines) print e >> FILENAME}' lines file
Some timings using a copy of the Unix words file (235,886 lines) with a five line lines file that has two overlaps:
$ echo "frob
knob
kabbob
stew
big slob" > lines
$ time awk 'FNR==NR {lines[$0]; next}
$0 in lines{delete lines[$0]}
END{for (e in lines) print e >> FILENAME}' lines words
real 0m0.056s
user 0m0.051s
sys 0m0.003s
$ tail words
zythum
Zyzomys
Zyzzogeton
frob
kabbob
big slob
Edit 2
Try this as being the best of both:
$ time grep -x -f lines words |
awk 'FNR==NR{a[$0]; next} !($0 in a)' - lines >> words
real 0m0.012s
user 0m0.010s
sys 0m0.003s
Explanation:
grep -x -f lines words find the lines that ARE in words
awk 'FNR==NR{a[$0]; next} !($0 in a)' - lines invert those into lines that are NOT in words
>> words append those to the file
Turning the millions of passes over the file into a script with millions of actions will save you a lot of overhead. Searching for a single label at each pass over the file is incredibly inefficient; you can search for as many labels as you can comfortably fit into memory in a single pass over the file.
Something along the following lines, perhaps.
awk 'NR==FNR { a[$0]++; next }
$0 in a { delete a[$0] }
1
END { for (k in a) print k }' strings bigfile >bigfile.new
If you can't fit strings in memory all at once, splitting that into suitable chunks will obviously allow you to finish this in as many passes as you have chunks.
On the other hand, if you have already (effectively) divided the input set into sets of 10-30 labels, you can obviously only search for those 10-30 in one pass. Still, this should provide you with a speed improvement on the order of 10-30 times.
This assumes that a "line" is always a full line. If the label can be a substring of a line in the input file, or vice versa, this will need some refactoring.
If duplicates are not valid in the file, just append them all and filter out the duplicates:
cat myfile mynewlines | awk '!n[$0]++' > mynewfile
This will allow appending millions of lines in seconds.
If order additionally doesn't matter and your files are more than a few gigabytes, you can use sort -u instead.
Have the script read new lines from stdin after consuming the original file. All lines are stored in an associative array (without any compression such as md5sum).
Appending the suffix 'x' is targeted to handle inputs such as '-e'; better ways probably exist.
#!/bin/bash
declare -A aa
while read line; do aa["x$line"]=1;
done < file.txt
while read line; do
if [ x${aa[$line]} == x ]; then
aa[$line]=1;
echo "x$line" >> file.txt
fi
done

Split large text file using AWK, given specific parameters

Hi i'm trying to divide an xml file, which contains item tags. As i have 250 items in a single file, i would like to divide the whole file into 5 smaller files containing 50 items (and their content) each.
What i got from this link Linux script: how to split a text into different files with match pattern
awk '{if ($0 ~ /<item>/) a++} { print > ("NewDirectory"a".xml") }'
However this divided the whole file into 1 file per 1 item. So I need help modifying this statement to split the file into 1 file per 50 items.
Assuming your original command does what you say it does and you fully understand the issues around trying to parse XML with awk:
awk '/<item>/ && (++a%50 == 1) { ++c } { print > ("NewDirectory"c".xml") }'
You might need to add a close() in there if you have a lot of files open simultaneously and aren't using GNU awk. Just get gawk.
Also, to learn awk read the book Effective Awk Programming, 4th Edition, by Arnold Robbins.
Try:
awk '$0~/<item>/' | split -l50 -d - NewDirectory.
Explanations:
awk will extract only those lines that contain <item>
split will split stdin into files with 50 lines, named NewDirectory.00, NewDirectory.01, etc. See man split for more info.

How to use awk to split a file and store each filename in a Bash array

Input
A file called input_file.csv, which has 7 columns, and n rows.
Example header and row:
Date Location Team1 Team2 Time Prize_$ Sport
2016 NY Raptors Gators 12pm $500 Soccer
Output
n files, where the rows in each new file are grouped based on their values in column 7 of the original file. Each file is named after that shared value from column 7. Note: each file will have the same header. (The script currently does this.)
Example: if 2 rows in the original file had golf as their value for column 7, they would be grouped together in a file called golf.csv. If 3 other rows shared soccer as their value for column 7, they would be found in soccer.csv.
An array that has the name of each generated file in it. This array lives outside of the scope of awk. (This is what I need help with.)
Example: Array = [golf.csv, soccer.csv]
Situation
The following script produces the desired output. However, I want to run another script on each of the newly generated files and I don't know how.
Question:
My idea is to store the names of each new file in an array. That way, I can loop through the array and do what I want to each file. The code below passes a variable called array into awk, but I don't know how to add the name of each file to the array.
#!/bin/bash
ARRAY=()
awk -v myarray="$ARRAY" -F"\",\"" 'NR==1 {header=$0}; NF>1 && NR>1 {if(! files[$7]) {print header >> ("" $7 ".csv"); files[$7]=1}; print $0 >> ("" $7 ".csv"); close("" $7 ".csv");}' input_file.csv
for i in "${ARRAY[#]}"
do
:
echo $i
done
Rather than struggling to get awk to fill your shell array variable, why not:
make sure that the *.csv files are created in a clean directory
use globbing to loop over all *.csv files in that directory?
awk -F'","' ... # your original Awk command
for i in *.csv # use globbing to loop over resulting *.csv files
do
:
echo $i
done
Just off the top of my head, untested because you haven't supplied very much sample data, what about this?
#!/usr/bin/awk -f
FNR==1 {
header=$0
next
}
! $7 in files {
files[$7]=sprintf("sport-%s.csv", $7)
print header > file
}
{
files[$7]=sprintf("sport-%s.csv", $7)
}
{
print > files[$7]
}
END {
printf("declare -a sportlist=( ")
for (sport in files) {
printf("\"%s\"", sport)
}
printf(" )\n");
}
The idea here is that we store sport names in the array files[], and build filenames out of that array. (You can format the filename inside sprintf() as you see fit.) We step through the file, adding a header line whenever we get a new sport with no recorded filename. Then for non-headers, print to the file based on the sport name.
For your second issue, exporting the array back to something outside of awk, the END block here will output a declare line which can be interpreted by bash. IF you feel lucky, you can eval this awk script inside command expansion, and the declare command will effectively be interpreted by your shell:
eval $(/path/to/awkscript inputfile.csv)
Or, if you subscribe to the school of thought that consiers eval to be evil, you can redirect the awk script's standard output to a temporary file which you source:
/path/to/awkscript inputfile.csv > /tmp/yadda.$$
. /tmp/yadda.$$
(Don't use this temp file, make a real one with mktemp or the like.)
There's no way for any program to modify the environment of the parent shell. Just have the awk script output the names of the files as standard output, and use command substitution to put them in an array.
filesArray=($(awk ... ))
If the files might have spaces in them, you need a different solution; assuming you're on bash 4, you can just be sure to print each file on a separate line and use readarray:
readarray filesArray < <( awk ... )
if the files might have newlines in them, too, then things get tricky...
if your file is not large, you can run another script to get the unique $7 elements, for example
$ awk 'NR>1&&!a[$7]++{print $7}' sports
will print the values, you can change it to your file name format as well, such as
$ awk 'NR>1&&!a[$7]++{print tolower($7)".csv"}' sports
this then can be piped to your other process, here for example to wc
$ awk ... sports | xargs wc
This will do what I THINK you want:
oIFS="$IFS"; IFS=$'\n'
array=( $(awk '{out=$7".csv"; print > out} !seen[out]++{print out}' input_file.csv) )
IFS="$oIFS"
If your input file really is comma-separated instead of space-separated as you show in the sample input in your question then adjust the awk script to suit (You might want to look at GNU awk and FPAT).
If you don't have GNU awk then you'll need to add a bit more code to close the open output files as you go.
The above will fail if you have file names that contain newlines but will be fine for blank chars or other white space.

How to use the awk command in a loop to produce several thinned data files

I have several large data files with 8 columns and 120,000 rows. Now I want to keep 1 line every 200 lines starting from the 100th line. I have the script file thin.sh as:
awk '(NR%200==100)' original_file > thinned_file
However, now I have 30 original files, which means I have to revise the command little by little for 30 times and the original files share similar names as:
data.0000.dat, data.0001.dat data.0002.dat, ..., data.0029.dat
I suppose there must be some way to embed the awk command into a loop to accomplish my goal, maybe something like:
for(i=0;i<30;i++);
do
awk '(NR%200==100)' data.$i.dat > data.$i_thinned.dat
done
But I realize there're two digits of 00 in front of $i in the file names. Can I use sprintf("%s") or something? If so, how to arrange the order of awk and sprinf?
I use ubuntu and bash.
With seq:
for i in $(seq -f %04g 1 29); do
awk 'NR % 200 == 100' "data.${i}.dat" > "data.${i}_thinned.dat"
done
Alternatively with bash:
for i in {0001..0029}; do
The quotes are not strictly necessary in the first snippet because we know $i does not contain anything nefarious, but it's better to be paranoid about expansion in shell scripts. The braces in "data.${i}_thinned.dat" are necessary so the shell doesn't look for a variable $i_thinned to use. They are not strictly necessary in "data.${i}.dat" because shell variable names cannot have . in them, but consistency is nice.
All you need is:
awk 'FNR==1{close(out); out=FILENAME; sub(/\.dat/,"_thinned&",out)} (FNR%200==100){print > out}' data.[0-9][0-9][0-9][0-9].dat
I used data.[0-9][0-9][0-9][0-9].dat as the file name globbing pattern instead of data.*.dat in case you rerun the script in the same dir where you previously generated all of the "_thinned" files.
Ingedients(GAWK)
1 FNR - The record number in the current file
1 match - Matches a regex string and can capture groups into an array.
1 print - Prints following data(if none is provided then defaults to current record)
1 *.dat - All files ending with .dat in the current director.
Instructions
In the condition block check that the current record number in the current file when divided by 200 leaves a remainder of 100.
If it does then run the next block {..}
Take the current file name and match up to the last dot, capture everything before this with (.*) into array a.
Print into a file using the captured date a[1] with the extension _thinned.dat
Finally add *.dat to the end to read all .dat files in the current directory
Resulting code
gawk '(FNR%200==100){match(FILENAME,/(.*)\./,a);print >(a[1]"_thinned.dat")}' *.dat

Resources