Split large text file using AWK, given specific parameters - bash

Hi i'm trying to divide an xml file, which contains item tags. As i have 250 items in a single file, i would like to divide the whole file into 5 smaller files containing 50 items (and their content) each.
What i got from this link Linux script: how to split a text into different files with match pattern
awk '{if ($0 ~ /<item>/) a++} { print > ("NewDirectory"a".xml") }'
However this divided the whole file into 1 file per 1 item. So I need help modifying this statement to split the file into 1 file per 50 items.

Assuming your original command does what you say it does and you fully understand the issues around trying to parse XML with awk:
awk '/<item>/ && (++a%50 == 1) { ++c } { print > ("NewDirectory"c".xml") }'
You might need to add a close() in there if you have a lot of files open simultaneously and aren't using GNU awk. Just get gawk.
Also, to learn awk read the book Effective Awk Programming, 4th Edition, by Arnold Robbins.

Try:
awk '$0~/<item>/' | split -l50 -d - NewDirectory.
Explanations:
awk will extract only those lines that contain <item>
split will split stdin into files with 50 lines, named NewDirectory.00, NewDirectory.01, etc. See man split for more info.

Related

How to make a table using bash shell?

I have multiple text files that have their own column. I hope to combine them into one text file like a table not a long column.
I tried 'paste' and 'column', but it did not make the shape that I wanted.
When I used the paste with two text files, it made a nice table.
paste height_1.txt height_2.txt > test.txt
The trouble starts from three or more text files.
paste height_1.txt height_2.txt height_3.txt > test.txt
At a glance, it seems nice. But when I plot the each column in the text.txt file in gnuplot(p "text.txt"), I could find some unexpected graph different from the original file especially in its last part.
The shape of the table is ruined in a strange way in the test.txt, causing the graph weird.
How could I make a well-structured table in the text file with bash shell?
Is it not useful to do this work with bash shell?
If yes, I will try this with python.
Height files are extracted from other *.csv files using awk.
Thank you so much for reading this question.
awk with simple concatenation can take the records for as many files as you have and join them together in a single output file for further processing. You simply provide the multiple input files as the files for awk to read and then concatenate each record using FNR (file record number) as an index and then use the END rule to print the combined records from all files.
For example, given 3 data files, e.g. data1.txt - data3.txt each with an integer in each row, e.g.
$ cat data1.txt
1
2
3
$ cat data2.txt
4
5
6
(7-9 in data3.txt, and presuming you have an equal number of records in each input file)
You could do:
awk '{a[FNR]=(FNR in a) ? a[FNR] "\t" $1 : $1} END {for (i in a) print a[i]}' data1.txt data2.txt data3.txt
(using a tab above with "\t" for the separator between columns of the output file -- you can change to suit your needs)
The result of the command above would be:
1 4 7
2 5 8
3 6 9
(note: this is what you would get with paste data1.txt data2.txt data3.txt, but presuming you have input that is giving paste problems, awk may be a bit more flexible)
Or using a "," as the separator, you would receive:
1,4,7
2,5,8
3,6,9
If your data file has more fields than a single integer and you want to compile all fields in each file, you can assign $0 to the array instead of the first field $1.
Spaced and formatted in multi-line format (for easier reading), the same awk script would be
awk '
{
a[FNR] = (FNR in a) ? a[FNR] "\t" $1 : $1
}
END {
for (i in a)
print a[i]
}
' data1.txt data2.txt data3.txt
Look things over and let me know if I misunderstood your question, or if you have further questions about this approach.

Splitting file in bash

I have a .TXT file containing account numbers. Sample:
TRV001 TRV002 TRV003 TRV004... The values are separated by space.
I want to split this file containing first 1000 account numbers in one file and next 1000 accounts in the next file using bash.These account numbers are coming from a report so we don't know how many account number are we going to get in the file.
Assuming the source file is called acc, you can use awk piped through to split
awk '{ for (i=1;i<=NF;i++) { print $i } }' acc | split -l 1000
For field in each line, print the field on a separate line using awk and then put the output in separate files (default prefix x) using split
Thanks all for help, I was able to work it out. I changed the format of the file to have only one account number per line of the file and then used split -l 1000 to split the files.

Fastest way -- Appending a line to a file only if it does not already exist

given this question Appending a line to a file only if it does not already exist
is there a faster way than the solution provided by #drAlberT?
grep -q -F 'string' foo.bar || echo 'string' >> foo.bar
I have implemented the above solution and I have to iterate it over a 500k lines file (i.e. check if a line is not already in a 500k lines set). Moreover, I've to run this process for a lot of times, maybe 10-50 million times. Needless to say it's kind of slow as it takes 25-30ms to run on my server (so 3-10+ days of runtime in total).
EDIT: the flow is the following: I have a file with 500k lines, each time I run, I get maybe 10-30 new lines and I check if they are already there or not. If not I add them, then I repeat many times. The order of my 500k lines files is important as I'm going through it with another process.
EDIT2: the 500k lines file is always containing unique lines, and I only care about "full lines", no substrings.
Thanks a lot!
Few suggested improvements:
Try using awk instead of grep so that you can both detect the string and write it in one action;
If you do use grep don't use a Bash loop to feed each potential match to grep and then append that one word to the file. Instead, read all the potential lines into grep as matches (using -f file_name) and print the matches. Then invert the matches and append the inverted match. See last pipeline here;
Exit as soon as you see the string (for a single string) rather than continuing to loop over a big file;
Don't call the script millions of times with one or just a few lines -- organize the glue script (in Bash I suppose) so that the core script is called once or a few times with all the lines instead;
Perhaps use multicores since the files are not dependent on each other. Maybe with GNU Parallel (or you could use Python or Ruby or Perl that has support for threads).
Consider this awk for a single line to add:
$ awk -v line=line_to_append 'FNR==NR && line==$0{f=1; exit}
END{if (!f) print line >> FILENAME}' file
Or for multiple lines:
$ awk 'FNR==NR {lines[$0]; next}
$0 in lines{delete lines[$0]}
END{for (e in lines) print e >> FILENAME}' lines file
Some timings using a copy of the Unix words file (235,886 lines) with a five line lines file that has two overlaps:
$ echo "frob
knob
kabbob
stew
big slob" > lines
$ time awk 'FNR==NR {lines[$0]; next}
$0 in lines{delete lines[$0]}
END{for (e in lines) print e >> FILENAME}' lines words
real 0m0.056s
user 0m0.051s
sys 0m0.003s
$ tail words
zythum
Zyzomys
Zyzzogeton
frob
kabbob
big slob
Edit 2
Try this as being the best of both:
$ time grep -x -f lines words |
awk 'FNR==NR{a[$0]; next} !($0 in a)' - lines >> words
real 0m0.012s
user 0m0.010s
sys 0m0.003s
Explanation:
grep -x -f lines words find the lines that ARE in words
awk 'FNR==NR{a[$0]; next} !($0 in a)' - lines invert those into lines that are NOT in words
>> words append those to the file
Turning the millions of passes over the file into a script with millions of actions will save you a lot of overhead. Searching for a single label at each pass over the file is incredibly inefficient; you can search for as many labels as you can comfortably fit into memory in a single pass over the file.
Something along the following lines, perhaps.
awk 'NR==FNR { a[$0]++; next }
$0 in a { delete a[$0] }
1
END { for (k in a) print k }' strings bigfile >bigfile.new
If you can't fit strings in memory all at once, splitting that into suitable chunks will obviously allow you to finish this in as many passes as you have chunks.
On the other hand, if you have already (effectively) divided the input set into sets of 10-30 labels, you can obviously only search for those 10-30 in one pass. Still, this should provide you with a speed improvement on the order of 10-30 times.
This assumes that a "line" is always a full line. If the label can be a substring of a line in the input file, or vice versa, this will need some refactoring.
If duplicates are not valid in the file, just append them all and filter out the duplicates:
cat myfile mynewlines | awk '!n[$0]++' > mynewfile
This will allow appending millions of lines in seconds.
If order additionally doesn't matter and your files are more than a few gigabytes, you can use sort -u instead.
Have the script read new lines from stdin after consuming the original file. All lines are stored in an associative array (without any compression such as md5sum).
Appending the suffix 'x' is targeted to handle inputs such as '-e'; better ways probably exist.
#!/bin/bash
declare -A aa
while read line; do aa["x$line"]=1;
done < file.txt
while read line; do
if [ x${aa[$line]} == x ]; then
aa[$line]=1;
echo "x$line" >> file.txt
fi
done

How to remove lines appear only once in a file using bash

How can I remove lines appear only once in a file in bash?
For example, file foo.txt has:
1
2
3
3
4
5
after process the file, only
3
3
will remain.
Note the file is sorted already.
If your duplicated lines are consecutives, you can use uniq
uniq -D file
from the man pages:
-D print all duplicate lines
Just loop the file twice:
$ awk 'FNR==NR {seen[$0]++; next} seen[$0]>1' file file
3
3
firstly to count how many times a line occurs: seen[ record ] keeps track of it as an array.
secondly to print those that appear more than once
Using single pass awk:
awk '{freq[$0]++} END{for(i in freq) for (j=1; freq[i]>1 && j<=freq[i]; j++) print i}' file
3
3
Using freq[$0]++ we count and store frequency of each line.
In the END block if frequency is greater than 1 then we print those lines as many times as the frequency.
Using awk, single pass:
$ awk 'a[$0]++ && a[$0]==2 {print} a[$0]>1' foo.txt
3
3
If the file is unordered, the output will happen in the order duplicates are found in the file due to the solution not buffering values.
Here's a POSIX-compliant awk alternative to the GNU-specific uniq -D:
awk '++seen[$0] == 2; seen[$0] >= 2' file
This turned out to be just a shorter reformulation of James Brown's helpful answer.
Unlike uniq, this command doesn't strictly require the duplicates to be grouped, but the output order will only be predictable if they are.
That is, if the duplicates aren't grouped, the output order is determined by the the relative ordering of the 2nd instances in each set of duplicates, and in each set the 1st and the 2nd instances will be printed together.
For unsorted (ungrouped) data (and if preserving the input order is also important), consider:
fedorqui's helpful answer (elegant, but requires reading the file twice)
anubhava's helpful answer (single-pass solution, but a little more cumbersome).

Compare execution log's ignoring the execution times

I'm new on linux SO and bash commands and i think someone with more experience could help me. I wanna compare 2 different text files with log's of an execution, but some lines (not all of them) begin with a time' token like this:
12345 ps line 1 content
23456 ps line 2 content
line 3 content
345 ps line 4 content
Those tokens have different values in each log, but, in that comparison, i don't care about them, i wanna just to compare the line contents and ignore them. I could use 'sed' command to generate new files without that tokens and then comepare them, but i pretend to do that repeatedly and could save me some time if i use just one command or one sh file. I've tried to use 'sed' and 'diff' combined, but without success. Would anyone please be able to help me?
You can use the following sed one liner to remove the numbers from the beginning of the file:
sed 's/^[0-9]* ps//g' file1
To diff two such files (less timestamps) you can use process substitution.
diff <(sed 's/^[0-9]* ps//g' file1) <(sed 's/^[0-9]* ps//g' file2)
Untested since you didn't show 2 input files and the expected output but from your description I THINK this would do what you want:
awk '
{ sub(/^[[:digit:]]+[[:space:]]*/,"") }
NR==FNR { file1[FNR] = $0; next }
{ print ($0 == file1[FNR] ? "==" : "!="), $0 }
' file1 file2
If that doesn't do it, post some small sample input and expected output.

Resources