How to order in bash, a space separated flat file? - bash

If a have a flat file database in which fields are separated by space like this :
Name Salary_cost function
Luc 50000 Engineer in mechanics
Gerard 35000 Bad in all, good at nothing
Martijn 150000 Director
Robert 45000 Java Specialist
(...)
I would like to order this stuff by Salary_cost. I can order this using this kind of stuff
cat file.txt | sed-e 's/ \+/\t/g' | sort -k 2
But this is no good, because
The first line is not data to be ordered (only sheer hazard put it on the top or the bottom or sometimes, god knows where...).
If the order of the fields changes or I add some files, then I have to rewrite...
It is complicated. I use number to designate the fields names which are string (and not number).
It is not elegant.
The data are modified.
...
I have thought of something like Recutils. But I cannot grasp how to use it for this purpose.
I can I sort this file by "Salary_cost" fields, considering other lines as records and the first as data header, using a command line interface (bash,sh, ksh,...)?
There is a lot of interfaces which produce such output, for example: df, transmission-remote, ps, ... Even coma separated files are close to this structure.

You can use head, tail combination piped with sort:
fld="Salary_cost"
n=$(awk -v q="$fld" 'NR==1{for (i=1; i<=NF; i++) if ($i==q) {print i; exit}}' file)
head -1 file && tail -n +2 file | sort -nk$n
Name Salary_cost function
Gerard 35000 Bad in all, good at nothing
Robert 45000 Java Specialist
Luc 50000 Engineer in mechanics
Martijn 150000 Director

Related

grep on one file and sort matches to several output files

my question concerns the following:
I have the file:
FileA:
Peter Programmer
Frank Chemist
Charles Physicist
John Programmer
Alex Programmer
Harold Chemist
George Chemist
I now got all the job information from FileA and saved it to a unique list (FileB).
FileB:
Programmer
Chemist
Physcist
(Assume the FileA goes on and on with many more people and redundant information)
What I want to do now is get all the job classes from FileA and create a new file for each Job-Class so that in the end I have:
FileProgrammer
Peter Programmer
John Programmer
Alex Programmer
FileChemist
Frank Chemist
Harold Chemist
George Chemist
FilePhysicist
Charles Physicist
I want to grep the pattern of the job name from the list in the Jobs File and create a new file for every job name which exists in the other original file.
So in reality, I have 56 Unique Elements in my list and the original file has several columns ( tab delimited).
What I did so far was this:
cut -f2 | sort | uniq > Jobs
grep -f(tr '\t' '\n' < "${Jobs}") "${FileA}" > FileA+"${Jobs}"
I assumed that on each new pattern match a new file would be created but I realized that it would just copy the file because there is no increment or iterative file creation.
Since my experience with bash is yet to be developed in depth, I hope you guys can help me. Thanks in advance.
#update:
Input file looks like this:
4 23454 22110 Direct + 3245 Corrected
3 21254 12110 Indirect + 2319 Paused-#2
11 45233 54103 Direct - 1134 Not-Corrected
Essentially, I want everything that has the status in column 7 of Corrected to be in a file named corrected and so for every unique value of column 7.
The answer craves for need of Awk, here is how you do it,
awk '{unique[$2]=(unique[$2] FS $1)}\
END {for (i in unique) { \
len=split(unique[i],temp); \
for (j=1;j<=len;j++) print temp[j],i > "File"i".txt"} }' \
file
The idea is to create a hash-map, with unique[$2]=(unique[$2] FS $1), which literally means, treat $2 as the index for the array unique and have values appended from $1, so at the end of each line processing of your input file, the array looks like this,
# <key> <value(s)>
Chemist Frank Harold George
Physicist Charles
Programmer Peter John Alex
The END clause is executed after all the lines are processed, so from the array constructed, using the split() function which splits on a single whistespace, we store the contents of the array value to the array temp, and len contains the number of elements resulting after the split.
A loop for each hash index and with each of the split element, the values are printed and stored in the file.
You can do it with grep inside a loop with:
for i in $(cat FileB); do grep $i$ FileA >> File$i; done
Note that in FileA of your question you wrote "Physicist" and in FileB you wrote "Physcist", so they won't match. Anyway if you write both of them properly, the above command will work.

Searching the first field and getting the output of all records where the first field is the same.

I have a pretty big text file. This file contains words and a number of definitions given for the words. There are 60 words which are repeated 17 times. The words are always in the first field and the definitions in the following fields adjacent to the words.
Example:
hand;extremity of the body;that which is commonly used to write with
paper;thin sheet made of wood pulp;material used to write things on;some other def's
book;collection of pages on a topic;publication of knowledge;concatenated paper with text
ham;that which comes from pork;a tasty meat;a type of food
anotherword;defs;defs;defs;defs
it continues until it reaches the 60th word then restarts with the same 60 words and different definitions. The order isn't always the same so the next 60 might be
book;defs;defs;defs
television;defs;defs;defs
ham;defs;defs;defs;defs;defs
paper;defs;defs
the field separator for this file is ";" and there is a empty record in between each record as shown in the examples.
What I want to do is look at the first field and output the records with the same first field.
Example:
ham;defs;defs;defs;defs;defs
ham;defs;defs;defs
ham;defs;defs;defs;defs
ham;defs;defs;
ham;defs;defs;defs
ham;defs;
ham;defs;defs
ham;defs;defs;defs;defs
paper;defs;defs;defs;defs
paper;defs;defs;defs
paper;defs;defs;
and so on.
I apologize if this isn't clear. Please help!
simple grep and sort command can that for you... try as below....
Explanation:
# ^$ will search for blank lines and -v will reverse that search ... so you get all lines which has data
# passing that data to sort command will sort it...
# -t option of sort for delimiter and -k option take which column it need to sort
grep -v ^$ yourfile.txt | sort -t";' -k1
# And if you expect duplicate lines also, meaning same lines multiple time but need it only 1 time... then pipe to the uniq command as below
grep -v ^$ yourfile.txt | sort -t";" -k1 | uniq
For your sample data I get the output as below....
$ grep -v ^$ mysamplefile.txt | sort -t";" -k1 | uniq
anotherword;defs;defs;defs;defs
book;collection of pages on a topic;publication of knowledge;concatenated paper with text
book;defs;defs;defs
ham;defs;defs;defs;defs;defs
ham;that which comes from pork;a tasty meat;a type of food
hand;extremity of the body;that which is commonly used to write with
paper;defs;defs
paper;thin sheet made of wood pulp;material used to write things on;some other def's
television;defs;defs;defs

Grepping progressively through large file

I have several large data files (~100MB-1GB of text) and a sorted list of tens of thousands of timestamps that index data points of interest. The timestamp file looks like:
12345
15467
67256
182387
199364
...
And the data file looks like:
Line of text
12345 0.234 0.123 2.321
More text
Some unimportant data
14509 0.987 0.543 3.600
More text
15467 0.678 0.345 4.431
The data in the second file is all in order of timestamp. I want to grep through the second file using the time stamps of the first, printing the timestamp and fourth data item in an output file. I've been using this:
grep -wf time.stamps data.file | awk '{print $1 "\t" $4 }' >> output.file
This is taking on the order of a day to complete for each data file. The problem is that this command searches though the entire data file for every line in time.stamps, but I only need the search to pick up from the last data point. Is there any way to speed up this process?
You can do this entirely in awk …
awk 'NR==FNR{a[$1]++;next}($1 in a){print $1,$4}' timestampfile datafile
JS웃's awk solution is probably the way to go. If join is available and the first field of the irrelevant "data" is not numeric, you could exploit the fact that the files are in the same order and avoid a sorting step. This example uses bash process substitution on linux
join -o2.1,2.4 -1 1 -2 1 key.txt <(awk '$1 ~ /^[[:digit:]]+$/' data.txt)
'grep' has a little used option -f filename which gets the patterns from filename and does the matching. It is likely to beat the awk solution and your timestamps would not have to be sorted.

Reading millions of files (in a certain order) and putting them into one big file --- fast

In my bash script I have the following (for concreteness I preserve the original names;
sometimes people ask about the background etc., and then the original names make more sense):
tail -n +2 Data | while read count phi npa; do
cat Instances/$phi >> $nF
done
That is, the first line of file Data is skipped, and then all lines, which are of
the form "r c p n", are read, and the content of files Instances/p is appended
to file $nF (in the order given by Data).
In typical examples, Data has millions of lines. So perhaps I should write a
C++ application for that. However I wondered whether somebody knew a faster
solution just using bash?
Here I use cut instead of your while loop, but you could re-introduce that if it provides some utility to you. The loop would have to output the phy variable once per iteration.
tail -n +2 Data | cut -d' ' -f 2 | xargs -I{} cat Instances/{} >> $nF
This reduces the number of cat invocations to as few as possible, which should improve efficiency. I also believe that using cut here will improve things further.

Split input into multiple outputs based on content?

Let's assume there is a file which looks like this:
xxxx aa whatever
yyyy bb whatever
zzzz aa whatever
I'd like split it into 2 files, containing:
first:
xxxx aa whatever
zzzz aa whatever
second:
yyyy bb whatever
I.e. I want to group the rows based on some value in the lines (rule can be: 2nd word separated by spaces), but do not reorder the lines within groups.
Of course I can write a program to do it, but I'm wondering if there is any ready tool that can do something like this?
Sorry, I didn't mention it, as I assumed it's pretty obvious - number of different "words" is huge. we are talking about at least 10000 of them. I.e. any solution based on enumeration of the words before hand will not work.
And also - I wouldn't really like multi-pass split - the files in question are usually pretty big.
This will create files named output.aa, output.bb, etc.:
awk '{print >> "output." $2}' input.file
Well, you could do a grep to get the lines that match, and a grep -v to get the lines that don't match.
Hm, you could do sort -f" " -s -k 2,2, but that's O(n log n).

Resources