Sort/Edit A File with Unix Commands? - sorting

I have a 300MB file that looks like this:
Item Item
Item Item
Item2 Something
... ...
It's basically two columns going all the way down. So each row has two entries. The columns are separated by this character (Alt +0009) which I believe is the "tab" character. The rows are sorted alphanumerically by the first column entry.
Basically what I need to do with this file is produce a new file with it like so:
First, sort the rows alphanumerically by the second column entry.
Second, remove all rows where the second column entry never appears in the file as one of the first column entries.
So for example:
A B
A C
A E
C A
E F
Goes to
C A
A B
A C
A E
E F
then finally to
C A
A C
A E
(Note that in this example I used a space character instead of a tab character to separate the columns, in the file I'm trying to sort the columns are separated by the tab character (Alt +0009))
So how would I go about doing this using Unix commands?

The first operation can be handled using the sort utility (with the -k flag set appropriately). The second operation is more complex and will likely need a custom script of some kind to be implemented.

Related

How to take two columns of two TXT and create new TXT with the two columns?

I have two text files with only one column each.
I need to take the column from each of the text files and create a new text file with the two columns with tabs.
These columns have no relation (ID) but are in order with each other.
I could do that in Excel, but there are more than 200 thousand lines and not accepted.
How can I do it in Pentaho?
Take 2 text input steps, read both the files,
after that add 2 add constant step create same column with some value,make sure the value of the both constant values remains same.
use stream lookup/merge join and merge them with constant values.
generate the file.
You can read both files with Text file input, add "row number" in each stream, which gives you two streams of 2 fields each. Then you can Merge join both streams on Row number, and finally a Select fields step to clean up the output so that only the two relevant fields are kept. Then Text file output to write it.

Parsing semicolon separated key value pairs into CSV file

I have a piece of data that is composed of semicolon separated key value pairs (a round 50 pairs) on the same line. The existence of all pairs is not necessary in each line.
Below is a sample of the data:
A=0.1; BB=2; CD=hi there; XZV=what's up; ...
A=-2; CD=hello; XZV=no; ...
I want to get a CSV file of this data, where the key becomes the field (column) name and the value becomes the row value of that particular line. Missing pairs should be replaced by default value or left blank.
In other words, I want my CSV to look like this:
A,BB,CD,XZV,....
0.1,2,"hi there","what's up",...
-2,0,"hello","no";...
The volume of my data is extremely large. What is the most efficient way to do this? Bash solution is highly appreciated.

advanced concatenation of lines based on the specific number of compared columns in csv

this is the question based on the previous solved problem.
i have the following type of .csv files(they aren't all sorted!, but the structure of columns is the same):
name1,address1,town1,zip1,email1,web1,,,,category1
name2,address2,town2,zip2,email2,,,,,category2
name3,address3,town3,zip3,email3,,,,,category3_1
name3,address3,town3,zip3,,,,,,category3_2
name3,address3,town3,zip3,,,,,,category3_3
name4,address4,town4,zip4,,,,,,category4_1
name4,address4,town4,zip4,email4,,,,,category4_2
name4,address4,town4,zip4,email4,,,,,category4_3
name4,address4,town4,zip4,,,,,,category4_4
name5,address5,town5,zip5,,,,,,category5_1
name5,address5,town5,zip5,,web5,,,,category5_2
name6,address6,town6,zip6,,,,,,category6
first 4 records in columns are always populated, other columns are not always, except the last one - category
empty space between "," delimiter means that there is no data for the particular line or name
if the nameX doesnt contain addressX but addressY, it is a different record(not the same line) and should not be concatenated
i need the script in sed or awk, maybe the bash(but this solution is little slower on bigger files[hundreds of MB+]), that will take first 4 columns(in this case) compares them and if matched, will merge every category with the ";" delimiter and will keep the structure and the most possible data in other columns of those matched lines of a .csv file:
name1,address1,town1,zip1,email1,web1,,,,category1
name2,address2,town2,zip2,email2,,,,,category2
name3,address3,town3,zip3,email3,,,,,category3_1;category3_2;category3_3
name4,address4,town4,zip4,email4,,,,,category4_1;category4_2;category4_3;category4_4
name5,address5,town5,zip5,,web5,,,,category5_1;category5_2
name6,address6,town6,zip6,,,,,,category6
if that is not possible, solution could be to retain data from the first line of the duped data(the one with categoryX_1). example:
name1,address1,town1,zip1,email1,web1,,,,category1
name2,address2,town2,zip2,email2,,,,,category2
name3,address3,town3,zip3,email3,,,,,category3_1;category3_2;category3_3
name4,address4,town4,zip4,,,,,,category4_1;category4_2;category4_3;category4_4
name5,address5,town5,zip5,,,,,,category5_1;category5_2
name6,address6,town6,zip6,,,,,,category6
does the .csv have to be sorted before using the script?
thank you again!
sed -n 's/.*/²&³/;H
$ { g
:cat
s/\(²\([^,]*,\)\{4\}\)\(\([^,]*,\)\{5\}\)\([^³]*\)³\(.*\)\n\1\(\([^,]*,\)\{5\}\)\([^³]*\)³/\1~\3~ ~\7~\5;\9³\6/
t fields
b clean
:fields
s/~\([^,]*\),\([^~]*~\) ~\1,\([^~]*~\)/\1,~\2 ~\3/
t fields
s/~\([^,]*\),\([^~]*~\) ~\([^,]*,\)\([^~]*~\)/\1\3~\2 ~\4/
t fields
s/~~ ~~//g
b cat
:clean
s/.//;s/[²³]//g
p
}' YourFile
Posix version (so --posixwith GNU sed) and without sorting your file previously
2 recursive loop after loading the full file in buffer, adding marker for easier manipulation and lot of fun with sed group substitution (hopefully just reach the maximum group available).
loop to add category (1 line after the other, needed for next loop on each field) per line and a big sub field temporary structured (2 group of field from the 2 concatened lines. field 5 to 9 are 1 group)
ungroup sub field to original place
finaly, remove marker and first new line
Assuming there is no ²³~ character because used as marker (you can use other marker and adapt the script with your new marker)
Note:
For performance on a hundred MB file, i guess awk will be lot more efficient.
Sorting the data previoulsy may help certainly in performance reducing amount of data to manipulate after each category loop
i found, that this particular problem is faster being processed through db...
SQL - GROUP BY to combine/concat a column
db: mysql through wamp

How to sort file rows with vi?

I have to edit multiple files with multiple rows, and also everything is in three columns, like this:
#file
save get go
go save get
rest place reset
Columns are separated with tab. Is there any possible way to sort rows based on second or third column using vi?
sort by the 2nd col:
:sor /\t/
sort by the 3rd col:
:sor /\t[^\t]*\t/
Second column:
:sort /\%9c/
Third column:
:sort /\%16c/
\%16c means "column 16".
Hi light the rows you want to sort with "V" command
Use a bash command with "!" to work on the selection, like:
!sort -k 10
Where the number is the column number where your second (sort) column starts.
vi will replace the selection with the output of the sort command - which is given the original selection.
You can specify a pattern for sort. For example:
sort /^\w*\s*/
Will sort on the second column (the first thing to sort after matching the pattern).
Likewise
sort /^\w*\s*\w*\s*/
Should sort on the third column.
delimit the column using some char here I have | symbol as delimiter, once did with that you can use below command to sort specific column use -n if u want to sort numeric and its working on some version of vi and not on ubuntu vi :(
/|.*|/ | sort

using awk to grab random lines and append to a new column?

So I have a document "1", which is one column. I have 3 files with one column each and I want to append a randomly selected line from each of those columns onto the document 1's line.
So like
awk 'NR==10' moves.txt 'NR==1' propp_tasks.txt
prints out
10.Qg3 Bb4+
First function of the donor
when I want it to be:
10 Qg3 Bb4+ First function of the donor
Is there a good way to do this with awk? I had been trying to set up a bash script with a for loop but I didn't know how to cycle the indices so on line n of document 1, columns 2,3 and 4 would be appended on there. I feel like this should be really, really simple...
paste 1 <(cat 2 3 4 | sort -R)
If the length of the first file and the length of the combination of the other 3 files are different, then some more work is required.

Resources