hierarchical sort of a tree csv file - sorting

I have a csv file of a folder/file tree that needs to be hierarchically, and alphabetically sorted. The commas can be thought of representing indents (tab or space). Conversely, because the csv is a comma separated file, the commas are field delimiters and consecutive commas are bounding blank fields. In terms of field position, the csv records the hierarchy with children of a parent in an nth+1 field position. The leading number of commas is therefore significant. In an outline format, they represent the folder hierarchy.
In sample_csv, "foldertop" contains 2 sub-folders. Each sub-folder contains 2 files. The files need to be sorted within their respective parent folders, and the sub-folders need to the sorted within their parent folder. The parent-child sets must be preserved in the end result.
$cat sample_csv
,foldertop,,,,,,,,,,,,,,libname,
,,folderB,,,,,,,,,,,,,libname,
,,,filethis,,,,,,,,,,,,libname,
,,,filethat,,,,,,,,,,,,libname,
,,folderA,,,,,,,,,,,,,libname,
,,,filetwo,,,,,,,,,,,,libname,
,,,fileone,,,,,,,,,,,,libname,
The sort function sorts the line order, but destroys the hierarchy.
$sort sample_csv
,,,fileone,,,,,,,,,,,,libname,
,,,filethat,,,,,,,,,,,,libname,
,,,filethis,,,,,,,,,,,,libname,
,,,filetwo,,,,,,,,,,,,libname,
,,folderA,,,,,,,,,,,,,libname,
,,folderB,,,,,,,,,,,,,libname,
,foldertop,,,,,,,,,,,,,,libname,
The needed hierarchical sort would preserve the hierarchy.
$hier_sort sample_csv
,foldertop,,,,,,,,,,,,,,libname,
,,folderA,,,,,,,,,,,,,libname,
,,,fileone,,,,,,,,,,,,libname,
,,,filetwo,,,,,,,,,,,,libname,
,,folderB,,,,,,,,,,,,,libname,
,,,filethat,,,,,,,,,,,,libname,
,,,filethis,,,,,,,,,,,,libname,
Is there any elegant way to do this?
BTW, the "libname" is another non-blank field on each line that exists in the csv, but is not relevant to the sorting or hierarchy, and can be ignored. I could have left that detail out for simplicity.

Related

Deleting entire row if text is found at any column of the sequential file

Using SORT, is it possible to delete a record if a supplied text is in the row? For instance, in the following records any record that contains the text "record" would not be copied.
Suppose:
123456abcdrecord123
111recordaaaaaaaaaa
recordjjjjjj1111111
11111111111abcccccc
So my output should be:
11111111111abcccccc
Can anyone suggest the right control cards for SORT?
Try
OMIT COND=(1,19,SS,EQ,C'record')
Substring search for INCLUDE and OMIT

Parsing semicolon separated key value pairs into CSV file

I have a piece of data that is composed of semicolon separated key value pairs (a round 50 pairs) on the same line. The existence of all pairs is not necessary in each line.
Below is a sample of the data:
A=0.1; BB=2; CD=hi there; XZV=what's up; ...
A=-2; CD=hello; XZV=no; ...
I want to get a CSV file of this data, where the key becomes the field (column) name and the value becomes the row value of that particular line. Missing pairs should be replaced by default value or left blank.
In other words, I want my CSV to look like this:
A,BB,CD,XZV,....
0.1,2,"hi there","what's up",...
-2,0,"hello","no";...
The volume of my data is extremely large. What is the most efficient way to do this? Bash solution is highly appreciated.

advanced concatenation of lines based on the specific number of compared columns in csv

this is the question based on the previous solved problem.
i have the following type of .csv files(they aren't all sorted!, but the structure of columns is the same):
name1,address1,town1,zip1,email1,web1,,,,category1
name2,address2,town2,zip2,email2,,,,,category2
name3,address3,town3,zip3,email3,,,,,category3_1
name3,address3,town3,zip3,,,,,,category3_2
name3,address3,town3,zip3,,,,,,category3_3
name4,address4,town4,zip4,,,,,,category4_1
name4,address4,town4,zip4,email4,,,,,category4_2
name4,address4,town4,zip4,email4,,,,,category4_3
name4,address4,town4,zip4,,,,,,category4_4
name5,address5,town5,zip5,,,,,,category5_1
name5,address5,town5,zip5,,web5,,,,category5_2
name6,address6,town6,zip6,,,,,,category6
first 4 records in columns are always populated, other columns are not always, except the last one - category
empty space between "," delimiter means that there is no data for the particular line or name
if the nameX doesnt contain addressX but addressY, it is a different record(not the same line) and should not be concatenated
i need the script in sed or awk, maybe the bash(but this solution is little slower on bigger files[hundreds of MB+]), that will take first 4 columns(in this case) compares them and if matched, will merge every category with the ";" delimiter and will keep the structure and the most possible data in other columns of those matched lines of a .csv file:
name1,address1,town1,zip1,email1,web1,,,,category1
name2,address2,town2,zip2,email2,,,,,category2
name3,address3,town3,zip3,email3,,,,,category3_1;category3_2;category3_3
name4,address4,town4,zip4,email4,,,,,category4_1;category4_2;category4_3;category4_4
name5,address5,town5,zip5,,web5,,,,category5_1;category5_2
name6,address6,town6,zip6,,,,,,category6
if that is not possible, solution could be to retain data from the first line of the duped data(the one with categoryX_1). example:
name1,address1,town1,zip1,email1,web1,,,,category1
name2,address2,town2,zip2,email2,,,,,category2
name3,address3,town3,zip3,email3,,,,,category3_1;category3_2;category3_3
name4,address4,town4,zip4,,,,,,category4_1;category4_2;category4_3;category4_4
name5,address5,town5,zip5,,,,,,category5_1;category5_2
name6,address6,town6,zip6,,,,,,category6
does the .csv have to be sorted before using the script?
thank you again!
sed -n 's/.*/²&³/;H
$ { g
:cat
s/\(²\([^,]*,\)\{4\}\)\(\([^,]*,\)\{5\}\)\([^³]*\)³\(.*\)\n\1\(\([^,]*,\)\{5\}\)\([^³]*\)³/\1~\3~ ~\7~\5;\9³\6/
t fields
b clean
:fields
s/~\([^,]*\),\([^~]*~\) ~\1,\([^~]*~\)/\1,~\2 ~\3/
t fields
s/~\([^,]*\),\([^~]*~\) ~\([^,]*,\)\([^~]*~\)/\1\3~\2 ~\4/
t fields
s/~~ ~~//g
b cat
:clean
s/.//;s/[²³]//g
p
}' YourFile
Posix version (so --posixwith GNU sed) and without sorting your file previously
2 recursive loop after loading the full file in buffer, adding marker for easier manipulation and lot of fun with sed group substitution (hopefully just reach the maximum group available).
loop to add category (1 line after the other, needed for next loop on each field) per line and a big sub field temporary structured (2 group of field from the 2 concatened lines. field 5 to 9 are 1 group)
ungroup sub field to original place
finaly, remove marker and first new line
Assuming there is no ²³~ character because used as marker (you can use other marker and adapt the script with your new marker)
Note:
For performance on a hundred MB file, i guess awk will be lot more efficient.
Sorting the data previoulsy may help certainly in performance reducing amount of data to manipulate after each category loop
i found, that this particular problem is faster being processed through db...
SQL - GROUP BY to combine/concat a column
db: mysql through wamp

extract text from links and insert in new sorted

I have a single html file that's updated and generated frequently. It's a list sorted by names and linked with html that contain numbers in the end like ...#835C or #717.
I would like to extract just the number (no letter) and insert it into a new html file sorted by the number but still linked with the original.
Can you help me make this chore an easy one, thanks?
Here is the file I shortened from the current count of 850. The upper half is original and the lower is what I'd like extracted as an example.
http://selvan777.tripod.com/test/rin.htm
Thanks,
Selvan

List of names and their numbers needed to be sorted .TXT file

I have a list of names (never over 100 names) with a value for each of them, either 3 or 4 digits.
john2E=1023
mary2E=1045
fred2E=968
And so on... They're formatted exactly like that in the .txt file. I have Python and Excel, also willing to download whatever I need.
What I want to do is sort all the names according to their values in a descending order so highest is on top. I've tried to use Excel by replacing the '2E=' with ',' so I can have the name,value then important the data so each are in separate columns but I still couldn't sort them any other way than A to Z.
Help is much appreciated, I did take my time to look around before posting this.
Replace the "2E=" with a tab character so that the data is displayed in excel in two columns. Then sort on the value column.

Resources