How to Add 4 Blank Columns to a pipe delimited CSV via Command Line - windows

I am on a Windows machine.
I have a CSV file that looks like below that use pipe as the delimiter:
Column 1 | Column 2 | Column 3
1 | 2 | 3
1 | 2 | 3
And I need to add 4 blank columns to make it look like:
Column 1 | Column 2 | Column 3 ||||
1 | 2 | 3 ||||
1 | 2 | 3 ||||
This works fine if my delimiter was a CSV, but can't figure out what to do for the pipe.
#echo off
for /f "delims=" %%a in ('type "Test.csv"') do (
>>"fileout.csv" echo.%%a,,,,
)
My expected output is as follows
Column 1 | Column 2 | Column 3 ||||
1 | 2 | 3 ||||
1 | 2 | 3 ||||

The escape character for batch scripts is the caret - you can use your existing code, just add a caret before each pipe:
#echo off
for /f "delims=" %%a in ('type "Test.csv"') do (
>>"fileout.csv" echo.%%a^|^|^|^|
)

Related

Count matches across files [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 days ago.
Improve this question
Iam stuck on an awk related problem to counts matching occurences. I have a file containing a list of IDs (numbers and or characters) as well as another file containing another ID with a second column containing a collection of the first IDs:
File 1:
| ID1 |
| --- |
| 1 |
| 2 |
| 5 |
| 7 |
File 2:
| ID2 | ID1_collection |
| -------- | -------- |
| 1 | 1,2,3 |
| 2 | 1 |
| 3 | 4 |
| 4 | |
| 5 | 5 |
| 6 | |
The column with the collection doesn't have to be filled or match any of the IDs present in the first file. The goal is a file that looks like this:
| ID2 | ID1_collection | count |
| -------- | -------- | -------- |
| 1 | 1,2,3 | 2
| 2 | 1 | 1
| 3 | 4 | 0
| 4 | | 0
| 5 | 5 | 1
| 6 | | 0
However I am unable to think about a logic which goes through the whole column of file 1 and count, how many of those IDs are present inside the collection with an awk script.
I thought I can create an array containing all ID1 values and split each string from ID1_collection at the separator inside the column (the global separator is "|") to grep for exact matches. But I am not able to figure out a) how efficient this would be (I guess not really) and b) how to write the syntax in a reasonable fashion...
Any help would be appreciated
An approach using awk
% awk 'NR == FNR{x++; arr[$1]++; next}
FNR == 1{print $0, "count"; next}
{n = split($2, a, ",")
for(i in arr){
for(j=1; j<=n; j++){
if(i == a[j]){ y++ }}}
print $0, y; y = 0}' file1 file2
ID2 ID1_collection count
1 1,2,3 2
2 1 1
3 4 0
4 "" 0
5 5 1
6 "" 0
Data
% cat file1 file2
ID1
1
2
5
7
ID2 ID1_collection
1 1,2,3
2 1
3 4
4 ""
5 5
6 ""

bash looping and extracting of the fragment of txt file

I am dealing with the analysis of big number of dlg text files located within the workdir. Each file has a table (usually located in different positions of the log) in the following format:
File 1:
CLUSTERING HISTOGRAM
____________________
________________________________________________________________________________
| | | | |
Clus | Lowest | Run | Mean | Num | Histogram
-ter | Binding | | Binding | in |
Rank | Energy | | Energy | Clus| 5 10 15 20 25 30 35
_____|___________|_____|___________|_____|____:____|____:____|____:____|____:___
1 | -5.78 | 11 | -5.78 | 1 |#
2 | -5.53 | 13 | -5.53 | 1 |#
3 | -5.47 | 17 | -5.44 | 2 |##
4 | -5.43 | 20 | -5.43 | 1 |#
5 | -5.26 | 19 | -5.26 | 1 |#
6 | -5.24 | 3 | -5.24 | 1 |#
7 | -5.19 | 4 | -5.19 | 1 |#
8 | -5.14 | 16 | -5.14 | 1 |#
9 | -5.11 | 9 | -5.11 | 1 |#
10 | -5.07 | 1 | -5.07 | 1 |#
11 | -5.05 | 14 | -5.05 | 1 |#
12 | -4.99 | 12 | -4.99 | 1 |#
13 | -4.95 | 8 | -4.95 | 1 |#
14 | -4.93 | 2 | -4.93 | 1 |#
15 | -4.90 | 10 | -4.90 | 1 |#
16 | -4.83 | 15 | -4.83 | 1 |#
17 | -4.82 | 6 | -4.82 | 1 |#
18 | -4.43 | 5 | -4.43 | 1 |#
19 | -4.26 | 7 | -4.26 | 1 |#
_____|___________|_____|___________|_____|______________________________________
The aim is to loop over all the dlg files and take the single line from the table corresponding to wider cluster (with bigger number of slashes in Histogram column). In the above example from the table this is the third line.
3 | -5.47 | 17 | -5.44 | 2 |##
Then I need to add this line to the final_log.txt together with the name of the log file (that should be specified before the line). So in the end I should have something in following format (for 3 different log files):
"Name of the file 1": 3 | -5.47 | 17 | -5.44 | 2 |##
"Name_of_the_file_2": 1 | -5.99 | 13 | -5.98 | 16 |################
"Name_of_the_file_3": 2 | -4.78 | 19 | -4.44 | 3 |###
A possible model of my BASH workflow would be:
#!/bin/bash
do
file_name2=$(basename "$f")
file_name="${file_name2/.dlg}"
echo "Processing of $f..."
# take a name of the file and save it in the log
echo "$file_name" >> $PWD/final_results.log
# search of the beginning of the table inside of each file and save it after its name
cat $f |grep 'CLUSTERING HISTOGRAM' >> $PWD/final_results.log
# check whether it works
gedit $PWD/final_results.log
done
Here I need to substitute combination of echo and grep in order to take selected parts of the table.
You can use this one, expected to be fast enough. Extra lines in your files, besides the tables, are not expected to be a problem.
grep "#$" *.dlg | sort -rk11 | awk '!seen[$1]++'
grep fetches all the histogram lines which are then sorted in reverse order by last field, that means lines with most # on the top, and finally awk removes the duplicates. Note that when grep is parsing more than one file, it has -H by default to print the filenames at the beginning of the line, so if you test it for one file, use grep -H.
Result should be like this:
file1.dlg: 3 | -5.47 | 17 | -5.44 | 2 |##########
file2.dlg: 3 | -5.47 | 17 | -5.44 | 2 |####
file3.dlg: 3 | -5.47 | 17 | -5.44 | 2 |#######
Here is a modification to get the first appearence in case of many equal max lines in a file:
grep "#$" *.dlg | sort -k11 | tac | awk '!seen[$1]++'
We replaced the reversed parameter in sort, with the 'tac' command which is reversing the file stream, so now for any equal lines, initial order is preserved.
Second solution
Here using only awk:
awk -F"|" '/#$/ && $NF > max[FILENAME] {max[FILENAME]=$NF; row[FILENAME]=$0}
END {for (i in row) print i ":" row[i]}' *.dlg
Update: if you execute it from different directory and want to keep only the basename of every file, to remove the path prefix:
awk -F"|" '/#$/ && $NF > max[FILENAME] {max[FILENAME]=$NF; row[FILENAME]=$0}
END {for (i in row) {sub(".*/","",i); print i ":" row[i]}}'
Probably makes more sense as an Awk script.
This picks the first line with the widest histogram in the case of a tie within an input file.
#!/bin/bash
awk 'FNR == 1 { if(sel) print sel; sel = ""; max = 0 }
FNR < 9 { next }
length($10) > max { max = length($10); sel = FILENAME ":" $0 }
END { if (sel) print sel }' ./"$prot"/*.dlg
This assumes the histograms are always the tenth field; if your input format is even messier than the lump you show, maybe adapt to taste.
In some more detail, the first line triggers on the first line of each input file. If we have collected a previous line (meaning this is not the first input file), print that, and start over. Otherwise, initialize for the first input file. Set sel to nothing and max to zero.
The second line skips lines 1-8 which contain the header.
The third line checks if the current line's histogram is longer than max. If it is, update max to this histogram's length, and remember the current line in sel.
The last line is spillover for when we have processed all files. We never printed the sel from the last file, so print that too, if it's set.
If you mean to say we should find the lines between CLUSTERING HISTOGRAM and the end of the table, we should probably have more information about what the surrounding lines look like. Maybe something like this, though;
awk '/CLUSTERING HISTOGRAM/ { if (sel) print sel; looking = 1; sel = ""; max = 0 }
!looking { next }
looking > 1 && $1 != looking { looking = 0; nextfile }
$1 == looking && length($10) > max { max = length($10); sel = FILENAME ":" $0 }
END { if (sel) print sel }' ./"$prot"/*.dlg
This sets looking to 1 when we see CLUSTERING HISTOGRAM, then counts up to the first line where looking is no longer increasing.
I would suggest processing using awk:
for i in $FILES
do
echo -n \""$i\": "
awk 'BEGIN {
output="";
outputlength=0
}
/(^ *[0-9]+)/ { # process only lines that start with a number
if (length(substr($10, 2)) > outputlength) { # if line has more hashes, store it
output=$0;
outputlength=length(substr($10, 2))
}
}
END {
print output # output the resulting line
}' "$i"
done

Split columntext to rows (extract delimiter in bracket) ORACLE SQL

I have a Column in a Database which contains multiple Values in one Column, which i need as different rows.
The Column contains comma delimited parts but also a Part with comma in brackets. I don't need to split this parts. (Only split on commas which are NOT in brackets)
Versions
Oracle 11g
Example:
**ID | Kategory**
1 | "ATD 5(2830),ATO 4(510),EDI 1,EH A1,SCI 2,SS 1,STO-SE 1(oral, CNS, blood),STO-SE 2(oral, respiratory effects)"
This string i need as
- 1 => ATD 5(2830)
- 1 => ATO 4(510)
- 1 => EDI 1
- 1 => EH A1
- 1 => SCI 2
- 1 => SS 1
- 1 => STO-SE 1(oral,CNS, blood)
- 1 => STO-SE 2(oral, respiratory effects)
Parts like (oral, CNS, blood) which contains comma in brackets i don't need to split.
You can use the regular expression (([^(]*?(\(.*?\))?)*)(,|$) to match:
[^(]*? Zero-or-more (but as few as possible) non-opening-bracket characters
(\(.*?\))? Then, optionally, an opening bracket and as few characters as possible until the closing bracket.
( )* Wrapped in a capturing group repeated zero-or-more times
( ) Wrapped in a capturing group to be able to reference the entire matched item
(,|$) Followed by either a comma or the end-of-string.
Like this:
SQL Fiddle
Oracle 11g R2 Schema Setup:
CREATE TABLE table_name ( ID, Kategory ) AS
SELECT 1, 'ATD 5(2830),ATO 4(510),EDI 1,EH A1,SCI 2,SS 1,STO-SE 1(oral, CNS, blood),STO-SE 2(oral, respiratory effects)' FROM DUAL;
Query 1:
SELECT ID,
l.COLUMN_VALUE AS item,
REGEXP_SUBSTR(
Kategory,
'(([^(]*?(\(.*?\))?)*)(,|$)',
1,
l.COLUMN_VALUE,
NULL,
1
) AS value
FROM table_name t
CROSS JOIN
TABLE(
CAST(
MULTISET(
SELECT LEVEL
FROM DUAL
CONNECT BY LEVEL < REGEXP_COUNT( t.Kategory, '(([^(]*?(\(.*?\))?)*)(,|$)' )
)
AS SYS.ODCINUMBERLIST
)
) l
Results:
| ID | ITEM | VALUE |
|----|------|-------------------------------------|
| 1 | 1 | ATD 5(2830) |
| 1 | 2 | ATO 4(510) |
| 1 | 3 | EDI 1 |
| 1 | 4 | EH A1 |
| 1 | 5 | SCI 2 |
| 1 | 6 | SS 1 |
| 1 | 7 | STO-SE 1(oral, CNS, blood) |
| 1 | 8 | STO-SE 2(oral, respiratory effects) |

Extract timestamp from filename and add it in new column(say,date) by using Pig

I have a file with name YYYYMMDD_claims_portal.csv, I need only YYYYMMDD part and store this value in new column (say, date).
Earlier we have 3 columns: Claim,User,ID. Now I need to add one more column date having value as YYYYMMDD as per file.
input__file__name
Demo
bash
[]$ mkdir mytable
[]$ cat>mytable/20170918_claims_portal.csv
1
2
[]$ cat>mytable/20170919_claims_portal.csv
3
[]$ cat>mytable/20170920_claims_portal.csv
4
5
6
hive
create external table mytable (i int) stored as textfile
;
select i
,regexp_extract(input__file__name,'(\\d{8})_claims_portal.csv',1) as dt
from mytable
;
+----+-----------+
| i | dt |
+----+-----------+
| 4 | 20170920 |
| 5 | 20170920 |
| 6 | 20170920 |
| 3 | 20170919 |
| 1 | 20170918 |
| 2 | 20170918 |
+----+-----------+

bash - extracting lines that contain only 3 columns

I have a file that include the following lines :
2 | blah | blah
1 | blah | blah
3 | blah
2 | blah | blah
1
1 | high | five
3 | five
I wanna extract only the lines that has 3 columns (3 fields, 2 seperators...)
I wanna pipe it to the following commands :
| sort -nbsk1 | cut -d "|" -f1 | uniq -d
So after all I will get only :
2
1
Any suggestions ?
It's a part of homework assignment, we are not allowed to use awk\sed and some more commands.. (grep\tr and whats written above can be used)
Thanks
since you said grep is allowed:
grep -E '^([^|]*\|){2}[^|]*$' file
grep '.*|.*|.*' will select lines with at least three fields and two separators.

Resources