Remove space between 2 columns and insert commas - bash - bash

I am using:
cut -f1-2 input.txt|sed 1d
The data is outputting like this:
/mnt/Hector/Data/benign/binary/benign-pete/ fd0977d5855d1295bd57383b17981a09
/mnt/Hector/Data/benign/binary/benign-pete/ fd34c32786aadab513f506c30c2cba33
/mnt/Hector/Data/benign/binary/benign-pete/ fe7d03512e0731e40be628524efbf317
I am trying to get it to output without a space like this and insert a comma between the file path and md5 check sum so excel can separate it properly:
/mnt/Hector/Data/benign/binary/benign-pete/,fd0977d5855d1295bd57383b17981a09
/mnt/Hector/Data/benign/binary/benign-pete/,fd34c32786aadab513f506c30c2cba33
/mnt/Hector/Data/benign/binary/benign-pete/,fe7d03512e0731e40be628524efbf317

I didn't see your input.txt, but try this line, do the job in one shot:
awk -v OFS="," 'NR>1{print $1,$2}' input.txt

This can make it:
$ tr -s " " < your_file | sed 's/ /,/g'
/mnt/Hector/Data/benign/binary/benign-pete/,fd0977d5855d1295bd57383b17981a09
/mnt/Hector/Data/benign/binary/benign-pete/,fd34c32786aadab513f506c30c2cba33
/mnt/Hector/Data/benign/binary/benign-pete/,fe7d03512e0731e40be628524efbf317
tr -s " " < your_file removes extra spaces. sed 's/ /,/g' replaces spaces with commas.

Related

Getting last X fields from a specific line in a CSV file using bash

I'm trying to get as bash variable list of users which are in my csv file. Problem is that number of users is random and can be from 1-5.
Example CSV file:
"record1_data1","record1_data2","record1_data3","user1","user2"
"record2_data1","record2_data2","record2_data3","user1","user2","user3","user4"
"record3_data1","record3_data2","record3_data3","user1"
I would like to get something like
list_of_users="cat file.csv | grep "record2_data2" | <something> "
echo $list_of_users
user1,user2,user3,user4
I'm trying this:
cat file.csv | grep "record2_data2" | awk -F, -v OFS=',' '{print $4,$5,$6,$7,$8 }' | sed 's/"//g'
My result is:
user2,user3,user4,,
Question:
How to remove all "," from the end of my result? Sometimes it is just one but sometimes can be user1,,,,
Can I do it in better way? Users always starts after 3rd column in my file.
This will do what your code seems to be trying to do (print the users for a given string record2_data2 which only exists in the 2nd field):
$ awk -F',' '{gsub(/"/,"")} $2=="record2_data2"{sub(/([^,]*,){3}/,""); print}' file.csv
user1,user2,user3,user4
but I don't see how that's related to your question subject of Getting last X records from CSV file using bash so idk if it's what you really want or not.
Better to use a bash array, and join it into a CSV string when needed:
#!/usr/bin/env bash
readarray -t listofusers < <(cut -d, -f4- file.csv | tr -d '"' | tr ',' $'\n' | sort -u))
IFS=,
printf "%s\n" "${listofusers[*]}"
cut -d, -f4- file.csv | tr -d '"' | tr ',' $'\n' | sort -u is the important bit - it first only prints out the fourth and following fields of the CSV input file, removes quotes, turns commas into newlines, and then sorts the resulting usernames, removing duplicates. That output is then read into an array with the readarray builtin, and you can manipulate it and the individual elements however you need.
GNU sed solution, let file.csv content be
"record1_data1","record1_data2","record1_data3","user1","user2"
"record2_data1","record2_data2","record2_data3","user1","user2","user3","user4"
"record3_data1","record3_data2","record3_data3","user1"
then
sed -n -e 's/"//g' -e '/record2_data/ s/[^,]*,[^,]*,[^,]*,// p' file.csv
gives output
user1,user2,user3,user4
Explanation: -n turns off automatic printing, expressions meaning is as follow: 1st substitute globally " using empty string i.e. delete them, 2nd for line containing record2_data substitute (s) everything up to and including 3rd , with empty string i.e. delete it and print (p) such changed line.
(tested in GNU sed 4.2.2)
awk -F',' '
/record2_data2/{
for(i=4;i<=NF;i++) o=sprintf("%s%s,",o,$i);
gsub(/"|,$/,"",o);
print o
}' file.csv
user1,user2,user3,user4
This might work for you (GNU sed):
sed -E '/record2_data/!d;s/"([^"]*)"(,)?/\1\2/4g;s///g' file
Delete all records except for that containing record2_data.
Remove double quotes from the fourth field onward.
Remove any double quoted fields.

Linux get data from each line of file

I have a file with many (~2k) lines similar to:
117 VALID|AUTHEN tcp:10.92.163.5:64127 uniqueID=nwCelerra
....
991 VALID|AUTHEN tcp:10.19.16.21:58332 uniqueID=smUNIX
I want only the IP address (10.19.16.21 shown above) and the value of the uniqueID (smUNIX shown above)
I am able to get close with:
cat t.txt|cut -f2- -d':'
10.22.36.69:46474 uniqueID=smwUNIX
...
I am on Linux using bash.
Using awk:
awk '{split($3,a,":"); split($4,b,"="); print a[2] " " b[2]}'
By default if splits on the whitespaces, with some extra code you can split the subfields
Update:
even easier overriding the default delimiter:
awk -F '[:=]' '{print $2 " "$4}'
using grep and sed :
grep -oP "^\d+ [A-Z]+\|[A-Z]+ \w+:\K(.*)" | sed "s/ uniqueID=/ /g"
outputs:
10.92.163.5:64127 nwCelerra
10.19.16.21:58332 smUNIX

Bash: concenate lines in csv file (1+2, 3+4 etc)

I have a bash file with increasing integers in the first column and some text behind.
1,text1a,text1b
2,text2a,text2b
3,text3a,text3b
4,text4a,text4b
...
I would like to add line 1+2, 3+4 etc. and add the outcome to a new csv file.
The desired output would be
1,text1a,text1b,2,text2a,text2b
3,text3a,text3b,4,text4a,text4b
...
A second option without the numbers would be great as well. The actual input would be
1,text,text,,,text#text.com,2,text.text,text
2,text,text,,,text#text.com,3,text.text,text
3,text,text,,,text#text.com,2,text.text,text
4,text,text,,,text#text.com,3,text.text,text
Desired outcome
text,text,,,text#text.com,2,text.text,text,text,text,,,text#text.com,3,text.text,text
text,text,,,text#text.com,2,text.text,text,text,text,,,text#text.com,3,text.text,text
$ pr -2ats, file
gives you
1,text1a,text1b,2,text2a,text2b
3,text3a,text3b,4,text4a,text4b
UPDATE
for the second part
$ cut -d, -f2- file | pr -2ats,
will give you
text,text,,,text#text.com,2,text.text,text,text,text,,,text#text.com,3,text.text,text
text,text,,,text#text.com,2,text.text,text,text,text,,,text#text.com,3,text.text,text
awk solution:
awk '{ printf "%s%s",$0,(!(NR%2)? ORS:",") }' input.csv > output.csv
The output.csv content:
1,text1a,text1b,2,text2a,text2b
3,text3a,text3b,4,text4a,text4b
----------
Additional approach (to skip numbers):
awk -F',' '{ printf "%s%s",$2 FS $3,(!(NR%2)? ORS:FS) }' input.csv > output.csv
The output.csv content:
text1a,text1b,text2a,text2b
text3a,text3b,text4a,text4b
3rd approach (for your extended input):
awk -F',' '{ sub(/^[0-9]+,/,"",$0); printf "%s%s",$0,(!(NR%2)? ORS:FS) }' input.csv > output.csv
With bash, cut, sed and paste:
paste -d, <(cut -d, -f 2- file | sed '2~2d') <(cut -d, -f 2- file | sed '1~2d')
Output:
text1a,text1b,text2a,text2b
text3a,text3b,text4a,text4b
I hoped to get started with something simple as
printf '%s,%s\n' $(<inputfile)
This turns out wrong when you have spaces inside your text fields.
The improvement is rather a mess:
source <(echo "printf '%s,%s\n' $(sed 's/.*/"&"/' inputfile|tr '\n' ' ')")
Skipping the first filed can be done in the same sed command:
source <(echo "printf '%s,%s\n' $(sed -r 's/([^,]*),(.*)/"\2"/' inputfile|tr '\n' ' ')")
EDIT:
This solution will fail when it has special characters, so you should use a simple solution as
cut -f2- file | paste -d, - -

Find unique words

Suppose there is one file.txt in which below content text is written:
ABC/xyz
ABC/xyz/rst
EFG/ghi
I need to write a shell script that can extract the first unique word before the first /.
So as output, I want ABC and EFG to be written in one file.
You can extract the first word with cut (slash as delimiter), then pipe to sort with the -u (for "unique") option:
$ cut -d '/' -f 1 file.txt | sort -u
ABC
EFG
To get the output into a file, just redirect by appending > filename to the command. (Or pipe to tee filename to see the output and get it in a file.)
Try this :
cat file.txt | tr -s "/" ' ' | awk -F " " '{print $1}' | sort | uniq > outfile.txt
Another interesting variation:
awk -F'/' '{print $1 |" sort -u" }' file.txt > outfile.txt
Not that it matters here, but being able to pipe and redirect within awk can be very handy.
Another easy way:
cut -d"/" -f1 file.txt|uniq > out.txt
You can use a mix of cut and sort like so:
cut -d '/' -f 1 file.txt | sort -u > newfile.txt
The first line grabs any string until a slash / and outputs it into newfile.txt.
The second line sorts the text, removing any duplicate strings you might have.

finding pattern in a file

I have a txt file of 500 rows and one column.
The column in each row appears some what like this (as an example I am pasting two rows):
chr22:49367820-49368570_NR_021492_LOC100144603,chr22:49368010-49368760_NM_005198_CHKB,chr22:49368010-49368760_NM_152247_CPT1B,chr22:49368010-49368760_NM_152253_CHKB
chr22:49367820-49368570_NR_021492_LOC100144603,chr22:49368010-49368760_NM_005198_CHKB
Want I want to extract from each row is the values starting from NM_ or NR_
like
row 1 has NR_021492 NM_005198 NM_152247 NM_152253
row 2 has NR_021492 NM_005198
...
in tab delimited file
any suggestions for a bash command line?
Try:
sed -r -e 's/chr[0-9]+:[^_]*_(N[RM])_([0-9]+)_[^,_]+([, ]|$)/\1_\2'$'\t''/g;s/'$'\t''$//g'
Presuming GNU sed.
So
sed -r -e 's/chr[0-9]+:[^_]*_(N[RM])_([0-9]+)_[^,_]+([, ]|$)/\1_\2'$'\t''/g;s/'$'\t''$//g' your_file > tab_delimited_file
EDIT: Updated to not leave a trailing tab character on each row.
EDIT 2: Updated again to work for any chr-then-number sequence.
grep "NM" yourfiname | cut -d_ -f3 | sed 's/[/\d]*/NM_/'
grep "NR" yourfiname | cut -d_ -f3 | sed 's/[/\d]*/NR_/'
cat file|sed s/$.*!(NR)//;
Use a regular expression to remove everything before the NR
awk -F '[,:_-]' '{
for (i=1; i<NF; i++)
if ($i == "NR" || $i == "NM")
printf("%s_%s ", $i, $(i+1))
print ""
}'
This will also work, but will print each match on its own line: egrep -o 'N[RM]_[0-9]+

Resources