Bash Script Write File Out Depending on Contents - bash

I have a file and it's in the form:
Thread1.Action1, wahhhhhh
Thread1.Action1, blahhhhhh
Thread1.Action2, wooooooo
Thread1.Action2, weeeeeee
Thread1.Action2, baaaaaaa
Thread2.Action1, mooooooo
Thread2.Action2, wooooooof
What I need to do is:
Write a file, where the filename is the first bit before the comma. This file should then contain all lines associated with it. e.g: There should be 4 files in this case: Thread1.Action1.out, Thread1.Action2.out, Thread2.Action1.out and Thread2.Action2.out
For example, Thread1.Action2.out should contain:
Thread1.Action2, wooooooo
Thread1.Action2, weeeeeee
Thread1.Action2, baaaaaaa
Thread2.Action1.out should contain:
Thread2.Action1, mooooooo
etc..
Note: I want it to be agnostic to the name of the first column - e.g: I won't necessarily know what the data is in the first column before executing the script, but there will be groups of it...
I want to write a bash script that will be able to do this. I've tried doing bits of it in awk, but it's getting very messy.
Any help?

awk -F, '{print > $1".out"}' your_file
For your comment:exceute the below command for creating all the directories at the same time:
awk -F. '{print $1}' your_file | sort -u | xargs mkdir
Now execute this command:
awk -F, '{split($1,a,".");print >a[1]"/"$1".out"}' your_file

Related

Bash splitting a large gzipped file

I've been recently dealing with some file processing, and I am trying to write a bash one-liner, which would look something like:
zcat largefile.gz | split_by_delimiter_into_separate_files
Things I tried:
zcat largefile.gz | awk '{print $0 " //"> "separate_file" NR}' RS='//'
The delimiter I am trying to split upon is "//". I know something like python could probably solve this into a couple of lines, yet my project is not python-based and this is thus not an option..
Try it like this:
zcat largefile.gz | awk -vRS='//' '{print $0 " //"> "separate_file" NR}'
You can use split which I believe does exactly what you need:
zcat largefile.gz | split -p '//' - separate_file_
will create files prefixed with separate_file_ with the content of large file, split on //

remove first text with shell script

Please someone help me with this bash script,
lets say I have lots of files with url like below:
https://example.com/x/c-ark4TxjU8/mybook.zip
https://example.com/x/y9kZvVp1k_Q/myfilename.zip
My question is, how to remove all other text and leave only the file name?
I've tried to use the command described in this url How to delete first two lines and last four lines from a text file with bash?
But since the text is random which means it doesn't have exact numbers the code is not working.
You can use the sed utility to parse out just the filenames
sed 's_.*\/__'
You can use awk:
The easiest way that I find:
awk -F/ '{print $NF}' file.txt
or
awk -F/ '{print $6}' file.txt
You can also use sed:
sed 's;.*/;;' file.txt
You can use cut:
cut -d'/' -f6 file.txt

Find string from a file to another file in shell script

I am new to shell scripting. Just wanna know how can I obtain the result I wanted with the following:
I have two files (FILE_A and FILE_B)
FILE_A contains:
09228606355,71295939,1,http://sun.net.ph/043xafj.xml,01000001C123000D30
09228505450,71295857,1,http://sun.net.ph/004xafk.xml,01000001C123000D30
FILE_B contains:
http://sun.net.ph/161ybfq.xml ,9220002354016,93111
http://sun.net.ph/004xafk.xml ,9220002354074,93111
If the URL (4th field) in FILE_A is present in FILE_B, the out will be:
09228505450,71295857,1,http://sun.net.ph/004xafk.xml,01000001C123000D30,9220002354074,93111
It will display the whole line in FILE_A and added 2nd and 3rd field of FILE_B.
I hope my question is clear. Thank you.
This might work for you (GNU sed):
sed -r 's/^\s*(\S+)\s*,(.*)/\\#^([^,]*,){3}\1#s#$#,\2#p/' fileB | sed -nrf - fileA
This builds a sed script from fileB and runs it against fileA. The second sed script is run in silent mode and only those lines that match the sed script are printed out.
Try this:
paste -d , A B | awk -F , '{if ($4==$6) print "match", $1,$2,$3,$4,$5,$7,$8;}'
I removed the spaces in your file B for the $4==$6 to work.
I use paste to create a composite line using , as the delimiter to get a line with , . I then use awk comparison to check the URLs from both files and if a match is found I print all the fields you care about.

How do I write an awk print command in a loop?

I would like to write a loop creating various output files with the first column of each input file, respectively.
So I wrote
for i in $(\ls -d /home/*paired.isoforms.results)
do
awk -F"\t" {print $1}' $i > $i.transcript_ids.txt
done
As an example if there were 5 files in the home directory named
A_paired.isoforms.results
B_paired.isoforms.results
C_paired.isoforms.results
D_paired.isoforms.results
E_paired.isoforms.results
I would like to print the first column of each of these files into a seperate output file, i.e. I would like to have 5 output files called
A.transcript_ids.txt
B.transcript_ids.txt
C.transcript_ids.txt
D.transcript_ids.txt
E.transcript_ids.txt
or any other name as long as it is 5 different names and I can still link them back to the original files.
I understand, that there is a problem with the double usage of $ in both the awk and the loop command, but I don't know how to change that.
Is it possible to write a command like this in a loop?
This should do the job:
for file in /home/*paired.isoforms.results
do
base=${file##*/}
base=${base%%_*}
awk -F"\t" '{print $1}' $file > $base.transcript_ids.txt
done
I assume that there can be spaces in the first field since you set the delimiter explicitly to tab. This runs awk once per file. There are ways to do it running awk once for all files, but I'm not convinced the benefit is significant. You could consider using cut instead of awk '{print $1}', too. Note that using ls as you did is less satisfactory than using globbing directly; it runs foul of file names with oddball characters (spaces, tabs, etc) in the name.
You can do that entirely in awk:
awk -F"\t" '{split(FILENAME,a,"_"); out=a[1]".transcript_ids.txt"; print $1 > out}' *_paired.isoforms.results
If your input files don't have names as indicated in the question, you'd have to split on something else ( as well as use a different pattern match for the input files ).
My original answer is actually doing extra name resolution every time something is printed. Here's a version that only updates the output filename when FILENAME changes:
awk -F"\t" 'FILENAME!=lf{split(FILENAME,a,"_"); out=a[1]".transcript_ids.txt"; lf=FILENAME} {print $1 > out}' *_paired.isoforms.results

awk execute same command on different files one by one

Hi I have 30 txt files in a directory which are containing 4 columns.
How can I execute a same command on each file one by one and direct output to different file.
The command I am using is as below but its being applied on all the files and giving single output. All i want is to call each file one by one and direct outputs to a new file.
start=$1
patterns=''
for i in $(seq -43 -14); do
patterns="$patterns /cygdrive/c/test/kpi/SIGTRAN_Load_$(exec date '+%Y%m%d' --date="-${i} days ${start}")*"; done
cat /cygdrive/c/test/kpi/*$patterns | sed -e "s/\t/,/g" -e "s/ /,/g"| awk -F, 'a[$3]<$4{a[$3]=$4} END {for (i in a){print i FS a[i]}}'| sed -e "s/ /0/g"| sort -t, -k1,2> /cygdrive/c/test/kpi/SIGTRAN_Load.csv
Sth like this
for fileName in /path/to/files/foo*.txt
do
mangleFile "$fileName"
done
will mangle a list of files you give via globbing. If you want to generate the file name patterns as in your example, you can do it like this:
for i in $(seq -43 -14)
do
for fileName in /cygdrive/c/test/kpi/SIGTRAN_Load_"$(exec date '+%Y%m%d' --date="-${i} days ${start}")"*
do
mangleFile "$fileName"
done
done
This way the code stays much more readable, even if shorter solutions may exist.
The mangleFile of course then will be the awk call or whatever you would like to do with each file.
Use the following idiom:
for file in *
do
./your_shell_script_containing_the_above.sh $file > some_unique_id
done
You need to run a loop on all the matching files:
for i in /cygdrive/c/test/kpi/*$patterns; do
tr '[:space:]\n' ',\n' < "$i" | awk -F, 'a[$3]<$4{a[$3]=$4} END {for (i in a){print i FS a[i]}}'| sed -e "s/ /0/g"| sort -t, -k1,2 > "/cygdrive/c/test/kpi/SIGTRAN_Load-$i.csv"
done
PS: I haven't tried much to refactor your piped commands that can probably be shortened too.

Resources