OK i'm an absolute noob to this (only started trying to code a few weeks ago for my job) so please go easy on me
IM on an aix system
I have file1, file2 and file3 and they all contain 1 column of data (text or numerical).
file1
VBDSBQ_KFGP_SAPECC_PRGX_ACCNT_WKLY
VBDSBQ_KFGP_SAPECC_PRGX_ADDRM_WKLY
VBDSBQ_KFGP_SAPECC_PRGX_COND_WKLY
VBDSBQ_KFGP_SAPECC_PRGX_CUSTM_WKLY
VBDSBQ_KFGP_SAPECC_PRGX_EPOS_DLY
VBDSBQ_KFGP_SAPECC_PRGX_INVV_WKLY
file2
MCMILS03
HGAHJK05
KARNEK93
MORROT32
LAWFOK12
LEMORK82
file3
8970597895
0923875
89760684
37960473
526238495
146407
There will be the exact same amount of lines in each of these files.
I have another file called "dummy_file" which is what i want to pull out, replace parts and pop into a new file.
WORKSTATION#JOB_NAME
SCRIPTNAME "^TWSSCRIPTS^SCRIPT"
STREAMLOGON "^TWSUSER^"
-job JOB_NAME -user USER_ID -i JOB_ID
RECOVERY STOP
There are only 3 strings i care about in this file that i want replaced and they will always be the same for the dummy files i use in future
JOB_NAME
JOB_ID
USER_ID
There are 2 entries for JOB_NAME and only 1 for the others. What i want is take the raw file, replace both JOB_NAME entries with line 1 from file1 then replace USER_ID with line 1 from file 2 and then replace JOB_ID with line 1 from file3 then throw this into a new file
I want to repeat the process for all the lines in file 1, 2 and 3 so the next one will have its entries replaced by line 2 from the 3 files then next one will have its entries replaced by line 3 from the 3 files then all of line 3 from the files and so on and so on
raw file and the expected output are below:
WORKSTATION#JOB_NAME
SCRIPTNAME "^TWSSCRIPTS^SCRIPT"
STREAMLOGON "^TWSUSER^"
-job JOB_NAME -user USER_ID -i JOB_ID
RECOVERY STOP
WORKSTATION#VBDSBQ_KFGP_SAPECC_PRGX_ACCNT_WKLY
SCRIPTNAME "^TWSSCRIPTS^SCRIPT"
STREAMLOGON "^TWSUSER^"
-job VBDSBQ_KFGP_SAPECC_PRGX_ACCNT_WKLY -user MCMILS03 -i 8970597895
RECOVERY STOP
this is as far as i got (again i know its crap)
file="/dir/dir/dir/file1"
while IFS= read -r line
do
cat dummy_file | sed "s/JOB_NAME/$file1/" | sed "s/JOB_ID/$file2/" | sed "s/USER_ID/$file3" #####this is where i get stuck as i dont know how to reference file2 and file3##### >>new_file.txt
done
You really don't want a do/while loop in the shell. Just do:
awk '/^WORKSTATION/{
getline jobname < "file1";
getline user_id < "file2";
getline job_id < "file3"
}
{
gsub("JOB_NAME", jobname);
gsub("USER_ID", user_id);
gsub("JOB_ID", job_id)
}1' dummy_file
This might work for you (GNU parallel and sed):
parallel -q sed 's/JOB_NAME/{1}/;s/USER_ID/{2}/;s/JOB_ID/{3}/' templateFile >newFile :::: file1 ::::+ file2 ::::+ file3
This creates newFile by appending the templateFile for each instance of a line jointly in file1, file2 and file3.
N.B. the ::::+ operation ensures the union of lines in file1, file2 and file3 rather than the default product.
Using GNU awk (ARGIND and 2d arrays):
$ gawk '
NR==FNR { # store the template file
t=t (t==""?"":ORS) $0 # to t var
next
}
{
a[FNR][ARGIND]=$0 # store filen records to 2d array
}
END { # in the end
for(i=1;i<=FNR;i++) { # for each record stored from filen
t_out=t # make a working copy of the template
gsub(/JOB_NAME/,a[i][2],t_out) # replace with data
gsub(/USER_ID/,a[i][3],t_out)
gsub(/JOB_ID/,a[i][4],t_out)
print t_out # output
}
}' template file1 file2 file3
Output:
WORKSTATION#VBDSBQ_KFGP_SAPECC_PRGX_ACCNT_WKLY
SCRIPTNAME "^TWSSCRIPTS^SCRIPT"
STREAMLOGON "^TWSUSER^"
-job VBDSBQ_KFGP_SAPECC_PRGX_ACCNT_WKLY -user MCMILS03 -i 8970597895
RECOVERY STOP
...
Bash variant
#!/bin/bash
exec 5<file1 # create file descriptor for file with job names
exec 6<file2 # create file descriptor for file with job ids
exec 7<file3 # create file descriptor for file with user ids
dummy=$(cat dummy_txt) # load dummy text
output () { # create output by inserting new values in a copy of dummy var
out=${dummy//JOB_NAME/$JOB_NAME}
out=${out//USER_ID/$USER_ID}
out=${out//JOB_ID/$JOB_ID}
printf "\n$out\n"
}
while read -u5 JOB_NAME; do # this will read from all files and print output
read -u6 JOB_id
read -u7 USER_ID
output
done
From read help
$ read --help
...
-u fd read from file descriptor FD instead of the standard input
...
And a variant with paste
#!/bin/bash
dummy=$(cat dummy)
while read JOB_NAME JOB_id USER_ID; do
out=${dummy//JOB_NAME/$JOB_NAME}
out=${out//USER_ID/$USER_ID}
out=${out//JOB_ID/$JOB_ID}
printf "\n$out\n"
done < <(paste file1 file2 file3)
Related
im looking to edit my 1.txt file, to find a word and replace it with the correspondant word in 2.txt and also add the rest of the string of file 2.
Im interest in maintain the order of my 1.txt file.
>title1
ID1 .... rest of string im not interested
>title2
ID2 .... rest of string im not interested
>title3
ID3 .... rest of string im not interested
>title....
But I want to add the information of my file 2
>ID1 text i want to extract
>ID2 text i want to extract
>ID3 text i want to extract
>IDs....
At the end im looking to create a new file with this structure
>title1
ID1 .... text I want
>title2
ID2 .... text I want
>title3
ID3 .... text I want
>title....
I have tried several sed commands, but most of them dont replace the ID# exactly for the one
that is in the two files. Hopefully it could be done in bash
Thanks for your help
Failed atempts..
my codes are
File 1 = cog_anotations.txt, File 2=Real.cog.txt
ID= COG05764, COG 015668, etc...
sed -e '/COG/{r Real.cog.txt' -e 'd}' cog_anotations.txt
sed "s/^.*COG.*$/$(cat Real.cog.txt)/" cog_anotations.txt
sed -e '/\$COG\$/{r Real.cog.txt' -e 'd}' cog_anotations.txt
grep -F -f cog_anotations.txt Real.cog.txt > newfile.txt
grep -F -f Real.cog.txt cog_anotations.txt > newfile.txt
file.awk :
BEGIN { RS=">" }
{
if (FILENAME == "1.txt") {
a[$2]=$1; b[$2]=$2;
}
else {
if ($1 == b[$1]) {
if ($1 !="") { printf(">%s\n%s",a[$1],$0) }
}
}
}
call:
gawk -f file.awk 1.txt 2.txt
The order of files is important.
result:
>title1
ID1 text i want to extract
>title2
ID2 text i want to extract
>title3
ID3 text i want to extract
explanation:
The first file is divided into records at the ">" place and then two associative arrays are created. Only the else body is performed for the second file. Next we check if field 1 of the second file is in table b and if so format and print the next lines.
DO NOT write some nested grep.
A simplistic one-pass-each logic with a lookup table:
declare -A lookup
while read key txt
do lookup["$key"]="$txt"
done < 2.txt
while read key txt
do echo "${lookup[$key]:-$txt}"
done < 1.txt
I have a large file with 2000 hostnames and I want to create multiple files with 25 each host per file, but separated by a comma and the last , should be removed.
Large.txt:
host1
host2
host3
.
.
host10000
The below-split command is creating multiple files like file1, file2 ... however, the host are not , separated and its not the expected output.
split -d -l 25 large.txt file
The expected output is:
host1,host2,host3
You'll need to perform 2 separate operations ... 1) split the file and 2) reformat the files generated by split.
The first step is already done:
split -d -l 25 large.txt file
For the second step let's work with the results that are dumped into the first file by the basic split command:
$ cat file00
host1
host2
host3
...
host25
We want to pull these lines into a single line using a comma (,) as delimiter. For this example I'll use an awk solution:
$ cat file00 | awk '{ printf "%s%s", sep, $0 ; sep="," } END { print "" }'
host1,host2,host3...,host25
Where:
sep is initially undefined (aka empty string)
on each successive line processed by awk we set sep to a comma
the printf doesn't include a linefeed (\n) so each successive printf will append to the 'first' line of output
we END the script by printing a linefeed to the end of the file
It just so happens that split has an option to call a secondary script/code-snippet to allow for custom formatting of the output (generated by split); the option is --filter. A few issues to keep in mind:
the initial output from split is (effectively) piped as input to the command listed in the --filter option
it is necessary to escape (with backslash) certain characters in the command (eg, double quotes, dollar sign) so as to keep them from being interpreted by the split command
the --filter option automatically has access to the current split outfile name using the $FILE variable
Pulling everything together gives us:
$ split -d -l 25 --filter="awk '{ printf \"%s%s\", sep, \$0 ; sep=\",\" } END { print \"\" }' > \$FILE" large.txt file
$ cat file00
host1,host2,host3...,host25
Using the --filter option on GNU split:
split -d -l 25 --filter="(perl -ne 'chomp; print \",\" if \$i++; print'; echo) > \$FILE" large.txt file
you can use below mentioned bash code snippet
INPUT FILE
~$ cat domainlist.txt
domain1.com
domain2.com
domain3.com
domain4.com
domain5.com
domain6.com
domain7.com
domain8.com
Script
#!/usr/bin/env bash
FILE_NAME=domainlist.txt
LIMIT=4
OUTPUT_PREFIX=domain_
CMD="csplit ${FILE_NAME} ${LIMIT} {1} -f ${OUTPUT_PREFIX}"
eval ${CMD}
#=====#
for file in ${OUTPUT_PREFIX}*; do
echo $file
sed -i ':a;N;$!ba;s/\n/,/g' $file
done
OUTPUT
./mysplit.sh
36
48
12
domain_00
domain_01
domain_02
~$ cat domain_00
domain1.com,domain2.com,domain3.com
Change LIMIT, OUTPUT_PREFIX file name prefix and input file as per your requirement
using awk:
awk '
BEGIN { PREFIX = "file"; n = 0; }
{ hosts = hosts sep $0; sep = ","; }
function flush() { print hosts > PREFIX n++; hosts = ""; sep = ""; }
NR % 25 == 0 { flush(); }
END { flush(); }
' large.txt
edit: improved comma separation handling stealing from markp-fuso's excellent answer :)
I have several files to compare. con and ref files contain list of paths to .txt files that should be compared,and the output should contain the variable name of con_vs_ref_1.txt.
con:
/home/POP_xpclr/A.txt
/home/POP_xpclr/B.txt
ref:
/home/POP_xpclr/C.txt
/home/POP_xpclr/D.txt
#!/usr/bin/env bash
XPCLR="/home/Tools/XPCLR/bin/XPCLR"
CON="/home/POP_xpclr/con"
REF="/home/POP_xpclr/ref"
MAPS="/home/POP_xpclr/1"
OUTDIR="/home/POP_xpclr/Results"
$XPCLR -xpclr $CON $REF $MAPS $OUTDIR -w1 0.5 200 1000000 $MAPS -p1 0.95
Comments in code.
# create an MCVE, ie. input files:
cat <<EOF >con
/home/POP_xpclr/A.txt
/home/POP_xpclr/B.txt
EOF
cat <<EOF >ref
/home/POP_xpclr/C.txt
/home/POP_xpclr/D.txt
ref
# join streams
paste <(
# repeat ref file times con file has lines
seq $(<con wc -l) |
xargs -i cat ref
) <(
# repeat each line from con file times ref file has lines
# from https://askubuntu.com/questions/594554/repeat-each-line-in-a-text-n-times
awk -v max=$(<ref wc -l) '{for (i = 0; i < max; i++) print $0}' con
) |
# ok, we have all combinations of lines
# now read them field by field and do whatever we want
while read -r file1 file2; do
# run the compare function
cmp "$file1" "$file2"
# probably you want something along:
"$XPCLR" -xpclr "$file1" "$file2" "$MAPS" "$OUTDIR" -w1 0.5 200 1000000 "$MAPS" -p1 0.95
done
Looping over the file paths in your con and ref files is pretty easy in bash.
As for "the output should contain the variable name of con_vs_ref_1.txt", you haven't explained what you want very well, but I'll guess that you want the file created to be named according to that formula and inside the output directory. Something like /home/POP_xpclr/Results/A_vs_C_1.txt.
#!/usr/bin/env bash
XPCLR="/home/Tools/XPCLR/bin/XPCLR"
CON="/home/POP_xpclr/con"
REF="/home/POP_xpclr/ref"
MAPS="/home/POP_xpclr/1"
OUTDIR="/home/POP_xpclr/Results"
for FILE1 in $(cat $CON)
do
for FILE2 in $(cat $REF)
do
OUTFILE="$OUTDIR/$(basename ${FILE1%.txt})_vs_$(basename ${FILE2%.txt})_1.txt"
$XPCLR -xpclr $FILE1 $FILE2 $MAPS $OUTFILE -w1 0.5 200 1000000 $MAPS -p1 0.95
done
done
What's this doing...
$(cat $CON) creates a subshell and runs cat to read your CON file, inserting the output (i.e. all the file paths) into the script at that point
for FILE1 in $(cat $CON) creates a loop where all the values read from your CON file are iterated across and assigned to the FILE1 variable one at a time.
for FILE2 in $(cat $REF) as above but with the REF file.
${FILE1%.txt} inserts the value of FILE1 variable, with ".txt" extension removed from the end. This is called parameter expansion.
$(basename ${FILE1%.txt}) makes a subshell as before, basename strips the path of all the leading directories and returns just the filename, which we have already stripped of the ".txt" extension with the parameter expansion.
OUTFILE="$OUTDIR/$(basename ${FILE1%.txt})_vs_$(basename ${FILE2%.txt})_1.txt" combines the above two dot points to create your new file path based on your formula.
do and done are parts of the for loop construct that I hope are pretty self explanatory.
My question is similar to How to sort files in paste command?
- which has been solved.
I have 500 csv files (daily rainfall data) in a folder with naming convention chirps_yyyymmdd.csv. Each file has only 1 column (rainfall value) with 100,000 rows, and no header. I want to merge all the csv files into a single csv in chronological order.
When I tried this script ls -v file_*.csv | xargs paste -d, with only 100 csv files, it worked. But when tried using 500 csv files, I got this error: paste: chirps_19890911.csv: Too many open files
How to handle above error?
For fast solution, I can divide the csv's into two folder and do the process using above script. But, the problem I have 100 folders and it has 500 csv in each folder.
Thanks
Sample data and expected result: https://www.dropbox.com/s/ndofxuunc1sm292/data.zip?dl=0
You can do it with gawk like this...
Simply read all the files in, one after the other and save them into an array. The array is indexed by two numbers, firstly the line number in the current file (FNR) and secondly the column, which I increment each time we encounter a new file in the BEGINFILE block.
Then, at the end, print out the entire array:
gawk 'BEGINFILE{ ++col } # New file, increment column number
{ X[FNR SEP col]=$0; rows=FNR } # Save datum into array X, indexed by current record number and col
END { for(r=1;r<=rows;r++){
comma=","
for(c=1;c<=col;c++){
if(c==col)comma=""
printf("%s%s",X[r SEP c],comma)
}
printf("\n")
}
}' chirps*
SEP is just an unused character that makes a separator between indices. I am using gawk because BEGINFILE is useful for incrementing the column number.
Save the above in your HOME directory as merge. Then start a Terminal and, just once, make it executable with the command:
chmod +x merge
Now change directory to where your chirps are with a command like:
cd subdirectory/where/chirps/are
Now you can run the script with:
$HOME/merge
The output will rush past on the screen. If you want it in a file, use:
$HOME/merge > merged.csv
First make one file without pasting and change that file into a oneliner with tr:
cat */chirps_*.csv | tr "\n" "," > long.csv
If the goal is a file with 100,000 lines and 500 columns then something like this should work:
paste -d, chirps_*.csv > chirps_500_merge.csv
Additional code can be used to sort the chirps_... input files into any desired order before pasteing.
The error comes from ulimit, from man ulimit:
-n or --file-descriptor-count The maximum number of open file descriptors
On my system ulimit -n returns 1024.
Happily we can paste the paste output, so we can chain it.
find . -type f -name 'file_*.csv' |
sort |
xargs -n$(ulimit -n) sh -c '
tmp=$(mktemp);
paste -d, "$#" >$tmp;
echo $tmp
' -- |
xargs sh -c '
paste -d, "$#"
rm "$#"
' --
Don't parse ls output
Once we moved from parsing ls output to good find, we find all files and sort them.
the first xargs takes 1024 files at a time, creates temporary file, pastes the output into temporary and outputs the temporary file filename
The second xargs does the same with temporary files, but also removes all the temporaries
As the count of files would be 100*500=500000 which is smaller then 1024*1024 we can get away with one pass.
Tested against test data generated with:
seq 1 2000 |
xargs -P0 -n1 -t sh -c '
seq 1 1000 |
sed "s/^/ $RANDOM/" \
>"file_$(date --date="-${1}days" +%Y%m%d).csv"
' --
The problem seems to be much like foldl with maximum size of chunk to fold in one pass. Basically we want paste -d, <(paste -d, <(paste -d, <1024 files>) <1023 files>) <rest of files> that runs kind-of-recursively. With a little fun I came up with the following:
func() {
paste -d, "$#"
}
files=()
tmpfilecreated=0
# read filenames...c
while IFS= read -r line; do
files+=("$line")
# if the limit of 1024 files is reached
if ((${#files[#]} == 1024)); then
tmp=$(mktemp)
func "${files[#]}" >"$tmp"
# remove the last tmp file
if ((tmpfilecreated)); then
rm "${files[0]}"
fi
tmpfilecreated=1
# start with fresh files list
# with only the tmp file
files=("$tmp")
fi
done
func "${files[#]}"
# remember to clear tmp file!
if ((tmpfilecreated)); then
rm "${files[0]}"
fi
I guess readarray/mapfile could be faster, and result in a bit clearer code:
func() {
paste -d, "$#"
}
tmp=()
tmpfilecreated=0
while readarray -t -n1023 files && ((${#files[#]})); do
tmp=("$(mktemp)")
func "${tmp[#]}" "${files[#]}" >"$tmp"
if ((tmpfilecreated)); then
rm "${files[0]}"
fi
tmpfilecreated=1
done
func "${tmp[#]}" "${files[#]}"
if ((tmpfilecreated)); then
rm "${files[0]}"
fi
PS. I want to merge all the csv files into a single csv in chronological order. Wouldn't that be just cut? Right now each column represents one day.
You can try this Perl-one liner. It will work for any number of files matching *.csv under a directory
$ ls -1 *csv
file_1.csv
file_2.csv
file_3.csv
$ cat file_1.csv
1
2
3
$ cat file_2.csv
4
5
6
$ cat file_3.csv
7
8
9
$ perl -e ' BEGIN { while($f=glob("*.csv")) { $i=0;open($FH,"<$f"); while(<$FH>){ chomp;#t=#{$kv{$i}}; push(#t,$_);$kv{$i++}=[#t];}} print join(",",#{$kv{$_}})."\n" for(0..$i) } ' <
1,4,7
2,5,8
3,6,9
$
I have thousands of files on unix, that I need to split into two parts, according to following rules:
1) Find the first occurence of the string ' JOB ' in the file
2) Find the first line after the occurence found in point 1) which doesn't end with comma ','
3) Split the file after the line found in point 2)
Below is a sample file, this one should be split after the line ending with the string 'DUMMY'.
//*%OPC SCAN
//*%OPC FETCH MEMBER=$BUDGET1,PHASE=SETUP
// TESTJOB JOB USER=TESTUSER,MSGLEVEL=5,
// CLASS=H,PRIORITY=10,
// PARAM=DUMMY
//*
//STEP1 EXEC DB2OPROC
//...
How can I achieve this?
Thanks
You can use sed for this task:
$ cat data1
//*%OPC SCAN
//*%OPC FETCH MEMBER=$BUDGET1,PHASE=SETUP
// TESTJOB JOB USER=TESTUSER,MSGLEVEL=5,
// CLASS=H,PRIORITY=10,
// PARAM=DUMMY
//*
//STEP1 EXEC DB2OPROC
//...
$ sed -n '0,/JOB/ p;/JOB/,/[^,]$/ p' data1 | uniq > part1
$ sed '0,/JOB/ d;0,/[^,]$/ d' data1 > part2
$ cat part1
//*%OPC SCAN
//*%OPC FETCH MEMBER=$BUDGET1,PHASE=SETUP
// TESTJOB JOB USER=TESTUSER,MSGLEVEL=5,
// CLASS=H,PRIORITY=10,
// PARAM=DUMMY
force#force-virtual-machine:~$ cat part2
//*
//STEP1 EXEC DB2OPROC
//...
$
my solution is:
find all files to be checked;
grep each file for specified pattern with -n to get the match line if it matches;
split the matching file by head or tail with line number got in step two.
what's more, grep can handle reg expression. such as grep -n "^.*JOB.*[^,]$" filename.
You can do this in a couple of steps using awk/sed:
line=`awk '/JOB/,/[^,]$/ {x=NR} END {print x}' filename`
next=`expr $line + 1`
sed -ne "1,$line p" filename > part_1
sed -ne "$next,\$ p" filename > part_2
where filename is the name of your file. This will create two files: part_1 and part_2.