Combine 2 files based on the date using awk or sed - shell

I have 2 files , the format is as follow,
File1 's contents,
02-01-12 28.46
02-02-12 27.15
02-03-12 47.54
02-04-12 27.36
02-05-12 47.57
02-06-12 27.01
02-07-12 27.41
02-08-12 27.27
02-09-12 27.39
File2 's contents,
02-01-12 11.46
02-02-12 12.15
02-03-12 14.54
02-04-12 15.36
02-05-12 17.57
02-06-12 17.01
02-07-12 17.41
02-08-12 21.27
02-09-12 17.39
I want to combine them into 1 file based on the date as below,
02-01-12 28.46 11.46
02-02-12 27.15 12.15
02-03-12 47.54 14.54
....................
....................
....................
Please help !! Thanks in advance..

what you want is join:
from the man page:
join - join lines of two files on a common field
try:
$ join file1 file2

Full real working example using paste :
paste FILE1 <(cut -d " " -f2 FILE2)
See :
man 1 paste

Using just sed:
/bin/sed -n '
p
R f2
' f1 |
/bin/sed 'N;s/\n[^ ]*//;'

Related

bash print only last 7 fields

I have hundreds of thousands of files with several hundreds of thousands of lines in each of them.
2022-09-19/SALES_1.csv:CUST1,US,2022-09-19,43.31,17.56,47.1,154.48,154. 114
2022-09-20/SALES_2.csv:CUST2,NA,2022-09-20,12.4,16.08,48.08,18.9,15.9,3517
The lines may have different number of fields. NO matter how many fields are present, I'm wanting to extract just the last 7 fields.
I'm trying with cut & awk but, have been only able to prit a range of fields but not last 'n' fields.
Please could I request guidance.
$ rev file | cut -d, -f1-7 | rev
will give the last 7 fields regardless of varying number of fields in each record.
Using any POSIX awk:
$ awk -F',' 'NF>7{sub("([^,]*,){"NF-7"}","")} 1' file
US,2022-09-19,43.31,17.56,47.1,154.48,154. 114
2022-09-20,12.4,16.08,48.08,18.9,15.9,3517
1 {m,g}awk' BEGIN { _+=(_+=_^= FS = OFS = ",")+_
2 ___= "^[^"(__= "\5") ("]*")__
3
4 } NF<=_ || ($(NF-_) = __$(NF-_))^(sub(___,"")*!_)'
US,
2022-09-19,
43.31,
17.56,
47.1,
154.48,
154. 114
2022-09-20,
12.4,
16.08,
48.08,
18.9,
15.9,
3517
In pure Bash, without any external processes and/or pipes:
(IFS=,; while read -ra line; do printf '%s\n' "${line[*]: -7}"; done;) < file
Prints the last 7 fields:
sed -E 's/.*,((.*,){6}.*)/\1/' file

Should I use a for loop to process text files line by line?

So I have two text files
FILE1: 1-40 names
FILE2: 1-40 names
Now what I want the program to do (Terminal) is to go through each name, by incrementing by ONE in each file so that the first name from FILE1 runs the first line from FILE2, and 20th name from FILE1 runs the 20th line from FILE2.
BUT I DON'T WANT IT TO run first name of FILE1, and then run through all of the names listed in FILE2, and repeat that over and over again.
Should I do a for loop?
I was thinking of doing something like:
for f in (cat FILE1); do
flirt -in $f -ref (cat FILE2);
done
I'm doing this using BASH.
Yes, you can do it quite easily, but it will require reading from two-different file descriptors at once. You can simply redirect one of the files into the next available file descriptor and use it to feed your read loop, e.g.
while read f1var && read -u 3 f2var; do
echo "f1var: $f1var -- f2var: $f2var"
done <file1.txt 3<file2.txt
Which will read line-by-line from each file reading a line from file1.txt on the standard file descriptor into f1var and from file2.txt on fd3 into f2var.
A short example might help:
Example Input Files
$ cat f1.txt
a
b
c
$ cat f2.txt
d
e
f
Example Use
$ while read f1var && read -u 3 f2var; do \
echo "f1var: $f1var -- f2var: $f2var"; \
done <f1.txt 3<f2.txt
f1var: a -- f2var: d
f1var: b -- f2var: e
f1var: c -- f2var: f
Using paste as an alternative
The paste utility also provides a simple alternative for combining files line-by-line, e.g.:
$ paste f1.txt f2.txt
a d
b e
c f
In Bash, you might make usage of arrays:
echo "Alice
> Bob
> Claire" > file-1
echo "Anton
Bärbel
Charlie" > file-2
n1=($(cat file-1))
n2=($(cat file-2))
for n in {0..2}; do echo ${n1[$n]} ${n2[$n]} ; done
Alice Anton
Bob Bärbel
Claire Charlie
Getting familiar with join and nl (number lines) can't be wrong, so here is a different approach:
nl -w 1 file-1 > file1
nl -w 1 file-2 > file2
join -1 1 -2 1 file1 file2 | sed -r 's/^[0-9]+ //'
nl with put a big amount of blanks in front of the small line numbers, if we don't tell it to -w 1.
We join the files by matching line number and remove the line number afterwards with sed.
Paste is of course much more elegant. Didn't know about this.

KornShell getting latest file versions

I want to write a KornShell (ksh) script to get the latest three versions for a file in the directory having lot of files with various versions (files with prefix as same but time stamp as suffix) and compress it and at the same time have to remove the remaining versions of the file (other than latest three versions of the file).
I'm new to KornShell scripting. Can any one provide me with solution?
The directory structure is like this:
abcd.11122013.txt
abcd.12122013.txt
abcd.10122013.txt
abcd.09122013.txt
xyz.11122013.txt
xyz.12122013.txt
xyz.10122013.txt
......................
In this I want the latest 3 version of files starting with abcd* as prefix. Similarly files starting with xyz*.
Assuming the file modification times are correct, it could be as simple as:
ls -tr abcd* | tail -3
One way using perl :
printf '%s\n' * |
perl -F'\'. -lane '
/^abcd/ and
$h{
substr($F[1], 4, 4),
substr($F[1], 2, 2),
substr($F[1], 0, 2)
} = $_;
END{
$c=0;
for (sort $a <=> $b, keys %h) {
print $h{$_};
$c++ == 3 and last;
}
}
'
There are undoubtedly simpler ones but here is one way:
for base in abcd xyz
do
ls ${base}* | sed 's/\(.*\).\(..\)\(..\)\(....\).txt/\4 \3 \2 \1/' | sort | tail -3 | while read a b c d;do echo $d.$c$b$a.txt; done
done
Update: using an array to store the prefixes and showing one way to call the archive function for each file.
#!/bin/ksh
typeset -a prefix=(abcd xyz foo bar)
for base in "${prefix[#]}"
do
for file in $(ls "${base}"* |
sed 's/\(.*\).\(..\)\(..\)\(....\).txt/\4 \3 \2 \1/' |
sort |
tail -3 |
while read a b c d; do
echo $d.$c$b$a.txt
done); do
echo archive $file
done
done

using paste command in a loop

I am using Fedora, and bash to do some text manipulation with the files I have. I am trying to combine a large number of files, each one with two columns of data. From these files, I want to extract the data on the 2nd column of the files, and put it in a single file. Previously, I used the following script:
paste 0_0.dat 0_6.dat 0_12.dat | awk '{print $1, $2, $4}' >0.dat
But this is painfully hard as the number of files gets larger -- trying to do with 100 files. So I looked through the web to see if there's a way to achieve this in a simple way, but come up empty-handed.
I'd like to invoke a 'for' loop, if possible -- for example,
for i in $(seq 0 6 600)
do
paste 0_0.dat | awk '{print $2}'>>0.dat
done
but this does not work, of course, with paste command.
Please let me know if you have any recommendations on how to do what I'm trying to do ...
DATA FILE #1 looks like below (deliminated by a space)
-180 0.00025432
-179 0.000309643
-178 0.000189226
.
.
.
-1 2E-5
0 1.4E-6
1 0.00000
.
.
.
178 0.0023454268
179 0.002352534
180 0.001504992
DATA FILE #2
-180 0.0002352
-179 0.000423452
-178 0.00019304
.
.
.
-1 2E-5
0 1.4E-6
1 0.00000
.
.
.
178 0.0023454268
179 0.002352534
180 0.001504992
First column goes from -180 to 180, with increment of 1.
DESIRED
(n is the # of columns; and # of files)
-180 0.00025432 0.00025123 0.000235123 0.00023452 0.00023415 ... n
-179 0.000223432 0.0420504 0.2143450 0.002345123 0.00125235 ... n
.
.
.
-1 2E-5
0 1.4E-6
1 0.00000
.
.
.
179 0.002352534 ... n
180 0.001504992 ... n
Thanks,
join can get you your desired result.
join <(sort -r file1) <(sort -r file2)
Test:
[jaypal:~/Temp] cat file1
-180 0.00025432
-179 0.000309643
-178 0.000189226
[jaypal:~/Temp] cat file2
-180 0.0005524243
-179 0.0002424433
-178 0.0001833333
[jaypal:~/Temp] join <(sort -r file1) <(sort -r file2)
-180 0.00025432 0.0005524243
-179 0.000309643 0.0002424433
-178 0.000189226 0.0001833333
To do multiple files at once, you can use it with find command -
find . -type f -name "file*" -exec join '{}' +
How about this:
paste "$#" | awk '{ printf("%s", $1);
for (i = 2; i < NF; i += 2)
printf(" %s", $i); printf "\n";
}'
This assumes that you don't run into a limit with paste (check how many open files it can have). The "$#" notation means 'all the arguments given, exactly as given'. The awk script simply prints $1 from each line of pasted output, followed by the even-numbered columns; followed by a newline. It doesn't validate that the odd-numbered columns all match; it would perhaps be sensible to do so, and you could code a vaguely similar loop to do so in awk. It also doesn't check that the number of fields on this line is the same as the number on the previous line; that's another reasonable check. But this does do the whole job in one pass over all the files - for an essentially arbitrary list of files.
I have 100 input files -- how do I use this code to open up these files?
You put my original answer in a script 'filter-data'; you invoke the script with the 101 file names generated by seq. The paste command pastes all 101 files together; the awk command selects the columns you are interested in.
filter-data $(seq --format="0_%g.dat" 0 6 600)
The seq command with the format will list you 101 file names; these are the 101 files that will be pasted.
You could even do without the filter-data script:
paste $(seq --format="0_%g.dat" 0 6 600) | awk '{ printf("%s", $1);
for (i = 2; i < NF; i += 2)
printf(" %s", $i); printf "\n";
}'
I'd probably go with the more general script as the main script, and if need be I'd create a 'one-liner' that invokes the main script with the specific set of arguments currently of interest.
The other key point which might be a stumbling block: paste is not limited to 2 files only; it can paste as many files as you can have open (give or take about 3).
Based on my assumptions that you see in the comments above, you don't need paste. Try this
awk '{
arr[$1] = arr[$1] "\t" $2 };
END {for (x=-180;x<=180;x++) print x "\t" arr[x]
}' *.txt \
| sort -n
Note that we just take all of the values into an array based on the value in the first field, and append values based on the $1 key. After all data has been read in, The END section prints out the key and the value. I've added things like "x=", ":vals= " to help 'explain' what is happening. Remove those for completely clean tab-seperated data. Change '\t' to ':' or '|', or ... shudder ',' if you need to. Change the *.txt to what every your filespec is.
Be aware that all Unix command lines have limitations to the number and size (length of filenames, not the data inside), of filenames that can be processed in 1 invocation. Let us know if you get error messages about that.
The pipe to sort ensures that data is sorted by column1.
With my test data, the output was
-178 0.0001892261 0.0001892262 0.0001892263 0.000189226
-179 0.0003096431 0.0003096432 0.0003096433 0.000309643
-180 0.000254321 0.000254322 0.000254323 0.00025432
178 0.0001892261 0.0001892262 0.0001892263 0.000189226
179 0.0003096431 0.0003096432 0.0003096433 0.000309643
180 0.000254321 0.000254322 0.000254323 0.00025432
Based on 4 files of input.
I hope this helps.
P.S. Welcome to StackOverflow (S.O.) Please remeber to read the FAQs, http://tinyurl.com/2vycnvr , vote for good Q/A by using the gray triangles, http://i.imgur.com/kygEP.png , and to accept the answer that bes solves your problem, if any, by pressing the checkmark sign , http://i.imgur.com/uqJeW.png
This might work for you:
echo *.dat | sed 's/\S*/<(cut -f2 &)/2g;s/^/paste /' | bash >all.dat

UNIX shell-scripting: Split a textfile by its entries

I'm trying to analyze an enormous text file (1.6GB), whose data lines look like this:
20090118025859 -2.400000 78.100000 1023.200000 0.000000
20090118025900 -2.500000 78.100000 1023.200000 0.000000
20090118025901 -2.400000 78.100000 1023.200000 0.000000
I don't even know how many lines there are. But I'm trying to split the file by date. The left number is a time stamp (these lines for example are from 2009, january 18th).
How can I split this file into pieces according to the date?
The number of entries per date differs, so using split with a constant number won't work.
Everything I know would be to grep file '20090118*' > data20090118.dat , but there sure is a way to do all the dates at once, right?
Thanks in advance,
Alex
Using awk:
awk '{print > "data"substr($1,0,8)".dat"}' myfile
This should work if the items are in date sequence:
date=20090101 # Change to the earliest date
while IFS= read -rd $'\n' line
do
if [ "$(echo "$line" | cut -d ' ' -f 1 | cut -c 1-8)" -eq $date ]
then
echo "$line" >> "$date.dat"
else
let date++
fi
done < log.dat
With the caveats that each day needs to have more than 1 record,
and that the output file will have blank lines:
uniq --all-repeated=separate -w8 file | csplit -s - '/^$/' '{*}'
We really should have an option to uniq to output even uniq records.
Also csplit should have an option to suppress the matched line.

Resources