Related
I have a huge file and I need to retrieve specific columns from File1 which is ~ 200000 rows and ~ 1000 Columns if it matches with the list of file2. (Prefer Bash over R )
for example my dummy data files are as follows,
file1
gene s1 s2 s3 s4 s5
a 1 2 1 2 1
b 2 3 2 3 3
c 1 1 1 1 1
d 1 1 2 2 2
and file2
sample
s4
s3
s7
s8
My desired output is
gene s3 s4
a 1 2
b 2 3
c 1 1
d 2 2
likewise, i have 3 different file2 and i have to pick different samples from the same file1 into a new file.
I would be very greatful if you guys can provide me with your valuable suggestions
P.S: I am a Biologist, i have very little coding experience
Regards
Ateeq
$ cat file1
gene s1 s2 s3 s4 s5
a 1 2 1 2 1
b 2 3 2 3 3
c 1 1 1 1 1
d 1 1 2 2 2
$ cat file2
gene
s4
s3
s8
s7
$ cat a
awk '
NR == FNR {
columns[ NR ] = $0
printf "%s\t", $0
next
}
FNR == 1 {
print ""
split( $0, headers )
for (x = 1 ; x <= length(headers) ; x++ )
{
aheaders[ headers[x]] = x
}
next
}
{
for ( x = 1 ; x <= length( columns ) ; x++ )
{
if (length( aheaders[ columns[x] ] ) == 0 )
printf "N/A\t"
else
printf "%s\t" , $aheaders[ columns[x] ]
}
print ""
}
' $*
$ ./a file2 file1 | column -t
gene s4 s3 s8 s7
a 2 1 N/A N/A
b 3 2 N/A N/A
c 1 1 N/A N/A
d 2 2 N/A N/A
The above should get you on your way. It's an extremely optimistic program and no negative testing was performed.
Awk is a tool that applies a set of commands to every line of every file that matches an expression. In general, the awk script has the form:
<pattern> <command>
There are three such pairs above. Each needs a little explanation:
NR == FNR {
columns[ NR ] = $0
printf "%s\t", $0
next
}
NR == FNR is a awk'ism. NR is the record number and FNR is the record number in the current file. NR is always increasing but FNR resets to 1 when awk parses the next file. NR==FNR is an idiom that is only true when parsing the first file.
I've designed the awk program to read the columns file first (you are calling this file2). File2 has a list of columns to output. As you can see, we are storing each line in the first file (file2) into an array called columns. We are also printing the columns out as we read them. In order to avoid newlines after each column name (since we want all the column headers to be on the same line), we use printf which doesn't output a newline (as opposed to print which does).
The 'next' at the end of the stanza tells awk to read the next line in the file without processing any of the other stanzas. After all, we just want to read the first file.
In summary, the first stanza remembers the column names (and order) and prints them out on a single line (without a newline).
The second "stanza":
FNR == 1 {
print ""
split( $0, headers )
for (x = 1 ; x <= length(headers) ; x++ )
{
aheaders[ headers[x]] = x
}
next
}
FNR==1 will match on the first line of any file. Due to the next in the previous stanza, we'll only hit this stanza when we are on the first line of the second file (file1). The first print "" statement adds the newline that was missing from the first stanza. Now the line with the column headers is complete.
The split command takes the first parameter, $0, the current line and splits it according to whitespace. We know the current line is the first line and has the column headers in it. The split command writes to an array named in the second parameter , headers. Now headers[1] = "gene" and headers[2] = "s4" , headers[3] = "s3", etc.
We're going to need to map the column names to the column numbers. The next bit of code takes each header value and creates an aheaders entry. aheders is an associative array that maps column header names to the column number.
aheaders["gene"] = 1
aheaders["s1"] = 2
aheaders["s2"] = 3
aheaders["s3"] = 4
aheaders["s4"] = 5
aheaders["s5"] = 6
When we're done making the aheaders array, the next command tells awk to skip to the next line of the input. From this point on, only the third stanza is going to have a true condition.
{
for ( x = 1 ; x <= length( columns ) ; x++ )
{
if (length( aheaders[ columns[x] ] ) == 0 )
printf "N/A\t"
else
printf "%s\t" , $aheaders[ columns[x] ]
}
print ""
}
The third stanza has no explicit . Awk will process this as always true. So this last is executed for every line of the second file.
At this point, we want to print the columns that are specified in columns array. We walk through each element of the array in order. The first time through the loop, columns[1] = "gene_symbol". This gives us:
printf "%s\t" , $aheaders[ "gene" ]
And since aheaders["gene"] = 1 this gives us:
printf "%s\t" , $1
And awk understands $1 to be the first field (or column) in the input line. Thus the first column is passed to printf which outputs the value with a tab (\t) appended.
The loop then executes another time with x=2 and columns[2]="s4". This results in the following print executing:
printf "%s\t" , $5
This prints the fifth column followed by a tab. The next iteration:
columns[3] = "s3"
aheaders["s3"] = 4
Which results in:
printf "%s\t" , $4
That is, the fourth field is output.
The next iteration we hit a failure situation:
columns[4] = "s8"
aheaders["s8"] = ''
In this case, the length( aheaders[ columns[x] ] ) == 0 is true so we just print out a placeholder - something to tell the operator their input may be invalid:
printf "N/A\t"
The same is output when we process the last columns[x] value "s7".
Now, since there are no more entries in columns, the loop exists and we hit the final print:
print ""
The empty string is provided to print because print by itself defaults to print $0 - the entire line.
At this point, awk reads the next line out of file1 hits the third block again (only). Thus awk continues until the second file is completely read.
I have two files:
File with strings (new line terminated)
File with integers (one per line)
I would like to print the lines from the first file indexed by the lines in the second file. My current solution is to do this
while read index
do
sed -n ${index}p $file1
done < $file2
It essentially reads the index file line by line and runs sed to print that specific line. The problem is that it is slow for large index files (thousands and ten thousands of lines).
Is it possible to do this faster? I suspect awk can be useful here.
I search SO to my best but could only find people trying to print line ranges instead of indexing by a second file.
UPDATE
The index is generally not shuffled. It is expected for the lines to appear in the order defined by indices in the index file.
EXAMPLE
File 1:
this is line 1
this is line 2
this is line 3
this is line 4
File 2:
3
2
The expected output is:
this is line 3
this is line 2
If I understand you correctly, then
awk 'NR == FNR { selected[$1] = 1; next } selected[FNR]' indexfile datafile
should work, under the assumption that the index is sorted in ascending order or you want lines to be printed in their order in the data file regardless of the way the index is ordered. This works as follows:
NR == FNR { # while processing the first file
selected[$1] = 1 # remember if an index was seen
next # and do nothing else
}
selected[FNR] # after that, select (print) the selected lines.
If the index is not sorted and the lines should be printed in the order in which they appear in the index:
NR == FNR { # processing the index:
++counter
idx[$0] = counter # remember that and at which position you saw
next # the index
}
FNR in idx { # when processing the data file:
lines[idx[FNR]] = $0 # remember selected lines by the position of
} # the index
END { # and at the end: print them in that order.
for(i = 1; i <= counter; ++i) {
print lines[i]
}
}
This can be inlined as well (with semicolons after ++counter and index[FNR] = counter, but I'd probably put it in a file, say foo.awk, and run awk -f foo.awk indexfile datafile. With an index file
1
4
3
and a data file
line1
line2
line3
line4
this will print
line1
line4
line3
The remaining caveat is that this assumes that the entries in the index are unique. If that, too, is a problem, you'll have to remember a list of index positions, split it while scanning the data file and remember the lines for each position. That is:
NR == FNR {
++counter
idx[$0] = idx[$0] " " counter # remember a list here
next
}
FNR in idx {
split(idx[FNR], pos) # split that list
for(p in pos) {
lines[pos[p]] = $0 # and remember the line for
# all positions in them.
}
}
END {
for(i = 1; i <= counter; ++i) {
print lines[i]
}
}
This, finally, is the functional equivalent of the code in the question. How complicated you have to go for your use case is something you'll have to decide.
This awk script does what you want:
$ cat lines
1
3
5
$ cat strings
string 1
string 2
string 3
string 4
string 5
$ awk 'NR==FNR{a[$0];next}FNR in a' lines strings
string 1
string 3
string 5
The first block only runs for the first file, where the line number for the current file FNR is equal to the total line number NR. It sets a key in the array a for each line number that should be printed. next skips the rest of the instructions. For the file containing the strings, if the line number is in the array, the default action is performed (so the line is printed).
Use nl to number the lines in your strings file, then use join to merge the two:
~ $ cat index
1
3
5
~ $ cat strings
a
b
c
d
e
~ $ join index <(nl strings)
1 a
3 c
5 e
If you want the inverse (show lines that NOT in your index):
$ join -v 2 index <(nl strings)
2 b
4 d
Mind also the comment by #glennjackman: if your files are not lexically sorted, then you need to sort them before passing in:
$ join <(sort index) <(nl strings | sort -b)
In order to complete the answers that use awk, here's a solution in Python that you can use from your bash script:
cat << EOF | python
lines = []
with open("$file2") as f:
for line in f:
lines.append(int(line))
i = 0
with open("$file1") as f:
for line in f:
i += 1
if i in lines:
print line,
EOF
The only advantage here is that Python is way more easy to understand than awk :).
Okay, I have two files: one is baseline and the other is a generated report. I have to validate a specific string in both the files match, it is not just a single word see example below:
.
.
name os ksd
56633223223
some text..................
some text..................
My search criteria here is to find unique number such as "56633223223" and retrieve above 1 line and below 3 lines, i can do that on both the basefile and the report, and then compare if they match. In whole i need shell script for this.
Since the strings above and below are unique but the line count varies, I had put it in a file called "actlist":
56633223223 1 5
56633223224 1 6
56633223225 1 3
.
.
Now from below "Rcount" I get how many iterations to be performed, and in each iteration i have to get ith row and see if the word count is 3, if it is then take those values into variable form and use something like this
I'm stuck at the below, which command to be used. I'm thinking of using AWK but if there is anything better please advise. Here's some pseudo-code showing what I'm trying to do:
xxxxx=/root/xxx/xxxxxxx
Rcount=`wc -l $xxxxx | awk -F " " '{print $1}'`
i=1
while ((i <= Rcount))
do
record=_________________'(Awk command to retrieve ith(1st) record (of $xxxx),
wcount=_________________'(Awk command to count the number of words in $record)
(( i=i+1 ))
done
Note: record, wcount values are later printed to a log file.
Sounds like you're looking for something like this:
#!/bin/bash
while read -r word1 word2 word3 junk; do
if [[ -n "$word1" && -n "$word2" && -n "$word3" && -z "$junk" ]]; then
echo "all good"
else
echo "error"
fi
done < /root/shravan/actlist
This will go through each line of your input file, assigning the three columns to word1, word2 and word3. The -n tests that read hasn't assigned an empty value to each variable. The -z checks that there are only three columns, so $junk is empty.
I PROMISE you you are going about this all wrong. To find words in file1 and search for those words in file2 and file3 is just:
awk '
NR==FNR{ for (i=1;i<=NF;i++) words[$i]; next }
{ for (word in words) if ($0 ~ word) print FILENAME, word }
' file1 file2 file3
or similar (assuming a simple grep -f file1 file2 file3 isn't adequate). It DOES NOT involve shell loops to call awk to pull out strings to save in shell variables to pass to other shell commands, etc, etc.
So far all you're doing is asking us to help you implement part of what you think is the solution to your problem, but we're struggling to do that because what you're asking for doesn't make sense as part of any kind of reasonable solution to what it sounds like your problem is so it's hard to suggest anything sensible.
If you tells us what you are trying to do AS A WHOLE with sample input and expected output for your whole process then we can help you.
We don't seem to be getting anywhere so let's try a stab at the kind of solution I think you might want and then take it from there.
Look at these 2 files "old" and "new" side by side (line numbers added by the cat -n):
$ paste old new | cat -n
1 a b
2 b 56633223223
3 56633223223 c
4 c d
5 d h
6 e 56633223225
7 f i
8 g Z
9 h k
10 56633223225 l
11 i
12 j
13 k
14 l
Now lets take this "actlist":
$ cat actlist
56633223223 1 2
56633223225 1 3
and run this awk command on all 3 of the above files (yes, I know it could be briefer, more efficient, etc. but favoring simplicity and clarity for now):
$ cat tst.awk
ARGIND==1 {
numPre[$1] = $2
numSuc[$1] = $3
}
ARGIND==2 {
oldLine[FNR] = $0
if ($0 in numPre) {
oldHitFnr[$0] = FNR
}
}
ARGIND==3 {
newLine[FNR] = $0
if ($0 in numPre) {
newHitFnr[$0] = FNR
}
}
END {
for (str in numPre) {
if ( str in oldHitFnr ) {
if ( str in newHitFnr ) {
for (i=-numPre[str]; i<=numSuc[str]; i++) {
oldFnr = oldHitFnr[str] + i
newFnr = newHitFnr[str] + i
if (oldLine[oldFnr] != newLine[newFnr]) {
print str, "mismatch at old line", oldFnr, "new line", newFnr
print "\t" oldLine[oldFnr], "vs", newLine[newFnr]
}
}
}
else {
print str, "is present in old file but not new file"
}
}
else if (str in newHitFnr) {
print str, "is present in new file but not old file"
}
}
}
.
$ awk -f tst.awk actlist old new
56633223225 mismatch at old line 12 new line 8
j vs Z
It's outputing that result because the 2nd line after 56633223225 is j in file "old" but Z in file "new" and the file "actlist" said the 2 files had to be common from one line before until 3 lines after that pattern.
Is that what you're trying to do? The above uses GNU awk for ARGIND but the workaround is trivial for other awks.
Use the below code:
awk '{if (NF == 3) { word1=$1; word2=$2; word3=$3; print "Words are:" word1, word2, word3} else {print "Line", NR, "is having", NF, "Words" }}' filename.txt
I have given the solution as per the requirement.
awk '{ # awk starts from here and read a file line by line
if (NF == 3) # It will check if current line is having 3 fields. NF represents number of fields in current line
{ word1=$1; # If current line is having exact 3 fields then 1st field will be assigned to word1 variable
word2=$2; # 2nd field will be assigned to word2 variable
word3=$3; # 3rd field will be assigned to word3 variable
print word1, word2, word3} # It will print all 3 fields
}' filename.txt >> output.txt # THese 3 fields will be redirected to a file which can be used for further processing.
This is as per the requirement, but there are many other ways of doing this but it was asked using awk.
I have a CSV that was exported, some lines have a linefeed (ASCII 012) in the middle of a record. I need to replace this with a space, but preserve the new line for each record to load it.
Most of the lines are fine, however a good few have this:
Input:
10 , ,"2007-07-30 13.26.21.598000" ,1922 ,0 , , , ,"Special Needs List Rows updated :
Row 1 : Instruction: other :Comment: pump runs all of the water for the insd's home" ,10003 ,524 ,"cc:2023" , , ,2023 , , ,"CCR" ,"INSERT" ,"2011-12-03 01.25.39.759555" ,"2011-12-03 01.25.39.759555"
Output:
10 , ,"2007-07-30 13.26.21.598000" ,1922 ,0 , , , ,"Special Needs List Rows updated :Row 1 : Instruction: other :Comment: pump runs all of the water for the insd's home" ,10003 ,524 ,"cc:2023" , , ,2023 , , ,"CCR" ,"INSERT" ,"2011-12-03 01.25.39.759555" ,"2011-12-03 01.25.39.759555"
I have been looking into Awk but cannot really make sense of how to preserve the actual row.
Another Example:
Input:
9~~"2007-08-01 16.14.45.099000"~2215~0~~~~"Exposure closed (Unnecessary) : Garage door working
Claim Withdrawn"~~701~"cc:6007"~~564~6007~~~"CCR"~"INSERT"~"2011-12-03 01.25.39.759555"~"2011-12-03 01.25.39.759555"
4~~"2007-08-01 16.14.49.333000"~1923~0~~~~"Assigned to user Leanne Hamshere in group GIO Home Processing (Team 3)"~~912~"cc:6008"~~~6008~~~"CCR"~"INSERT"~"2011-12-03 01.25.39.759555"~"2011-12-03 01.25.39.759555"
Output:
9~~"2007-08-01 16.14.45.099000"~2215~0~~~~"Exposure closed (Unnecessary) : Garage door working Claim Withdrawn"~~701~"cc:6007"~~564~6007~~~"CCR"~"INSERT"~"2011-12-03 01.25.39.759555"~"2011-12-03 01.25.39.759555"
4~~"2007-08-01 16.14.49.333000"~1923~0~~~~"Assigned to user Leanne Hamshere in group GIO Home Processing (Team 3)"~~912~"cc:6008"~~~6008~~~"CCR"~"INSERT"~"2011-12-03 01.25.39.759555"~"2011-12-03 01.25.39.759555"
One way using GNU awk:
awk -f script.awk file.txt
Contents of script.awk:
BEGIN {
FS = "[,~]"
}
NF < 21 {
line = (line ? line OFS : line) $0
fields = fields + NF
}
fields >= 21 {
print line
line=""
fields=0
}
NF == 21 {
print
}
Alternatively, you can use this one-liner:
awk -F "[,~]" 'NF < 21 { line = (line ? line OFS : line) $0; fields = fields + NF } fields >= 21 { print line; line=""; fields=0 } NF == 21 { print }' file.txt
Explanation:
I made an observation about your expected output: it seems each line should contain exactly 21 fields. Therefore if your line contains less than 21 fields, store the line and store the number of fields. When we loop onto the next line, the line will be joined to the stored line with a space, and the number of fields totaled. If this number of fields is greater or equal to 21 (the sum of the fields of a broken line will add to 22), print the stored line. Else if the line contains 21 fields (NF == 21), print it. HTH.
I think sed is your choice. I assume all the records end with non-colon character, thus if a line end with a colon, it is recognized as an exception and should be concatenated to the previous line.
Here is the code:
cat data | sed -e '/[^"]$/N' -e 's/\n//g'
The first execution -e '/[^"]$/N' match an abnormal case, and read in next record without empty the buffer. Then -e 's/\n//g' remove the new line character.
try this one-liner:
awk '{if(t){print;t=0;next;}x=$0;n=gsub(/"/,"",x);if(n%2){printf $0" ";t=1;}else print $0}' file
idea:
count the number of " in a line. if the count is odd, join the following line, otherwise the current line would be considered as a complete line.
I have 15 files like
file1.csv
a,cg2,0,0,0,21,0
a,cq1,10,0,0,0,0
a,cm2,0,19,0,0,0
...
a,ad10,0,0,0,37,0
file2.csv
d,cm1,0,3,0,0,0
d,cs2,0,32,0,0,0
d,cg2,0,0,9,0,0
...
d,az2,0,0,0,21,0
.
.
.
.
file15.csv
s,sd1,0,23,0,0,0
s,cw1,0,0,7,0,0
s,c23,0,0,90,0,0
...
s,cg2,0,45,0,0,0
I have different number of lines in each file and I want to compare the second field of all 15 files and extract the lines which are common to second field of all 15 files.
in this above case
output is:
cg2
(taking it is common to second field of all 15 files)
I am little new to unix and shell scripting, please help
Do you want the full lines from each of the fifteen files where field 2 appears in all fifteen files? Or do you only want a list of the field 2 values that appear in all fifteen files.
The former:
a,cg2,0,0,0,21,0
d,cg2,0,0,9,0,0
. . .
s,cg2,0,45,0,0,0
. . .
The latter:
cg2
. . .
If the latter, then this should work
awk -F, '{arr[$2]++; if (FILENAME != prevfile) {c++; prevfile = FILENAME}} END {for (i in arr) {if (arr[i] == c) {print i}}}' file*.csv
Broken out on multiple lines:
awk -F, '{
arr[$2]++;
if (FILENAME != prevfile) {
c++;
prevfile = FILENAME
}
}
END {
for (i in arr) {
if (arr[i] >= c) {
print i
}
}
}' file*.csv
Explanation:
increment the count of the number of times a field 2 value occurs
if the filename changes, increment the count of files (the first file changes from a null string to its filename and the count increments from 0 to 1)
save the current filename
once all the counting is done, iterate of the array by its keys
if the count contained in the array is greater than or equal to the number of files, then the field 2 value appeared in all the files (by checking for >= instead of == this will work in case a value appears more than once in a single file)
so print the key (which is a field 2 value)
a glob is used to get all the files, but you could list them explicitly
Edit:
Here's a way to print the full matching lines using a two-pass technique. It's a modification of the version above. Make sure to list the files twice.
awk -F, '
FILENAME == first && flag {
exit
}
! first {
first = FILENAME
}
FILENAME != first {
flag = 1
}
{
arr[$2]++;
if (FILENAME != prevfile) {
c++;
prevfile = FILENAME
}
}
END {
# print the matching lines
do {
if ($2 in arr) {
print;
}
} while (getline);
# print the list of words
for (i in arr) {
if (arr[i] >= c) {
print i
}
}
}' file*.csv file*.csv
It depends on the first file in the first group being the same name as the first file in the second group. Using globbing similar to what I've shown will take care of that requirement.
It prints the matching lines (not grouped, though), then it prints the list of words. If you want only one or the other, comment out or remove the loop that you don't want (do/while or for).
If you print only the full lines, you can pipe the output to:
sort -t , -k2,2
to have them grouped.
Piping only the list of words to:
sort
will put them in the same order for easier comparison.
Fun problem.
One way to do it, entirely in Bash, is as follows.
One thing you will need to invoke is join -t ',' -1 2 -2 2 file1 file2 to join on the second column of two files. Before you can join, though, you must sort on the second column.
Do successive joins in a for-loop, because join takes only two files as arguments.
ADDENDUM
Here is a little transcript showing successive joins. You can adapt it fairly easily, I think.
$ cat 1.csv
a,b,c,d
e,f,g,h
i,j,k,l
$ cat 2.csv
7,5,4,3
3,b,s,e
2,f,5,5
$ cat 3.csv
4,5,6,7
0,0,0,0
1,b,4,4
$ join -t ',' -1 2 -2 2 1.csv 2.csv | cut -f 1 -d ',' > temp
$ cat temp
b
f
$ join -t ',' -2 2 temp 3.csv | cut -f 1 -d ','
b
The first join (on the first two files) produces the joined value in the first column of the result. So as you join to file3, file4, file5, etc. You will be using the first column of the result you are generating, which is why you only need the -2 option. To keep things very efficient, always cut out all but the first column each time you do the join.