argument getting truncated while printing in unix after merging files - bash

I am trying to combine two tab seperated text files but one of the fields is being truncated by awk when I use the command (pls suggest something other than awk if it is easier to do so)
pr -m -t test_v1 test.predict | awk -v OFS='\t' '{print $4,$5,$7}' > out_test8
The format of the test_v1 is
478 192 46 10203853138191712
but I only print 10203853138 for $4 truncating the other digits. Should I use string format?
Actually I found out after a suggestion given that pr -m -t itself does not give the correct output
478^I192^I46^I10203853138^I^I is the output of the command
pr -m -t test_v1 test.predict | cat -vte
I used paste test_v1 test.predict instead of pr and got the right answer.

You problem is use pr -m (merge) here which as per manual:
-m, --merge
print all files in parallel, one in each column, truncate lines, but join lines of full length with -J
You can use:
paste test_v1 test.predict

Run dos2unix on your files first, you've just got control-Ms in your input file(s).

Related

Passing parameter as control number and get table name

I have a scenario where there is a file with control number and table name, hereby an example:
1145|report_product|N|N|
1156|property_report|N|N
I need to pass the control number as 1156 and have to get table name as PR once I get the table name as PR then I need to add some text on that.
Please help
Assuming the controll file is:
# cat controlfile.txt
1145|report_product|N|N
1156|property_report|N|N
To fine some line you can use:
grep 1156 controlfile.txt
If needed you can save it to a variable: result=$(grep 1156 file.txt)
Assuming you need to add append something on this line.... you can use:
sed '/^1156/s/$/ 123/' controlfile.txt
This example will add "123" at the end of line that start with 1156
If needed, add more details like what output you want or anything else to help us better understand your need.
You need to work in two stages:
You need to find the line, containing 1156.
You need to get the information from that line.
In order to find the line (as already indicated by Juranir), you can use grep:
Prompt> grep "1156" control.txt
1156|property_report|N|N
In order to get the information from that line, you need to get the second column, based on the vertical line (often referred as a "pipe" character), for which there are different approaches. I'll give you two:
The cut approach: you can cut a line into different parts and take a character, a byte, a column, .... In this case, this is what you need:
grep "1156" control.txt | cut -d '|' -f 2
-d '|' : use the vertical line as a column separator
-f 2 : show the second field (column)
The awk approach: awk is a general "text modifier" with multiple features (showing parts of text, performing basic calculations, ...). For this case, it can be used as follows:
grep "1156" control.txt | awk -F '|' '{print $2}'
-F '|' : use the vertical line as a column separator
'{print $2}' : the awk script for showing the second field.
Oh, by the way, I've edited your question. You might press the edit button in order to learn how I did this :-)
For getting only the first letters, separated by the underscores:
grep "1156" control.txt | awk -F '|' '{print $2}' | awk -F '_' '{print substr($1,1,1) substr($2,1,1)}'
(something like that)

Trying to sort a large csv file but the output is not being written to another file

I have a (very) large csv file almost around 70GB which I am trying to sort using the sort command. As much as I am trying, the output is not being written to file. Here is what I tried
sort -T /data/data/.tmp -t "," -k 38 /data/data/raw/KKR.csv > /data/data/raw/KKR_38.csv
sort -T /data/data/.tmp -t "," -k 38 /data/data/raw/KKR.csv -o /data/data/raw/KKR-38.csv
What happens is that the KKR_38.csv file is created and its size is the same as the KKR.csv file but there is nothing inside it. When I do
head -n 100 /data/data/raw/KKR_38.csv
It prints out 100 empty lines.
If you sort, it is quite normal the empty lines come first. Try this:
tail -100 /data/data/raw/KKR_38.csv
You can use the following commands if you want to not take into account the empty lines:
cat -s /data/data/raw/KKR_38.csv | less #to squeeze the successive empty lines to only one
or if you want to remove them:
sed '/^$/d' /data/data/raw/KKR_38.csv | less
You can redirect the output of those commands to create another file without the empty line (watch out for the space on your file system).

How to use shell to solve the scripts and about file?

I have a question:
file:
154891
145690
165211
190189
135901
290134
I want to output like this: (Every three uid separated by comma)
154891,145690,165211
190189,135901,290134
How can I do it?
You can use pr:
pr -3 -s, -l 1
Print in 3 columns, with commas as separators, with a 'page length' of 1.
154891,145690,165211
190189,135901,290134
sed ':1;N;s/\n/,/;0~3b;t1' file
or
awk 'ORS=NR%3?",":"\n"' file
There could be many ways to do that, pick one you like, with/out comma ",":
$ awk '{printf "%s%s",$0,(NR%3?",":RS)}' file
154891,145690,165211
190189,135901,290134
$ xargs -n3 -a file
154891 145690 165211
190189 135901 290134

How to quickly check a .gz file without unzip? [duplicate]

How to get the first few lines from a gziped file ?
I tried zcat, but its throwing an error
zcat CONN.20111109.0057.gz|head
CONN.20111109.0057.gz.Z: A file or directory in the path name does not exist.
zcat(1) can be supplied by either compress(1) or by gzip(1). On your system, it appears to be compress(1) -- it is looking for a file with a .Z extension.
Switch to gzip -cd in place of zcat and your command should work fine:
gzip -cd CONN.20111109.0057.gz | head
Explanation
-c --stdout --to-stdout
Write output on standard output; keep original files unchanged. If there are several input files, the output consists of a sequence of independently compressed members. To obtain better compression, concatenate all input files before compressing
them.
-d --decompress --uncompress
Decompress.
On some systems (e.g., Mac), you need to use gzcat.
On a mac you need to use the < with zcat:
zcat < CONN.20111109.0057.gz|head
If a continuous range of lines needs be, one option might be:
gunzip -c file.gz | sed -n '5,10p;11q' > subFile
where the lines between 5th and 10th lines (both inclusive) of file.gz are extracted into a new subFile. For sed options, refer to the manual.
If every, say, 5th line is required:
gunzip -c file.gz | sed -n '1~5p;6q' > subFile
which extracts the 1st line and jumps over 4 lines and picks the 5th line and so on.
If you want to use zcat, this will show the first 10 rows
zcat your_filename.gz | head
Let's say you want the 16 first row
zcat your_filename.gz | head -n 16
This awk snippet will let you show not only the first few lines - but a range you can specify. It will also add line numbers which i needed for debugging an error message pointing to a certain line way down in a gzipped file.
gunzip -c file.gz | awk -v from=10 -v to=20 'NR>=from { print NR,$0; if (NR>=to) exit 1}'
Here is the awk snippet used in the one liner above. In awk NR is a built-in variable (Number of records found so far) which usually is equivalent to a line number. the from and to variable are picked up from the command line via the -v options.
NR>=from {
print NR,$0;
if (NR>=to)
exit 1
}

bash: shortest way to get n-th column of output

Let's say that during your workday you repeatedly encounter the following form of columnized output from some command in bash (in my case from executing svn st in my Rails working directory):
? changes.patch
M app/models/superman.rb
A app/models/superwoman.rb
in order to work with the output of your command - in this case the filenames - some sort of parsing is required so that the second column can be used as input for the next command.
What I've been doing is to use awk to get at the second column, e.g. when I want to remove all files (not that that's a typical usecase :), I would do:
svn st | awk '{print $2}' | xargs rm
Since I type this a lot, a natural question is: is there a shorter (thus cooler) way of accomplishing this in bash?
NOTE:
What I am asking is essentially a shell command question even though my concrete example is on my svn workflow. If you feel that workflow is silly and suggest an alternative approach, I probably won't vote you down, but others might, since the question here is really how to get the n-th column command output in bash, in the shortest manner possible. Thanks :)
You can use cut to access the second field:
cut -f2
Edit:
Sorry, didn't realise that SVN doesn't use tabs in its output, so that's a bit useless. You can tailor cut to the output but it's a bit fragile - something like cut -c 10- would work, but the exact value will depend on your setup.
Another option is something like: sed 's/.\s\+//'
To accomplish the same thing as:
svn st | awk '{print $2}' | xargs rm
using only bash you can use:
svn st | while read a b; do rm "$b"; done
Granted, it's not shorter, but it's a bit more efficient and it handles whitespace in your filenames correctly.
I found myself in the same situation and ended up adding these aliases to my .profile file:
alias c1="awk '{print \$1}'"
alias c2="awk '{print \$2}'"
alias c3="awk '{print \$3}'"
alias c4="awk '{print \$4}'"
alias c5="awk '{print \$5}'"
alias c6="awk '{print \$6}'"
alias c7="awk '{print \$7}'"
alias c8="awk '{print \$8}'"
alias c9="awk '{print \$9}'"
Which allows me to write things like this:
svn st | c2 | xargs rm
Try the zsh. It supports suffix alias, so you can define X in your .zshrc to be
alias -g X="| cut -d' ' -f2"
then you can do:
cat file X
You can take it one step further and define it for the nth column:
alias -g X2="| cut -d' ' -f2"
alias -g X1="| cut -d' ' -f1"
alias -g X3="| cut -d' ' -f3"
which will output the nth column of file "file". You can do this for grep output or less output, too. This is very handy and a killer feature of the zsh.
You can go one step further and define D to be:
alias -g D="|xargs rm"
Now you can type:
cat file X1 D
to delete all files mentioned in the first column of file "file".
If you know the bash, the zsh is not much of a change except for some new features.
HTH Chris
Because you seem to be unfamiliar with scripts, here is an example.
#!/bin/sh
# usage: svn st | x 2 | xargs rm
col=$1
shift
awk -v col="$col" '{print $col}' "${#--}"
If you save this in ~/bin/x and make sure ~/bin is in your PATH (now that is something you can and should put in your .bashrc) you have the shortest possible command for generally extracting column n; x n.
The script should do proper error checking and bail if invoked with a non-numeric argument or the incorrect number of arguments, etc; but expanding on this bare-bones essential version will be in unit 102.
Maybe you will want to extend the script to allow a different column delimiter. Awk by default parses input into fields on whitespace; to use a different delimiter, use -F ':' where : is the new delimiter. Implementing this as an option to the script makes it slightly longer, so I'm leaving that as an exercise for the reader.
Usage
Given a file file:
1 2 3
4 5 6
You can either pass it via stdin (using a useless cat merely as a placeholder for something more useful);
$ cat file | sh script.sh 2
2
5
Or provide it as an argument to the script:
$ sh script.sh 2 file
2
5
Here, sh script.sh is assuming that the script is saved as script.sh in the current directory; if you save it with a more useful name somewhere in your PATH and mark it executable, as in the instructions above, obviously use the useful name instead (and no sh).
It looks like you already have a solution. To make things easier, why not just put your command in a bash script (with a short name) and just run that instead of typing out that 'long' command every time?
If you are ok with manually selecting the column, you could be very fast using pick:
svn st | pick | xargs rm
Just go to any cell of the 2nd column, press c and then hit enter
Note, that file path does not have to be in second column of svn st output. For example if you modify file, and modify it's property, it will be 3rd column.
See possible output examples in:
svn help st
Example output:
M wc/bar.c
A + wc/qax.c
I suggest to cut first 8 characters by:
svn st | cut -c8- | while read FILE; do echo whatever with "$FILE"; done
If you want to be 100% sure, and deal with fancy filenames with white space at the end for example, you need to parse xml output:
svn st --xml | grep -o 'path=".*"' | sed 's/^path="//; s/"$//'
Of course you may want to use some real XML parser instead of grep/sed.

Resources