I am trying to split a 13Gb file into equal chunks using Linux Bash Shell in windows 10 by:
split -n l/13 myfile.csv
and I am getting the following error:
split: 'xaa' would overwrite input; aborting
the xaa which is created is empty.
I have also tried using:
split -l 9000000 myfile.csv
which wields the same results.
I have used the split command before with similar arguments with no problem.
Any ideas what am I missing?
Thanks in advance
EDIT: even if i provide my own prefix I still get the same error:
split -n l/13 myfile.csv completenewprefix
split: 'completenewprefixaa' would overwrite input; aborting
EDIT2:
ls -di completenewprefixaa myfile.csv
1 completenewprefixaa 1 myfile.csv
findmnt -T .
TARGET SOURCE FSTYPE OPTIONS
/mnt/u U: drvfs rw,relatime,case=off
Related
I am trying to compress & index a VCF file and am facing several issues.
When I use bgzip/tabix, it throws an error saying it cannot be indexed due to some unsorted values.
# code used to bgzip and tabix
bgzip -c fn.vcf > fn.vcf.gz
tabix -p vcf fn.vcf.gz
# below is the error returnd
[E::hts_idx_push] Unsorted positions on sequence #1: 115352924 followed by 115352606
tbx_index_build failed: fn.vcf.gz
When I use bcftools sort to sort this VCF to tackle #1, it throws an error due to invalid entries.
# code used to sort
bcftools sort -O z --output-file fn.vcf.gz fn.vcf
# below is the error returned
Writing to /tmp/bcftools-sort.YSrhjT
[W::vcf_parse_format] Extreme FORMAT/AD value encountered and set to missing at chr12:115350908
[E::vcf_parse_format] Invalid character '\x0F' in 'GT' FORMAT field at chr12:115352482
Error encountered while parsing the input
Cleaning
I've tried sorting using linux commands to get around #2. However, when I run the below code, the size of fout.vcf is almost half of fin.vcf, indicating something might be going wrong.
grep "^#" fin.vcf > fout.vcf
grep -v "^#" fin.vcf | sort -k1,1V -k2,2n >> fout.vcf
Please let me know if you have any advice regarding:
How I could sort/fix the problematic inputs in my VCF in a safe & feasible way. (The file is 340G so I cannot simply open the file and edit.)
Why my linux sort might be behaving in an odd way. (i.e. returning file much smaller than the original.)
Any comments or suggestions are appreciated!
Try this
mkdir tmp ##1 create a tmp folder in your working directory
tmp=/yourpath/ ##2 assign the tmp folder
bcftools sort file.vcf -T ./tmp -Oz -o file.vcf.gz
you can index your file after sorting your file
bcftools index file.vcf.gz
I have a (very) large csv file almost around 70GB which I am trying to sort using the sort command. As much as I am trying, the output is not being written to file. Here is what I tried
sort -T /data/data/.tmp -t "," -k 38 /data/data/raw/KKR.csv > /data/data/raw/KKR_38.csv
sort -T /data/data/.tmp -t "," -k 38 /data/data/raw/KKR.csv -o /data/data/raw/KKR-38.csv
What happens is that the KKR_38.csv file is created and its size is the same as the KKR.csv file but there is nothing inside it. When I do
head -n 100 /data/data/raw/KKR_38.csv
It prints out 100 empty lines.
If you sort, it is quite normal the empty lines come first. Try this:
tail -100 /data/data/raw/KKR_38.csv
You can use the following commands if you want to not take into account the empty lines:
cat -s /data/data/raw/KKR_38.csv | less #to squeeze the successive empty lines to only one
or if you want to remove them:
sed '/^$/d' /data/data/raw/KKR_38.csv | less
You can redirect the output of those commands to create another file without the empty line (watch out for the space on your file system).
I am trying to find common names in a file and file name is generated dynamically. But when I try to give the filename using the $ size its not getting replaced tried echo and then eval but get an error as an unexpected token (
The code is as below
hive -e "use $1;show tables;">$1.txt
eval $(echo "comm -12 <(sort -u hub_table_list) <(sort -u $1.txt) >result.txt")
The hive command runs succesfully file is created with the parameter name.
It contains the table names.
All Help appreciated.
How to get the first few lines from a gziped file ?
I tried zcat, but its throwing an error
zcat CONN.20111109.0057.gz|head
CONN.20111109.0057.gz.Z: A file or directory in the path name does not exist.
zcat(1) can be supplied by either compress(1) or by gzip(1). On your system, it appears to be compress(1) -- it is looking for a file with a .Z extension.
Switch to gzip -cd in place of zcat and your command should work fine:
gzip -cd CONN.20111109.0057.gz | head
Explanation
-c --stdout --to-stdout
Write output on standard output; keep original files unchanged. If there are several input files, the output consists of a sequence of independently compressed members. To obtain better compression, concatenate all input files before compressing
them.
-d --decompress --uncompress
Decompress.
On some systems (e.g., Mac), you need to use gzcat.
On a mac you need to use the < with zcat:
zcat < CONN.20111109.0057.gz|head
If a continuous range of lines needs be, one option might be:
gunzip -c file.gz | sed -n '5,10p;11q' > subFile
where the lines between 5th and 10th lines (both inclusive) of file.gz are extracted into a new subFile. For sed options, refer to the manual.
If every, say, 5th line is required:
gunzip -c file.gz | sed -n '1~5p;6q' > subFile
which extracts the 1st line and jumps over 4 lines and picks the 5th line and so on.
If you want to use zcat, this will show the first 10 rows
zcat your_filename.gz | head
Let's say you want the 16 first row
zcat your_filename.gz | head -n 16
This awk snippet will let you show not only the first few lines - but a range you can specify. It will also add line numbers which i needed for debugging an error message pointing to a certain line way down in a gzipped file.
gunzip -c file.gz | awk -v from=10 -v to=20 'NR>=from { print NR,$0; if (NR>=to) exit 1}'
Here is the awk snippet used in the one liner above. In awk NR is a built-in variable (Number of records found so far) which usually is equivalent to a line number. the from and to variable are picked up from the command line via the -v options.
NR>=from {
print NR,$0;
if (NR>=to)
exit 1
}
I am trying to combine two tab seperated text files but one of the fields is being truncated by awk when I use the command (pls suggest something other than awk if it is easier to do so)
pr -m -t test_v1 test.predict | awk -v OFS='\t' '{print $4,$5,$7}' > out_test8
The format of the test_v1 is
478 192 46 10203853138191712
but I only print 10203853138 for $4 truncating the other digits. Should I use string format?
Actually I found out after a suggestion given that pr -m -t itself does not give the correct output
478^I192^I46^I10203853138^I^I is the output of the command
pr -m -t test_v1 test.predict | cat -vte
I used paste test_v1 test.predict instead of pr and got the right answer.
You problem is use pr -m (merge) here which as per manual:
-m, --merge
print all files in parallel, one in each column, truncate lines, but join lines of full length with -J
You can use:
paste test_v1 test.predict
Run dos2unix on your files first, you've just got control-Ms in your input file(s).