Reading Large json.gz files crashes kernel

Reading Large json.gz files crashes kernel - memory-management

I have a dataset that's 7GB worth of data. I am reading it as follows:
path = direc + '2018-01-*.*'
ddf = dd.read_json(path,blocksize=None)
I used this method because reading it through pandas seemed to keep crashing my kernel and using up my local memory - I am running this on my machine.
I need to do a bunch of analysis, but any command seems to crash the kernel, if im saving to parquet, or even doing a count or dropping duplicates.
Any suggestions on how I can run commands/manipulate this dataset?

From what I can see everything that you're doing is fine. Dask shouldn't ever crash the kernel, the worst that might happen in a situation like this is that you run out of memory.
So you may have to figure out how to provide more information concisely to create an MCVE

Related

The loop with one go-exiftool instance hangs on big amount of files

I'm looping big ~10k files using go-exiftool.
I'm using one instance of the go-exiftool to get info for all required files.
This code is called 10k times in the loop, where the file is always different.
fileInfos := et.ExtractMetadata(file)
After the ~7k loops the program hangs. I debugged go-exiftool and found that it hangs in
https://github.com/barasher/go-exiftool/blob/master/exiftool.go#L121
on the line:
fmt.Fprintln(io.WriteCloser, "-execute")
if i understood correct io.WriteCloser has instance of exec.Command(binary, initArgs...).StdinPipe()
so, the questions are:
Does exec.Command has a limit of execution?
If 1) - not, what can be the reason else?
Does it depends on the file sizes? I tried another folder and it worked with 35k files and then hanged. How to check that?
UPDATE:
i tried to run the same file in 10k loops. Works fine. It looks like it runs out of memory, can it be? I see no problem in the system memory graph. Or stdin is overflowed. Have no idea how to check that.

two programs accessing one file

New to this forum - looks great!
I have some Processing code that periodically reads data wirelessly from remote devices and writes that data as bytes to a file, e.g. data.dat. I want to write an Objective C program on my Mac Mini using Xcode to read this file, parse the data, and act on the data if data values indicate a problem. My question is: can my two different programs access the same file asynchronously without a problem? If this is a problem can you suggest a technique that will allow these operations?
Thanks,
Kevin H.

Multiple processes can read from the same file at a time without any problem. A process can also read from a file while another writes without problem, although you'll have to take care to ensure that you read in any new data that was written. Multiple processes should not write to the same file at at the same time, though. The OS will let you do it, but the ordering of data will be undefined, and you'll like overwrite data—in general, you're gonna have a bad time if you do that. So you should take care to ensure that only one process writes to a file at a time.
The simplest way to protect a file so that only one process can write to it at a time is with the C function flock(), although that function is admittedly a bit rudimentary and may or may not suit your use case.

How to debug potential CPU/RAM errors in Bash script on Linux

I have a relatively simple bash script that reads from a set of static input files, stores the input in bash variables and then does a bunch of processing over said input by calling out to external scripts (e.g. written in Python, Go, other bash scripts etc.) and using the intermediate results.
Lately I have been experiencing an intermittent problem where a single character seems to be getting altered somewhere during the processing which then causes subsequent errors. Specifically, a lot of the processing I'm doing involves slicing up a list of comma-separated records, and one of the values on each line is a unix timestamp, e.g. 1354245000.
What seems to be happening is that occasionally one of these values will get altered slightly, so I end up with a timestamp like 13542458=2 or 13542458>2 or 13542458;2 coming out of one of the intermediate scripts. This then subsequently gets fed into another script, which throws an exception when it tries to parse the value to an integer.
In the title of this question, I've suggested that this might be a potential CPU/RAM error. I know the general folly in thinking errors are caused by low level things like hardware/compilers etcetera, but the nature of this particular error makes me think it may be possible, for the following reasons:
The input files are the same on each invocation of the script, and the script only fails on some invocations.
I cannot think of any sources of randomness in the source code prior to where the script is breaking. It's basically just slicing and dicing csv input.
I cannot think of any sources of concurrency in the source code -- even the Go scripts aren't actually written to run anything concurrently.
This problem has only arisen in the last week or so. Prior to this time, this error would never occur.
While I haven't documented every erroneous character, they seem to often be quite close in the ASCII table to numeric values (=, >, ; etc). That said, I guess the Hamming distance between two characters quite far apart can be small also with changes to a high order bit.
The script often breaks at a different stage on different runs. i.e. I have a number of separate Python scripts, and sometimes it'll make it past one script and then the error will be induced in another. Other times it'll be induced on an earlier script.
What I'd like to know is, is there any methodical way to either confirm or rule out a hardware error for this problem? Or if it is a hardware problem, is it possibly undetectable by the operating system?
A bit of further info on the machine:
Linux 64-bit, Ubuntu 12.04
Intel i7 processor
16GB DDR3 RAM
I'm hoping someone can either point me to a reliable way to verify whether the hardware is to blame or otherwise a sound reason as to what else might be the cause.

Try booting into Memtest to check your memory.

While it is highly unlikely that it will be hardware, if you have exhausted you standard software debug as suggested by #OliCharlesworth, here is an outline of hardware error investigation:
(1) check your log area for any `MCE` logs (machine check exceptions).
If you find any in either your log area (syslog) or sometimes in
the present working dir or /dir -- you have a hardware failure.
(2) check your log area for disk errors. e.g:
smartd[3963]: Device: /dev/sda [SAT], 34 Currently unreadable (pending) sectors
(3) check your drive integrity, e.g.: (as root) # `smartctl -a /dev/sda` if any abnormality, run:
smartctl -t short /dev/sda (change drive as required)
(4) download/install/boot to [memtest86](http://www.memtest86.com/download.htm)
(run the complete test)
If your cpu/motherboard has thrown no mce's, you have no disk error, your drive tests OK with smartctl and you have no memory errors with memtest86, then recheck the software debugging. While additional hardware errors can still be present (bad capacitors, etc..) the likelihood at this point is software. Good luck.

I/O performance of multiple JVM (Windows 7 affected, Linux works)

I have a program that creates a file of about 50MB size. During the process the program frequently rewrites sections of the file and forces the changes to disk (in the order of 100 times). It uses a FileChannel and direct ByteBuffers via fc.read(...), fc.write(...) and fc.force(...).
New text:
I have a better view on the problem now.
The problem appears to be that I use three different JVMs to modify a file (one creates it, two others (launched from the first) write to it). Every JVM closes the file properly before the next JVM is started.
The problem is that the cost of fc.write() to that file occasionally goes through the roof for the third JVM (in the order of 100 times the normal cost). That is, all write operations are equally slow, it is not just one that hang very long.
Interestingly, one way to help this is to insert delays (2 seconds) between the launching of JVMs. Without delay, writing is always slow, with delay, the writing is slow aboutr every second time or so.
I also found this Stackoverflow: How to unmap a file from memory mapped using FileChannel in java? which describes a problem for mapped files, which I'm not using.
What I suspect might be going on:
Java does not completely release the file handle when I call close(). When the next JVM is started, Java (or Windows) recognizes concurrent access to that file and installes some expensive concurrency handler for that file, which makes writing expensive.
Would that make sense?
The problem occurs on Windows 7 (Java 6 and 7, tested on two machines), but not under Linux (SuSE 11.3 64).
Old text:
The problem:
Starting the program from as a JUnit test harness from eclipse or from console works fine, it takes around 3 seconds.
Starting the program through an ant task (or through JUnit by kicking of a separate JVM using a ProcessBuilder) slows the program down to 70-80 seconds for the same task (factor 20-30).
Using -Xprof reveals that the usage of 'force0' and 'pwrite' goes through the roof from 34.1% (76+20 tics) to 97.3% (3587+2913+751 tics):
Fast run:
27.0% 0 + 76 sun.nio.ch.FileChannelImpl.force0
7.1% 0 + 20 sun.nio.ch.FileDispatcher.pwrite0
[..]
Slow run:
Interpreted + native Method
48.1% 0 + 3587 sun.nio.ch.FileDispatcher.pwrite0
39.1% 0 + 2913 sun.nio.ch.FileChannelImpl.force0
[..]
Stub + native Method
10.1% 0 + 751 sun.nio.ch.FileDispatcher.pwrite0
[..]
GC and compilation are negligible.
More facts:
No other methods show a significant change in the -Xprof output.
It's either fast or very slow, never something in-between.
Memory is not a problem, all test machines have at least 8GB, the process uses <200MB
rebooting the machine does not help
switching of virus-scanners and similar stuff has no affect
When the process is slow, there is virtually no CPU usage
It is never slow when running it from a normal JVM
It is pretty consistently slow when running it in a JVM that was started from the first JVM (via ProcessBuilder or as ant-task)
All JVMs are exactly the same. I output System.getProperty("java.home") and the JVM options via RuntimeMXBean RuntimemxBean = ManagementFactory.getRuntimeMXBean(); List arguments = RuntimemxBean.getInputArguments();
I tested it on two machines with Windows7 64bit, Java 7u2, Java 6u26 and JRockit, the hardware of the machines differs, though, but the results are very similar.
I tested it also from outside Eclipse (command-line ant) but no difference there.
The whole program is written by myself, all it does is reading and writing to/from this file, no other libraries are used, especially no native libraries. -
And some scary facts that I just refuse to believe to make any sense:
Removing all class files and rebuilding the project sometimes (rarely) helps. The program (nested version) runs fast one or two times before becoming extremely slow again.
Installing a new JVM always helps (every single time!) such that the (nested) program runs fast at least once! Installing a JDK counts as two because both the JDK-jre and the JRE-jre work fine at least once. Overinstalling a JVM does not help. Neither does rebooting. I haven't tried deleting/rebooting/reinstalling yet ...
These are the only two ways I ever managed to get fast program runtimes for the nested program.
Questions:
What may cause this performance drop for nested JVMs?
What exactly do these methods do (pwrite0/force0)? -

Are you using local disks for all testing (as opposed to any network share) ?
Can you setup Windows with a ram drive to store the data ? When a JVM terminates, by default its file handles will have been closed but what you might be seeing is the flushing of the data to the disk. When you overwrite lots of data the previous version of data is discarded and may not cause disk IO. The act of closing the file might make windows kernel implicitly flush data to disk. So using a ram drive would allow you to confirm that their since disk IO time is removed from your stats.
Find a tool for windows that allows you to force the kernel to flush all buffers to disk, use this in between JVM runs, see how long that takes at the time.
But I would guess you are hitten some iteraction with the demands of the process and the demands of the kernel in attempting to manage disk block buffer cache. In linux there is a tool like "/sbin/blockdev --flushbufs" that can do this.
FWIW
"pwrite" is a Linux/Unix API for allowing concurrent writing to a file descriptor (which would be the best kernel syscall API to use for the JVM, I think Win32 API already has provision for the same kinds of usage to share a file handle between threads in a process, but since Sun have Unix heritige things get named after the Unix way). Google "pwrite(2)" for more info on this API.
"force" I would guess that is a file system sync, meaning the process is requesting the kernel to flush unwritten data (that is currently in disk block buffer cache) into the file on the disk (such as would be needed before you turned your computer off). This action will happen automatically over time, but transactional systems require to know when the data previously written (with pwrite) has actually hit the physical disk and is stored. Because some other disk IO is dependant on knowing that, such as with transactional checkpointing.

One thing that could help is making sure you explicitly set the FileChannel to null. Then call System.runFinalization() and maybe System.gc() at the end of the program. You may need more than 1 call.
System.runFinalizersOnExit(true) may also help, but it's deprecated so you will have to deal with the compiler warnings.

Page error 0xc0000006 with VC++

I have a VS 2005 application using C++ . It basically importing a large XML of around 9 GB into the application . After running for more than 18 hrs it gave an exception 0xc0000006 In page error. THe virtual memory consumed is 2.6 GB (I have set the 3GB) flag.
Does any one have a clue as to what caused this error and what could be the solution

Instead of loading the whole file into the memory you can use SAX parsers to load only a part of the file to the memory.

9Gb seems overly large to read in. I would say that even 3Gb is too large in one go.
Is your OS 64bit?
What is the maximum pagefile size set to?
How much RAM do you have?
Were you running this in debug or release mode?
I would suggest that you try to reading the XML in smaller chunks.
Why are you trying to read in such a large file in one go?
I would imagine that your application took so long to run before failing as it started to copy the file into virtual memory, which is basically a large file on the hard disk. Thus the OS is reading the XML from the disk and writing it back onto a different area of disk.
**Edit - added text below **
Having had a quick peek at Expat XML parser it does look as if you're running into problems with stack or event handling, most likely you are adding too much to the stack.
Do you really need 3Gb of data on the stack? At a guess I would say that you are trying to process a XML database file, but I can't imagine that you have a table row that is so large.
I think that really you should use it to search for key areas and discard what is not wanted.
I know nothing other than what I have just read about Expat XML Parser but would suggest that you are not using it in the most efficient manner.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio