Bug #2774

Segfault when plotting large rectilinear grid

Added by Paul Melis 5 months ago. Updated 3 months ago.

Status:Pending Start:02/24/2017
Priority:Normal Due date:
Assigned to:- % Done:

0%

Category:-
Target version:2.13.0
Likelihood:3 - Occasional OS:Linux
Severity:4 - Crash / Wrong Results Support Group:Any
Found in Version:2.12.1

Description

We have a user that is trying to visualize a 1536x1536x1297 rectilinear grid, specified using an Xdmf file and two HDF5 files (one for grid coordinates and one for the volume data itself).

Loading in Visit 2.12.2 (official binary) works fine, but two cases cause a segfault, in different places as far as I can tell from the logs:

1. Add Mesh plot -> thetacut; Draw
2. Add Volume plot -> Azimuthal Velocity; disable lighting in the volume properties; Draw

I've added set of log files for both cases. Also a dummy xdmf file using the original grid coordinate HDF5 file and a script to generate dummy volume data (I can't share the data itself).

The viz node I'm running this on has 64G memory, and I don't see the usage getting close to the limit.

dummy_volume_rendering.log.tgz - Volume plot logs (80.9 KB) Paul Melis, 02/24/2017 09:22 am

dummy.xmf - Xdmf file (826 Bytes) Paul Melis, 02/24/2017 09:22 am

gen_dummy.py - Python script to generate dummy volume data (232 Bytes) Paul Melis, 02/24/2017 09:22 am

cordin_info.h5 - Grid coordinate (53 KB) Paul Melis, 02/24/2017 09:22 am

History

Updated by Paul Melis 5 months ago

The original volume data actually used 64-bit floats, but I've reduced it to 32-bit floats for this bug report.

Updated by Eric Brugger 5 months ago

  • Status changed from New to Pending
  • Target version set to 2.13.0
  • Likelihood changed from 5 - Always to 3 - Occasional

Updated by Paul Melis 5 months ago

The change of Likelihood to Occasional suggests there are situations in which this does not occur? Is there any workaround I can give to the user, as he is currently stuck in visualizing this data.

Updated by Paul Melis 3 months ago

I looked a bit into the stack trace after the segfault (for the case of adding the Volume plot) and it seems to be a case of integer overflow. This is with 2.12.2 sources, btw.

(gdb) bt 
#0  0x00007fbf48429125 in *__GI_raise (sig=<optimized out>) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64
#1  0x00007fbf4842c3a0 in *__GI_abort () at abort.c:92
#2  0x00007fbf4b7b615a in signalhandler_core (sig=11) at /scratch/visit2.12.2/src/common/misc/DebugStreamFull.C:196
#3  <signal handler called>
#4  0x00007fbf524a9b6c in AssignEight<float> (vals=0x7ffc8176c9e0, index=0x7ffc8176c9c0, s=1, m=0, 
    array=0x7fbc652c5010) at /scratch/visit2.12.2/src/avt/Filters/avtMassVoxelExtractor.C:232
#5  0x00007fbf524a0455 in AssignEight (vartype=10, vals=0x7ffc8176c9e0, index=0x7ffc8176c9c0, s=1, m=0, 
    array=0x7fbc652c5010) at /scratch/visit2.12.2/src/avt/Filters/avtMassVoxelExtractor.C:262
#6  0x00007fbf524a9345 in avtMassVoxelExtractor::ExtractImageSpaceGrid (this=0x307c3e0, rgrid=0x306f4e0, 
    varnames=..., varsizes=...) at /scratch/visit2.12.2/src/avt/Filters/avtMassVoxelExtractor.C:2654
#7  0x00007fbf524a0753 in avtMassVoxelExtractor::Extract (this=0x307c3e0, rgrid=0x306f4e0, varnames=..., varsizes=...)
    at /scratch/visit2.12.2/src/avt/Filters/avtMassVoxelExtractor.C:348
#8  0x00007fbf524dba62 in avtSamplePointExtractor::RasterBasedSample (this=0x7ffc8176d7a0, ds=0x306f4e0, num=0)
    at /scratch/visit2.12.2/src/avt/Filters/avtSamplePointExtractor.C:1032
#9  0x00007fbf524da8e4 in avtSamplePointExtractor::ExecuteTree (this=0x7ffc8176d7a0, dt=...)
    at /scratch/visit2.12.2/src/avt/Filters/avtSamplePointExtractor.C:764
#10 0x00007fbf524d9501 in avtSamplePointExtractor::Execute (this=0x7ffc8176d7a0)
    at /scratch/visit2.12.2/src/avt/Filters/avtSamplePointExtractor.C:367
#11 0x00007fbf50c4776e in avtFilter::Update (this=0x7ffc8176da88, contract=...)
    at /scratch/visit2.12.2/src/avt/Pipeline/AbstractFilters/avtFilter.C:292
#12 0x00007fbf50bd2b9c in avtDataObject::Update (this=0x306d6e0, contract=...)
    at /scratch/visit2.12.2/src/avt/Pipeline/Data/avtDataObject.C:131
#13 0x00007fbf524d3ac8 in avtResampleFilter::ResampleInput (this=0x30476a0)
    at /scratch/visit2.12.2/src/avt/Filters/avtResampleFilter.C:461
#14 0x00007fbf524d2da2 in avtResampleFilter::Execute (this=0x30476a0)
    at /scratch/visit2.12.2/src/avt/Filters/avtResampleFilter.C:172
#15 0x00007fbf50c4776e in avtFilter::Update (this=0x30477d8, contract=...)
    at /scratch/visit2.12.2/src/avt/Pipeline/AbstractFilters/avtFilter.C:292
#16 0x00007fbf50bd2b9c in avtDataObject::Update (this=0x3046190, contract=...)
    at /scratch/visit2.12.2/src/avt/Pipeline/Data/avtDataObject.C:131
#17 0x00007fbf50c63bc5 in avtDataObjectSink::UpdateInput (this=0x30482f0, spec=...)
    at /scratch/visit2.12.2/src/avt/Pipeline/Sinks/avtDataObjectSink.C:157
#18 0x00007fbf50c4758a in avtFilter::Update (this=0x30482b0, contract=...)
    at /scratch/visit2.12.2/src/avt/Pipeline/AbstractFilters/avtFilter.C:258
#19 0x00007fbf50bd2b9c in avtDataObject::Update (this=0x3048330, contract=...)
    at /scratch/visit2.12.2/src/avt/Pipeline/Data/avtDataObject.C:131
#20 0x00007fbf50c63bc5 in avtDataObjectSink::UpdateInput (this=0x299d550, spec=...)
    at /scratch/visit2.12.2/src/avt/Pipeline/Sinks/avtDataObjectSink.C:157
#21 0x00007fbf50c4758a in avtFilter::Update (this=0x299d510, contract=...)
    at /scratch/visit2.12.2/src/avt/Pipeline/AbstractFilters/avtFilter.C:258
#22 0x00007fbf50bd2b9c in avtDataObject::Update (this=0x299d590, contract=...)
    at /scratch/visit2.12.2/src/avt/Pipeline/Data/avtDataObject.C:131
#23 0x00007fbf50c6f314 in avtTerminatingSink::Execute (this=0x30495a0, contract=...)
    at /scratch/visit2.12.2/src/avt/Pipeline/Sinks/avtTerminatingSink.C:208
#24 0x00007fbf52895b1a in avtPlot::Execute (this=0x299a100, input=..., contract=..., atts=0x28391c8)
    at /scratch/visit2.12.2/src/avt/Plotter/avtPlot.C:624
#25 0x00007fbf5607e600 in DataNetwork::GetWriter (this=0x298d4e0, dob=..., contract=..., atts=0x28391c8)
    at /scratch/visit2.12.2/src/engine/main/DataNetwork.C:247
#26 0x00007fbf560c7fae in NetworkManager::GetOutput (this=0x25fb270, respondWithNullData=false, 
    calledForRender=false, cellCountMultiplier=0x7ffc8176f26c)
    at /scratch/visit2.12.2/src/engine/main/NetworkManager.C:2557
#27 0x00007fbf560a2d94 in EngineRPCExecutor<ExecuteRPC>::Execute (this=0x25a6870, rpc=0x25961a8)
    at /scratch/visit2.12.2/src/engine/main/Executors.h:1003
#28 0x00007fbf560b585f in EngineRPCExecutor<ExecuteRPC>::Update (this=0x25a6870, s=0x25961d8)
    at /scratch/visit2.12.2/src/engine/main/EngineRPCExecutor.h:67
#29 0x00007fbf4b95de12 in Subject::Notify (this=0x25961d8) at /scratch/visit2.12.2/src/common/state/Subject.C:193
#30 0x00007fbf4b82f776 in AttributeSubject::Notify (this=0x25961a8)
    at /scratch/visit2.12.2/src/common/state/AttributeSubject.C:99
#31 0x00007fbf4b99adb1 in Xfer::Process (this=0x2588710) at /scratch/visit2.12.2/src/common/state/Xfer.C:416
#32 0x00007fbf560a9865 in Engine::ProcessInput (this=0x25885d0) at /scratch/visit2.12.2/src/engine/main/Engine.C:1906
#33 0x00007fbf560a9688 in Engine::EventLoop (this=0x25885d0) at /scratch/visit2.12.2/src/engine/main/Engine.C:1850
#34 0x00000000004019c7 in EngineMain (argc=2, argv=0x7ffc8176f5d8) at /scratch/visit2.12.2/src/engine/main/main.C:331
#35 0x0000000000401adc in main (argc=10, argv=0x7ffc8176f5d8) at /scratch/visit2.12.2/src/engine/main/main.C:394
(gdb) frame 4
#4  0x00007fbf524a9b6c in AssignEight<float> (vals=0x7ffc8176c9e0, index=0x7ffc8176c9c0, s=1, m=0, 
    array=0x7fbc652c5010) at /scratch/visit2.12.2/src/avt/Filters/avtMassVoxelExtractor.C:232
232            vals[i] = (double) array[s*index[i]+m];
(gdb) p s*index[i]+m
$25 = -2111524864
(gdb) p i
$26 = 0
(gdb) p index[0]
$27 = -2111524864
(gdb) p m
$28 = 0
(gdb) p s
$29 = 1
(gdb) frame 6
6  0x00007fbf524a9345 in avtMassVoxelExtractor::ExtractImageSpaceGrid (this=0x307c3e0, rgrid=0x306f4e0, 
    varnames=..., varsizes=...) at /scratch/visit2.12.2/src/avt/Filters/avtMassVoxelExtractor.C:2654
2654                                            s, m, pt_array);
(gdb) p nX
$31 = 1297
(gdb) p nY
$32 = 1536
(gdb) p nZ
$33 = 1536

The index array is set in avtMassVoxelExtractor::ExtractImageSpaceGrid() around line 2640:

                    int index[8];
                    index[0] = (zind)*nX*nY + (yind)*nX + (xind);
                    index[1] = (zind)*nX*nY + (yind)*nX + (xind+1);
                    index[2] = (zind)*nX*nY + (yind+1)*nX + (xind);
                    index[3] = (zind)*nX*nY + (yind+1)*nX + (xind+1);
                    index[4] = (zind+1)*nX*nY + (yind)*nX + (xind);
                    index[5] = (zind+1)*nX*nY + (yind)*nX + (xind+1);
                    index[6] = (zind+1)*nX*nY + (yind+1)*nX + (xind);
                    index[7] = (zind+1)*nX*nY + (yind+1)*nX + (xind+1);

followed by the call to AssignEight() using index[].

The total number of points in this dataset is more than 2^31 (1536*1536*1297 = 3,060,006,912), causing negative values due to overflow in the index[] array. Which then causes out-of-bounds access in of array[] in AssignEight<T>.

I'm not very familiar with the VisIt sources, so my analysis might be wrong. Because it's a bit surprising that a dataset that isn't extremely large would trigger this kind of bug. Surely, VisIt is regularly used for much larger datasets? Or is the fact that this is an unstructured grid the exception?

Updated by Mark Miller 3 months ago

Paul Melis wrote:

The total number of points in this dataset is more than 2^31 (1536*1536*1297 = 3,060,006,912), causing negative values due to overflow in the index[] array. Which then causes out-of-bounds access in of array[] in AssignEight<T>.

I'm not very familiar with the VisIt sources, so my analysis might be wrong. Because it's a bit surprising that a dataset that isn't extremely large would trigger this kind of bug. Surely, VisIt is regularly used for much larger datasets? Or is the fact that this is an unstructured grid the exception?

Yes, but ordinarily when VisIt is processing a large dataset, that dataset is broken into pieces by the data producer (e.g. simulation code) such that no one piece is O(max-int) number of elements. Different ranks operate on different pieces from bytes on disk to pixels in a framebuffer and then those framebuffers are composited together in parallel to produce the final image. Or, in the case of VR, rays are traced in parallel through those pieces.

It turns out it is kinda rare to encounter single blocks of the size you are dealing with here. That said, it seems like switching to a 64 bit integer type here or even a double (which is 52 bits of integer precision) might work.

Updated by Paul Melis 3 months ago

Mark Miller wrote:

Paul Melis wrote:

The total number of points in this dataset is more than 2^31 (1536*1536*1297 = 3,060,006,912), causing negative values due to overflow in the index[] array. Which then causes out-of-bounds access in of array[] in AssignEight<T>.

I'm not very familiar with the VisIt sources, so my analysis might be wrong. Because it's a bit surprising that a dataset that isn't extremely large would trigger this kind of bug. Surely, VisIt is regularly used for much larger datasets? Or is the fact that this is an unstructured grid the exception?

Yes, but ordinarily when VisIt is processing a large dataset, that dataset is broken into pieces by the data producer (e.g. simulation code) such that no one piece is O(max-int) number of elements. Different ranks operate on different pieces from bytes on disk to pixels in a framebuffer and then those framebuffers are composited together in parallel to produce the final image. Or, in the case of VR, rays are traced in parallel through those pieces.

Sure, I understand that. However, I expect that producing data in multiple files is avoided as long as the post-processing tools being used provide sufficient performance on a single block (and "sufficient performance" becomes relative to the effort required to add parallel writing to a simulation code). Especially 3D volumes can be quite large when rendered on a single high-end GPU these days. You'd be surprised how many users do not use the parallel rendering features of VisIt (or ParaView for that matter) because it takes extra effort to get working.

It turns out it is kinda rare to encounter single blocks of the size you are dealing with here. That said, it seems like switching to a 64 bit integer type here or even a double (which is 52 bits of integer precision) might work.

One of the issues here is that there is an implicit assumption about the maximum size of the dataset, which isn't checked or enforced. Even when using 64-bit integers there will be a limit, which is fine in itself, as long as that limit is clear. So is the current limit 2^31 points per volume dataset?

I will ask the user if writing multiple smaller pieces is an option.

Updated by Paul Melis 3 months ago

Paul Melis wrote:

I will ask the user if writing multiple smaller pieces is an option.

The user's reaction:


The advantage of using HDF5 is that we can output all the data from different cores to the same file automatically. Actually the simulation that carried out the dataset I gave you was running on
800 cores.

This actually seems quite reasonable. Having a single result file is a lot better than multiple smaller ones, in terms of pre-processing with a diverse set of tools. Are there any tricks in Visit to be able to pick a spatial subset of the dataset, to overcome the overflow bug here? I know Xdmf allows picking complete HDF5 datasets, but I haven't looked into picking a subset of a dataset.

Also available in: Atom PDF