Index by title

April 27 2011


Axom and HDF5

Currently, this is just a landing place for hosting data files to point at from THG's Confluence Site


Bi-Weekly Meeting Notes

January 19, 2011
February 2, 2011
February 10, 2011
February 16, 2011
March 16, 2011
March 30, 2011
April 27, 2011
June 8, 2011
July 20, 2011
August 31, 2011
September 14, 2011
September 28, 2011
December 21, 2011
January 12, 2012
March 14, 2012
August 29, 2012

January 19 2011

Attendees: Mark, Richard, Quincey, Albert, Scott
Action items footnoted with person

Computer Accounts

Albert has his and has logged into DawnDev. Quincey just got crypto-card, will complete sign on process asap. Start ball rolling on accounts for Scott on DawnDev and other linux LC resources1 and Silo redmine2.

What about other DOE sites such as ANL/ORNL? Quincey/Albert might already have some access on other relevant/interesting sites. Will look into how appropriate it might be to use existing accounts for activities related to this work3. Look into ANL/ORNL accounts for Quincey, Albert, Scott1.

We'll start trying to get Cielito accounts in mid-february.

Regular meeting time

We agreed to meet bi-weekly, every other wednesday 1pm PST2. Will use teleconference call Mark has set up. Note, this teleconference system also provides shared white board capability which might come in handy for more detailed technical discussion.

Poor Man's parallel I/O (PMPIO) tests.

Silo already has some simple PMPIO tests. But, we need to create semantically equivalent for HDF5's test suite. Mark gave brief overview of Poor Man's Parallel I/O. For more details, download silo tarball from silo.llnl.gov, untar, and look at

src/silo/pmpio.h
tests/pmpio_hdf5_test.c
tests/pmpio_silo_test_mesh.c

Note that the pmpio_hdf5_test.c is purely HDF5 code and should be able to compile and link without any silo code. The pmpio.h file contains most of the magic for doing very simple PMPIO. A key feature of PMPIO is that it allows application to vary number of files created (e.g. amount of concurency) between 1 (basically serial I/O) to numProcs (e.g. file per processor) seamlessly.

Testing on <64 procs nightly and ~1024 procs weekly should be easily done on DawnDev as well as one or more of LLNL's linux clusters. Richard explains that quarterly tests on >16K processors may involve aligning tests with system upgrades (which occur about every 2 months) or getting some dedicated access time. Also, larger runs are likely to require Dawn so only Richard will be able to run those tests.

Note: In a perfect world, it would be really good to collect a whole scalability curve from 2-128K processors on a fairly regular basis. We should shoot for this even if we fall short.

Want to set up cron job3 that submits tests to batch system on desired intervals. By keeping expected wall-clock time of tests short and processor counts low (<1-2K), should be able to ensure tests run within 8-10 hours of submission.

Albert running into snags getting HDF5 built on DawnDev. Albert should feel free to pester Richard (or Mark) for help. Richard1 will work with Albert to get current issues resolved. Send vugraphs regarding building/using Dawn/DawnDev to Albert1.

Quincey/Albert prefer to stand up existing HDF5 test suite on DawnDev before adding new PMPIO tests. This is fine. Quincey/Albert interested in integrating with HDF5 SVN repo as well. Mark asks that SVN/repo integration be done in parallel with adding new PMPIO tests.

Once new HDF5 PMPIO tests written, Mark also asks that get broad view baseline data with HDF5-1.8.4.

Quincey/Albert to include Silo in HDF5's external software test suite.

LLNL End-to-end routine testing

Once we have HDF5 testing substrate in place on LLNL systems, we should build upon that from applications down. Pick one Silo application such as Ares/Ale3d and one HDF5 application (Richard can you suggest), and get some cron jobs running where some benchmark I/O runs for these applications are routinely checked and compared1. This is potentially a lot of work. We don't want the layers above HDF5 to change either (e.g. Silo and application should be kept at same version).

Scoping Parallel I/O benchmarking tool

Would like to test a variety of I/O paradigms. Would like to test a variety of I/O interfaces (whole software stack).

HDF5-specific parameters represents one orthogonal axis of benchmarking space. Other axis include processor counts, requests sizes and the like.

We discussed existing benchmarks. IOR is available on SourceForge and Richard is inclined to suggest starting from that. We agree it doesn't use HDF5 intelligently. Some of the paradigms we wish to explore may be fundamentally outside the scope of IOR's current design. Richard/Albert will look into this1. Another option is H5perf.

As part of Silo, Mark developed a sort of plugin based infrastructure to facilitate comparing various I/O interfaces in a sane way. Mark asks that Albert take a look at following sources in Silo...

tests/ioperf.c
tests/ioperf.h
tests/ioperf_hdf5.c
tests/ioperf_pdb.c
tests/ioperf_sec2.c
tests/ioperf_silo.c
tests/ioperf_stdio.c

We agreed that a parallel I/O benchmark achieving everything we all might like to include is a large effort and we should probably adopt the strategy where we get some initial ground work in place for it and then decided on an as-needed basis how/where to enhance it as the project proceeds. FWIW, I think a fully developed tool including various of the features we envision would be a) well received by rest of HPC and I/O community and b) very useful long term.

Not enough time to discuss

Silo's block-based VFD

Albert/Quincey/Scott, please get silo 4.8 source code and have a look at

src/hdf5_drv/H5FDsilo.c
src/hdf5_drv/silo_hdf5.c

The more important of the two is H5FDsilo.c. H5FDsilo.c is the VFD code. silo_hdf5.c is the driver code for mapping Silo API onto HDF5. It effects how Silo uses HDF5 and in this regard is relevant to HDF5 Group. In fact, I'd be interested in a) walking you through this code if you are interested b) your initial reactions to how we're using HDF5 from Silo c) suggestions to improve how we use HDF5. Also, please have a look at Overview of Silo, block-based VFD and Scaling Studies of block-based VFD on BG/P

NIF 'open file on a buffer of bytes' VFD

Quincey says this should make it into May release.

Planning Q2 activities.

1 Richard

2 Mark

3 Albert

4 Quincey

5 Scott

February 02 2011

Computer Accounts

Quincey is all set now. Richard has all stuff on LLNL's end in place for Scott too. Scott's participation likely in another month. Albert is in good shape now too. Won't start pinging folks for accounts on Cielito. Richard sees some major hurdles on I/O on that system to got over before intense work on performance can really begin in earnest anyways. Would probably not be most productive to pursue until these wrinkles get worked out.

Poor Man's Parallel I/O looksee

Albert has yet to look into this material. As you do, please feel free to contact Mark/Richard with questions.

Standing up HDF5 testing on DawnDev

Albert has yet to stand up HDF5's existing testing on DawnDev. Richard volunteered to refresh his own memory about building HDF5 on DawnDev and share what he remembers with Albert.

Parallel I/O Benchmark

Richard examined IOR and doesn't think the right path forward is to fix the way it uses HDF5. Maybe better to work from H5perf.

About other DOE site accounts

Richard is looking into this. ANL sponsor candidates: Bill Allcock and/or Tim Tautges. ORNL sponsor candidates: Sean Ahern or Jeremy Meredith. Richard/Mark will contact via email an inquiry as to possibilities.

Cielito Class in March

Albert/Quincey cannot attend due to travel. Richard will get whatever course materials are made available and ensure Albert gets a copy.

Is XE6 similar to XT3 (red storm)? Yes and no. Red Storm used Lustre. XE6 at LANL is using Panasas. So, very different I/O. Also, many differences in architecture and usage as per Richard's cursory observations of existing development material.

February 10 2011

Quincey visited LLNL on site following the MPI forum in the bay area. Quincey and Mark met 10-5 on this date and discussed a variety of issues both short and long term.

Thread Issues at Exascale
What does ideal block/page based VFD look like?
Application of Compression at Exascale
Poor Man's vs. Rich Man's Parallel I/O

February 16 2011

Concurrency vs. Thread Safety

Quincey explained that thread safety mechanism effects ability for app running on multiple threads to call into a lib, concurrently. HDF5's existing thread safety (locking) mechanism permits NO concurrency. Introduced concept of internal vs. external concurrency; internal concurrency is where lib spawns/uses threads to perform work while external concurrency is where lib can be called from multiple threads.

Stackable VFD

Quincey scoped this. Says its going to be necessary for good forward progress on VFDs in general. Mark M. approved this. Quincey will close scope ticket, add a new ticket for getting this into HDF5.

Chunking and I/O Requests.

There are new chunk-indexing methods in HDF5-1.10. The chunk indexing methods are useful because it bypasses a general b-tree approach. Could introduce a new chunk indexing scheme called simple chunk for a dataset whose chunk is the size of the dataset. Can get all the benefits of chunking (e.g. compression, checksumming etc) but can save an I/O call and reduce MD overhead from 1-2Kbytes to ~50bytes. Mark M approved this. Quincey will add a new issue ticket for this work.

Work on Dawn

Albert using xlc and bgxlc and having difficult running on non-login nodes. Trying salloc. May need to us '-p pdebug'.

People

Add John Mainzer to HDF-HPC redmine.

March 16 2011

HDF5 Leak/Bloat in free list stuff

Quincey ran ioperf test but had to turn off some stuff there. Ran on a mac for several hours to see some issue with growth. Niel took closer look and described problem is with an ever growing group heap internal data structure in HDF5 that is a direct result of writing many datasets to a single group. Note that Silo operates in this funky way that all the objects created by Silo client result in HDF5 datasets created in a single, '/.silo' group in the HDF5 file. Thus, the problem maybe related to Silo's design in this regard. Silo has an option to run in such a way that datasets are not all created in /.silo group and that may sidestep this issue. In addition, Niel proposed using H5Pset_libver_bounds() to tell HDF5 lib its ok to use newer versions of some internal data structures that are more efficient in this regard. However, the impact is that the resultant files are not as backward compatible as ordinarily. Only software using HDF5 version 1.8 would be able to read the resultant files. Niel will send Mark a tarfile example on how to do this.

HDF5 mainline testing on LLNL Systems

Albert's got this working but there are problems. It takes manual intervention and would like to automate. It takes a long time to complete a 'make check' because each test has to flow through batch submission system.

We need to identify some routine BG/P developers Albert can contact for detailed questions. lc-support is fine but not necessarily able to address details Albert is facing.

March 30 2011

Memory leak use case follow up

Quincey looking into it with ioperf tool Mark sent. valgrind/massive report some peculiar information regarding Btree blocks. Quincey sees only gradual growth on tests he's run but Mark sees more pronounced. Mark will follow up with real test data from system where he last ran this test.

Multiple file open issue

VisIt has 120 database plugins and ~15-20 of those are HDF5 based. During auto-format_detection (see here), several different HDF5 plugins may open and attempt to read a file's contents. If a plugin is buggy in the way it closes HDF5 objects, even though it calls H5Fclose(), it may leave the file open. Another plugin will wind up opening the same file (or if the plugin attempts to open the same file with different properties, the open will fail) and problems ensue.

Advise having all plugins open with property H5F_CLOSE_DEGREE_STRONG.

Also, requested enhancment to have H5Fopen fail if HDF5 detects the file is already open.

Developer support resources for Dawn

Steve Langer said he'd be willing to field (occasinal) emails from Albert. Mark to send email about this.

Richard looked into this as well and suggests that John Jyllenhall (spelling?) would be a good resource. Maybe best thing is for Albert to send email to hotline and request it be assigned to John. Richard to email John with heads up that Albert may be contacting him directly as well with Dawn questions.

Cielo training resources

Richard email group with online pages.

Albert encounters occasional resources he can't access. Thinks its a firewall issue. Maybe just parts of LC's pages that point at LC-staff-only content. Albert asked that if he encounters this on pages he thinks he really needs, email Richard for help.

Round robin sanity check on progress

Richard wants to put more time into end-to-end test setup.

Quincey still a bit resource starved waiting for a key developer to finish up some other tasks

Albert progress on Dawn has been a bit challenging. Expects now to get HDF5 testing stood up there in next 1-2 weeks. Expects also to get Silo testing stood up on HDF Group resources. Richard mention that Albert's progress on getting up to speed on Dawn has been really good given all the issues he is facing.

Mark feels like progress is sketchy. Doesn't have time to track everything.

April 27 2011

June 8, 2011

Manpower updates

Core VFD enhancements for NIF

Quincey ping'd NIF on most recent enhancement spec. but got no response. Suggest ping NIF again. However, is it possible to include a few paragraphs of high-level text outlining to them their options...

Explain why you think best option is the fully general approach. What are relative costs to develop? If you put all this in a few paragraphs and ask for feedback (postive or negative), I think you'll have a much better chance of getting a response than just attaching a 14 page design document. People might not have enough time to read all that nor know specifically which of the 14 pages to read to get the gist of things.

Quincey to put up VFD enhancement document on this wiki.

Testing

Performance Benchmarking

Ruth starting to look into. Check out ioperf.c in Silo's tests directory. There is a makefile there plus a number of I/O plugins for each of the interfaces the driver (ioperf) runs. Email Mark with any questions.

July 20, 2011

Round-robin status

PMPIO spinning up. Some questions about goals/purpose.

File Image Proposal

Darshan (I/O profiling) tool

Silo testing

Possible Klockworks use

August 31, 2011

Round Robin Check in

September 14, 2011

Parallel Benchmark Architecture Discussion

Round-robin Checking

Staffing adjustments

September 28, 2011

Parallel benchmark architecture discussion

Ruth A: Good summarization of discussion. That said, I do want to comment regarding "There are concerns about what really is the HDF5-way"... I think my concerns are more along the lines of "Equivalence is in the eye of the beholder", not only for the HDF5 interface but for other intefaces as well. I want to make sure the benchmark tool will support multiple views of equivalence (along lines of generality mentioned early in notes) down the road. That said, I do understand the top priority is to support what the "beholder" funding this work wants to compare :-). I believe we can do that via the I/O pattern generation, use of default options, etc, that are based on what ioperf currently does or planned to do, within a more general framework.

Round-robin status checkin

December 21, 2011

Round-robin checkin/status

Integration Testing

File Image work

ioperf work

Funding situation

Quincey's visit to LLNL

Other notes

January 12, 2012

Quincey visited LLNL for all day meetings. He gave a presentation on the state of the union of hdf5.

March 14, 2012

Due to reduced funding levels, we have reduced frequency of meetings. In addition, when we have had meetings, we've focused on simply keeping everyone in the loop and up to date on status of various activities which we haven't had a need to record. That said, this week's meeting included some information worth recording.

Attendees: Mark, Richard, Quincey.

Miscellaneous stuff

Chicago sponsor has interest in metadata aggregation solutions

They are working with Isilon storage server and seeing really bad performance.

Quincey proposed a solution to reserve space at beginning of file for metadata. If client exceeds that space, handle all other metadata normally in that it gets sprinkled about rest of file.

Potential new funding opportunities

August 29, 2012

CSSE funding

NIF work with Tim Frasier

WhamCloud

October 10, 2012

Status updates

Klocwork Static Analysis


Compression for Exascale

Compression in the rich man's parallel sense is hard. HDF5 doesn't support it. Problems have to do with not knowing ahead of time where data lands in files due to variations in size. A variable loss but fixed size compression might be better here (e.g. wavelets). That way, can always hit a target compressed size but quality of compressed result varies. Could be useful in plot files but obviously not for restart.

Alternatively, can do compression in rich man's parallel if HDF5 operated in such a way as to assume a target compression ratio of R (R set by application in a property list), and then assume each block compresses R:1 so compressed block size is always 1/R of orig. This means compressed block size is always predictable though if actual compression exceeds R:1, some space savings will be sacrificed because we'll assume size is 1/R of orig. So what? The real problem is if a given block cannot be compressed R:1. Then what? One option is to fail the write and then app. can re-try the write with a lower R. Another option is to have two kinds of blocks. Those that hit or exceeded the compression target of R:1 and those that didn't. The former will always be treated as size 1/R of orig. and the latter are size of orig. Either way, size is predictable and then manageable in rich man's parallel.

Adding in additional preprocessing filters to the compression pipeline may give a better chance of achieving the R:1 compression ratio (or may allow the compression ratio to be increased), at the expense of additional computing power. Some examples include: shuffle, delta and/or space-filling curve filters.

Eliminating the block-level indirection here might be useful. Yes, its bad for an eventual attempt to subset on read but if caller accepts limitations and/or costs of that, we allow it. Then, whole dataset is single block and it is either compressed to target R:1, with possible wasted space if it exceeded, or not.

Exascale may involve higher than double precision of 64 bits. Maybe 96 or 128 bits are required. What does this mean for compression of floating point data compared to single or double precision? Would we expect to be able to do better because there are more exponent bits or worse because there are more mantissa bits?

Also, see this HDF5 document, Chunking in HDF5

Part of an email thread with John Biddiscombe

Replying to multiple comments at once.

Quincey : "multiple processes may be writing into each chunk, which MPI-I/O can handle when the data is not compressed, but since compressed data is context-sensitive"
My initial use case would be much simpler. A chunk would be aligned with the boundaries of the domain decomposition and each process would write one chunk - one at a time - A compression filter would be applied by the process owning the data and then it would be written to disk (much like Marks' suggestion).
a) lossless. Problem understood, chunks varying in size, nasty metadata synchronization, sparse files, issues.
b) lossy. Seems feasible. We were in fact considering a wavelet type compression as a first pass (pun intended). "It's great from the perspective that it completely eliminates the space allocation problem". Absolutely. All chunks are known to be of size X beforehand, so nothing changes except for the indexing and actual chunk storage/retrieval + de/compression.

I also like the idea of using a lossless compression and having the IO operation fail if the data doesn't fit. Would give the user the chance to try their best to compress with some knowledge of the data type and if it doesn't fit the allocated space, to abort.

Mark : Multi-pass VFD. I like this too. It potentially allows a very flexible approach where even if collective IO is writing to the same chunk, the collection/compression phase can do the sums and transmit the info into the hdf5 metadata layer. We'd certainly need to extend the chunking interface to handle variable seized chunks to allow for more/less compression in different areas of the data (actually this would be true for any option involving lossless compression). I think the chunk hashing relies on all chunks being the same size, so any change to that is going to be a huge compatibility breaker. Also, the chunking layer sits on top of the VFD, so I'm not sure if the VFD would be able to manipulate the chunks in the way desired. Perhaps I'm mstaked and the VFD does see the chunks. Correct me anyway.

Quincey : One idea I had and which I think Mark also expounded on is ... each process takes its own data and compresses it as it sees fit, then the processes do a synchronization step to tell each other how much (new compressed) data they have got - and then a dataset create is called - using the size of the compressed data. Now each process creates a hyperslab for its piece of compressed data and writes into the file using collective IO. We now add an array of extent information and compression algorithm info to the dataset as an attribute where each entry has a start and end index of the data for each process.

Now the only trouble is that reading the data back requires a double step of reading the attributes and decompressing the desired piece- quite nasty when odd slices are being requested.

Now I start to think that Marks double VFD suggestion would do basically this (in one way or another), but maintaining the normal data layout rather than writing a special dataset representing the compressed data.
step 1 : Data is collected into chunks (if already aligned with domain decomposition, no-op), chunks are compressed.
step 2 : Sizes of chunks are exchanged and space is allocated in the file for all the chunks.
step 3 : chunks of compressed data are written
not sure two passes are actually needed, as long as the 3 steps are followed.

...but variable chunk sizes are not allowed in hdf (true or false?) - this seems like a showstopper.
Aha. I understand. The actual written data can/could vary in size, as long as the chunk indices as referring to the original dataspace are regular. yes?


February 02 2011

Computer Accounts

Quincey is all set now. Richard has all stuff on LLNL's end in place for Scott too. Scott's participation likely in another month. Albert is in good shape now too. Won't start pinging folks for accounts on Cielito. Richard sees some major hurdles on I/O on that system to got over before intense work on performance can really begin in earnest anyways. Would probably not be most productive to pursue until these wrinkles get worked out.

Poor Man's Parallel I/O looksee

Albert has yet to look into this material. As you do, please feel free to contact Mark/Richard with questions.

Standing up HDF5 testing on DawnDev

Albert has yet to stand up HDF5's existing testing on DawnDev. Richard volunteered to refresh his own memory about building HDF5 on DawnDev and share what he remembers with Albert.

Parallel I/O Benchmark

Richard examined IOR and doesn't think the right path forward is to fix the way it uses HDF5. Maybe better to work from H5perf.

About other DOE site accounts

Richard is looking into this. ANL sponsor candidates: Bill Allcock and/or Tim Tautges. ORNL sponsor candidates: Sean Ahern or Jeremy Meredith. Richard/Mark will contact via email an inquiry as to possibilities.

Cielito Class in March

Albert/Quincey cannot attend due to travel. Richard will get whatever course materials are made available and ensure Albert gets a copy.

Is XE6 similar to XT3 (red storm)? Yes and no. Red Storm used Lustre. XE6 at LANL is using Panasas. So, very different I/O. Also, many differences in architecture and usage as per Richard's cursory observations of existing development material.

Prev Meeting
Next Meeting
Up


February 10 2011

Quincey visited LLNL on site following the MPI forum in the bay area. Quincey and Mark met 10-5 on this date and discussed a variety of issues both short and long term.

Thread Issues at Exascale
What does ideal block/page based VFD look like?
Application of Compression at Exascale
Poor Man's vs. Rich Man's Parallel I/O

Prev Meeting
Next Meeting
Up


February 16 2011

Concurrency vs. Thread Safety

Quincey explained that thread safety mechanism effects ability for app running on multiple threads to call into a lib, concurrently. HDF5's existing thread safety (locking) mechanism permits NO concurrency. Introduced concept of internal vs. external concurrency; internal concurrency is where lib spawns/uses threads to perform work while external concurrency is where lib can be called from multiple threads.

Stackable VFD

Quincey scoped this. Says its going to be necessary for good forward progress on VFDs in general. Mark M. approved this. Quincey will close scope ticket, add a new ticket for getting this into HDF5.

Chunking and I/O Requests.

There are new chunk-indexing methods in HDF5-1.10. The chunk indexing methods are useful because it bypasses a general b-tree approach. Could introduce a new chunk indexing scheme called simple chunk for a dataset whose chunk is the size of the dataset. Can get all the benefits of chunking (e.g. compression, checksumming etc) but can save an I/O call and reduce MD overhead from 1-2Kbytes to ~50bytes. Mark M approved this. Quincey will add a new issue ticket for this work.

Work on Dawn

Albert using xlc and bgxlc and having difficult running on non-login nodes. Trying salloc. May need to us '-p pdebug'.

People

Add John Mainzer to HDF-HPC redmine.

Prev Meeting
Next Meeting
Up


HDF5 Installations at LLNL Systems

HDF5 binary are installed in
/usr/global/tools/hdf5dev/hdf5.

=======================
* v1.8.9 installation *
=======================
=== /usr/global/tools/hdf5dev/hdf5/v189/README ===

This is HDF5 v1.8.9 binary installation for different platforms.

linux64: (e.g., Aztec)
    Linux x86_64, using icc/ifort/icpc, static lib, zlib, no-szip-lib.

linux64-szip-encode: (e.g., Aztec)
    Linux x86_64, using icc/ifort/icpc, static lib, zlib, szip-lib.

linux64-szip-decode: (e.g., Aztec)
    Linux x86_64, using icc/ifort/icpc, static lib, zlib, szip-lib-decode only.

linuxppc64: (e.g., uDawn)
    Linux ppc64, using bgxlc/bgxlf90/bgxlC, static lib, no-zlib, no-szip-lib.

src:
    Holds the HDF5 source tar ball of this version.

README:
    This file.

Question?
Email Albert Cheng (acheng@hdfgroup.org)

====
Last update: May 15, 2012 by AKC


HPC Book Chapter Outline

This is supposed to be more or less of a case study (of Silo)

Ideal Block-Based VFD Characteristics

Some of these characteristics may be mutually exclusive. Which, I don't know. Let's elaborate as we flesh out what this thing looks like.

What is the ideal, block-based Virtual File Driver to support PMPIO?
  1. Blocks are either pure meta-data (MD) or pure-raw data (RD)
  2. MD blocks can be written throughout file (e.g. don't have to let MD grow without bound and write at close)
  3. Block size of RD controlled independently of MD
  4. Application can specify how many MD blocks and RD blocks are allowed to be kept in memory at any one time and/or total memory that is allowed to be used by VFD to cache/buffer blocks.
  5. When file is closed, whatever blocks are still in memory are written in increasing file index order.
  6. Produces a single file out the bottom (not one for RD and one for MD)
  7. Can be re-opened correctly by any standard HDF5 VFD (e.g. sec2 for example)
  8. PMPIO baton handoff is performed on the open file.
  9. Employs a least recently used MD block pre-emption algorithm for deciding which MD blocks to page out to disk and when
  10. Can handle MD async. (And likely RD async)
  11. A perfect block-based VFD decorolates chunks from I/O requests.
  12. Computes diagnostic statistics for performance debugging (e.g. like Silo's VFD currently does)
  13. Can use MPI under the covers to aggregate blocks from different MPI-tasks 'files' to a single, shared file on disk.
  14. Option to ship blocks off processor via MPI message to...

Internal HDF5 lib communication with VFD.

The internal parts of HDF5 lib can communicate directly with VFD by adding what amounts to out of band read/write messages to the VFD. Currently, there is a mem type tag on each message that indicates the type of memory HDF5 is sending to or requesting from the VFD. We could add new types to this enum to support messages to be sent between HDF5 lib proper and VFD. For example, to send information of hot spots in MD, HDF5 lib could write data to VFD with mem_type of MD_HOT_SPOTS. The VFD would advertise to HDF5 if and what kind of out of band messaging it supports. So, HDF5 would only send such messages to VFDs that claim to support them. This way, however, its possible for HDF5 lib proper to communicate with VFD without changing existing VFD API. (QAK: Cool idea!)

Likewise, HDF5 lib could request information from VFD by a read method with an appropriate mem_type.

Stackable VFDs

Given the variety of different functionality listed above and the desire for good software engineering practices and results, it seems likely that teasing apart the different kinds of functionality into multiple aspects, tied together by a common VFD framework would be desirable. The HDF Group has tackled this before, and nearly finished a prototype. We should resurrect that project and implement it, so that these features can be combined flexibly by application developers.

Possible VFD aspects, from characteristics above:


January 19 2011

Attendees: Mark, Richard, Quincey, Albert, Scott
Action items footnoted with person

Computer Accounts

Albert has his and has logged into DawnDev. Quincey just got crypto-card, will complete sign on process asap. Start ball rolling on accounts for Scott on DawnDev and other linux LC resources1 and Silo redmine2.

What about other DOE sites such as ANL/ORNL? Quincey/Albert might already have some access on other relevant/interesting sites. Will look into how appropriate it might be to use existing accounts for activities related to this work3. Look into ANL/ORNL accounts for Quincey, Albert, Scott1.

We'll start trying to get Cielito accounts in mid-february.

Regular meeting time

We agreed to meet bi-weekly, every other wednesday 1pm PST2. Will use teleconference call Mark has set up. Note, this teleconference system also provides shared white board capability which might come in handy for more detailed technical discussion.

Poor Man's parallel I/O (PMPIO) tests.

Silo already has some simple PMPIO tests. But, we need to create semantically equivalent for HDF5's test suite. Mark gave brief overview of Poor Man's Parallel I/O. For more details, download silo tarball from silo.llnl.gov, untar, and look at

src/silo/pmpio.h
tests/pmpio_hdf5_test.c
tests/pmpio_silo_test_mesh.c

Note that the pmpio_hdf5_test.c is purely HDF5 code and should be able to compile and link without any silo code. The pmpio.h file contains most of the magic for doing very simple PMPIO. A key feature of PMPIO is that it allows application to vary number of files created (e.g. amount of concurency) between 1 (basically serial I/O) to numProcs (e.g. file per processor) seamlessly.

Testing on <64 procs nightly and ~1024 procs weekly should be easily done on DawnDev as well as one or more of LLNL's linux clusters. Richard explains that quarterly tests on >16K processors may involve aligning tests with system upgrades (which occur about every 2 months) or getting some dedicated access time. Also, larger runs are likely to require Dawn so only Richard will be able to run those tests.

Note: In a perfect world, it would be really good to collect a whole scalability curve from 2-128K processors on a fairly regular basis. We should shoot for this even if we fall short.

Want to set up cron job3 that submits tests to batch system on desired intervals. By keeping expected wall-clock time of tests short and processor counts low (<1-2K), should be able to ensure tests run within 8-10 hours of submission.

Albert running into snags getting HDF5 built on DawnDev. Albert should feel free to pester Richard (or Mark) for help. Richard1 will work with Albert to get current issues resolved. Send vugraphs regarding building/using Dawn/DawnDev to Albert1.

Quincey/Albert prefer to stand up existing HDF5 test suite on DawnDev before adding new PMPIO tests. This is fine. Quincey/Albert interested in integrating with HDF5 SVN repo as well. Mark asks that SVN/repo integration be done in parallel with adding new PMPIO tests.

Once new HDF5 PMPIO tests written, Mark also asks that get broad view baseline data with HDF5-1.8.4.

Quincey/Albert to include Silo in HDF5's external software test suite.

LLNL End-to-end routine testing

Once we have HDF5 testing substrate in place on LLNL systems, we should build upon that from applications down. Pick one Silo application such as Ares/Ale3d and one HDF5 application (Richard can you suggest), and get some cron jobs running where some benchmark I/O runs for these applications are routinely checked and compared1. This is potentially a lot of work. We don't want the layers above HDF5 to change either (e.g. Silo and application should be kept at same version).

Scoping Parallel I/O benchmarking tool

Would like to test a variety of I/O paradigms. Would like to test a variety of I/O interfaces (whole software stack).

HDF5-specific parameters represents one orthogonal axis of benchmarking space. Other axis include processor counts, requests sizes and the like.

We discussed existing benchmarks. IOR is available on SourceForge and Richard is inclined to suggest starting from that. We agree it doesn't use HDF5 intelligently. Some of the paradigms we wish to explore may be fundamentally outside the scope of IOR's current design. Richard/Albert will look into this1. Another option is H5perf.

As part of Silo, Mark developed a sort of plugin based infrastructure to facilitate comparing various I/O interfaces in a sane way. Mark asks that Albert take a look at following sources in Silo...

tests/ioperf.c
tests/ioperf.h
tests/ioperf_hdf5.c
tests/ioperf_pdb.c
tests/ioperf_sec2.c
tests/ioperf_silo.c
tests/ioperf_stdio.c

We agreed that a parallel I/O benchmark achieving everything we all might like to include is a large effort and we should probably adopt the strategy where we get some initial ground work in place for it and then decided on an as-needed basis how/where to enhance it as the project proceeds. FWIW, I think a fully developed tool including various of the features we envision would be a) well received by rest of HPC and I/O community and b) very useful long term.

Not enough time to discuss

Silo's block-based VFD

Albert/Quincey/Scott, please get silo 4.8 source code and have a look at

src/hdf5_drv/H5FDsilo.c
src/hdf5_drv/silo_hdf5.c

The more important of the two is H5FDsilo.c. H5FDsilo.c is the VFD code. silo_hdf5.c is the driver code for mapping Silo API onto HDF5. It effects how Silo uses HDF5 and in this regard is relevant to HDF5 Group. In fact, I'd be interested in a) walking you through this code if you are interested b) your initial reactions to how we're using HDF5 from Silo c) suggestions to improve how we use HDF5. Also, please have a look at Overview of Silo, block-based VFD and Scaling Studies of block-based VFD on BG/P

NIF 'open file on a buffer of bytes' VFD

Quincey says this should make it into May release.

Planning Q2 activities.

1 Richard

2 Mark

3 Albert

4 Quincey

5 Scott

Next Meeting
Up


March 16 2011

HDF5 Leak/Bloat in free list stuff

Quincey ran ioperf test but had to turn off some stuff there. Ran on a mac for several hours to see some issue with growth. Niel took closer look and described problem is with an ever growing group heap internal data structure in HDF5 that is a direct result of writing many datasets to a single group. Note that Silo operates in this funky way that all the objects created by Silo client result in HDF5 datasets created in a single, '/.silo' group in the HDF5 file. Thus, the problem maybe related to Silo's design in this regard. Silo has an option to run in such a way that datasets are not all created in /.silo group and that may sidestep this issue. In addition, Niel proposed using H5Pset_libver_bounds() to tell HDF5 lib its ok to use newer versions of some internal data structures that are more efficient in this regard. However, the impact is that the resultant files are not as backward compatible as ordinarily. Only software using HDF5 version 1.8 would be able to read the resultant files. Niel will send Mark a tarfile example on how to do this.

HDF5 mainline testing on LLNL Systems

Albert's got this working but there are problems. It takes manual intervention and would like to automate. It takes a long time to complete a 'make check' because each test has to flow through batch submission system.

We need to identify some routine BG/P developers Albert can contact for detailed questions. lc-support is fine but not necessarily able to address details Albert is facing.

Prev Meeting
Next Meeting
Up


March 30 2011

Memory leak use case follow up

Quincey looking into it with ioperf tool Mark sent. valgrind/massive report some peculiar information regarding Btree blocks. Quincey sees only gradual growth on tests he's run but Mark sees more pronounced. Mark will follow up with real test data from system where he last ran this test.

Multiple file open issue

VisIt has 120 database plugins and ~15-20 of those are HDF5 based. During auto-format_detection (see here), several different HDF5 plugins may open and attempt to read a file's contents. If a plugin is buggy in the way it closes HDF5 objects, even though it calls H5Fclose(), it may leave the file open. Another plugin will wind up opening the same file (or if the plugin attempts to open the same file with different properties, the open will fail) and problems ensue.

Advise having all plugins open with property H5F_CLOSE_DEGREE_STRONG.

Also, requested enhancment to have H5Fopen fail if HDF5 detects the file is already open.

Developer support resources for Dawn

Steve Langer said he'd be willing to field (occasinal) emails from Albert. Mark to send email about this.

Richard looked into this as well and suggests that John Jyllenhall (spelling?) would be a good resource. Maybe best thing is for Albert to send email to hotline and request it be assigned to John. Richard to email John with heads up that Albert may be contacting him directly as well with Dawn questions.

Cielo training resources

Richard email group with online pages.

Albert encounters occasional resources he can't access. Thinks its a firewall issue. Maybe just parts of LC's pages that point at LC-staff-only content. Albert asked that if he encounters this on pages he thinks he really needs, email Richard for help.

Round robin sanity check on progress

Richard wants to put more time into end-to-end test setup.

Quincey still a bit resource starved waiting for a key developer to finish up some other tasks

Albert progress on Dawn has been a bit challenging. Expects now to get HDF5 testing stood up there in next 1-2 weeks. Expects also to get Silo testing stood up on HDF Group resources. Richard mention that Albert's progress on getting up to speed on Dawn has been really good given all the issues he is facing.

Mark feels like progress is sketchy. Doesn't have time to track everything.

Prev Meeting
Next Meeting
Up


Markup usage conventions

For in-line commentary (e.g. marking up)

Some documents on the wiki have an overall structure that is essential to maintain relatively statically. Nonetheless, various developers need to be able to comment on the material within such documents and this commentary needs to be maintained. We call this marking up a wiki document.

We will use wiki footnotes for this purpose. This allows us to keep all commentary related to a given document on the same page at the bottom of the document. We adopt the following wiki usage conventions in marking up such documents with developer commentary...

For example...

Here is a line to which a footnote has been added (see footnote3). Note '3' was chosen arbitrarily.

Discussion notes

3 Note 3:

MCM: Mark miller added this footnote


Monthly HDF Group Reports


Multi-threaded cores and HPC-HDF5

What are the issues we face in the long term on these exascale systems that will likely involve multi-core chips running multiple threads? How do we expect applications to use those threads to do their work? How do we expect the existence of threads to impact how/what the application does in the way of I/O?

Thread Safety versus Thread Implementation

These two issues can be confused. In the context of libraries like Silo and HDF5, threads can have implications on two levels

Thread Safety & Concurrency

By thread safe we mean the library can be safely used by an application running on multiple threads and that the application, running on each thread, can make calls into the library safely without encountering any problems endemic to threads such as global data structures getting corrupted and/or race conditions. When a library is not thread safe, an application running on multiple threads has to be re-engineered (slightly) so that when it makes use of a library that is not thread safe, it ensures that use of that library occurs on only one thread. Concurrency refers to a thread safe library's ability to be used simultaneously by multiple threads (See note2).

Silo is not a thread safe library. HDF5 is thread safe but is not concurrent. The whole library is locked when a thread enters an API routine. The problem with HDF5's thread safety is that locking is done on code (functions in the API or below it) but should instead be being done on data. Its a challenging enough problem to fix that although it has been discussed many times over the last several years, no funding agency has wanted to support it.

Threaded Implementation

By threaded implementation, we mean the library is designed to use multiple threads to do its work. For a computational kernal library like LINPACK, threaded implementations can be important as the primary service the library provides is a computational one. The additional threads parallelize the computational work of the library.

For an I/O library, where the primary service the library performs is to move data between memory and disk, the value of employing threads is unclear. A purely I/O library is one that engages in no problem sized work (operations on arrays/buffers passed into the library from the caller) and simply passes application data it is handed onto the underlying I/O interfaces (section 2, stdio, MPI-IO, etc.). However, both Silo and HDF5 do support operations on the data as it moves between memory and disk. These operations include

For a detailed description and flowchart of HDF5 operations during an H5Dwrite call, read chunk write actions

Nonetheless, performance studies have shown that even without threads, the HDF5 library can perform these operations at speeds well-above the associated disk I/O bandwidth. So, what value is there in making them any faster by employing threads when the associated disk I/O is going to dominate any particular data movement operation (See note1)? The answer is unclear.

We have identified a few things that use of threads in HDF5 could perhaps be gainfully employed at exascale to facilitate I/O

MPI and OpenMP types of parallelism

MPI-like parallelism is where the application is engineered to explicitly handle messaging between tasks using the MPI message library (or equivalent). OpenMP-like parallelism is much finer grained and typically handled via #pragma statements with a compiler. Work on a large array of data is then assigned to a variable number of tasks (threads) and the compiler handles all the issues (messaging/locking whatever) under the covers.

Could/would every thread in an exascale app look like any other MPI tasks as they currently do now? Apparently, MPI community is aiming to enable this degree of transparency such that a quality MPI-2/3 implementation would handle sending messages with whatever native efficiencies are possible between threads/chips. However, for an application like Ale3d, such an approach is likely to be impractical given memory constraints of each domain level task. So, instead, the way Ale3d might handle this is to run with one or just a few uber domains on a chip. MPI-parallelism would occur between uber domains (chips) while OpenMP parallelism would be used within the chip to operate on one or the few domains over threads there (Neely/Keasler). How would this effect the way Ale3d might do its I/O?

For many of the ways of employing a thread implementation of HDF5 described above, OpenMP parallelism makes the most (only) sense.

Threads provide more parallelism in compute, not I/O

The I/O pathways on/off chip are not improving. Indeed, the gap between processor speeds and disk I/O has continued to widen in the last decade. This is true on single cpu systems as well as parallel systems. Increasing parallelism on chip with extra cores and threads is great from a compute standpoint but can be leveraged very little, if at all to help move data on and off chip. About the only thing threads could possibly help with is hiding some I/O with compute by doing async. I/O and handing the actual I/O work off to a different thread. At the same time, few codes are designed with async. I/O in mind and would have to be either retro-fitted substantially to add logic to check that buffer reads and writes were completed before using those buffers or the underlying I/O libraries would have to make buffer copies (chewing up precious memory).

Conclusions

Existence of threaded execution in applications at exascale system is unlikely to change how I/O needs to be done. We don't want to do I/O independently from each thread do we?. Is there any reason we should try? Threads are really for compute anyways. There is no additional bandwidth off processor that using multiple threads for I/O will give us. So, although it seems counter-intuitive, from an I/O standpoint the thread headaches we're anticipating with exascale don't seem to be relevant.

Notes

1 Note 1:
None of HDF5's data operations listed in this section are performed apart from I/O operations. At least none that I know of. That means they are only performed as part of some larger I/O operation and are not an operation HDF5 provides apart from I/O. This may not be true of numeric arhitecture conversions.

2 Note 2:
Quincey explained that thread safety mechanism effects ability for app running on multiple threads to call into a lib, concurrently. HDF5's existing thread safety (locking) mechanism permits NO concurrency. Introduced concept of internal vs. external concurrency; internal concurrency is where lib spawns/uses threads to perform work while external concurrency is where lib can be called from multiple threads.


Parallel Benchmarking

Notes from 7/26/2011 telecon (Entered by Ruth 8/3/2011)

Mark, Ruth, and Albert attended.
Ruth circulated 2-page drawing that outlined a use case in advance.
Page 1
Page 2

Based on that drawing, we discussed how Silo would handle it.

This is a transcription of Ruth's notes... additions/ corrections welcome.

DBputquadmesh would be used to put the coordinate arrays / xy values
DBputquadvar would be used to put the temp and pressure variables.

Raw data and metadata (name, units, etc.). This is Silo metadata.
A small # of calls would be made to HDF5 - an uberstruct w/ all the metadata would be written.
This would be Dataset create, write, close.

From the application perspective, they care about mesh and variable data at a minimum, but don't really think about the metadata. The metadata is not counted in terms of I/O request size. The application "thinks" in terms of # of nodes or zones of field (variable). For example, in the use case drawing for D0 there are 5x4 = 20 zones of 64 bit doubles for the temperature field = 160 bytes.

The metric at the HDF5 level is multi-dimensional array.

At the filesystem level it's # of bytes.

Looking at the code snippet on page 2-172 of Silo UG:

  PMPIO_baton_t *bat = PMPIO_Init(...);
  dbFile = (DBfile *) PMPIO_WaitForBaton(bat, ...);
  /* local work (e.g. DBPutXXX() calls) for this processor */
  .
  .
  .
  PMPIO_HandOffBaton(bat, ...)
  PMPIO_Finish(bat);

In general, HPC codes don't do partial I/O -- they write or read all the data in a dataset at once. This is changing some with wavelet compression such as what Fastbit does, and in some cases Visit does partial reads if it helps improve viz speed. But, in general that's the case.

The mock-up drawing does not accurately reflect the way applications and silo work. The domains would be assigned "linearly" to processors. So, where the drawing shows D0, D2 to P0; D1, D3 to P1; D4 to P2, in fact it would be D0, D1 to P0; D2, D3 to P1; D4 to P2. And, in the timeline, it shows compute & I/O interspersed. In fact, all the compute would occur first, then all the I/O.

Currently there is no assignment of domains or division of processors into GROUPS for writing files based on I/O capabilities of the processors. And, currently no overlap of compute and I/O (as shown in the timing diagrams).

Parallel Benchmark Architecture (Initially discussed on 9/14/11 telecon)

A finer decomposition of pieces

Equivalence in I/O requests across I/O library interfaces

Be sure to see some discussion of this topic in the forum

A typical issue we face in winning adoption for higher level I/O libraries like HDF5 and Silo is I/O performance. We always hear application developers complain about how much it is costing them to write data to HDF5 vs. using a lower-level interface like Section 2 or stdio or MPI-IO.

This notion of equivalence in I/O requests across interfaces is something that requires further dialog. At a minimum, we need to be thinking of it in terms of what an application code needs to do with its data, irrespective of the idiosynchracies with which that might be achieved through various I/O interfaces even if said interfaces are used optimally.

An application has a bunch of data it would like to store to persistent storage. Generally, that data is scattered all over in pieces in various places in memory. Some of those pieces represent different parts of some larger single semantic data object and others of those pieces may be either a single data object unto themselves or even a whole set of smaller data objects. Ultimately, all that data is intended to wind up in one or more files. The application really has only a few choices...

  1. Gather all the pieces for one or more data object(s) into one place and pass that aggregated whole object onto a library below.
  2. Build some kind of a map like thing that indicates where all the pieces of one or more data objects are and pass that thing on to a library below

Possiblity to play back a trace captured from another code

More on notion of I/O request equivalence across interfaces

We have identified a need to develope a notion equivalence of (a set of) I/O operation(s) across a variety of I/O interfaces.

I believe it is best to ask this question from the context of the application needing to write/read data without regard for qualitative differences in how that data winds up being stored/handled by any given I/O interface. For example, when data is written to HDF5, its possible to give the data a name, associate a datatype with it, convert from one datatype to another, to checksum it, compress it, etc. The data is stored such that it can be subsequently random accessed. These are all useful features in HDF5

But, given the basic action of an application writing/reading data to/from persistent storage, I claim all of these useful features represent something that is qualitatively different from raw I/O performance. Therefore, when developing performance and benchmarking metrics, we have two problems. One is quantifying the subsequent overhead higher level interfaces impose on raw I/O performance. The other is developing a way to equivalence I/O operations across interfaces operating at very different levels of abstractions.

There is no doubt that such qualitative differences are very important. And, if all of these other features had no impact on performance, we wouldn't really even need to be talking about them. But, I claim we need a way to factor these issues out of raw I/O performance measurements and comparisons so that we can normalize measurements across interfaces where the lowest common denominator is something like stdio or sec2 which in and of themselves support none of these features. In so doing, we in fact wind up with a good idea of the cost that applications pay in using a library like HDF5 as well as how to optimize use of such a library to minimize that cost.

But, we also have to be careful. We can envision I/O libraries like Silo which operate on meshes and fields (a level of abstraction above HDF5's structs and arrays) and which include ever more sophisticated operations on the data such that the operations themselves have a profound impact on the basic action of moving data from memory to persistent storage. For example, if we think of some really advanced scientific database that maybe includes very high level operations to detect vortices in fluid flow or high gradients in fields defined on a mesh and then only takes a snapshot of the data when the conditions are right, such operations will have such a profound impact on I/O performance that it does not make sense to exclude them when measuring I/O performance. In this context, the application's need isn't so much to store data to storage as it is to store snapshots of the data around the time(s) of important events in the evolution of the simulation.

So, in general, there is a spectrum; at one end is simply raw I/O and no operations on the data. At the other end is highly sophisticated database-like processing that can change entirely the nature of the data being stored. Then, there are in-between operations that maintain the same data semantics but represent it in perhaps different ways. Stdio and sec2 are examples of the extreme raw-I/O end of the spectrum. HDF5 and Silo are examples of the in-between type of library. ITAPS together with some specialized service software to for feature detection is the other extreme end of the spectrum.

Note, for restart dumps, there is an implied requirement that the complete internal memory state of the simulation can be reconstructed from whatever is stored to persistent storage. The purpose of a restart dump is to store the state of the application so that the simulation can be restarted from that point forward. For plot dumps, there is no such implied requirement and so its conceivable that there can be many operations applied to the data that may change its characteristics dramatically from what is actually stored in the application's memory.

The notion of application data objects

Conceptually, we can think of all of the data the application wants to store as a collection of one or more data objects.

A data object is a whole, coherent entity of data that is treated, semantically, as an independent, single thing. For example, in a simulation of airflow over a wing, one of the data objects may be the velocity vector field of the air. There are many, many ways an application could choose to store this data object in memory as suiting the needs of the implementation of the numerical models the simulation uses. Below, we characterize examples by way of code showing how the memory of the data is allocated...

Ensuring a benchmark writes and reads are verifieable

Its sometimes convenient, and I myself have written simple I/O test code, to write data to disk in such a way that it cannot later be verified that the data in the file is actually what the writer handed off. For example, its common to allocate a buffer of bytes to write but to not set those bytes to some pre-defined values. That is because we're aften thinking only about timing how long it takes for a given I/O library to push the bytes to disk. However, the more I think about this, the more I think it is probably prudent to design and indeed require a useful I/O benchmark in such a way that the data that it uses in testing is verifieable. That is that it is possible to indpendently check that indeed the data from the application was properly written to the file in such a way that it can later be read back. I think this represents the absolute minimum requirement of any I/O interface that might be a candidate for including in a benchmarking study. I mean, would we really wan to include a library for which it is not possible to ensure this?

In practical terms, this means something like Silo's ioperf needs to do tiny bit more work constructing the data buffers it writes. In addition, I think there is value in designing things in such a way that given any bucket of data in the file, based on its contents we can identify which processor and/or which number in sequence of I/O requests originated it. Perhaps there is more we might like to be able to deduce from the contents of a given bucket of data in the file but those are at least two useful features.

I am also thinking it makes sense to include in the benchmark a few different fundamental data types such as character data, integer data and floating point data. So that we define a useful benchmark to be one that handles all these types of data not so much that they have to be handled portably across different machines but that they have to be handled in whatever way that means for the underlying I/O interface being tested.

Uploaded files appear below (but not in edit mode). Insert new text above this header.

Focused Benchmarking and Auto-Tuning Activities (07Dec11 Telecon)

Attendees: Ruth, Quincey, Mark H., Mark M., Prabhat

One stop shopping for I/O relevant tuneables

The goal here is to outline development of a common interface to control all these parameters; a one-stop-shopping for I/O tuneables. This could be a useful capability apart from any specific product such as HDF5. So, might want to consider software engineering issues to make it packageable as such. This is a sort of essential piece to many of the other benchmarking and auto-tuning activities we'd like to consider. Without the ability to vary/control parameters in a common way, its difficult to develop software to do other things we want. We need to be aware of situations where there are not well defined interfaces to specific parts of the system with which to control parameters (e.g. only way to affect parameter X is via some env. variable).

We're considering the notion of an HPC-specific (high-level) API in HDF5 for this purpose1?. There are still problems with environment variables, since they are out of band.

Include ability to read parameter sets from human readable/editable settings files

Having ability to set paramaeters via a common interface is good. Being able to vary them for different runs of a benchmark or application is also useful. But to do that, some part of the application has to take responsibility for implementing the ability to accpet user-specified settings. The current proposed solution is to affect the ability to drive interface defined above from the contents of human readable/editable text files, probably xml. So, part of defining the interface above will include ability to write out and read back (xml) settings files.

Mark M. propoosed notion of making HDF5 library properties have this ability so that literally any HDF5 properties could be stored persistently as XML strings either within an HDF5 file or as a raw, standalone xml file. The HDF5 Group is nearly finished with a mechanism to serialize/deserialize property lists (into binary, not text though).

Add interface defined above in some I/O benchmarks (Silo's ioperf, h5perf, h5part kernel)

Mark M. offered manpower to incorporate the one-stop-shopping I/O tuneables interface into Silo's ioperf. Timeframe would be sometime before end of March, 2012. Prabhat and Mark H. could adjust h5part and use as kernel.

Inventory I/O relevant tuneables

Do we even know what all the knobs are and what the are (intended to) do? Can we collect together in one place (as well as maintain this information as things change) all the tuneables that exist? Ruth suggested world-readable wiki for this. We agreed that having it be HPC-specific would be best.

Triage tuneables for whats important and whats not (rationale too)?

Having a list of all possible tuneables is useful but we really ought to have some idea of what's important and what's not, as well as our rationale for these choices... So, we need to triage/classify the tuneables list into things we think are

as well as the assumptions/conditions under which such judgements are valid (e.g. kinds of I/O application scenarious).

Do we know the answers already?

Mark M. argues that single biggest factor in I/O performance is finding a way to make I/O requests as large as possible. Focusing on maximizing that achieves biggest and best gains in performance. So, why worry about all the other less significant knobs? If we have a good handle on the other knobs, we may be able to vary performance by several times on top of what we can gain with I/O request size. So, the other knobs are still useful?

An initial seat of the pants triage of various paramters where we all sit around the table and say things like "I think X is useless" or "I think Y has a 10% effect" would be a useful thing to do at a future meeting once we've compiled a complete list of tuneables.

Also, even if we know the answers today, they are very likely to change over time, or on other systems. And, we may add/retire tuneables from a auto-tuning framework over time, allowing it to stay relevant.

That said, in situations where it is possible to affect I/O request size without undue impact on application or other parts of the software stack, then I believe very strongly that I/O request size will always be relevant and always have a significant impact. On the other hand, should I/O request size really be considered a tuneable as it is something that is controlled almost entirely by the application and/or I/O layers above HDF5.

Identifying driving applications (kernels)

We should spend some time to characterize I/O scenarious used by our relevant applications. Prabhat mentions h5part can serve as a useful I/O kernel. Quincey suggested the following 3 kinds of I/O scenarious: restart/plot dumps (i.e. write once, read never), LSST-like transactions (write many, read frequently), visualization/post-processing analyses (write once, read many).

We need to spend some time defining what the driving applications are and putting word descriptions to them.

Mark H. explained that there is some literature already available on automatically detecting I/O patterns and classifying them. We should look to see whats out there already and see if we can use it.

Granularity of timing and statistics?

Is a single number, total execution time, sufficient for characterizing results from benchmarking runs? Do we need to be able to take finer granularity measures?

Measuring I/O performance can be really, really hard. There are many pitfalls and gotchas. One thing is simply sanity checking to ensure a given test was indeed run with the settings you thought you specified. I can't tell you how often passing things like MPI-IO hints silently failed due to miss-spellings in the hints names. This kind of thing is fraught with peril and we need to ensure we do plenty of error checking to confirm settings the test is supposed to run with are indeed being set.

Fine granularity data is often helping in diagnosing why a test went wrong or in concluding the test is an outlier or in identifying what part of the software stack went wrong. So, there is value in fine grained timing data. Mark H. mentions that such data can also be gainfully exploited along with some statistical analysis to better understand performance and results.

Key next steps

Footnotes

1 Note that this problem is somewhat similar in spirit to the problem of multiple run-time type interfaces (HDF5, MPI, C/C++ programming language, etc. Presently, a user winds up having to learn multiple interfaces to defininig run-time types and then re-specify the same type to the different interfaces to obtain consistent behavior across interfaces.


Pdfs

Uploaded PDF files


Changes to Terminology

This document uses the terms Poor Man's and Rich Man's parallel I/O to refer to two different modalities of handling parallel I/O. These terms have fallen out of favor and are replaced by Multiple Independent File (MIF) and Single Shared File (SSF) parallel I/O. The new terminology is intended to capture where and what software handles concurency. In MIF, concurrency is handled explicitly by the application in writing independent files. In SSF, because a single file is produced, this necessitates that concurrency be handled implicitly by the file system.

Poor Man's vs Rich Man's Parallel IO

Poor Man's and Rich Man's Parallel I/O differ in one important respect: Poor Man's achieves concurrent parallelism by writing to multiple, different files while Rich Man's writes to a single, shared file. That's basically it in a nutshell.

People often confuse Poor Man's parallel with file-per-processor. That is false. While file-per-processor does represent one extreme end-point in the spectrum of Poor Man's use cases, there is no fundamental reason to restrict Poor Man's to file-per-processor. In fact, there are a variety of reasons for not doing it that way. Indeed, codes like Ale3d have a knob to control the number of files used for concurrent I/O and that knob is entirely independent of the number of processors the code is run on. Using PMPIO in Ale3d, you can run on 1024 processors and write to 10 files or 12 or 7 or 128. Typically, the number chosen is on par with the number of I/O nodes the code can see during its execution as well as the relative beefiness of the I/O nodes in handling I/O requests from multiple processors.

What are the strengths and weaknesses here? If we are standing in the filesystem looking upwards and watching I/O activity, is there any fundamental difference in the two?

With the exception of additional metadata requirements due to additional filesystem object names (e.g. files), as you stand in the filesystem and look upwards at the I/O requests pouring down from an application, there is little to distinguish the two I/O paradigms.

Disadvantages of PMPIO

Advantages of PMPIO

Poor Man's Parallel I/O and String-based Metadata

When a large, parallel object is stored in its decomposed state (either to separate files or to separate objects within a single shared file), each object winds up getting a unique name string that differs from other object's name string in only a small portion of characters. In other words, there are a lot of very similar strings. Relative to the total raw data being stored, the storage cost for these unique name strings is not significant.

What is significant, however, is the potential scaling issues of the underlying software responsible for managing all these unique but highly similar name strings. At the filesystem level, Lustre winds up having to manage hundreds of thousands of filenames for a given single dataset. Likewise, in an HDF5 file, the HDF5 library winds up having to manage hundreds of thousands of object names. Again, the storage cost is not so much the issue as the scaling of the data structures the software uses to manage these names.

In the Silo library, a caller is responsible for constructing a multi-block object that holds all the individual object piece names. This introduces a scaling problem. First, only one processor can be responsible for writing Silo's multi-block objects to the root file. So, all the names need to get created in one large array on one processor and then written to the file. To correct for this, Silo was recently enhanced to support an sprintf-like name scheme pattern that defines a rule for computing an object name from its position (index or offset) in the list of names. So, this enhanced multi-block object avoids the need of having to gather together in one, linear list, a bunch of similarly named strings. However, it still does not address the issue that all the component objects in the file still have names associated with them that need to be managed by the underlying software (HDF5 and/or Lustre)

The question arises whether Lustre or HDF5 could be enhanced to support the names of files/objects in a given collection in a similar way, thereby avoiding whatever scaling issues may arise in managing the names as numbers of unique but highly similar name strings grow.


Quarter 1 Activites


HDF5 Release Management for Advanced HPC Features

The HDF Group has plans to change the way it manages releases of HDF5 software in the near future, probably by the time of the 1.8.7 release scheduled for May, 2011. Between major releaases, The HDF Group will change their release management to maintain and release what amounts to two versions of supported releases. The HDF Group calls these two types of releases stable and feature releases.

Compatibility

The stable release series will be one that is designed with maximal compatibility with the most recent previously supported major release. This variant of an HDF5 release is designed for users who want bug fixes but cannot tolerate either API or file format changes. Trivial additions to the API are not considered changes in this context.

As part of addressing HPC needs, we recognize that HDF5 may require changes to either API or file format. Between major releases, new capabilities that reqiure such changes will be supported only in the feature series of releases. Feature releases will differ from the stable releases primarily in compatibility of the API and/or file format.

Software Compatibility

Application code using the advanced features of a feature release of HDF5 will not be compatibile with a stable release. It will compile with and link only with a feature release. On the other hand, application code using only stable release features will compile with either variant of a release.

Data File Compatibility

Files resulting from the use of advanced features of a feature release may be not be compatible (depending on circumstances, they could indeed be compatible) with a stable release. Software using a stable release will likely not be able to read these files. Only software using a feature release will be able to read them. On the other hand, files produced using a stable release will be compatible with either variant of a release.

Managing Compatibility

There will be conditional compilation macros and run-time functions to facilitate an application's enforcement or allowance of the use of advanced features that could result in incompatibilities with stable releases. The method(s) will be similar to how the HDF5 library presently manages 1.6/1.8 API compatibility. In addition, if the need arises and as a last resort, there are options available to develop tools to convert advanced files to be compatible with a stable release.

Release Intervals

At major release intervals, the feature and stable releases will be brought into sync. These intervals may be longer than 1-2 years. Thus, between major releases, in order to deliver HPC specific advanced features -- which if introduced to a stable release would result in negative impacts on compatibility and is contrary to the goals of a stable release -- The HDF Group will deliver those features in a series of feature releases. In turn, this also means that in those cases where HPC specific enhancements could concievably be put into a stable release without causing API or format compatibility issues, they will nonetheless be required to at least go into the feature release series. This is because it is important that all HPC relevant functionality be kept together in a common version of the library. Feature releases can and will be made frequently; perhaps as often as every 2-3 months.

As per a new version numbering scheme, feature releases will have an odd major number while stable releases will have an even major number.


SWTestStatus
[[
ftp://ftp.hdfgroup.uiuc.edu/pub/outgoing/acheng/SW_Regularly_Tested_LLNL.pdf]]


Wiki

This redmine project site is devoted to LLNL's contract with HDF Group to address HPC specific enhancements of HDF5 library. A detailed statement of work is here.

This site will serve as a platform for all project related documentation and to provide visibility on project planning and progress. [Added by Neelam]

The material here is not presently organized in any particular way. As the project evolves, all participants should feel free to re-organize material as necessary including adding editing existing pages, adding new pages, and/or changing overall structure of the wiki. All developer's comments are welcomed and encouraged. Since the material is revision controlled. No one should worry about making changes that result in some loss of something somone else valued.

Extended topic areas

Other resources


ZFP Compressor

See H5Z-ZFP, an HDF5 compression plugin. This plugin addresses many of the issues discussed in the notes below.

Notes from a short teleconference with Quincey Koziol and Peter Lindstrom on implementation of zfp within HDF5 and Silo.

Peter has developed a new floating point compression algorithm named zfp.

Some of the interesting features of ZFP are

Lossless vs. Lossy compression

Just to give an idea of zfp's utility, below are a few images that show how zfp compares with other compressors. By comparison, gzip --best gives you 1.04x lossless compression on this data set, while fpzip (a lossless compressor also written by Peter Lindstrom) gives 1.11x lossless compression. Even at 89x zfp compression (i.e. 0.72 bits/value compared to the original 64 bits/value), you'll be hard pressed to tell that the data has been compressed. Its concievable this is good enough for vis dumps or for expanding the effective memory to allow much larger data sets to be visualized in-core. For instance, a 4K^3 grid fits comfortably on my 8 GB laptop at this compression level. We're working on adding such "decompression on demand" to VisIt. A similar capability could easily be supported through HDF5 via chunking.


HDF5 Filter using ZFP

We have developed an HDF5 filter plugin that is highly efficient in storage of metadata and integrates with HDF5 features and supports all of ZFP's mode.

Other Stuff

Attachements