Tutorial - Users 0 - RSF file format

From Madagascar
Jump to: navigation, search

Objectives

  • Overview of Madagascar
  • Madagascar's file format (RSF).
  • Where and how data is stored in Madagascar.

Overview of Madagascar

What Madagascar is

Madagascar is a modern, open-source data processing package primarily used for geophysical data processing and signal processing. Madagascar is powerful, allowing users to leverage hundreds of previously developed algorithms and programs for signal processing. Some Madagascar programs leverage advanced computing technologies such as: Message Passing Interface (MPI) and OpenMP, bringing high-performance computing out of the box. Madagascar is flexible, users can import data from many different formats into a simple data structure used for all Madagascar programs. If you can't find what you need, you can write a Madagascar program using the language of your choice (C, C++, Fortran, Python, Java, Octave, Matlab,...). Madagascar is focused on computational reproduciblity, using Python and Latex documents and numerical experiments can be easily reproduced on other machines, allowing greater collaboration and preserving your contribution to the scientific community for years to come. Greater reproducibility increases transparency in research, and greatly reduces time spent by others reimplementing your work.

What Madagascar is NOT

Madagascar is not a package for symbolic computation (e.g. Mathematica). Madagascar is not designed to replace commercial processing packages. Madagascar is not highly optimized for specific hardware.

Madagascar's file format (RSF)

To truly understand Madagascar, one has to understand the basic building block of Madagascar: the RSF file format. The RSF file format, is a common file format, that all Madagascar programs must use and output files to. Besides that criterion, Madagascar programs can do whatever they like. Individual programs can use MPI for parallel processing, link to various scientific or graphics libraries, or whatever else you might want to do. However, all programs must read and/or write RSF files to be considered part of the Madagascar processing package. Thus, it's crucial to understand how the RSF file format works and how it is designed.

RSF stands for regularly sampled format. As the name implies, RSF files correspond to arrays (hypercubes) that are rectangular, or regularly sampled. However, RSF is designed to be very flexible, and performant, so RSF files are not just stored in your local folder (unlike Seismic Unix). Instead, RSF files are usually composed of two parts: a header file and a binary file.

Overview of the header/binary separation for RSF files.

Header file

The header file is stored in your local directory with the suffix: .rsf. All files with the .rsf suffix are automatically considered to be RSF header files. Inside the header is the following information: A pointer to the actual data file (see binary data file below) Hypercube dimensions (origin, spacing, and number of points) Hypercube parameters (labels, units) A processing workflow history. Lists all programs used to create this file with parameters and dates/times. Special flags to indicate joint header/binary RSF files (see below). Because the header file stores very little actual data, the header file is stored in plain text (ASCII). You can open up a header file with your favorite text editor and read the parameters, or edit them if you'd like (note: don't edit the values unless you know what you're doing).

Binary file

The binary file contains the actual data that the header file points to. The binary data file is typically stored elsewhere on your local system (or even remotely) because the binary file can be very large (many TB). Binary files have the suffix: .rsf@ . To find out where your binary files are being stored you can view the header for a specific RSF file, or you can try looking in the location specified by your DATAPATH environment variable. By default, binary files are placed in the DATAPATH environment variable location. For example: DATAPATH=/tmp/local/ would make an RSF file /tmp/local/test.rsf@.

Binary files are not readable by a text-editor. The only way to manipulate them is to use the built-in Madagascar programs, or to write your own RSF reader.

WARNING: There are pros and cons to this style of file storage. The pro is that we can avoid bogging down our local file system with very large files. One con is that we might pay a performance penalty for keeping our data remotely. The major issue though is that we can't just move our headers or binaries around on the file system. If we move the binary data without changing our header we no longer point to the right data, creating an orphaned data file. We can avoid this issue using combined header/binary files.

Combined header/binary files

Combined header/binary files are exactly as the name says, the header and binary are attached to each other, meaning that the file has both the header information as well as the binary information in it. The advantage to doing this, is that the combined header binary file can be moved around on the filesystem without worrying about breaking the link between header and binary. Combined header/binary files can be made by specifying the parameter --out=stdout for any Madagascar program.

Recap

  • Header file suffix: .rsf
  • Binary file suffix: .rsf@
  • RSF files are header-binary separated (stored in different locations).
  • The header file contains all the information about the binary file (dimensions, etc.) and is human readable.
  • The binary file contains the raw binary data and is not human readable.
  • Don't move header or binary files without first combining them together into header/binary combined format.

Additional information

Guide to RSF file format