Difference between revisions of "Tutorial"

From Madagascar
Jump to navigation Jump to search
Line 42: Line 42:
 
*how to visualize your data.  
 
*how to visualize your data.  
 
By the end of this tutorial group you should be able to fully use all of Madagascar's built-in tools for data processing and scripting.  By using these tools, you'll be able to process data ranging from tens of megabytes to tens of terabytes in a reproducible fashion.   
 
By the end of this tutorial group you should be able to fully use all of Madagascar's built-in tools for data processing and scripting.  By using these tools, you'll be able to process data ranging from tens of megabytes to tens of terabytes in a reproducible fashion.   
==Introduction to Madagascar==
 
To begin, let's talk about the core principles of Madagascar, and the RSF file format. 
 
\setlength{\unitlength}{1in}
 
�egin{figure}
 
�egin{picture}(4,4.5)(0,0)
 
\put(2,4){�ramebox(2,0.5){LaTeX}}
 
\put(3,4){�ector(0,-1){0.5}}
 
\put(2,3){�ramebox(2,0.5){Python}}
 
\put(3,3){�ector(0,-1){0.5}}
 
\put(2,2){�ramebox(2,0.5){SCons}}
 
\put(3,2){�ector(-2,-1){1}}
 
\put(0,1){�ramebox(2,0.5){VPLOT}}
 
\put(3,2){�ector(2,-1){1}}
 
\put(2,1){�ector(2,-1){1}}
 
\put(4,1){�ector(-2,-1){1}}
 
\put(4,1){�ramebox(2,0.5){Programs}}
 
\put(2,0){�ramebox(2,0.5){RSF file format}}
 
\end{picture}
 
\caption{The hierarchy of Madagascar.  Fundamentally, everything builds off of the RSF file format.  As you go up the chain the complexity level increases, but the capabilities of the processing package increase.}
 
\end{figure}
 
===Madagascar's design===
 
There are a few layers to Madagascar.  At the bottom-most layer, is the RSF file format, which is a common exchange format that all Madagascar programs use.  Non-Madagascar programs can also read/write to and from RSF because it is an open exchange format.  The next level of Madagascar contains the actual Madagascar programs that manipulate RSF files to process data.  Concurrent to this level is the VPLOT graphics library which allows users to plot and visualize RSF files.  The scripting utilities in Python and SCons are up another level from the core programs.  These scripting utilities allow users to make powerful scripts that can perform even the most advanced data processing tasks.  The last level includes support for LaTeX which allows you to make documents combining the features of Madagascar with the powerful typesetting of LaTeX.  Throughout the course of these tutorials, we will examine all of these components, and demonstrate how they can be used individually, as well as together.  When combined, the individual components of Madagascar allow us to: conduct experiments, process data, visualize our results, make reproducible scripts that we can share with others, and write papers to document our experiments.  Thus, Madagascar is one of the first fully integrated research environments.
 
===RSF file format===
 
As previously mentioned, the lowest level of Madagascar is the RSF file format, which is the format used to exchange information between Madagascar programs.  Conceptually, the RSF file format is one of the easiest to understand, as RSF files are simply regularly sampled hypercubes of information.  For reference, a hypercube is a hyper dimensional cube (or array) that can best be visualized as an N-dimensional array, where N=[1,9] in Madagascar.
 
RSF hypercubes are defined by two files, the header and the binary file.  The header file contains information about the dimensionality of the hypercube as well as the data contained within the hypercube.  Information contained in the header file includes the following:
 
 
 
*number of elements on all axes,
 
*the origin of the axes,
 
*the sampling interval of elements on the axes,
 
*the type of elements in the axes (i.e. float, integer),
 
*the size of the elements (e.g. single or double precision),
 
*and the location of the actual binary file.
 
Since we often want to view this information about files without deciphering it, we store the header file as an ASCII text file in the local directory with the suffix '''.rsf''' .  At any time, you can view or edit the contents of the header files using a text editor such as VIM or Emacs.
 
The binary file is a file stored remotely (i.e. in a separate directory) that contains the actual hypercube data.  Because the hypercube data can be very large (<math>10s</math> of GB or TB) we usually store the binary files in a remote directory with the suffix '''.rsf@''' .  The remote directory is specified by the user using the '''DATAPATH'''  environmental variable.  The advantage to doing this, is that we can store the large binary data file on a fast remote filesystem if we want, and we can avoid working in local directories. 
 
 
�egin{figure}
 
\setlength{\unitlength}{1cm}
 
�egin{picture}(12,7)(0,0)
 
    \put(2,6){�ramebox(2,2){Header}}
 
    \put(3,6){�ector(0,-1){2}}
 
    \put(2,0){�ramebox(10,4){Binary}}
 
\end{picture}
 
\caption{Cartoon of the RSF file format.  The header file points to the binary file, which can be separate from one another.  The header file, which is text, is very small compared to the binary file.}
 
(fig:rsfformat)
 
\end{figure}
 
 
Because the header and binary are separated from one another, it is possible that we can lose either the header or binary for a particular RSF file.  If the header is lost, then we can simply reconstruct the header using our previous knowledge of the data and a text editor.  However, if we lose the binary file, then we cannot reconstruct the data regardless of what we do.  Therefore, you should try and avoid losing either the header or binary data.  The best way to avoid data loss is to make your research reproducible so that your results can be replicated later.
 
 
Sometimes though we need to store RSF files for archiving or to transfer to other machines.  Fortunately, we can avoid transferring the header and binary separately by using the combined header/binary format for RSF files.  Files can be constructed using the combined header/binary format by specifying additional parameters on the command line, in particular '''out=stdout''' , for any Madagascar program.  The output file will then be header/binary combined, which allows you to transfer the file without fear for losing either the header or binary.  Be careful though: header/binary combined files can be very large, and might slow down your local filesystem.  A best practice is to only use combined header/binary files when absolutely necessary for file transfers.  Note: header/binary combined files are automatically converted to header/binary separate files when processed by a Madagascar program.
 
 
 
===Additional documentation===
 
 
For more complete documentation on the RSF file format see the following links:
 
 
A gentle guide to the RSF file format: http://reproducibility.org/wiki/Guide_to_RSF_file_format
 
 
A detailed guide to the RSF file format: http://reproducibility.org/wiki/RSF_Comprehensive_Description
 
 
 
 
 
==Introduction to Madagascar==
 
==Introduction to Madagascar==
 
To begin, let's talk about the core principles of Madagascar, and the RSF file format.   
 
To begin, let's talk about the core principles of Madagascar, and the RSF file format.   

Revision as of 17:13, 23 October 2011

This page was created from the LaTeX source in book/tutorial/ using latex2wiki

Introduction

Welcome to a brief introduction to Madagascar. The purpose of this document is to teach new users how to use the powerful tools in Madagascar to: process data, produce documents and build your own Madagascar programs. To gradually introduce you to Madagascar, we have created a series of tutorials that are targeted to distinct audiences and designed to make you an experienced Madagascar user in a short-time period. The tutorials are divided by interest into three main categories:

  • Users learn about Madagascar, how to use the processing programs, and build scripts.
  • Authors learn how to build reproducible documents using Madagascar.
  • Developers build new Madagascar programs that add additional functionality to Madagascar.

Each tutorial is designed to be completed in a short period of time. Additionally, each tutorial has hands-on examples that you should be able to reproduce on your computer as you go along with the tutorials. Most tutorials will use scripts that you can edit, modify, or play with to further gain experience and understanding. By the end of the tutorial series, you should be able to use all of the tools inside of Madagascar. Please note that this tutorial series does not explicitly show you how to process certain types of data, or how to perform common data processing operations (e.g. CMP semblance picking, time migration,etc.). Additional tutorials on those specific subjects will be added over time. The purpose of this document is simply to familiarize you with the Madagascar framework in a general sense. Before you go on, here are some notes on notation:

  • important names, or program names are usually bold in the text. For example: sfwindow
  • code snippets are always in the following formatting:
     sfwindow 

Users

The Users tutorials demonstrate how to use the Madagascar framework to create, process and visualize data, and to create reproducible scripts for processing data. The main goals of the Users tutorials are to learn about:

  • the Madagascar framework,
  • the RSF file format,
  • the command line interface,
  • how to interact with files on the command line,
  • commonly used programs,
  • how to make plots in Madagascar,
  • how to make reproducible scripts,
  • how to use SCons and Python,
  • how to visualize your data.

By the end of this tutorial group you should be able to fully use all of Madagascar's built-in tools for data processing and scripting. By using these tools, you'll be able to process data ranging from tens of megabytes to tens of terabytes in a reproducible fashion.

Introduction to Madagascar

To begin, let's talk about the core principles of Madagascar, and the RSF file format. \setlength{\unitlength}{1in} �egin{figure} �egin{picture}(4,4.5)(0,0) \put(2,4){�ramebox(2,0.5){LaTeX}} \put(3,4){�ector(0,-1){0.5}} \put(2,3){�ramebox(2,0.5){Python}} \put(3,3){�ector(0,-1){0.5}} \put(2,2){�ramebox(2,0.5){SCons}} \put(3,2){�ector(-2,-1){1}} \put(0,1){�ramebox(2,0.5){VPLOT}} \put(3,2){�ector(2,-1){1}} \put(2,1){�ector(2,-1){1}} \put(4,1){�ector(-2,-1){1}} \put(4,1){�ramebox(2,0.5){Programs}} \put(2,0){�ramebox(2,0.5){RSF file format}} \end{picture} \caption{The hierarchy of Madagascar. Fundamentally, everything builds off of the RSF file format. As you go up the chain the complexity level increases, but the capabilities of the processing package increase.} \end{figure}

Madagascar's design

There are a few layers to Madagascar. At the bottom-most layer, is the RSF file format, which is a common exchange format that all Madagascar programs use. Non-Madagascar programs can also read/write to and from RSF because it is an open exchange format. The next level of Madagascar contains the actual Madagascar programs that manipulate RSF files to process data. Concurrent to this level is the VPLOT graphics library which allows users to plot and visualize RSF files. The scripting utilities in Python and SCons are up another level from the core programs. These scripting utilities allow users to make powerful scripts that can perform even the most advanced data processing tasks. The last level includes support for LaTeX which allows you to make documents combining the features of Madagascar with the powerful typesetting of LaTeX. Throughout the course of these tutorials, we will examine all of these components, and demonstrate how they can be used individually, as well as together. When combined, the individual components of Madagascar allow us to: conduct experiments, process data, visualize our results, make reproducible scripts that we can share with others, and write papers to document our experiments. Thus, Madagascar is one of the first fully integrated research environments.

RSF file format

As previously mentioned, the lowest level of Madagascar is the RSF file format, which is the format used to exchange information between Madagascar programs. Conceptually, the RSF file format is one of the easiest to understand, as RSF files are simply regularly sampled hypercubes of information. For reference, a hypercube is a hyper dimensional cube (or array) that can best be visualized as an N-dimensional array, where N=[1,9] in Madagascar. RSF hypercubes are defined by two files, the header and the binary file. The header file contains information about the dimensionality of the hypercube as well as the data contained within the hypercube. Information contained in the header file includes the following:

  • number of elements on all axes,
  • the origin of the axes,
  • the sampling interval of elements on the axes,
  • the type of elements in the axes (i.e. float, integer),
  • the size of the elements (e.g. single or double precision),
  • and the location of the actual binary file.

Since we often want to view this information about files without deciphering it, we store the header file as an ASCII text file in the local directory with the suffix .rsf . At any time, you can view or edit the contents of the header files using a text editor such as VIM or Emacs. The binary file is a file stored remotely (i.e. in a separate directory) that contains the actual hypercube data. Because the hypercube data can be very large ( of GB or TB) we usually store the binary files in a remote directory with the suffix .rsf@ . The remote directory is specified by the user using the DATAPATH environmental variable. The advantage to doing this, is that we can store the large binary data file on a fast remote filesystem if we want, and we can avoid working in local directories.

�egin{figure} \setlength{\unitlength}{1cm} �egin{picture}(12,7)(0,0)

   \put(2,6){�ramebox(2,2){Header}}
   \put(3,6){�ector(0,-1){2}}
   \put(2,0){�ramebox(10,4){Binary}}

\end{picture} \caption{Cartoon of the RSF file format. The header file points to the binary file, which can be separate from one another. The header file, which is text, is very small compared to the binary file.}

(fig:rsfformat)

\end{figure}

Because the header and binary are separated from one another, it is possible that we can lose either the header or binary for a particular RSF file. If the header is lost, then we can simply reconstruct the header using our previous knowledge of the data and a text editor. However, if we lose the binary file, then we cannot reconstruct the data regardless of what we do. Therefore, you should try and avoid losing either the header or binary data. The best way to avoid data loss is to make your research reproducible so that your results can be replicated later.

Sometimes though we need to store RSF files for archiving or to transfer to other machines. Fortunately, we can avoid transferring the header and binary separately by using the combined header/binary format for RSF files. Files can be constructed using the combined header/binary format by specifying additional parameters on the command line, in particular out=stdout , for any Madagascar program. The output file will then be header/binary combined, which allows you to transfer the file without fear for losing either the header or binary. Be careful though: header/binary combined files can be very large, and might slow down your local filesystem. A best practice is to only use combined header/binary files when absolutely necessary for file transfers. Note: header/binary combined files are automatically converted to header/binary separate files when processed by a Madagascar program.


Additional documentation

For more complete documentation on the RSF file format see the following links:

A gentle guide to the RSF file format: http://reproducibility.org/wiki/Guide_to_RSF_file_format

A detailed guide to the RSF file format: http://reproducibility.org/wiki/RSF_Comprehensive_Description


Introduction to Madagascar

To begin, let's talk about the core principles of Madagascar, and the RSF file format. \setlength{\unitlength}{1in} �egin{figure} �egin{picture}(4,4.5)(0,0) \put(2,4){�ramebox(2,0.5){LaTeX}} \put(3,4){�ector(0,-1){0.5}} \put(2,3){�ramebox(2,0.5){Python}} \put(3,3){�ector(0,-1){0.5}} \put(2,2){�ramebox(2,0.5){SCons}} \put(3,2){�ector(-2,-1){1}} \put(0,1){�ramebox(2,0.5){VPLOT}} \put(3,2){�ector(2,-1){1}} \put(2,1){�ector(2,-1){1}} \put(4,1){�ector(-2,-1){1}} \put(4,1){�ramebox(2,0.5){Programs}} \put(2,0){�ramebox(2,0.5){RSF file format}} \end{picture} \caption{The hierarchy of Madagascar. Fundamentally, everything builds off of the RSF file format. As you go up the chain the complexity level increases, but the capabilities of the processing package increase.} \end{figure}

Madagascar's design

There are a few layers to Madagascar. At the bottom-most layer, is the RSF file format, which is a common exchange format that all Madagascar programs use. Non-Madagascar programs can also read/write to and from RSF because it is an open exchange format. The next level of Madagascar contains the actual Madagascar programs that manipulate RSF files to process data. Concurrent to this level is the VPLOT graphics library which allows users to plot and visualize RSF files. The scripting utilities in Python and SCons are up another level from the core programs. These scripting utilities allow users to make powerful scripts that can perform even the most advanced data processing tasks. The last level includes support for LaTeX which allows you to make documents combining the features of Madagascar with the powerful typesetting of LaTeX. Throughout the course of these tutorials, we will examine all of these components, and demonstrate how they can be used individually, as well as together. When combined, the individual components of Madagascar allow us to: conduct experiments, process data, visualize our results, make reproducible scripts that we can share with others, and write papers to document our experiments. Thus, Madagascar is one of the first fully integrated research environments.

RSF file format

As previously mentioned, the lowest level of Madagascar is the RSF file format, which is the format used to exchange information between Madagascar programs. Conceptually, the RSF file format is one of the easiest to understand, as RSF files are simply regularly sampled hypercubes of information. For reference, a hypercube is a hyper dimensional cube (or array) that can best be visualized as an N-dimensional array, where N=[1,9] in Madagascar. RSF hypercubes are defined by two files, the header and the binary file. The header file contains information about the dimensionality of the hypercube as well as the data contained within the hypercube. Information contained in the header file includes the following:

  • number of elements on all axes,
  • the origin of the axes,
  • the sampling interval of elements on the axes,
  • the type of elements in the axes (i.e. float, integer),
  • the size of the elements (e.g. single or double precision),
  • and the location of the actual binary file.

Since we often want to view this information about files without deciphering it, we store the header file as an ASCII text file in the local directory with the suffix .rsf . At any time, you can view or edit the contents of the header files using a text editor such as VIM or Emacs. The binary file is a file stored remotely (i.e. in a separate directory) that contains the actual hypercube data. Because the hypercube data can be very large ( of GB or TB) we usually store the binary files in a remote directory with the suffix .rsf@ . The remote directory is specified by the user using the DATAPATH environmental variable. The advantage to doing this, is that we can store the large binary data file on a fast remote filesystem if we want, and we can avoid working in local directories.

�egin{figure} \setlength{\unitlength}{1cm} �egin{picture}(12,7)(0,0)

   \put(2,6){�ramebox(2,2){Header}}
   \put(3,6){�ector(0,-1){2}}
   \put(2,0){�ramebox(10,4){Binary}}

\end{picture} \caption{Cartoon of the RSF file format. The header file points to the binary file, which can be separate from one another. The header file, which is text, is very small compared to the binary file.}

(fig:rsfformat)

\end{figure}

Because the header and binary are separated from one another, it is possible that we can lose either the header or binary for a particular RSF file. If the header is lost, then we can simply reconstruct the header using our previous knowledge of the data and a text editor. However, if we lose the binary file, then we cannot reconstruct the data regardless of what we do. Therefore, you should try and avoid losing either the header or binary data. The best way to avoid data loss is to make your research reproducible so that your results can be replicated later.

Sometimes though we need to store RSF files for archiving or to transfer to other machines. Fortunately, we can avoid transferring the header and binary separately by using the combined header/binary format for RSF files. Files can be constructed using the combined header/binary format by specifying additional parameters on the command line, in particular out=stdout , for any Madagascar program. The output file will then be header/binary combined, which allows you to transfer the file without fear for losing either the header or binary. Be careful though: header/binary combined files can be very large, and might slow down your local filesystem. A best practice is to only use combined header/binary files when absolutely necessary for file transfers. Note: header/binary combined files are automatically converted to header/binary separate files when processed by a Madagascar program.


Additional documentation

For more complete documentation on the RSF file format see the following links:

A gentle guide to the RSF file format: http://reproducibility.org/wiki/Guide_to_RSF_file_format

A detailed guide to the RSF file format: http://reproducibility.org/wiki/RSF_Comprehensive_Description


Introduction to Madagascar

To begin, let's talk about the core principles of Madagascar, and the RSF file format. \setlength{\unitlength}{1in} �egin{figure} �egin{picture}(4,4.5)(0,0) \put(2,4){�ramebox(2,0.5){LaTeX}} \put(3,4){�ector(0,-1){0.5}} \put(2,3){�ramebox(2,0.5){Python}} \put(3,3){�ector(0,-1){0.5}} \put(2,2){�ramebox(2,0.5){SCons}} \put(3,2){�ector(-2,-1){1}} \put(0,1){�ramebox(2,0.5){VPLOT}} \put(3,2){�ector(2,-1){1}} \put(2,1){�ector(2,-1){1}} \put(4,1){�ector(-2,-1){1}} \put(4,1){�ramebox(2,0.5){Programs}} \put(2,0){�ramebox(2,0.5){RSF file format}} \end{picture} \caption{The hierarchy of Madagascar. Fundamentally, everything builds off of the RSF file format. As you go up the chain the complexity level increases, but the capabilities of the processing package increase.} \end{figure}

Madagascar's design

There are a few layers to Madagascar. At the bottom-most layer, is the RSF file format, which is a common exchange format that all Madagascar programs use. Non-Madagascar programs can also read/write to and from RSF because it is an open exchange format. The next level of Madagascar contains the actual Madagascar programs that manipulate RSF files to process data. Concurrent to this level is the VPLOT graphics library which allows users to plot and visualize RSF files. The scripting utilities in Python and SCons are up another level from the core programs. These scripting utilities allow users to make powerful scripts that can perform even the most advanced data processing tasks. The last level includes support for LaTeX which allows you to make documents combining the features of Madagascar with the powerful typesetting of LaTeX. Throughout the course of these tutorials, we will examine all of these components, and demonstrate how they can be used individually, as well as together. When combined, the individual components of Madagascar allow us to: conduct experiments, process data, visualize our results, make reproducible scripts that we can share with others, and write papers to document our experiments. Thus, Madagascar is one of the first fully integrated research environments.

RSF file format

As previously mentioned, the lowest level of Madagascar is the RSF file format, which is the format used to exchange information between Madagascar programs. Conceptually, the RSF file format is one of the easiest to understand, as RSF files are simply regularly sampled hypercubes of information. For reference, a hypercube is a hyper dimensional cube (or array) that can best be visualized as an N-dimensional array, where N=[1,9] in Madagascar. RSF hypercubes are defined by two files, the header and the binary file. The header file contains information about the dimensionality of the hypercube as well as the data contained within the hypercube. Information contained in the header file includes the following:

  • number of elements on all axes,
  • the origin of the axes,
  • the sampling interval of elements on the axes,
  • the type of elements in the axes (i.e. float, integer),
  • the size of the elements (e.g. single or double precision),
  • and the location of the actual binary file.

Since we often want to view this information about files without deciphering it, we store the header file as an ASCII text file in the local directory with the suffix .rsf . At any time, you can view or edit the contents of the header files using a text editor such as VIM or Emacs. The binary file is a file stored remotely (i.e. in a separate directory) that contains the actual hypercube data. Because the hypercube data can be very large ( of GB or TB) we usually store the binary files in a remote directory with the suffix .rsf@ . The remote directory is specified by the user using the DATAPATH environmental variable. The advantage to doing this, is that we can store the large binary data file on a fast remote filesystem if we want, and we can avoid working in local directories.

�egin{figure} \setlength{\unitlength}{1cm} �egin{picture}(12,7)(0,0)

   \put(2,6){�ramebox(2,2){Header}}
   \put(3,6){�ector(0,-1){2}}
   \put(2,0){�ramebox(10,4){Binary}}

\end{picture} \caption{Cartoon of the RSF file format. The header file points to the binary file, which can be separate from one another. The header file, which is text, is very small compared to the binary file.}

(fig:rsfformat)

\end{figure}

Because the header and binary are separated from one another, it is possible that we can lose either the header or binary for a particular RSF file. If the header is lost, then we can simply reconstruct the header using our previous knowledge of the data and a text editor. However, if we lose the binary file, then we cannot reconstruct the data regardless of what we do. Therefore, you should try and avoid losing either the header or binary data. The best way to avoid data loss is to make your research reproducible so that your results can be replicated later.

Sometimes though we need to store RSF files for archiving or to transfer to other machines. Fortunately, we can avoid transferring the header and binary separately by using the combined header/binary format for RSF files. Files can be constructed using the combined header/binary format by specifying additional parameters on the command line, in particular out=stdout , for any Madagascar program. The output file will then be header/binary combined, which allows you to transfer the file without fear for losing either the header or binary. Be careful though: header/binary combined files can be very large, and might slow down your local filesystem. A best practice is to only use combined header/binary files when absolutely necessary for file transfers. Note: header/binary combined files are automatically converted to header/binary separate files when processed by a Madagascar program.


Additional documentation

For more complete documentation on the RSF file format see the following links:

A gentle guide to the RSF file format: http://reproducibility.org/wiki/Guide_to_RSF_file_format

A detailed guide to the RSF file format: http://reproducibility.org/wiki/RSF_Comprehensive_Description


Introduction to Madagascar

To begin, let's talk about the core principles of Madagascar, and the RSF file format. \setlength{\unitlength}{1in} �egin{figure} �egin{picture}(4,4.5)(0,0) \put(2,4){�ramebox(2,0.5){LaTeX}} \put(3,4){�ector(0,-1){0.5}} \put(2,3){�ramebox(2,0.5){Python}} \put(3,3){�ector(0,-1){0.5}} \put(2,2){�ramebox(2,0.5){SCons}} \put(3,2){�ector(-2,-1){1}} \put(0,1){�ramebox(2,0.5){VPLOT}} \put(3,2){�ector(2,-1){1}} \put(2,1){�ector(2,-1){1}} \put(4,1){�ector(-2,-1){1}} \put(4,1){�ramebox(2,0.5){Programs}} \put(2,0){�ramebox(2,0.5){RSF file format}} \end{picture} \caption{The hierarchy of Madagascar. Fundamentally, everything builds off of the RSF file format. As you go up the chain the complexity level increases, but the capabilities of the processing package increase.} \end{figure}

Madagascar's design

There are a few layers to Madagascar. At the bottom-most layer, is the RSF file format, which is a common exchange format that all Madagascar programs use. Non-Madagascar programs can also read/write to and from RSF because it is an open exchange format. The next level of Madagascar contains the actual Madagascar programs that manipulate RSF files to process data. Concurrent to this level is the VPLOT graphics library which allows users to plot and visualize RSF files. The scripting utilities in Python and SCons are up another level from the core programs. These scripting utilities allow users to make powerful scripts that can perform even the most advanced data processing tasks. The last level includes support for LaTeX which allows you to make documents combining the features of Madagascar with the powerful typesetting of LaTeX. Throughout the course of these tutorials, we will examine all of these components, and demonstrate how they can be used individually, as well as together. When combined, the individual components of Madagascar allow us to: conduct experiments, process data, visualize our results, make reproducible scripts that we can share with others, and write papers to document our experiments. Thus, Madagascar is one of the first fully integrated research environments.

RSF file format

As previously mentioned, the lowest level of Madagascar is the RSF file format, which is the format used to exchange information between Madagascar programs. Conceptually, the RSF file format is one of the easiest to understand, as RSF files are simply regularly sampled hypercubes of information. For reference, a hypercube is a hyper dimensional cube (or array) that can best be visualized as an N-dimensional array, where N=[1,9] in Madagascar. RSF hypercubes are defined by two files, the header and the binary file. The header file contains information about the dimensionality of the hypercube as well as the data contained within the hypercube. Information contained in the header file includes the following:

  • number of elements on all axes,
  • the origin of the axes,
  • the sampling interval of elements on the axes,
  • the type of elements in the axes (i.e. float, integer),
  • the size of the elements (e.g. single or double precision),
  • and the location of the actual binary file.

Since we often want to view this information about files without deciphering it, we store the header file as an ASCII text file in the local directory with the suffix .rsf . At any time, you can view or edit the contents of the header files using a text editor such as VIM or Emacs. The binary file is a file stored remotely (i.e. in a separate directory) that contains the actual hypercube data. Because the hypercube data can be very large ( of GB or TB) we usually store the binary files in a remote directory with the suffix .rsf@ . The remote directory is specified by the user using the DATAPATH environmental variable. The advantage to doing this, is that we can store the large binary data file on a fast remote filesystem if we want, and we can avoid working in local directories.

�egin{figure} \setlength{\unitlength}{1cm} �egin{picture}(12,7)(0,0)

   \put(2,6){�ramebox(2,2){Header}}
   \put(3,6){�ector(0,-1){2}}
   \put(2,0){�ramebox(10,4){Binary}}

\end{picture} \caption{Cartoon of the RSF file format. The header file points to the binary file, which can be separate from one another. The header file, which is text, is very small compared to the binary file.}

(fig:rsfformat)

\end{figure}

Because the header and binary are separated from one another, it is possible that we can lose either the header or binary for a particular RSF file. If the header is lost, then we can simply reconstruct the header using our previous knowledge of the data and a text editor. However, if we lose the binary file, then we cannot reconstruct the data regardless of what we do. Therefore, you should try and avoid losing either the header or binary data. The best way to avoid data loss is to make your research reproducible so that your results can be replicated later.

Sometimes though we need to store RSF files for archiving or to transfer to other machines. Fortunately, we can avoid transferring the header and binary separately by using the combined header/binary format for RSF files. Files can be constructed using the combined header/binary format by specifying additional parameters on the command line, in particular out=stdout , for any Madagascar program. The output file will then be header/binary combined, which allows you to transfer the file without fear for losing either the header or binary. Be careful though: header/binary combined files can be very large, and might slow down your local filesystem. A best practice is to only use combined header/binary files when absolutely necessary for file transfers. Note: header/binary combined files are automatically converted to header/binary separate files when processed by a Madagascar program.


Additional documentation

For more complete documentation on the RSF file format see the following links:

A gentle guide to the RSF file format: http://reproducibility.org/wiki/Guide_to_RSF_file_format

A detailed guide to the RSF file format: http://reproducibility.org/wiki/RSF_Comprehensive_Description


Introduction to Madagascar

To begin, let's talk about the core principles of Madagascar, and the RSF file format. \setlength{\unitlength}{1in} �egin{figure} �egin{picture}(4,4.5)(0,0) \put(2,4){�ramebox(2,0.5){LaTeX}} \put(3,4){�ector(0,-1){0.5}} \put(2,3){�ramebox(2,0.5){Python}} \put(3,3){�ector(0,-1){0.5}} \put(2,2){�ramebox(2,0.5){SCons}} \put(3,2){�ector(-2,-1){1}} \put(0,1){�ramebox(2,0.5){VPLOT}} \put(3,2){�ector(2,-1){1}} \put(2,1){�ector(2,-1){1}} \put(4,1){�ector(-2,-1){1}} \put(4,1){�ramebox(2,0.5){Programs}} \put(2,0){�ramebox(2,0.5){RSF file format}} \end{picture} \caption{The hierarchy of Madagascar. Fundamentally, everything builds off of the RSF file format. As you go up the chain the complexity level increases, but the capabilities of the processing package increase.} \end{figure}

Madagascar's design

There are a few layers to Madagascar. At the bottom-most layer, is the RSF file format, which is a common exchange format that all Madagascar programs use. Non-Madagascar programs can also read/write to and from RSF because it is an open exchange format. The next level of Madagascar contains the actual Madagascar programs that manipulate RSF files to process data. Concurrent to this level is the VPLOT graphics library which allows users to plot and visualize RSF files. The scripting utilities in Python and SCons are up another level from the core programs. These scripting utilities allow users to make powerful scripts that can perform even the most advanced data processing tasks. The last level includes support for LaTeX which allows you to make documents combining the features of Madagascar with the powerful typesetting of LaTeX. Throughout the course of these tutorials, we will examine all of these components, and demonstrate how they can be used individually, as well as together. When combined, the individual components of Madagascar allow us to: conduct experiments, process data, visualize our results, make reproducible scripts that we can share with others, and write papers to document our experiments. Thus, Madagascar is one of the first fully integrated research environments.

RSF file format

As previously mentioned, the lowest level of Madagascar is the RSF file format, which is the format used to exchange information between Madagascar programs. Conceptually, the RSF file format is one of the easiest to understand, as RSF files are simply regularly sampled hypercubes of information. For reference, a hypercube is a hyper dimensional cube (or array) that can best be visualized as an N-dimensional array, where N=[1,9] in Madagascar. RSF hypercubes are defined by two files, the header and the binary file. The header file contains information about the dimensionality of the hypercube as well as the data contained within the hypercube. Information contained in the header file includes the following:

  • number of elements on all axes,
  • the origin of the axes,
  • the sampling interval of elements on the axes,
  • the type of elements in the axes (i.e. float, integer),
  • the size of the elements (e.g. single or double precision),
  • and the location of the actual binary file.

Since we often want to view this information about files without deciphering it, we store the header file as an ASCII text file in the local directory with the suffix .rsf . At any time, you can view or edit the contents of the header files using a text editor such as VIM or Emacs. The binary file is a file stored remotely (i.e. in a separate directory) that contains the actual hypercube data. Because the hypercube data can be very large ( of GB or TB) we usually store the binary files in a remote directory with the suffix .rsf@ . The remote directory is specified by the user using the DATAPATH environmental variable. The advantage to doing this, is that we can store the large binary data file on a fast remote filesystem if we want, and we can avoid working in local directories.

�egin{figure} \setlength{\unitlength}{1cm} �egin{picture}(12,7)(0,0)

   \put(2,6){�ramebox(2,2){Header}}
   \put(3,6){�ector(0,-1){2}}
   \put(2,0){�ramebox(10,4){Binary}}

\end{picture} \caption{Cartoon of the RSF file format. The header file points to the binary file, which can be separate from one another. The header file, which is text, is very small compared to the binary file.}

(fig:rsfformat)

\end{figure}

Because the header and binary are separated from one another, it is possible that we can lose either the header or binary for a particular RSF file. If the header is lost, then we can simply reconstruct the header using our previous knowledge of the data and a text editor. However, if we lose the binary file, then we cannot reconstruct the data regardless of what we do. Therefore, you should try and avoid losing either the header or binary data. The best way to avoid data loss is to make your research reproducible so that your results can be replicated later.

Sometimes though we need to store RSF files for archiving or to transfer to other machines. Fortunately, we can avoid transferring the header and binary separately by using the combined header/binary format for RSF files. Files can be constructed using the combined header/binary format by specifying additional parameters on the command line, in particular out=stdout , for any Madagascar program. The output file will then be header/binary combined, which allows you to transfer the file without fear for losing either the header or binary. Be careful though: header/binary combined files can be very large, and might slow down your local filesystem. A best practice is to only use combined header/binary files when absolutely necessary for file transfers. Note: header/binary combined files are automatically converted to header/binary separate files when processed by a Madagascar program.


Additional documentation

For more complete documentation on the RSF file format see the following links:

A gentle guide to the RSF file format: http://reproducibility.org/wiki/Guide_to_RSF_file_format

A detailed guide to the RSF file format: http://reproducibility.org/wiki/RSF_Comprehensive_Description