MPI IO
Parallel Distributed Systems
File Management I/O
Peter Collins, Khadouj Fikry
File Management I/O
Overview
Problem Statement: as the size of data is
excessively increasing, the need of I/O
parallelization is becoming a necessity to avoid
scalability bottleneck on almost every application.
1
General: File management I/O allows the
distribution of large amount of data files among
multiple computing nodes where instructions will
be executed.
The goal: to use parallelism to increase bandwidth
and reduce execution time when dealing with
large data sets.
2
[1] http://www.ouestsolutions.com/wp-content/uploads/2013/08/big-data.jpg
[2] https://tomyrhymond.files.wordpress.com/2014/08/big-data.png
File Management I/O
Challenge: I/O can be challenging to implement,
to coordinate, and to optimize, especially when
dealing directly with the File System or the
Network Protocols layer.
Solution: Specialist implemented infrastructures
that provides intermediate layer for
coordination of data access and mapping from
application layer to I/O layer.
Examples: MPI –IO, NFS, PVFS, Hadoop, Parallel
HDF5, Parallel netCDF, T3PIO,…
Sequential I/0
Data 0
Data 1
Data 2
Data 3
Very Simple
Data
0
1
2
3
Performance
Bottle Neck!
File
Data 0
Data 1
Data 2
Data 3
Parallel I/O: Multiple Files
Data 0
Data 1
Data 2
Data 3
Better Improvement
compared to
sequential
0
1
2
3
Data 0
Data 1
Data 2
Data 3
File 1
File 2
File 3
File 4
The need to
manage the files
and aggregate the
results.
About: MPI-IO
The objective of MPI-IO is to read and write to a single file in parallel.
• It interoperates with the file system to improve the performance with I/O in
distributed memory applications.
• Function calls are similar to POSIX commands.
• It can result in potentially good performance, especially when dealing with
small, distinct, and non-contiguous IO requests [1].
• It is relatively ease to use - utilizing the existing MPI data structures.
• Portable: MPI-IO code can be run on any computer node supporting MPI V2.0
+. The binaries are not portable.
[1]: http://www.mcs.anl.gov/~thakur/papers/mpi-io-noncontig.pdf
Parallel I/O: Single File
Data 0
Data 1
Data 2
Data 3
High Complexity
0
1
2
3
Performance improvement
Optimization opportunity
Scalability
Single Common File
3 keys to MPI-IO
• Positioning
• Explicit (Non-Contiguous)
• Implicit (Contiguous)
Function
Positioning
Coordination
Synchronization
MPI_File_read
Contiguous
Non-Collective
Blocking
MPI_File_read_at
Non Contiguous
Non-Collective
Blocking
• Coordination
• Collective
• Non-Collective
MPI_File_read_all
Contiguous
Collective
Blocking
MPI_File_read_at_all
Non Contiguous
Collective
Blocking
MPI_File_iread()
Contiguous
Non-Collective
Non-Blocking
MPI_File_read_at
Non Contiguous
Non-Collective
Non-Blocking
MPI_File_read_all
Contiguous
Collective
Non-Blocking
MPI_File_read_at_all
Non Contiguous
Collective
Non-Blocking
• Synchronization
• Blocking (Synchronous)
• Non-Blocking
(Asynchronous)
MPI-IO Components
File Management Components:
IO Components:
• File Handler:
• Collective/ non-collective IO
• Usually it is an ADT that is used to access the
files.
• File pointer
• The position of the file of which we read and
write. Managed by the file handler.
• File View
• Defines the portions that are visible by each
processors.
• It can enable efficient non-contiguous access
patterns to the file.
•
•
Collective: all the processes in the communicator are forced to read/
write data collectively and wait for each other - MPI_File_read_all()
Non-Collective IO: no coordination by MPI infrastructure MPI_File_read()
• Contiguous/ Non- Contiguous
•
•
Contiguous: MPI-IO default; the entire file is visible to the process, and
data will be read/write contiguously starting from location specified by
read/write functions MPI_File_read()
Non-Contiguous : MPI-IO allows non-contiguous data access for read
and write with a single I/O function call. MPI_File_read_at()
• Asynchronous /Synchronous
•
•
Asynchronous : It allows you to continue your computation while data
is being sent on the background. And uses MPI_Test or MPI_wait to
check if the data transfer is completed. It is non blocking
MPI_File_iread()
Synchronous: when MPI_read or MPI_write are called, the call returns
only when the data on the read or write buffer are being sent, only
then it is safe to perform other operations on the send/receive buffer;
It is blocking MPI_File_read()
File Views
• Defines the portions that are visible by each processors.
• They describe where in memory and in the file the
current process can read/write too
• It can enable efficient non-contiguous access patterns to the
file.
• Not overlapping allows files to be accessed in parallel
without data being corrupted
File
• Basic /Derived
• File views can either use basic existing MPI data types or
create Derived data types .
File pointer
Process # 0
Process # …
Process # n-1
Process # n
This could be a simple MPI_INT or aThe
complex
filetypestruct
is a group
of ints,ofdoubles,
elementary
floats
types
File View Example
int MPI_File_set_view(MPI_File fh, MPI_Offset disp, MPI_Datatype
etype,MPI_Datatype filetype, char *datarep, MPI_Info info)
`
0
1
2
3
MPI_File_set_view(fh, 2 * etype, MPI_INT, filetype, “native”, MPI_INFO_NULL);
0
1
2
3
Proc
2
Proc
0
Displacement
Proc
Iteration 1
1
Iteration 2
Iteration 3
Proc
3
Sample MPI-IO Program Sequence
#include <stdio.h>
#include "mpi.h"
int main(int argc, char **argv) {
int rank, rankSize, bufferSize, numberInts;
MPI_File fh; //File handler
MPI_Status status; //Stores status of operation
1- Set up
MPI
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, & rankSize);
2- Open
File
Start
File Handler,
communicates file
position for each node.
File open
Opens the file in a
specific mode.
Open is broadcasted
for all nodes
bufferSize = fileSize / rankSize; //amount of file each node has
numberInts = bufferSize / sizeof(int); //# ints that goes in the buff
int buffer[numberInts]; //used by file views
int offset= numberInts * rank; //offset for each rank
3- Set
File
Views
MPI Initialization
Set Views
Process 0
MPI_File_open(MPI_COMM_WORLD, "filename", MPI_MODE_WRDWR, MPI_INFO, &fh);
MPI_File_set_view(fh, offset, MPI_INT, MPI_INT, “native”, MPI_INFO_NULL);
MPI_File_write(fh, buffer, numberInts, MPI_INT, &status);
4- Read/
Write
MPI_File_seek(fh, rank * bufferSize, MPI_SEEK_SET); //Pos of file pointer
MPI_File_read(fh, buffer, numberInts, MPI_INT, &status);
5- Close
MPI_File_close(&fh);
MPI_Finalize();
}
Process 3
Communication
File
File pointer
return 0;
Process 1
File Close
an implicit, blocking, non-collective Unix I/O style
Synchronizes the file
state then closes it
Finalize
Complete
Process n
Sample MPI-IO Program Sequence
#include <stdio.h>
#include "mpi.h"
int main(int argc, char **argv) {
int rank, rankSize, bufferSize, numberInts;
MPI_File fh; //File handler
MPI_Status status; //Stores status of operation
1- Set up
MPI
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, & rankSize);
2- Open
File
Start
File Handler,
communicates file
position for each node.
File open
Opens the file in a
specific mode.
Open is broadcasted
for all nodes
bufferSize = fileSize / rankSize; //amount of file each node has
numberInts = bufferSize / sizeof(int); //# ints that goes in the buff
int buffer[numberInts]; //used by file views
int offset= numberInts * rank; //offset for each rank
3- Set
File
Views
MPI Initialization
Set Views
Process 0
MPI_File_open(MPI_COMM_WORLD, "filename", MPI_MODE_WRDWR, MPI_INFO, &fh);
MPI_File_set_view(fh, offset, MPI_INT, MPI_INT, “native”, MPI_INFO_NULL);
MPI_File_write(fh, buffer, numberInts, MPI_INT, &status);
4- Read/
Write
MPI_File_seek(fh, rank * bufferSize, MPI_SEEK_SET); //Pos of file pointer
MPI_File_read(fh, buffer, numberInts, MPI_INT, &status);
5- Close
MPI_File_close(&fh);
MPI_Finalize();
}
Process 3
Communication
File
File pointer
return 0;
Process 1
File Close
an implicit, blocking, non-collective Unix I/O style
Synchronizes the file
state then closes it
Finalize
Complete
Process n
Collective I/O
• The many I/O requests across all processes are merged
into larger I/O
• All processes I/O occur together
• This is effective when we have many non-contiguous
I/O at once
• Gather the small non-contiguous requests and optimize the
requests
• With many non-contiguous combine them in such a way that
we read/write efficiently
• This is determined by the MPI implementation
• Can increase performance by only reading/writing one time
and parsing the data to the correct processes
Source: http://wgropp.cs.illinois.edu/courses/cs598-s16/lectures/lecture32.pdf
Small individual requests
Large
Collective
access
MPI IO Best Uses
• Perform operations on a large
data file.
• High Performance parallel
applications that requires I/O.
• Need to make many small I/O
request to non-contiguous parts
of the file.
• Describe non-contiguous file
access patterns
• Common Mistakes
• Attempting to write to
multiple files
• Miscalculating the
offset/ displacement
• Making frequent
meta-data accesses.
Example - Large Distributed Array
Source: https://www.tacc.utexas.edu/documents/13601/900558/MPI-IO-Final.pdf/eea9d7d3-4b81-471c-b244-41498070e35d
MPI IO 4 Levels of Access and Use Cases
level 0: MPI_File_read a single element from its sub
array (unix style)
Level 1: MPI_File_read_all – collectively read their
elements from the sub array
Level 2: Create a derived type for the subarray;
create a file view describing the non-contiguous
access; Independent I/O for each
Level 3: Same as Level 2 but using collective
function calls
Source: https://www.tacc.utexas.edu/documents/13601/900558/MPI-IO-Final.pdf/eea9d7d3-4b81-471c-b244-41498070e35d
MPI I/O Command List
• MPI_File_c2f
• MPI_File_get_type_extent
• MPI_File_read_all
• MPI_File_set_info
• MPI_File_call_errhandler
• MPI_File_get_view
• MPI_File_read_all_begin
• MPI_File_set_size
• MPI_File_close
• MPI_File_iread
• MPI_File_read_all_end
• MPI_File_set_view
• MPI_File_create_errhandler
• MPI_File_iread_all
• MPI_File_read_at
• MPI_File_sync
• MPI_File_delete
• MPI_File_iread_at
• MPI_File_read_at_all
• MPI_File_write
• MPI_File_f2c
• MPI_File_iread_at_all
• MPI_File_read_at_all_begin
• MPI_File_write_all
• MPI_File_get_amode
• MPI_File_iread_shared
• MPI_File_read_at_all_end
• MPI_File_write_all_begin
• MPI_File_get_atomicity
• MPI_File_iwrite
• MPI_File_read_ordered
• MPI_File_write_all_end
• MPI_File_get_byte_offset
• MPI_File_iwrite_all
• MPI_File_read_ordered_begin
• MPI_File_write_at
• MPI_File_get_errhandler
• MPI_File_iwrite_at
• MPI_File_read_ordered_end
• MPI_File_write_at_all
• MPI_File_get_group
• MPI_File_iwrite_at_all
• MPI_File_read_shared
• MPI_File_write_at_all_begin
• MPI_File_get_info
• MPI_File_iwrite_shared
• MPI_File_seek
• MPI_File_write_at_all_end
• MPI_File_get_position
• MPI_File_open
• MPI_File_seek_shared
• MPI_File_write_ordered
• MPI_File_set_atomicity
• MPI_File_write_ordered_begin
• MPI_File_set_errhandler
• MPI_File_write_ordered_end
• MPI_File_get_position_shared • MPI_File_preallocate
• MPI_File_read
• MPI_File_get_size
• MPI_File_write_shared
References
Overview of MPI IO
Ref to MPI-IO Commands
Different MPI Performance Levels
Parallel IO, MPI IO, HDF5, T3PIO, and Strategies
Wiki Page
• http://wgropp.cs.illinois.edu/cours
es/cs598s16/lectures/lecture32.pdf
• http://www.mpich.org/static/docs/
latest/
• http://www.mcs.anl.gov/~thakur/p
apers/mpi-io-noncontig.pdf
• https://www.tacc.utexas.edu/docu
ments/13601/900558/MPI-IOFinal.pdf/eea9d7d3-4b81-471cb244-41498070e35d
• https://www.hpc.ntnu.no/display/
hpc/MPI+IO
MPI IO Levels Read/Write Performance
Source: http://www.mcs.anl.gov/~thakur/papers/mpi-io-noncontig.pdf
© Copyright 2026 Paperzz