RDMA for Hadoop Distributed
FileSystem
README
Rev 1.3
www.mellanox.com
Mellanox Technologies Confidential
NOTE:
THIS HARDWARE, SOFTWARE OR TEST SUITE PRODUCT (“PRODUCT(S)”) AND ITS RELATED
DOCUMENTATION ARE PROVIDED BY MELLANOX TECHNOLOGIES “AS-IS” WITH ALL FAULTS OF ANY
KIND AND SOLELY FOR THE PURPOSE OF AIDING THE CUSTOMER IN TESTING APPLICATIONS THAT USE
THE PRODUCTS IN DESIGNATED SOLUTIONS. THE CUSTOMER'S MANUFACTURING TEST ENVIRONMENT
HAS NOT MET THE STANDARDS SET BY MELLANOX TECHNOLOGIES TO FULLY QUALIFY THE PRODUCT(S)
AND/OR THE SYSTEM USING IT. THEREFORE, MELLANOX TECHNOLOGIES CANNOT AND DOES NOT
GUARANTEE OR WARRANT THAT THE PRODUCTS WILL OPERATE WITH THE HIGHEST QUALITY. ANY
EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF
MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT ARE DISCLAIMED.
IN NO EVENT SHALL MELLANOX BE LIABLE TO CUSTOMER OR ANY THIRD PARTIES FOR ANY DIRECT,
INDIRECT, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES OF ANY KIND (INCLUDING, BUT NOT
LIMITED TO, PAYMENT FOR PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA,
OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY,
WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
ARISING IN ANY WAY FROM THE USE OF THE PRODUCT(S) AND RELATED DOCUMENTATION EVEN IF
ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
Mellanox Technologies
350 Oakmead Parkway Suite 100
Sunnyvale, CA 94085
U.S.A.
www.mellanox.com
Tel: (408) 970-3400
Fax: (408) 970-3403
Mellanox Technologies, Ltd.
Hakidma 26
Ofer Industrial Park
Yokneam 2069200
Israel
www.mellanox.com
Tel: +972 (0)74 723 7200
Fax: +972 (0)4 959 3245
© Copyright 2015. Mellanox Technologies. All Rights Reserved.
Mellanox®, Mellanox logo, BridgeX®, ConnectX®, Connect-IB®, CoolBox®, CORE-Direct®, GPUDirect®, InfiniBridge®,
InfiniHost®, InfiniScale®, Kotura®, Kotura logo, MetroX®, MLNX-OS®, PhyX®, ScalableHPC®, SwitchX®, TestX®,
UFM®, Virtual Protocol Interconnect®, Voltaire® and Voltaire logo are registered trademarks of Mellanox Technologies,
Ltd.
ExtendX™, FabricIT™, FPGADirect™, HPC-X™, Mellanox Care™, Mellanox CloudX™, Mellanox Open Ethernet™,
Mellanox PeerDirect™, Mellanox Virtual Modular Switch™, MetroDX™, NVMeDirect™, Switch-IB™, UnbreakableLink™ are trademarks of Mellanox Technologies, Ltd.
All other trademarks are property of their respective owners.
2
Document Number: MLNX-15-3770
Mellanox Technologies Confidential
Table of Contents
Rev 1.3
Table of Contents
1
Introduction ..................................................................................................................................... 4
1.1
2
Dependencies and Prerequisites............................................................................................ 4
1.1.1
Hardware .................................................................................................................. 4
1.1.2
Software .................................................................................................................... 4
1.1.3
Setup Verification ...................................................................................................... 5
Architecture Highlights .................................................................................................................. 6
2.1
JXIO ........................................................................................................................................ 6
3
Installing R4H .................................................................................................................................. 7
4
Using R4H ...................................................................................................................................... 10
4.1
5
Specific Usage Examples ..................................................................................................... 10
Troubleshooting R4H.................................................................................................................... 12
3
Mellanox Technologies Confidential
Rev 1.3
1
Introduction
Introduction
RDMA for HDFS (R4H) is a plugin for Hadoop Distributed FileSystem (HDFS) which
accelerates HDFS by using RDMA (Remote Direct Memory Access) technology. R4H
enables HDFS write operations over RDMA using Mellanox ConnectX® interconnect for
Ethernet and InfiniBand fabrics.
This README describes the steps required to install and operate R4H on Cloudera and
HDP Hadoop cluster.
For more information on R4H, please visit R4H repository on Github:
https://github.com/Mellanox/R4H
1.1
Dependencies and Prerequisites
1.1.1
Hardware
• Mellanox ConnectX®-3 or ConnectX®-3 Pro:
•
Ethernet 10/40/56 GbE
or
•
Infiniband FDR links
• The network cards should be installed on:
1.1.2
•
Name Node and Secondary Name Node
•
All datanodes
•
Client and server nodes that interact with the Hadoop Distributed File System
Software
• RedHat 6.4
• MLNX_OFED v2.4-1.0.1 and firmware v2.33.5100 installed on:
•
Namenode
•
All datanodes
•
Client and server nodes that interact with the Hadoop Distributed File System
• Supported distributions:
•
Cloudera 5.1.2
•
Cloudera 5.3.0
•
Cloudera 5.3.1
•
Hortonworks HDP 2.1.2
•
Hortonworks HDP 2.2.0
• RDMA for HDFS plugin (supplied as a tarball, corresponds to a specific distribution)
4
Mellanox Technologies Confidential
RDMA for Hadoop Distributed FileSystem README
1.1.3
Rev 1.3
Setup Verification
Verify RDMA connectivity between all relevant machines using Open MPI:
1. Set HOSTS variable to a comma delimited list of your cluster's host names.
2. Set NUM_OF_HOSTS to the number of hosts in your cluster.
3. Set OPENMPI_VER variable to your Open MPI version.
4. Set MLX_PORT to the ConnectX port your cluster is using (i.e: for port 1, set it to
mlx4_0:1).
5. Run the following as root user:
/usr/mpi/gcc/openmpi-${OPENMPI_VER}/bin/mpirun --allow-run-as-root
--display-map -H ${HOSTS} -np ${NUM_OF_HOSTS} –mca
btl_openib_ib_rnr_retry 0 -mca btl_openib_ib_retry_count 0 -mca
coll_fca_enable 0 --bind-to core --map-by node --display-map -mca
pml ob1 -mca btl self,sm,openib -mca btl_openib_cpc_include rdmacm
-mca btl_openib_if_include ${MLX_PORT} /usr/mpi/gcc/openmpi${OPENMPI_VER}/tests/IMB-3.2.4/IMB-MPI1 alltoall
For further information, please refer to the following links:
•
http://www.open-mpi.org/faq/?category=openfabrics
•
http://docs.oracle.com/cd/E19708-01/821-1319-10/ExecutingPrograms.html
•
http://blogs.cisco.com/performance/process-affinity-hop-on-the-bus-gus/
Please use the links above to make sure the aforementioned command covers the
following features:
•
RC QP
•
all2all
•
rdmacm
•
Connection establishment
•
Massive RDMA write
5
Mellanox Technologies Confidential
Rev 1.3
2
Architecture Highlights
Architecture Highlights
The R4H plugin works side-by-side with other HDFS communication layers, and does not
replace or intervene with the TCP activity and other HDFS core tasks.
You can choose to use:
• the existing HDFS over TCP
• the faster R4H over RDMA, to get faster writes with higher bandwidth
Upon startup, every DataNode loads R4H plugin in addition to standard HDFS code.
If the client application uses the standard HDFS jar, the R4H plugin does not process any
data transfer. The client DFS will use the standard TCP connection to that respectful
DataNode according to the Namenode's pipeline.
When the client application uses the R4H plugin jar, the connection is initiated over RDMA
using JXIO framework to the DataNode. With R4H the client utilizes RDMA connectivity
for all write operations. All other client communication uses the TCP/IP connectivity.
Similarly, the server R4H plugin process all incoming write operations from the clients. All
other communications between the server and Namenode are handled by the TCP stack.
2.1
JXIO
Java over Accelio (JXIO) provides a JAVA Application Protocol Interface (API) to Accelio.
Accelio is an Open Source high-performance, asynchronous, reliable messaging and Remote
Procedure Call (RPC) library.
JXIO provides:
• Simple and abstract Java API for high-performance asynchronous communications
• Zero copy data delivery
• Request/Reply
• Increased RDMA benefits, hardware offloads, multi-core CPUs and multi-threaded
applications.
• Message combining and batch message processing optimization
For more information on Accelio, an Open Source high-performance, asynchronous, reliable
messaging and Remote Procedure Call (RPC) library, please visit www.accelio.org
For more information on JXIO, please visit JXIO repository on Github:
https://github.com/accelio/JXIO
6
Mellanox Technologies Confidential
RDMA for Hadoop Distributed FileSystem README
3
Rev 1.3
Installing R4H
1. Download the appropriate RDMA JAR for HDFS Plugin depending on the Hadoop
version you are using from the releases page: https://github.com/Mellanox/R4H/releases
2. Place RDMA for HDFS and JXIO JAR files on every DataNode and NameNode in the
cluster, and on every node in the cluster that interacts with HDFS.
The preferable location is the default one where all the native Hadoop and HDFS JARs
are already found. This location is different for each distribution. If chosen, then no
additional configuration changes are required:
•
CDH location:
/opt/cloudera/parcels/CDH/lib/hadoop-hdfs/lib/
•
HDP location:
/usr/lib/hadoop-hdfs/lib/
3. Configure R4H DataNode plugin in hdfs-site.xml as follows 1:
<property>
<name>dfs.datanode.plugins</name>
<value>com.mellanox.r4h.R4HDatanodePlugin</value>
</property>
4. Cloudera Manager users must also set the following:
Category
Property
Value
DataNode Default
Group / Resource
Management
Maximum Memory Used
for Caching
dfs.datanode.max.locked.m
emory
0 GiB
If you wish to enable the HDFS read caching feature, please configure the following
parameters:
Category
Property
Value
HDFS Service Advanced
Configuration Snippet
(Safety Valve) for hdfssite.xml
<property>
DataNode Default
Group / Advanced
DataNode Advanced
Configuration Snippet
(Safety Valve) for hdfssite.xml
Same as above
Gateway Default
Group / Advanced
HDFS Client Advanced
Configuration Snippet
(Safety Valve) for hdfssite.xml
Same as above
Service-Wide /
Advanced
1
<name>dfs.datanode.max.locked.memory
</name>
<value>4294967296</value>
</property>
If you are using Cloudera Manager, the “dfs.datanode.plugins “ parameter should be inserted in DataNode (Default) / Plugins category.
7
Mellanox Technologies Confidential
Rev 1.3
Installing R4H
5. Configure the hostnames of all machines according to the interface used.
NOTE: Make sure that for every machine, the hostname corresponds to the IP address of
the network interface that will be used by RDMA.
Examples:
•
InfiniBand interface ib0 is used and is configured with address 1.0.0.10 on the specific
server with hostname machine1
You should make sure:
i. All hostnames are defined as `hostname`-ib.
ii. The file /etc/hosts on all the machines contains the line
1.0.0.10 machine1-ib.
iii. All the machines in Cloudera Manager (CDM)/HDP Manager (Ambari) are
defined according to their new hostname.
•
Ethernet interface eth3 is used and is configured with address 2.0.0.10 on the specific
server with hostname machine1
You should make sure:
i.
All hostnames are defined as `hostname`-10g.
ii. The file /etc/hosts in all machines contains the line
2.0.0.10 machine1-10g
iii. All the machines in Cloudera Manager (CDM)/HDP Manager (Ambari) are
defined according to their new hostname
6. Tune R4H.
The following are central parameters that affect HDFS, native TCP and R4H
performance and functionality:
Property
Comments
dfs.replication
This parameter affects the performance
dfs.block.size, dfs.blocksize
We advise not to use block sizes smaller than the default 128MB
DataNode Data Directory
The number of disks per DataNode affects the performance. To
guarantee a high performing cluster we recommend the usage of
faster and higher number of disk drives.
r4h.server.portal.workers
The amount of server-portal worker threads for JXIO FORWARD
network model (20 by default).
Setting the amount of workers to 0 will enable the JXIO ACCEPT
model with single thread for network context.
r4h.io.executors
The amount of IO executors for each DataNode
Recommended to set 2-3 workers per disk. (10 by default).
r4h.msg.blocks.bind
The amount of blocks for calculating the initial amount of JXIO
messages and IO buffers on each DataNode startup (50 by
default).
7. Restart HDFS.
8
Mellanox Technologies Confidential
RDMA for Hadoop Distributed FileSystem README
Rev 1.3
8. Verify the installation was performed successfully. Please refer to Specific Usage
Examples section for further information.
•
On the client side, the following line should appear in the log:
“date time INFO fs.FileSystem Using Mellanox RDMA acceleration”
•
On the server side (DataNode log), the following line should appear in the log:
“date time INFO datanode.DataNode Started plug-in R4HDatanodePlugin
{…}”
9
Mellanox Technologies Confidential
Rev 1.3
4
Using R4H
Using R4H
RDMA accelerated HDFS plugin can be used in the following manner:
• Use an existing Hadoop application, supplying it with
fs.hdfs.impl=com.mellanox.r4h.DistributedFileSystem
• Please note, the R4H solution is designed to accelerate large files writing. Some
applications (such as Impala) currently do not support the usage of R4H. Therefore,
setting the above parameter globally (i.e. in HDFS’s core-site.xml) is not recommended.
Instead, it is highly recommended to use this parameter per job or per application. For
example, configuring it for YARN applications via Cloudera Manager:
Category
Property
Value
Gateway
Default Group /
Advanced
YARN Client Advanced
Configuration Snippet
(Safety Valve) for
yarn-site.xml
<property>
<name>fs.hdfs.impl</name>
<value>com.mellanox.r4h.Distribu
tedFileSystem</value>
</property>
• Write an Hadoop application which directly writes files using
com.mellanox.r4h.DistributedFileSystem
NOTE: Be aware, if you are configuring R4H for HBase service, HBase master may fail
to boot with R4H if HDFS is in safe mode. In such case, wait for the namenode to exit the
Safe Mode (or force it), and then restart HBase master.
4.1
Specific Usage Examples
• A standard TestDFSIO benchmark
For example:
hadoop jar tests.jar TestDFSIO
-Dfs.hdfs.impl=com.mellanox.r4h.DistributedFileSystem
-write -nrFiles 10 -fileSize 4GB -resFile /tmp/res
• Choosing TestDFSIO parameters:
•
Number of containers per node
To maximize performance gain, this number should be chosen with accordance to
•
number of nodes
•
number of disks
•
number of cores in the machine
You are advised to use the following script that suggests the best YARN parameters,
taking your hardware specification as input:
http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.1latest/bk_installing_manually_book/content/rpm-chap1-11.html
•
Number of files should be chosen as
10
Mellanox Technologies Confidential
RDMA for Hadoop Distributed FileSystem README
•
Rev 1.3
(num of containers per node X num of nodes) – 1
– 1 stands for application master container
• Files size:
To maximize performance gain, choose a size between block size and 4GB
• A standard HDFS command that writes files.
For instance, this is copyFromLocal command:
bin/hdfs dfs -Dfs.hdfs.impl=com.mellanox.r4h.DistributedFileSystem
-copyFromLocal filenameToCopy destinationFilename
11
Mellanox Technologies Confidential
Rev 1.3
5
Troubleshooting R4H
Troubleshooting R4H
Issue # 1:
Changing the logging levels of R4H.
The logging levels can be controlled in two basic ways:
a. By increasing/decreasing the logging level of the entire HDFS component.
For example, you can increase the logging level of the DataNode to TRACE by setting:
Category
Property
Value
DataNode (Default) / Logs
DataNode Logging Threshold
TRACE
This will also affect R4H DataNode plugin.
b. By increasing/decreasing the logging level of R4H plugin.
Category
Property
Value
DataNode (Default) / Advanced
DataNode Logging Safety Valve
log4j.logger.com.mella
nox.r4h=TRACE
12
Mellanox Technologies Confidential
© Copyright 2026 Paperzz