With HAST (Highly Available Storage) , we will be able to mount a transparent storage system between two remote computers connected via a TCP/IP network. HAST can be considered a network RAID1 (mirror).
The article is composed of several sections:
Introduction:
To demonstrate the operation of HAST, we will have two FreeBSD13.1 servers, each with its own independent IP and a VIP-CARP :
VIP: 192.168.69.40
PeanutBrain01: 192.168.69.41
PeanutBrain02: 192.168.69.42
When working with HAST, we must take into account several aspects. HAST provides synchronous block-level replication, making it transparent to file systems and applications. There is no difference between using HAST devices or raw disks, partitions, etc., all of them are just regular GEOM providers. HAST works in primary-secondary mode, only one of the nodes can be active at any given time. The primary node will handle I/O requests, and the secondary node is automatically synchronized from the primary. Write/delete/flush operations are sent to the primary and then replicated to the secondary. Read operations are served from the primary unless there is an I/O error, in which case the read is sent to the secondary.
HAST implements several synchronization modes:
- memsync: This mode reports a write operation as completed when the primary has written the data to disk and the secondary sends the ACK to confirm the start of data reception, only the start of reception, the data has not been written to the secondary. This mode reduces latency while providing reasonable reliability and is the default mode.
- fullsync: This mode reports a write operation as completed when both nodes have written the data to disk. It is the safest mode but also the slowest.
- async: This mode reports a write operation as completed when the primary has written the data to disk. It is the fastest mode but also the most dangerous and should only be used when the latency to the secondary is too high to use either of the other two modes.
If we are using a custom kernel , we will have to enable the following option in its compilation:
options GEOM_GATE
HAST Configuration:
In both servers we have two disks in addition to the one used by the system, so we can test the operation with UFS/ZFS.
- UFS: /dev/ada1
- ZFS: /dev/ada2
root@PeanutBrain01:~ # camcontrol devlist
<VBOX HARDDISK 1.0> at scbus0 target 0 lun 0 (pass0,ada0)
<VBOX HARDDISK 1.0> at scbus0 target 1 lun 0 (pass1,ada1)
<VBOX HARDDISK 1.0> at scbus1 target 0 lun 0 (pass2,ada2)
root@PeanutBrain02:~ # camcontrol devlist
<VBOX HARDDISK 1.0> at scbus0 target 0 lun 0 (pass0,ada0)
<VBOX HARDDISK 1.0> at scbus0 target 1 lun 0 (pass1,ada1)
<VBOX HARDDISK 1.0> at scbus1 target 0 lun 0 (pass2,ada2)
The disks must be clean:
It is not possible to use GEOM providers with an existing file system or to convert an existing storage to a HAST-managed pool
The HAST configuration would be the following in both nodes:
resource MySQLData {
on PeanutBrain01 {
local /dev/ada1
remote 192.168.69.42
}
on PeanutBrain02 {
local /dev/ada1
remote 192.168.69.41
}
}
resource FilesData {
on PeanutBrain01 {
local /dev/ada2
remote 192.168.69.42
}
on PeanutBrain02 {
local /dev/ada2
remote 192.168.69.41
}
}
We create the HAST pools on both nodes:
hastctl create FilesData
We enable and start the service:
service hastd start
We set one of the nodes to primary (PeanutBrain01):
hastctl role primary FilesData
The other node to secondary (PeanutBrain02):
hastctl role secondary FilesData
We check the status on both:
root@PeanutBrain01:~ # hastctl status MySQLData
Name Status Role Components
MySQLData complete primary /dev/ada1 192.168.69.42
root@PeanutBrain01:~ # hastctl status FilesData
Name Status Role Components
FilesData complete primary /dev/ada2 192.168.69.42
root@PeanutBrain02:~ # hastctl status MySQLData
Name Status Role Components
MySQLData complete secondary /dev/ada1 192.168.69.41
root@PeanutBrain02:~ # hastctl status FilesData
Name Status Role Components
FilesData complete secondary /dev/ada2 192.168.69.41
We can see how the HAST device has only been generated on the primary:
root@PeanutBrain01:~ # ls -la /dev/hast/
total 1
dr-xr-xr-x 2 root wheel 512 Oct 30 18:11 .
dr-xr-xr-x 10 root wheel 512 Oct 30 18:04 ..
crw-r----- 1 root operator 0x61 Oct 30 18:11 FilesData
crw-r----- 1 root operator 0x5f Oct 30 18:11 MySQLData
root@PeanutBrain02:~ # ls -la /dev/hast/
ls: /dev/hast/: No such file or directory
Now we must decide which file system we want to use on the HAST device, in my case I am going to use both UFS and ZFS, the following commands are ONLY executed on the primary:
Now we need to decide which file system we want to use on the HAST device. In my case, I will use both UFS and ZFS. The following commands should only be executed on the primary:
newfs -U /dev/hast/MySQLData
mkdir /var/db/mysql
mount /dev/hast/MySQLData /var/db/mysql
zpool create FilesData /dev/hast/FilesData
Now we check that it is mounted in the primary node:
df -Th /var/db/mysql
Filesystem Type Size Used Avail Capacity Mounted on
/dev/hast/MySQLData ufs 15G 8.0K 14G 0% /var/db/mysql
zpool status FilesData
pool: FilesData
state: ONLINE
config:
NAME STATE READ WRITE CKSUM
FilesData ONLINE 0 0 0
hast/FilesData ONLINE 0 0 0
errors: No known data errors
df -Th /FilesData
Filesystem Type Size Used Avail Capacity Mounted on
FilesData zfs 15G 96K 15G 0% /FilesData
The idea is to set up two MySQL and NFS services, both offered by the primary node, which will be either node depending on where the VIP-CARP is configured at any given time.
Install the mysql-server package on both nodes:
Enable the service on both nodes:
Bind the service to all server IPs:
[mysqld]
bind-address = 0.0.0.0
Start the service on both nodes:
Now enable NFS for the /FilesData directory on the primary node:
Check if it is shared:
NAME PROPERTY VALUE SOURCE
FilesData sharenfs maproot=root:wheel local
Enable NFS services on both nodes so that everything is ready when the VIP-CARP migrates:
sysrc nfs_server_enable=YES
sysrc mountd_enable=YES
Start the NFS service on both nodes:
Create a test file on the primary node:
From our PC, make sure we can access the content via NFS:
Garrus # ~> showmount -e 192.168.69.40
Exports list on 192.168.69.40:
/FilesData Everyone
Garrus # ~> mount 192.168.69.40:/FilesData /mnt/nfs/
Garrus # ~> df -Th /mnt/nfs/
Filesystem Type Size Used Avail Capacity Mounted on
192.168.69.40:/FilesData nfs 15G 100K 15G 0% /mnt/nfs
Garrus # ~> cat /mnt/nfs/AA
kr0m
The role change will be controlled through devd events. When the VIP-CARP migrates to a node, it will be configured as the primary, and when the VIP-CARP is lost, it will be configured as the secondary.
The necessary parameters for configuring devd rules are
as follows
:
System Subsystem Type Description
CARP Events related to the carp(4) protocol.
CARP vhid@inet The "subsystem" contains the actual
CARP vhid and the name of the network
interface on which the event took
place.
CARP vhid@inet MASTER Node become the master for a virtual
host.
CARP vhid@inet BACKUP Node become the backup for a virtual
host.
Add the following configuration to devd to execute an HAST reconfiguration script when a VIP-CARP migration is detected:
notify 30 {
match "system" "CARP";
match "subsystem" "1@em0";
match "type" "MASTER";
action "/usr/local/sbin/carp-hast-switch primary";
};
notify 30 {
match "system" "CARP";
match "subsystem" "1@em0";
match "type" "BACKUP";
action "/usr/local/sbin/carp-hast-switch secondary";
};
Restart devd:
Depending on the detected change, the failover script will act in a certain way:
- Primary: Waits for the HAST-secondary processes to die, changes the role, mounts/imports the file system, and starts the services.
- Secondary: Stops the services, unmounts/exports the file system, and changes the role.
All HAST resources are linked to a directory where they will be mounted, a file system type, and a service:
Resource | Directory | File System | Service |
---|---|---|---|
MySQLData | /var/db/mysql | UFS | mysql |
FilesData | /FilesData | ZFS | nfs |
The script in question would be as follows:
#!/bin/sh
# The names of the HAST resources, as listed in /etc/hast.conf
resources="MySQLData FilesData"
# Resource mountpoints
resource_mountpoints="/var/db/mysql /FilesData"
# Supported file system types: UFS, ZFS
resource_filesystems="UFS ZFS"
# Service types: mysql nfs
resource_services="mysql nfs"
# Delay in mounting HAST resource after becoming primary
delay=3
# logging
log="local0.debug"
name="carp-hast"
# end of user configurable stuff
case "$1" in
primary)
logger -p $log -t $name "Switching to primary provider for - ${resources} -."
sleep ${delay}
# -- SERVICE MANAGEMENT --
logger -p $log -t $name ">> Stopping services."
resource_counter=1
for resource in ${resources}; do
resource_service=`echo $resource_services | cut -d\ -f$resource_counter`
case "${resource_service}" in
mysql)
logger -p $log -t $name "Service MySQL detected for resource - ${resource} -."
logger -p $log -t $name "Stoping MySQL service for resource - ${resource} -."
service mysql-server stop
logger -p $log -t $name "Done for resource ${resource}"
;;
nfs)
logger -p $log -t $name "Service NFS detected for resource - ${resource} -."
logger -p $log -t $name "Disabling NFS-ZFS share for resource - ${resource} -."
zfs set sharenfs="off" FilesData
logger -p $log -t $name "Done for resource ${resource}"
logger -p $log -t $name "Stopping NFS service for resource - ${resource} -."
service nfsd stop
logger -p $log -t $name "Done for resource ${resource}"
;;
*)
logger -p local0.error -t $name "ERROR: Unknown service: ${resource_filesystem}, exiting."
exit 1
;;
esac
let resource_counter=$resource_counter+1
done
# -- HAST ROLE MANAGEMENT --
logger -p $log -t $name ">> Managing disks."
for resource in ${resources}; do
# When primary HAST node is inaccesible secondary node stops hastd secondary processes automatically
# Wait 30s for any "hastd secondary" processes to stop
num=0
logger -p $log -t $name "Waitting for secondary process of resource - ${resource} - to die."
while $( pgrep -lf "hastd: ${resource} \(secondary\)" > /dev/null 2>&1 ); do
let num=$num+1
sleep 1
if [ $num -gt 29 ]; then
logger -p $log -t $name "ERROR: Secondary process for resource - ${resource} - is still running after 30 seconds, exiting."
exit
fi
done
logger -p $log -t $name "Secondary process for resource - ${resource} - dead successfully."
# Switch role for resource
logger -p $log -t $name "Switching resource - ${resource} - to primary."
hastctl role primary ${resource}
if [ $? -ne 0 ]; then
logger -p $log -t $name "ERROR: Unable to change role to primary for resource - ${resource} -."
exit 1
fi
logger -p $log -t $name "Role for HAST resource - ${resource} - switched to primary."
done
# -- WAIT FOR HAST DEVICE CREATION --
logger -p $log -t $name ">> Waitting for hast devices."
for resource in ${resources}; do
num=0
logger -p $log -t $name "Waitting for hast device of resource - ${resource} -."
while [ ! -c "/dev/hast/${resource}" ]; do
let num=$num+1
sleep 1
if [ $num -gt 29 ]; then
logger -p $log -t $name "ERROR: GEOM provider /dev/hast/${resource} did not appear, exiting."
exit
fi
done
logger -p $log -t $name "Device /dev/hast/${resource} appeared for resource - ${resource} -."
done
# -- FILESYSTEM MANAGEMENT --
logger -p $log -t $name ">> Managing filesystems."
resource_counter=1
for resource in ${resources}; do
resource_mountpoint=`echo $resource_mountpoints | cut -d\ -f$resource_counter`
resource_filesystem=`echo $resource_filesystems | cut -d\ -f$resource_counter`
case "${resource_filesystem}" in
UFS)
logger -p $log -t $name "UFS filesystem detected in resource - ${resource} -."
mkdir -p ${resource_mountpoint} 2>/dev/null
logger -p $log -t $name "Checking /dev/hast/${resource} of resource - ${resource} -."
fsck -p -y -t ufs /dev/hast/${resource}
logger -p $log -t $name "Mounting /dev/hast/${resource} in ${resource_mountpoint}."
out=`mount /dev/hast/${resource} ${resource_mountpoint} 2>&1`
if [ $? -ne 0 ]; then
logger -p local0.error -t hast "ERROR: UFS mount - ${resource} - failed: ${out}."
exit 1
fi
logger -p local0.debug -t $name "UFS mount - ${resource} - mounted successfully."
;;
ZFS)
logger -p $log -t $name "ZFS filesystem detected in resource - ${resource} -."
logger -p $log -t $name "Importing ZFS pool of resource - ${resource} -."
out=`zpool import -f "${resource}" 2>&1`
if [ $? -ne 0 ]; then
logger -p local0.error -t $name "ERROR: ZFS pool import for resource - ${resource} - failed: ${out}."
exit 1
fi
logger -p local0.debug -t $name "ZFS pool for resource - ${resource} - imported successfully."
;;
*)
logger -p local0.error -t $name "ERROR: Unknown filesystem: ${resource_filesystem}, exiting."
exit 1
;;
esac
let resource_counter=$resource_counter+1
done
# -- SERVICE MANAGEMENT --
logger -p $log -t $name ">> Starting services."
resource_counter=1
for resource in ${resources}; do
logger -p $log -t $name "Starting service for resource - ${resource} -."
resource_service=`echo $resource_services | cut -d\ -f$resource_counter`
case "${resource_service}" in
mysql)
logger -p $log -t $name "Service MySQL detected for resource - ${resource} -."
logger -p $log -t $name "Starting MySQL service for resource - ${resource} -."
service mysql-server start
logger -p $log -t $name "Done for resource . ${resource} -."
;;
nfs)
logger -p $log -t $name "Service NFS detected for resource - ${resource} -."
logger -p $log -t $name "Starting NFS service for resource - ${resource} -."
service nfsd start
logger -p $log -t $name "Done for resource - ${resource} -."
logger -p $log -t $name "Enabling NFS-ZFS share for resource - ${resource} -."
zfs set sharenfs="maproot=root:wheel" FilesData
logger -p $log -t $name "Done for resource - ${resource} -."
;;
*)
logger -p local0.error -t $name "ERROR: Unknown service: ${resource_filesystem}, exiting."
exit 1
;;
esac
let resource_counter=$resource_counter+1
done
;;
secondary)
logger -p $log -t $name "Switching to secondary provider for - ${resources} -."
# -- SERVICE MANAGEMENT --
logger -p $log -t $name ">> Stopping services."
resource_counter=1
for resource in ${resources}; do
resource_service=`echo $resource_services | cut -d\ -f$resource_counter`
logger -p $log -t $name "Stopping services for resource - ${resource} -."
case "${resource_service}" in
mysql)
logger -p $log -t $name "Service MySQL detected for resource - ${resource} -."
logger -p $log -t $name "Stopping MySQL service for resource - ${resource} -."
service mysql-server stop
logger -p $log -t $name "Done for resource - ${resource} -."
;;
nfs)
logger -p $log -t $name "Service NFS detected for resource - ${resource} -."
logger -p $log -t $name "Disabling NFS-ZFS share for resource - ${resource} -."
zfs set sharenfs="off" FilesData
logger -p $log -t $name "Done for resource - ${resource} -."
logger -p $log -t $name "Restarting NFS service for resource - ${resource} -."
service nfsd restart
logger -p $log -t $name "Done for resource - ${resource} -."
;;
*)
logger -p local0.error -t $name "ERROR: Unknown service: ${resource_service}, exiting."
exit 1
;;
esac
let resource_counter=$resource_counter+1
done
# -- FILESYSTEM MANAGEMENT --
logger -p $log -t $name ">> Managing filesystems."
resource_counter=1
for resource in ${resources}; do
resource_mountpoint=`echo $resource_mountpoints | cut -d\ -f$resource_counter`
resource_filesystem=`echo $resource_filesystems | cut -d\ -f$resource_counter`
case "${resource_filesystem}" in
UFS)
logger -p $log -t $name "UFS filesystem detected in resource - ${resource} -."
if ! mount | grep -q "^/dev/hast/${resource} on "
then
else
logger -p $log -t $name "Umounting - ${resource} -."
umount -f ${resource_mountpoint}
logger -p $log -t $name "Done."
fi
sleep $delay
;;
ZFS)
logger -p $log -t $name "ZFS filesystem detected in resources - ${resource} -."
if ! mount | grep -q "^${resource} on ${resource_mountpoint}"
then
else
logger -p $log -t $name "Umounting - ${resource} -."
zfs umount ${resource}
logger -p $log -t $name "Done."
logger -p $log -t $name "Exporting ZFS pool of resources - ${resource} -."
out=`zpool export -f "${resource}" 2>&1`
if [ $? -ne 0 ]; then
logger -p local0.error -t $name "ERROR: ZFS pool export for resource - ${resource} - failed: ${out}."
exit 1
fi
logger -p local0.error -t $name "ZFS pool for resource - ${resource} - exported successfully."
fi
sleep $delay
;;
*)
logger -p local0.error -t $name "ERROR: Unknown filesystem: ${resource_filesystem}, exiting."
exit 1
;;
esac
let resource_counter=$resource_counter+1
done
# -- HAST ROLE MANAGEMENT --
logger -p $log -t $name ">> Managing resources."
resource_counter=1
for resource in ${resources}; do
logger -p $log -t $name "Switching resource - ${resource} - to secondary."
hastctl role secondary ${resource} 2>&1
if [ $? -ne 0 ]; then
logger -p $log -t $name "ERROR: Unable to switch resource - ${resource} - to secondary role."
exit 1
fi
logger -p $log -t $name "Role for resource - ${resource} - switched to secondary successfully."
let resource_counter=$resource_counter+1
done
;;
esac
We assign the necessary permissions:
Testing:
MySQL:
To generate MySQL traffic, we will use
sysbench
. To do this, we create the test database on the primary node:
root@localhost [(none)]> create database kr0m;
Query OK, 1 row affected (0.00 sec)
root@localhost [(none)]> show databases;
+--------------------+
| Database |
+--------------------+
| information_schema |
| kr0m |
| mysql |
| performance_schema |
| sys |
+--------------------+
5 rows in set (0.00 sec)
We create the access user with the necessary grants for the database:
root@localhost [(none)]> create user sbtest_user identified by 'password';
Query OK, 0 rows affected (0.01 sec)
root@localhost [(none)]> grant all on kr0m.* to `sbtest_user`@`%`;
Query OK, 0 rows affected (0.01 sec)
root@localhost [(none)]> show grants for sbtest_user;
+-------------------------------------------------------+
| Grants for sbtest_user@% |
+-------------------------------------------------------+
| GRANT USAGE ON *.* TO `sbtest_user`@`%` |
| GRANT ALL PRIVILEGES ON `kr0m`.* TO `sbtest_user`@`%` |
+-------------------------------------------------------+
2 rows in set (0.01 sec)
On our PC, we install sysbench:
We create some tables and insert data into them:
Now that we have data in the database, let’s have sysbench perform a read-write test:
NFS:
To test NFS access, we will use
bonnie++
. We install it on our PC where we have the NFS mounted:
Let bonnie++ run while we perform the failover tests.
Failover:
To migrate the service between nodes, we have three possibilities:
-
Migrate the VIP by running the following command on the primary node:
ifconfig em0 vhid 1 state backup -
Gracefully shut down the primary node:
service mysql-server stop
umount -f /var/db/mysqlservice nfsd stop
zpool export -f FilesDataservice hastd stop
shutdown -p now
-
Abruptly shutting down the primary node:
Pull the power cable.
In case of migrating the VIP, the node that has lost the floating IP will automatically reconfigure itself as secondary. In the other two cases, when the node comes back to life, the HAST resources will be in the init state:
Name Status Role Components
MySQLData - init /dev/ada1 192.168.69.42
Name Status Role Components
FilesData - init /dev/ada2 192.168.69.42
We need to switch it to secondary:
hastctl role secondary FilesData
service mysql-server stop
service nfsd stop
Resulting in the following state:
Name Status Role Components
MySQLData complete secondary /dev/ada1 192.168.69.42
Name Status Role Components
FilesData complete secondary /dev/ada2 192.168.69.42
Troubleshooting:
A very important point is that the nodes have the same time. Using NTP is a good idea.
In case of split-brain, if data has been written on both nodes, the administrator must decide which data is more important and configure the other node as secondary, discarding the data:
hastctl create test
hastctl role secondary test
There are occasions when the migration script fails. We can always run it manually to check if the script has any issues:
We can see the steps performed by the failover script in the logs:
We can also check the status of HAST:
hastctl status FilesData
Check if the ZFS pool is imported:
zfs get sharenfs FilesData
Check that the test database exists:
Check that NFS access is available from the client:
We must be very clear that HAST is not a backup. If data is deleted, it will be replicated over the network to the secondary, resulting in data loss. Additionally, in certain scenarios such as its use as a backend for databases, it can cause problems. An incorrect shutdown of MySQL could leave the data corrupted, and it would be replicated at the block level to the secondary, leaving us with a corrupted database. In such cases, we can only restore a backup on the current primary.