This article describes a series of recommendations to consider when deploying a Mongo server on Linux, covering aspects such as the kernel, file system, considerations to take into account with SELinux/GrSec, among others.
These are just basic considerations, of course, each scenario is linked to its own particularities.
Kernel
2.6.36 or later.
File system
XFS or Ext4, preferably XFS, Mongo requires a file system that supports the fsync() call on directories. Some exotic file systems such as HGFS or shared VBox folders do not support it.
Add the atime option to the partition where the DB files are located.
The readahead is beneficial when reading continuous areas of the disk, but mongo does not follow this pattern but rather performs random reads.
blockdev --setra 32 /dev/sda
blockdev --getra /dev/sda
32
We can do it at the OS startup:
#! /bin/bash
blockdev --setra 32 /dev/sda
OS Limits
The recommended parameters by mongo are:
-f (file size): unlimited
-t (cpu time): unlimited
-v (virtual memory): unlimited
-n (open files): 64000
-m (memory size): unlimited
-u (processes/threads): 64000
mongodb soft fsize unlimited
mongodb hard fsize unlimited
mongodb soft cpu unlimited
mongodb hard cpu unlimited
mongodb soft as unlimited
mongodb hard as unlimited
mongodb soft nofile 64000
mongodb hard nofile 64000
mongodb soft nproc 64000
mongodb hard nproc 64000
With this small script we can check if everything is in order:
#! /bin/bash
for process in $@; do
process_pids=`ps -C $process -o pid --no-headers | cut -d " " -f 2`
if [ -z $@ ]; then
echo "[no $process running]"
else
for pid in $process_pids; do
echo "[$process #$pid -- limits]"
cat /proc/$pid/limits
done
fi
done
NTP
Especially important for sharding clusters.
Transparent Huge Pages
The mapping between virtual and physical memory is carried out by the CPU’s MMU, this process is relatively slow, to mitigate this slowness a mapping cache called TLB was created but the TLB is of limited size so the entries are constantly changing.
As the TLB is limited, what can be done is to make memory pages larger so that we cover more physical memory space with the same TLB entries.
- Huge pages are appropriate for applications that access large continuous memory regions.
- It doesn’t work with applications that access small portions of memory at different positions, as is the case with Mongo.
We add disabling huge pages in the startup script:
#! /bin/bash
blockdev --setra 32 /dev/sda
if [ -d /sys/kernel/mm/transparent_hugepage ]; then
thp_path=/sys/kernel/mm/transparent_hugepage
elif [ -d /sys/kernel/mm/redhat_transparent_hugepage ]; then
thp_path=/sys/kernel/mm/redhat_transparent_hugepage
else
return 0
fi
echo 'never' > ${thp_path}/enabled
echo 'never' > ${thp_path}/defrag
re='^[0-1]+$'
if [[ $(cat ${thp_path}/khugepaged/defrag) =~ $re ]]; then
# RHEL 7
echo 0 > ${thp_path}/khugepaged/defrag
else
# RHEL 6
echo 'no' > ${thp_path}/khugepaged/defrag
fi
unset re
unset thp_path
NUMA
The NUMA architecture was designed to overcome scalability problems in SMP architectures, where all CPUs compete for access to the RAM memory bus, causing a bottleneck. NUMA, on the other hand, proposes a RAM access bus for every X CPUs, with access to this RAM being faster than access to RAM from another group of CPUs.
On paper, NUMA looks very good, but it is known that Mongo does not work correctly in this type of environment. First, we find out if our hardware supports NUMA:
numactl –hardware
available: 1 nodes (0) --> There is only one node, it is NOT NUMA: Intel Xeon CPU E5-1660 v3 @ 3.00GHz
available: 2 nodes (0-1) --> It is NUMA: AMD Opteron Processor 4386
To disable it:
sysctl -w vm.zone_reclaim_mode=0
We start mongo using numactl:
cd
screen -S mongo
numactl –interleave=all /usr/bin/mongod –config /etc/mongodb.conf
SELinux
It is problematic:
https://docs.mongodb.com/manual/tutorial/install-mongodb-on-red-hat/#install-rhel-configure-selinux
GrSec
https://jira.mongodb.org/browse/SERVER-12991
V8, the Javascript engine used by mongodb, requires the ability to write executable pages of memory for Just-In-Time (JIT) compilation.
If an operating system has been configured so it’s not possible to write to executable memory regions, V8 cannot function.