Jeff Hubbs via Ale
2018-09-07 20:46:23 UTC
For the past few months, I've been operating an Apache Hadoop cluster at
Emory University's Goizueta Business School. That cluster is
Gentoo-Linux-based and consists of a dual-homed "edge node" and three
16-GiB-RAM 16-thread two-disk "worker" nodes. The edge node provides NAT
for the active cluster nodes and holds a complete mirror of the Gentoo
package repository that is updated nightly. There is also an auxiliary
edge node (a one-piece Dell Vostro 320) with xorg and xfce that I mostly
use to display exported instances of xosview from all of the other nodes
so that I can keep an eye on the cluster's operation. Each of the worker
nodes carries a standalone Gentoo Linux instance that was flown in via
rsync from another node while booted to a liveCD-style distribution
(SystemRescueCD, which happens to be Gentoo-based).
I have since set up the main edge node to form a "shadow cluster" in
addition to the one I've been operating. Via iPXE and dnsmasq on the
edge node, any x86_64 system that is connected to the internal cluster
network and allowed to PXE-boot will download a stripped-down Gentoo
instance via HTTP (served up by nginx), boot to this instance in RAM,
and execute a bash script that finds, partitions, and formats all of
that system's disks, downloads and writes to those disks a complete
Gentoo Linux instance, installs and configures the GRUB bootloader, sets
a hostname based on the system's first NIC's MAC address, and reboots
the system into that freshly-written instance.
At present, there is only one read/write NFS export on the edge node and
it holds a flat file that Hadoop uses as a list of available worker
nodes. The list is populated by the aforementioned node setup script
after the hostname is generated.
Both the PXE-booted Gentoo Linux instance and the on-disk instance are
managed within a chroot on the edge node in a manner not unlike how
Gentoo Linux is conventionally installed on a system. Once set up as
desired, these instances are compressed into separate squashfs files and
placed in the nginx doc root. In the case of the PXE-booted instance,
there is an intermediate step where much of the instance is stripped
away just to reduce the size of the squashfs file, which is currently
431MiB. The full cluster node distribution file is 1.6GiB but I
sometimes exclude the kernel source tree and local package
meta-repository to bring it down to 1.1GiB. The on-disk footprint of the
complete worker node instance is 5.9GiB.
The node setup script takes the first drive it finds and GPT-partitions
it six ways: 1) a 2MiB "spacer" for the bootloader; 2) 256MiB for /boot;
3) 32GiB for root; 4) 2xRAM for swap (this is WAY overkill; it's set by
ratio in the script and a ratio of one or less would suffice); 5) 64GiB
for /tmp/hadoop-yarn (more about this later); 6) whatever is left for
/hdfs1. Any remaining disks identified are single-partitioned as /hdfs2,
/hdfs3, etc. All partitions are formatted btrfs with the exception of
/boot, which is vfat for UEFI compatibility (a route I went down because
I have one old laptop I found that was UEFI-only and I expect that will
become more the case than less over time). A quasi-boolean in the script
optionally enables compression at mount time for /tmp/hadoop-yarn.
One of Gentoo Linux's strengths is the ability to compile software
specifically for the CPU but the node instance is set up with the gcc
option -mtune=generic. Another quasi-boolean setting in the node setup
script will change that to -march=native but that change will only
effectuate when packages are built or rebuilt locally (as opposed to in
chroot on the edge node, where everything must be built generic). I can
couple this feature with another feature to optionally rebuild all the
system's binaries native but that's an operation that would take a fair
bit of time (that's over 500 packages and only some of them would affect
cluster operation). Similarly, in the interest of run-what-ya-brung
flexibility, I'm using Gentoo's genkernel utility to generate a kernel
and initrd befitting a liveCD-style instance that will boot on basically
any x86-64 along with whatever NICs and disk controllers it finds.
I am using the Hadoop binary distribution (currently 3.1.1) as
distributed directly by Apache (no HortonWorks; no Cloudera). Each
cluster node has its own Hadoop distribution and each node's Hadoop
distribution has configuration features both in common and specific to
that node, modified in place by the node setup script. In the latter
case, the amount of available RAM, the number of available CPU threads,
and the list of available HDFS partitions on a system are flown into the
proper local config files. Hadoop services run in a Java VM; I am
currently using the IcedTea 3.8.0 source distribution supplied within
Gentoo's packaging system. I have also run it under the IcedTea binary
distribution and the Oracle JVM with equal success.
Hadoop has three primary constructs that make it up. HDFS (Hadoop
Distributed File System) consists of a NameNode daemon that runs on a
single machine and controls the filesystem namespace and user access to
it; DataNode daemons run on each worker node and coordinate between the
NameNode daemon and the local machine's on-disk filesystem. You access
the filesystem with command-line-like options to the hdfs binary like
-put, -get, -ls, -mkdir, etc. but in the on-disk filesystem underneath
/hdfs1.../hdfsN, the files you write are cut up into "blocks" (default
size: 128MiB) and those blocks are replicated (default: three times)
among all the worker nodes. My initial cluster with standalone workers
reported 7.2TiB of HDFS available spread across six physical spindles.
As you can imagine, it's possible to accumulate tens of TiB of HDFS
across only a handful of nodes but doing so isn't necessarily helpful.
YARN (Yet Another Resource Negotiator) is the construct that manages the
execution of work among the nodes. Part of the whole point behind Hadoop
is to /move the processing to where the data is /and it's YARN that
coordinates all that. It consists of a ResourceManager daemon that
communicates with all the worker nodes and NodeManager daemons that run
on each of the worker nodes. You can run the ResourceManager daemon and
HDFS' NameNode daemon on the same machines that act as worker nodes but
past a point you won't want to and past /that/ point you'd want to run
each of NameNode and ResourceManager on two separate machines. In that
regime, you'd have two machines dedicated to those roles (their names
would be taken out of the centrally-located workers file) and the rest
would run both the DataNode and NodeManager daemons, forming the HDFS
storage subsystem and the YARN execution subsystem.
There is another construct, MapReduce, whose architecture I don't fully
understand yet; it comes into play as a later phase in Hadoop
computations and there is a JobHistoryServer daemon associated with it.
Another place where the bridge is out with respect to my understanding
of Hadoop is coding for it - but I'll get there eventually. There are
other apps like Apache's Spark and Hive that use HDFS and/or YARN that I
have better mental insight into, and I have successfully gotten
Python/Spark demo programs to run on YARN in my cluster.
One thing I have learned is that Hadoop clusters do not "genericize"
well. When I first tried running the Hadoop-supplied teragen/terasort
example (goal: make a file of 10^10 100-character lines and sort it), it
failed for want of space available in /tmp/hadoop-yarn but it ran
perfectly when the file was cut down to 1/100th that size. For my
PXE-boot-based cluster, I gave my worker nodes a separate partition for
/tmp/hadoop-yarn and gave it optional transparent compression. There are
a lot of parameters for controlling things like minimum size and minimum
size increment of memory containers and JVM parameters that I haven't
messed with but to optimize the cluster for a given job, one would
expect to.
What I have right now - basically, a single Gentoo Linux instance for
installation on a dual-homed edge node - is able to generate a working
Hadoop cluster with an arbitrary number of nodes, limited primarily by
space, cooling, and electric power (the Dell Optiplex desktops I'm using
right now max out at about an amp, so you have to be prepared to supply
at least N amps for N nodes). They can be purpose-built rack-mount
servers, a lab environment full of thin clients, or wire shelf units
full of discarded desktops and laptops.
- Jeff
Emory University's Goizueta Business School. That cluster is
Gentoo-Linux-based and consists of a dual-homed "edge node" and three
16-GiB-RAM 16-thread two-disk "worker" nodes. The edge node provides NAT
for the active cluster nodes and holds a complete mirror of the Gentoo
package repository that is updated nightly. There is also an auxiliary
edge node (a one-piece Dell Vostro 320) with xorg and xfce that I mostly
use to display exported instances of xosview from all of the other nodes
so that I can keep an eye on the cluster's operation. Each of the worker
nodes carries a standalone Gentoo Linux instance that was flown in via
rsync from another node while booted to a liveCD-style distribution
(SystemRescueCD, which happens to be Gentoo-based).
I have since set up the main edge node to form a "shadow cluster" in
addition to the one I've been operating. Via iPXE and dnsmasq on the
edge node, any x86_64 system that is connected to the internal cluster
network and allowed to PXE-boot will download a stripped-down Gentoo
instance via HTTP (served up by nginx), boot to this instance in RAM,
and execute a bash script that finds, partitions, and formats all of
that system's disks, downloads and writes to those disks a complete
Gentoo Linux instance, installs and configures the GRUB bootloader, sets
a hostname based on the system's first NIC's MAC address, and reboots
the system into that freshly-written instance.
At present, there is only one read/write NFS export on the edge node and
it holds a flat file that Hadoop uses as a list of available worker
nodes. The list is populated by the aforementioned node setup script
after the hostname is generated.
Both the PXE-booted Gentoo Linux instance and the on-disk instance are
managed within a chroot on the edge node in a manner not unlike how
Gentoo Linux is conventionally installed on a system. Once set up as
desired, these instances are compressed into separate squashfs files and
placed in the nginx doc root. In the case of the PXE-booted instance,
there is an intermediate step where much of the instance is stripped
away just to reduce the size of the squashfs file, which is currently
431MiB. The full cluster node distribution file is 1.6GiB but I
sometimes exclude the kernel source tree and local package
meta-repository to bring it down to 1.1GiB. The on-disk footprint of the
complete worker node instance is 5.9GiB.
The node setup script takes the first drive it finds and GPT-partitions
it six ways: 1) a 2MiB "spacer" for the bootloader; 2) 256MiB for /boot;
3) 32GiB for root; 4) 2xRAM for swap (this is WAY overkill; it's set by
ratio in the script and a ratio of one or less would suffice); 5) 64GiB
for /tmp/hadoop-yarn (more about this later); 6) whatever is left for
/hdfs1. Any remaining disks identified are single-partitioned as /hdfs2,
/hdfs3, etc. All partitions are formatted btrfs with the exception of
/boot, which is vfat for UEFI compatibility (a route I went down because
I have one old laptop I found that was UEFI-only and I expect that will
become more the case than less over time). A quasi-boolean in the script
optionally enables compression at mount time for /tmp/hadoop-yarn.
One of Gentoo Linux's strengths is the ability to compile software
specifically for the CPU but the node instance is set up with the gcc
option -mtune=generic. Another quasi-boolean setting in the node setup
script will change that to -march=native but that change will only
effectuate when packages are built or rebuilt locally (as opposed to in
chroot on the edge node, where everything must be built generic). I can
couple this feature with another feature to optionally rebuild all the
system's binaries native but that's an operation that would take a fair
bit of time (that's over 500 packages and only some of them would affect
cluster operation). Similarly, in the interest of run-what-ya-brung
flexibility, I'm using Gentoo's genkernel utility to generate a kernel
and initrd befitting a liveCD-style instance that will boot on basically
any x86-64 along with whatever NICs and disk controllers it finds.
I am using the Hadoop binary distribution (currently 3.1.1) as
distributed directly by Apache (no HortonWorks; no Cloudera). Each
cluster node has its own Hadoop distribution and each node's Hadoop
distribution has configuration features both in common and specific to
that node, modified in place by the node setup script. In the latter
case, the amount of available RAM, the number of available CPU threads,
and the list of available HDFS partitions on a system are flown into the
proper local config files. Hadoop services run in a Java VM; I am
currently using the IcedTea 3.8.0 source distribution supplied within
Gentoo's packaging system. I have also run it under the IcedTea binary
distribution and the Oracle JVM with equal success.
Hadoop has three primary constructs that make it up. HDFS (Hadoop
Distributed File System) consists of a NameNode daemon that runs on a
single machine and controls the filesystem namespace and user access to
it; DataNode daemons run on each worker node and coordinate between the
NameNode daemon and the local machine's on-disk filesystem. You access
the filesystem with command-line-like options to the hdfs binary like
-put, -get, -ls, -mkdir, etc. but in the on-disk filesystem underneath
/hdfs1.../hdfsN, the files you write are cut up into "blocks" (default
size: 128MiB) and those blocks are replicated (default: three times)
among all the worker nodes. My initial cluster with standalone workers
reported 7.2TiB of HDFS available spread across six physical spindles.
As you can imagine, it's possible to accumulate tens of TiB of HDFS
across only a handful of nodes but doing so isn't necessarily helpful.
YARN (Yet Another Resource Negotiator) is the construct that manages the
execution of work among the nodes. Part of the whole point behind Hadoop
is to /move the processing to where the data is /and it's YARN that
coordinates all that. It consists of a ResourceManager daemon that
communicates with all the worker nodes and NodeManager daemons that run
on each of the worker nodes. You can run the ResourceManager daemon and
HDFS' NameNode daemon on the same machines that act as worker nodes but
past a point you won't want to and past /that/ point you'd want to run
each of NameNode and ResourceManager on two separate machines. In that
regime, you'd have two machines dedicated to those roles (their names
would be taken out of the centrally-located workers file) and the rest
would run both the DataNode and NodeManager daemons, forming the HDFS
storage subsystem and the YARN execution subsystem.
There is another construct, MapReduce, whose architecture I don't fully
understand yet; it comes into play as a later phase in Hadoop
computations and there is a JobHistoryServer daemon associated with it.
Another place where the bridge is out with respect to my understanding
of Hadoop is coding for it - but I'll get there eventually. There are
other apps like Apache's Spark and Hive that use HDFS and/or YARN that I
have better mental insight into, and I have successfully gotten
Python/Spark demo programs to run on YARN in my cluster.
One thing I have learned is that Hadoop clusters do not "genericize"
well. When I first tried running the Hadoop-supplied teragen/terasort
example (goal: make a file of 10^10 100-character lines and sort it), it
failed for want of space available in /tmp/hadoop-yarn but it ran
perfectly when the file was cut down to 1/100th that size. For my
PXE-boot-based cluster, I gave my worker nodes a separate partition for
/tmp/hadoop-yarn and gave it optional transparent compression. There are
a lot of parameters for controlling things like minimum size and minimum
size increment of memory containers and JVM parameters that I haven't
messed with but to optimize the cluster for a given job, one would
expect to.
What I have right now - basically, a single Gentoo Linux instance for
installation on a dual-homed edge node - is able to generate a working
Hadoop cluster with an arbitrary number of nodes, limited primarily by
space, cooling, and electric power (the Dell Optiplex desktops I'm using
right now max out at about an amp, so you have to be prepared to supply
at least N amps for N nodes). They can be purpose-built rack-mount
servers, a lab environment full of thin clients, or wire shelf units
full of discarded desktops and laptops.
- Jeff