[ale] HADTOO: Automatic-node-generating Hadoop cluster on Gentoo Linux

Discussion:

Jeff Hubbs via Ale

2018-09-07 20:46:23 UTC

For the past few months, I've been operating an Apache Hadoop cluster at
Emory University's Goizueta Business School. That cluster is
Gentoo-Linux-based and consists of a dual-homed "edge node" and three
16-GiB-RAM 16-thread two-disk "worker" nodes. The edge node provides NAT
for the active cluster nodes and holds a complete mirror of the Gentoo
package repository that is updated nightly. There is also an auxiliary
edge node (a one-piece Dell Vostro 320) with xorg and xfce that I mostly
use to display exported instances of xosview from all of the other nodes
so that I can keep an eye on the cluster's operation. Each of the worker
nodes carries a standalone Gentoo Linux instance that was flown in via
rsync from another node while booted to a liveCD-style distribution
(SystemRescueCD, which happens to be Gentoo-based).

I have since set up the main edge node to form a "shadow cluster" in
addition to the one I've been operating. Via iPXE and dnsmasq on the
edge node, any x86_64 system that is connected to the internal cluster
network and allowed to PXE-boot will download a stripped-down Gentoo
instance via HTTP (served up by nginx), boot to this instance in RAM,
and execute a bash script that finds, partitions, and formats all of
that system's disks, downloads and writes to those disks a complete
Gentoo Linux instance, installs and configures the GRUB bootloader, sets
a hostname based on the system's first NIC's MAC address, and reboots
the system into that freshly-written instance.

At present, there is only one read/write NFS export on the edge node and
it holds a flat file that Hadoop uses as a list of available worker
nodes. The list is populated by the aforementioned node setup script
after the hostname is generated.

Both the PXE-booted Gentoo Linux instance and the on-disk instance are
managed within a chroot on the edge node in a manner not unlike how
Gentoo Linux is conventionally installed on a system. Once set up as
desired, these instances are compressed into separate squashfs files and
placed in the nginx doc root. In the case of the PXE-booted instance,
there is an intermediate step where much of the instance is stripped
away just to reduce the size of the squashfs file, which is currently
431MiB. The full cluster node distribution file is 1.6GiB but I
sometimes exclude the kernel source tree and local package
meta-repository to bring it down to 1.1GiB. The on-disk footprint of the
complete worker node instance is 5.9GiB.

The node setup script takes the first drive it finds and GPT-partitions
it six ways: 1) a 2MiB "spacer" for the bootloader; 2) 256MiB for /boot;
3) 32GiB for root; 4) 2xRAM for swap (this is WAY overkill; it's set by
ratio in the script and a ratio of one or less would suffice); 5) 64GiB
for /tmp/hadoop-yarn (more about this later); 6) whatever is left for
/hdfs1. Any remaining disks identified are single-partitioned as /hdfs2,
/hdfs3, etc. All partitions are formatted btrfs with the exception of
/boot, which is vfat for UEFI compatibility (a route I went down because
I have one old laptop I found that was UEFI-only and I expect that will
become more the case than less over time). A quasi-boolean in the script
optionally enables compression at mount time for /tmp/hadoop-yarn.

One of Gentoo Linux's strengths is the ability to compile software
specifically for the CPU but the node instance is set up with the gcc
option -mtune=generic. Another quasi-boolean setting in the node setup
script will change that to -march=native but that change will only
effectuate when packages are built or rebuilt locally (as opposed to in
chroot on the edge node, where everything must be built generic). I can
couple this feature with another feature to optionally rebuild all the
system's binaries native but that's an operation that would take a fair
bit of time (that's over 500 packages and only some of them would affect
cluster operation). Similarly, in the interest of run-what-ya-brung
flexibility, I'm using Gentoo's genkernel utility to generate a kernel
and initrd befitting a liveCD-style instance that will boot on basically
any x86-64 along with whatever NICs and disk controllers it finds.

I am using the Hadoop binary distribution (currently 3.1.1) as
distributed directly by Apache (no HortonWorks; no Cloudera). Each
cluster node has its own Hadoop distribution and each node's Hadoop
distribution has configuration features both in common and specific to
that node, modified in place by the node setup script. In the latter
case, the amount of available RAM, the number of available CPU threads,
and the list of available HDFS partitions on a system are flown into the
proper local config files. Hadoop services run in a Java VM; I am
currently using the IcedTea 3.8.0 source distribution supplied within
Gentoo's packaging system. I have also run it under the IcedTea binary
distribution and the Oracle JVM with equal success.

Hadoop has three primary constructs that make it up. HDFS (Hadoop
Distributed File System) consists of a NameNode daemon that runs on a
single machine and controls the filesystem namespace and user access to
it; DataNode daemons run on each worker node and coordinate between the
NameNode daemon and the local machine's on-disk filesystem. You access
the filesystem with command-line-like options to the hdfs binary like
-put, -get, -ls, -mkdir, etc. but in the on-disk filesystem underneath
/hdfs1.../hdfsN, the files you write are cut up into "blocks" (default
size: 128MiB) and those blocks are replicated (default: three times)
among all the worker nodes. My initial cluster with standalone workers
reported 7.2TiB of HDFS available spread across six physical spindles.
As you can imagine, it's possible to accumulate tens of TiB of HDFS
across only a handful of nodes but doing so isn't necessarily helpful.

YARN (Yet Another Resource Negotiator) is the construct that manages the
execution of work among the nodes. Part of the whole point behind Hadoop
is to /move the processing to where the data is /and it's YARN that
coordinates all that. It consists of a ResourceManager daemon that
communicates with all the worker nodes and NodeManager daemons that run
on each of the worker nodes. You can run the ResourceManager daemon and
HDFS' NameNode daemon on the same machines that act as worker nodes but
past a point you won't want to and past /that/ point you'd want to run
each of NameNode and ResourceManager on two separate machines. In that
regime, you'd have two machines dedicated to those roles (their names
would be taken out of the centrally-located workers file) and the rest
would run both the DataNode and NodeManager daemons, forming the HDFS
storage subsystem and the YARN execution subsystem.

There is another construct, MapReduce, whose architecture I don't fully
understand yet; it comes into play as a later phase in Hadoop
computations and there is a JobHistoryServer daemon associated with it.

Another place where the bridge is out with respect to my understanding
of Hadoop is coding for it - but I'll get there eventually. There are
other apps like Apache's Spark and Hive that use HDFS and/or YARN that I
have better mental insight into, and I have successfully gotten
Python/Spark demo programs to run on YARN in my cluster.

One thing I have learned is that Hadoop clusters do not "genericize"
well. When I first tried running the Hadoop-supplied teragen/terasort
example (goal: make a file of 10^10 100-character lines and sort it), it
failed for want of space available in /tmp/hadoop-yarn but it ran
perfectly when the file was cut down to 1/100th that size. For my
PXE-boot-based cluster, I gave my worker nodes a separate partition for
/tmp/hadoop-yarn and gave it optional transparent compression. There are
a lot of parameters for controlling things like minimum size and minimum
size increment of memory containers and JVM parameters that I haven't
messed with but to optimize the cluster for a given job, one would
expect to.

What I have right now - basically, a single Gentoo Linux instance for
installation on a dual-homed edge node - is able to generate a working
Hadoop cluster with an arbitrary number of nodes, limited primarily by
space, cooling, and electric power (the Dell Optiplex desktops I'm using
right now max out at about an amp, so you have to be prepared to supply
at least N amps for N nodes). They can be purpose-built rack-mount
servers, a lab environment full of thin clients, or wire shelf units
full of discarded desktops and laptops.

- Jeff

dev null zero two via Ale

2018-09-07 20:52:25 UTC

Permalink

that is pretty incredible (never thought to use Gentoo for this purpose).

have you thought about using orchestration tools for this (Kubernetes etc.)?

Post by Jeff Hubbs via Ale
For the past few months, I've been operating an Apache Hadoop cluster at
Emory University's Goizueta Business School. That cluster is
Gentoo-Linux-based and consists of a dual-homed "edge node" and three
16-GiB-RAM 16-thread two-disk "worker" nodes. The edge node provides NAT
for the active cluster nodes and holds a complete mirror of the Gentoo
package repository that is updated nightly. There is also an auxiliary edge
node (a one-piece Dell Vostro 320) with xorg and xfce that I mostly use to
display exported instances of xosview from all of the other nodes so that I
can keep an eye on the cluster's operation. Each of the worker nodes
carries a standalone Gentoo Linux instance that was flown in via rsync from
another node while booted to a liveCD-style distribution (SystemRescueCD,
which happens to be Gentoo-based).
I have since set up the main edge node to form a "shadow cluster" in
addition to the one I've been operating. Via iPXE and dnsmasq on the edge
node, any x86_64 system that is connected to the internal cluster network
and allowed to PXE-boot will download a stripped-down Gentoo instance via
HTTP (served up by nginx), boot to this instance in RAM, and execute a bash
script that finds, partitions, and formats all of that system's disks,
downloads and writes to those disks a complete Gentoo Linux instance,
installs and configures the GRUB bootloader, sets a hostname based on the
system's first NIC's MAC address, and reboots the system into that
freshly-written instance.
At present, there is only one read/write NFS export on the edge node and
it holds a flat file that Hadoop uses as a list of available worker nodes.
The list is populated by the aforementioned node setup script after the
hostname is generated.
Both the PXE-booted Gentoo Linux instance and the on-disk instance are
managed within a chroot on the edge node in a manner not unlike how Gentoo
Linux is conventionally installed on a system. Once set up as desired,
these instances are compressed into separate squashfs files and placed in
the nginx doc root. In the case of the PXE-booted instance, there is an
intermediate step where much of the instance is stripped away just to
reduce the size of the squashfs file, which is currently 431MiB. The full
cluster node distribution file is 1.6GiB but I sometimes exclude the kernel
source tree and local package meta-repository to bring it down to 1.1GiB.
The on-disk footprint of the complete worker node instance is 5.9GiB.
The node setup script takes the first drive it finds and GPT-partitions it
six ways: 1) a 2MiB "spacer" for the bootloader; 2) 256MiB for /boot; 3)
32GiB for root; 4) 2xRAM for swap (this is WAY overkill; it's set by ratio
in the script and a ratio of one or less would suffice); 5) 64GiB for
/tmp/hadoop-yarn (more about this later); 6) whatever is left for /hdfs1.
Any remaining disks identified are single-partitioned as /hdfs2, /hdfs3,
etc. All partitions are formatted btrfs with the exception of /boot, which
is vfat for UEFI compatibility (a route I went down because I have one old
laptop I found that was UEFI-only and I expect that will become more the
case than less over time). A quasi-boolean in the script optionally enables
compression at mount time for /tmp/hadoop-yarn.
One of Gentoo Linux's strengths is the ability to compile software
specifically for the CPU but the node instance is set up with the gcc
option -mtune=generic. Another quasi-boolean setting in the node setup
script will change that to -march=native but that change will only
effectuate when packages are built or rebuilt locally (as opposed to in
chroot on the edge node, where everything must be built generic). I can
couple this feature with another feature to optionally rebuild all the
system's binaries native but that's an operation that would take a fair bit
of time (that's over 500 packages and only some of them would affect
cluster operation). Similarly, in the interest of run-what-ya-brung
flexibility, I'm using Gentoo's genkernel utility to generate a kernel and
initrd befitting a liveCD-style instance that will boot on basically any
x86-64 along with whatever NICs and disk controllers it finds.
I am using the Hadoop binary distribution (currently 3.1.1) as distributed
directly by Apache (no HortonWorks; no Cloudera). Each cluster node has its
own Hadoop distribution and each node's Hadoop distribution has
configuration features both in common and specific to that node, modified
in place by the node setup script. In the latter case, the amount of
available RAM, the number of available CPU threads, and the list of
available HDFS partitions on a system are flown into the proper local
config files. Hadoop services run in a Java VM; I am currently using the
IcedTea 3.8.0 source distribution supplied within Gentoo's packaging
system. I have also run it under the IcedTea binary distribution and the
Oracle JVM with equal success.
Hadoop has three primary constructs that make it up. HDFS (Hadoop
Distributed File System) consists of a NameNode daemon that runs on a
single machine and controls the filesystem namespace and user access to it;
DataNode daemons run on each worker node and coordinate between the
NameNode daemon and the local machine's on-disk filesystem. You access the
filesystem with command-line-like options to the hdfs binary like -put,
-get, -ls, -mkdir, etc. but in the on-disk filesystem underneath
/hdfs1.../hdfsN, the files you write are cut up into "blocks" (default
size: 128MiB) and those blocks are replicated (default: three times) among
all the worker nodes. My initial cluster with standalone workers reported
7.2TiB of HDFS available spread across six physical spindles. As you can
imagine, it's possible to accumulate tens of TiB of HDFS across only a
handful of nodes but doing so isn't necessarily helpful.
YARN (Yet Another Resource Negotiator) is the construct that manages the
execution of work among the nodes. Part of the whole point behind Hadoop is
to *move the processing to where the data is *and it's YARN that
coordinates all that. It consists of a ResourceManager daemon that
communicates with all the worker nodes and NodeManager daemons that run on
each of the worker nodes. You can run the ResourceManager daemon and HDFS'
NameNode daemon on the same machines that act as worker nodes but past a
point you won't want to and past *that* point you'd want to run each of
NameNode and ResourceManager on two separate machines. In that regime,
you'd have two machines dedicated to those roles (their names would be
taken out of the centrally-located workers file) and the rest would run
both the DataNode and NodeManager daemons, forming the HDFS storage
subsystem and the YARN execution subsystem.
There is another construct, MapReduce, whose architecture I don't fully
understand yet; it comes into play as a later phase in Hadoop computations
and there is a JobHistoryServer daemon associated with it.
Another place where the bridge is out with respect to my understanding of
Hadoop is coding for it - but I'll get there eventually. There are other
apps like Apache's Spark and Hive that use HDFS and/or YARN that I have
better mental insight into, and I have successfully gotten Python/Spark
demo programs to run on YARN in my cluster.
One thing I have learned is that Hadoop clusters do not "genericize" well.
When I first tried running the Hadoop-supplied teragen/terasort example
(goal: make a file of 10^10 100-character lines and sort it), it failed for
want of space available in /tmp/hadoop-yarn but it ran perfectly when the
file was cut down to 1/100th that size. For my PXE-boot-based cluster, I
gave my worker nodes a separate partition for /tmp/hadoop-yarn and gave it
optional transparent compression. There are a lot of parameters for
controlling things like minimum size and minimum size increment of memory
containers and JVM parameters that I haven't messed with but to optimize
the cluster for a given job, one would expect to.
What I have right now - basically, a single Gentoo Linux instance for
installation on a dual-homed edge node - is able to generate a working
Hadoop cluster with an arbitrary number of nodes, limited primarily by
space, cooling, and electric power (the Dell Optiplex desktops I'm using
right now max out at about an amp, so you have to be prepared to supply at
least N amps for N nodes). They can be purpose-built rack-mount servers, a
lab environment full of thin clients, or wire shelf units full of discarded
desktops and laptops.
- Jeff
_______________________________________________
Ale mailing list
https://mail.ale.org/mailman/listinfo/ale
See JOBS, ANNOUNCE and SCHOOLS lists at
http://mail.ale.org/mailman/listinfo

--
Sent from my mobile. Please excuse the brevity, spelling, and punctuation.

Jeff Hubbs via Ale

2018-09-08 03:54:04 UTC

Permalink

Post by dev null zero two via Ale
that is pretty incredible (never thought to use Gentoo for this purpose).

I use it for everything. :)

Post by dev null zero two via Ale
have you thought about using orchestration tools for this (Kubernetes etc.)?

I have been immersed in enough IRC/forum/StackExchange traffic that I
know that people do this, but thus far I haven't seen a need past the
simple crafting of a single readily replicable Linux instance that
justifies the added complexity. In addition to being able to make
changes to that instance in chroot on the edge node and reboot the
workers and any other daemon-running machines (to facilitate this, I set
the machines to boot to disk first and PXE second and then I have a
script that "breaks" the first drive on each worker node by overwriting
the GPT partition table and forcing a reboot), I can still "fan" changes
via ssh across all nodes serially or simultaneously and I still have the
NFS mechanism available to me (right now, it handles only the Hadoop
workers file).

By the way, I've done this thing where the node setup script opens the
optical media tray, which then automatically closes after a few minutes
when the machine reboots. The on-disk instance is set to open and
immediately close the tray at boot, so if all the machines have optical
drives with trays that can open and close on command there will be quite
an entertaining racket when a whole cluster starts up.

Hey, Jim/Aaron - just think; I could have turned Sutton Middle School
into a 500-node Hadoop cluster! There's a 24-seat lab at work that'd be
good for about 3.3TiB of HDFS.

Post by dev null zero two via Ale
For the past few months, I've been operating an Apache Hadoop
cluster at Emory University's Goizueta Business School. That
cluster is Gentoo-Linux-based and consists of a dual-homed "edge
node" and three 16-GiB-RAM 16-thread two-disk "worker" nodes. The
edge node provides NAT for the active cluster nodes and holds a
complete mirror of the Gentoo package repository that is updated
nightly. There is also an auxiliary edge node (a one-piece Dell
Vostro 320) with xorg and xfce that I mostly use to display
exported instances of xosview from all of the other nodes so that
I can keep an eye on the cluster's operation. Each of the worker
nodes carries a standalone Gentoo Linux instance that was flown in
via rsync from another node while booted to a liveCD-style
distribution (SystemRescueCD, which happens to be Gentoo-based).
I have since set up the main edge node to form a "shadow cluster"
in addition to the one I've been operating. Via iPXE and dnsmasq
on the edge node, any x86_64 system that is connected to the
internal cluster network and allowed to PXE-boot will download a
stripped-down Gentoo instance via HTTP (served up by nginx), boot
to this instance in RAM, and execute a bash script that finds,
partitions, and formats all of that system's disks, downloads and
writes to those disks a complete Gentoo Linux instance, installs
and configures the GRUB bootloader, sets a hostname based on the
system's first NIC's MAC address, and reboots the system into that
freshly-written instance.
At present, there is only one read/write NFS export on the edge
node and it holds a flat file that Hadoop uses as a list of
available worker nodes. The list is populated by the
aforementioned node setup script after the hostname is generated.
Both the PXE-booted Gentoo Linux instance and the on-disk instance
are managed within a chroot on the edge node in a manner not
unlike how Gentoo Linux is conventionally installed on a system.
Once set up as desired, these instances are compressed into
separate squashfs files and placed in the nginx doc root. In the
case of the PXE-booted instance, there is an intermediate step
where much of the instance is stripped away just to reduce the
size of the squashfs file, which is currently 431MiB. The full
cluster node distribution file is 1.6GiB but I sometimes exclude
the kernel source tree and local package meta-repository to bring
it down to 1.1GiB. The on-disk footprint of the complete worker
node instance is 5.9GiB.
The node setup script takes the first drive it finds and
GPT-partitions it six ways: 1) a 2MiB "spacer" for the bootloader;
2) 256MiB for /boot; 3) 32GiB for root; 4) 2xRAM for swap (this is
WAY overkill; it's set by ratio in the script and a ratio of one
or less would suffice); 5) 64GiB for /tmp/hadoop-yarn (more about
this later); 6) whatever is left for /hdfs1. Any remaining disks
identified are single-partitioned as /hdfs2, /hdfs3, etc. All
partitions are formatted btrfs with the exception of /boot, which
is vfat for UEFI compatibility (a route I went down because I have
one old laptop I found that was UEFI-only and I expect that will
become more the case than less over time). A quasi-boolean in the
script optionally enables compression at mount time for
/tmp/hadoop-yarn.
One of Gentoo Linux's strengths is the ability to compile software
specifically for the CPU but the node instance is set up with the
gcc option -mtune=generic. Another quasi-boolean setting in the
node setup script will change that to -march=native but that
change will only effectuate when packages are built or rebuilt
locally (as opposed to in chroot on the edge node, where
everything must be built generic). I can couple this feature with
another feature to optionally rebuild all the system's binaries
native but that's an operation that would take a fair bit of time
(that's over 500 packages and only some of them would affect
cluster operation). Similarly, in the interest of
run-what-ya-brung flexibility, I'm using Gentoo's genkernel
utility to generate a kernel and initrd befitting a liveCD-style
instance that will boot on basically any x86-64 along with
whatever NICs and disk controllers it finds.
I am using the Hadoop binary distribution (currently 3.1.1) as
distributed directly by Apache (no HortonWorks; no Cloudera). Each
cluster node has its own Hadoop distribution and each node's
Hadoop distribution has configuration features both in common and
specific to that node, modified in place by the node setup script.
In the latter case, the amount of available RAM, the number of
available CPU threads, and the list of available HDFS partitions
on a system are flown into the proper local config files. Hadoop
services run in a Java VM; I am currently using the IcedTea 3.8.0
source distribution supplied within Gentoo's packaging system. I
have also run it under the IcedTea binary distribution and the
Oracle JVM with equal success.
Hadoop has three primary constructs that make it up. HDFS (Hadoop
Distributed File System) consists of a NameNode daemon that runs
on a single machine and controls the filesystem namespace and user
access to it; DataNode daemons run on each worker node and
coordinate between the NameNode daemon and the local machine's
on-disk filesystem. You access the filesystem with
command-line-like options to the hdfs binary like -put, -get, -ls,
-mkdir, etc. but in the on-disk filesystem underneath
/hdfs1.../hdfsN, the files you write are cut up into "blocks"
three times) among all the worker nodes. My initial cluster with
standalone workers reported 7.2TiB of HDFS available spread across
six physical spindles. As you can imagine, it's possible to
accumulate tens of TiB of HDFS across only a handful of nodes but
doing so isn't necessarily helpful.
YARN (Yet Another Resource Negotiator) is the construct that
manages the execution of work among the nodes. Part of the whole
point behind Hadoop is to /move the processing to where the data
is /and it's YARN that coordinates all that. It consists of a
ResourceManager daemon that communicates with all the worker nodes
and NodeManager daemons that run on each of the worker nodes. You
can run the ResourceManager daemon and HDFS' NameNode daemon on
the same machines that act as worker nodes but past a point you
won't want to and past /that/ point you'd want to run each of
NameNode and ResourceManager on two separate machines. In that
regime, you'd have two machines dedicated to those roles (their
names would be taken out of the centrally-located workers file)
and the rest would run both the DataNode and NodeManager daemons,
forming the HDFS storage subsystem and the YARN execution subsystem.
There is another construct, MapReduce, whose architecture I don't
fully understand yet; it comes into play as a later phase in
Hadoop computations and there is a JobHistoryServer daemon
associated with it.
Another place where the bridge is out with respect to my
understanding of Hadoop is coding for it - but I'll get there
eventually. There are other apps like Apache's Spark and Hive that
use HDFS and/or YARN that I have better mental insight into, and I
have successfully gotten Python/Spark demo programs to run on YARN
in my cluster.
One thing I have learned is that Hadoop clusters do not
"genericize" well. When I first tried running the Hadoop-supplied
teragen/terasort example (goal: make a file of 10^10 100-character
lines and sort it), it failed for want of space available in
/tmp/hadoop-yarn but it ran perfectly when the file was cut down
to 1/100th that size. For my PXE-boot-based cluster, I gave my
worker nodes a separate partition for /tmp/hadoop-yarn and gave it
optional transparent compression. There are a lot of parameters
for controlling things like minimum size and minimum size
increment of memory containers and JVM parameters that I haven't
messed with but to optimize the cluster for a given job, one would
expect to.
What I have right now - basically, a single Gentoo Linux instance
for installation on a dual-homed edge node - is able to generate a
working Hadoop cluster with an arbitrary number of nodes, limited
primarily by space, cooling, and electric power (the Dell Optiplex
desktops I'm using right now max out at about an amp, so you have
to be prepared to supply at least N amps for N nodes). They can be
purpose-built rack-mount servers, a lab environment full of thin
clients, or wire shelf units full of discarded desktops and laptops.
- Jeff
_______________________________________________
Ale mailing list
https://mail.ale.org/mailman/listinfo/ale
See JOBS, ANNOUNCE and SCHOOLS lists at
http://mail.ale.org/mailman/listinfo
--
Sent from my mobile. Please excuse the brevity, spelling, and punctuation.

Jim Kinney via Ale

2018-09-09 13:35:21 UTC

Permalink

Ha!!

My ultimate plan with the APS project was to cobble all the "waste machines" together into a system-wide distributed storage cluster then run a single-system image cluster tool like mosics or kerrigh. That would make the entire school district a single instance, giant super computer that K12 students use :-)

Excellent work on the hadoop cluster!

I have my cluster nodes set to always pxe boot and the pxe boot default is to fall back to local boot drive. That way I can drive a new install/rebuild by twiddling a file on the DHCP server and rebooting a node. Eventually, the nodes will use no local storage as all will be reserved for /tmp (raid0 across all drives for speed) and use an NFS mounted root and remote log file process. Basically a homegrown PAAS set up controlled by job submission defined need.

My current jobs are compiled matlab and custom python. Hadoop is coming back. Can't run hadoop on the same nodes with the others as it assumes full system control. Not enough demand yet for dedicated hadoop stack. So a reboot to nfs root hadoop cluster with a temp "node offline" status in torque seems to be feasible. Use pre-run scripts in torque to call reboot to hadoop node and post-run to reboot to normal cluster mode.

Post by dev null zero two via Ale

Post by dev null zero two via Ale
that is pretty incredible (never thought to use Gentoo for this

purpose).
I use it for everything. :)

Post by dev null zero two via Ale
have you thought about using orchestration tools for this (Kubernetes
etc.)?

I have been immersed in enough IRC/forum/StackExchange traffic that I
know that people do this, but thus far I haven't seen a need past the
simple crafting of a single readily replicable Linux instance that
justifies the added complexity. In addition to being able to make
changes to that instance in chroot on the edge node and reboot the
workers and any other daemon-running machines (to facilitate this, I set
the machines to boot to disk first and PXE second and then I have a
script that "breaks" the first drive on each worker node by overwriting
the GPT partition table and forcing a reboot), I can still "fan" changes
via ssh across all nodes serially or simultaneously and I still have the
NFS mechanism available to me (right now, it handles only the Hadoop
workers file).
By the way, I've done this thing where the node setup script opens the
optical media tray, which then automatically closes after a few minutes
when the machine reboots. The on-disk instance is set to open and
immediately close the tray at boot, so if all the machines have optical
drives with trays that can open and close on command there will be quite
an entertaining racket when a whole cluster starts up.
Hey, Jim/Aaron - just think; I could have turned Sutton Middle School
into a 500-node Hadoop cluster! There's a 24-seat lab at work that'd be
good for about 3.3TiB of HDFS.

Post by dev null zero two via Ale
via rsync from another node while booted to a liveCD-style
distribution (SystemRescueCD, which happens to be Gentoo-based).
I have since set up the main edge node to form a "shadow cluster"
in addition to the one I've been operating. Via iPXE and dnsmasq
on the edge node, any x86_64 system that is connected to the
internal cluster network and allowed to PXE-boot will download a
stripped-down Gentoo instance via HTTP (served up by nginx), boot
to this instance in RAM, and execute a bash script that finds,
partitions, and formats all of that system's disks, downloads and
writes to those disks a complete Gentoo Linux instance, installs
and configures the GRUB bootloader, sets a hostname based on the
system's first NIC's MAC address, and reboots the system into

that

Post by dev null zero two via Ale
freshly-written instance.
At present, there is only one read/write NFS export on the edge
node and it holds a flat file that Hadoop uses as a list of
available worker nodes. The list is populated by the
aforementioned node setup script after the hostname is generated.
Both the PXE-booted Gentoo Linux instance and the on-disk

instance

Post by dev null zero two via Ale
are managed within a chroot on the edge node in a manner not
unlike how Gentoo Linux is conventionally installed on a system.
Once set up as desired, these instances are compressed into
separate squashfs files and placed in the nginx doc root. In the
case of the PXE-booted instance, there is an intermediate step
where much of the instance is stripped away just to reduce the
size of the squashfs file, which is currently 431MiB. The full
cluster node distribution file is 1.6GiB but I sometimes exclude
the kernel source tree and local package meta-repository to bring
it down to 1.1GiB. The on-disk footprint of the complete worker
node instance is 5.9GiB.
The node setup script takes the first drive it finds and
GPT-partitions it six ways: 1) a 2MiB "spacer" for the

bootloader;

Post by dev null zero two via Ale
2) 256MiB for /boot; 3) 32GiB for root; 4) 2xRAM for swap (this

Post by dev null zero two via Ale
WAY overkill; it's set by ratio in the script and a ratio of one
or less would suffice); 5) 64GiB for /tmp/hadoop-yarn (more about
this later); 6) whatever is left for /hdfs1. Any remaining disks
identified are single-partitioned as /hdfs2, /hdfs3, etc. All
partitions are formatted btrfs with the exception of /boot, which
is vfat for UEFI compatibility (a route I went down because I

have

Post by dev null zero two via Ale
one old laptop I found that was UEFI-only and I expect that will
become more the case than less over time). A quasi-boolean in the
script optionally enables compression at mount time for
/tmp/hadoop-yarn.
One of Gentoo Linux's strengths is the ability to compile

software

Post by dev null zero two via Ale
specifically for the CPU but the node instance is set up with the
gcc option -mtune=generic. Another quasi-boolean setting in the
node setup script will change that to -march=native but that
change will only effectuate when packages are built or rebuilt
locally (as opposed to in chroot on the edge node, where
everything must be built generic). I can couple this feature with
another feature to optionally rebuild all the system's binaries
native but that's an operation that would take a fair bit of time
(that's over 500 packages and only some of them would affect
cluster operation). Similarly, in the interest of
run-what-ya-brung flexibility, I'm using Gentoo's genkernel
utility to generate a kernel and initrd befitting a liveCD-style
instance that will boot on basically any x86-64 along with
whatever NICs and disk controllers it finds.
I am using the Hadoop binary distribution (currently 3.1.1) as
distributed directly by Apache (no HortonWorks; no Cloudera).

Each

Post by dev null zero two via Ale
cluster node has its own Hadoop distribution and each node's
Hadoop distribution has configuration features both in common and
specific to that node, modified in place by the node setup

script.

Post by dev null zero two via Ale
In the latter case, the amount of available RAM, the number of
available CPU threads, and the list of available HDFS partitions
on a system are flown into the proper local config files. Hadoop
services run in a Java VM; I am currently using the IcedTea 3.8.0
source distribution supplied within Gentoo's packaging system. I
have also run it under the IcedTea binary distribution and the
Oracle JVM with equal success.
Hadoop has three primary constructs that make it up. HDFS (Hadoop
Distributed File System) consists of a NameNode daemon that runs
on a single machine and controls the filesystem namespace and

user

Post by dev null zero two via Ale
access to it; DataNode daemons run on each worker node and
coordinate between the NameNode daemon and the local machine's
on-disk filesystem. You access the filesystem with
command-line-like options to the hdfs binary like -put, -get,

-ls,

Post by dev null zero two via Ale
-mkdir, etc. but in the on-disk filesystem underneath
/hdfs1.../hdfsN, the files you write are cut up into "blocks"
three times) among all the worker nodes. My initial cluster with
standalone workers reported 7.2TiB of HDFS available spread

across

Post by dev null zero two via Ale
six physical spindles. As you can imagine, it's possible to
accumulate tens of TiB of HDFS across only a handful of nodes but
doing so isn't necessarily helpful.
YARN (Yet Another Resource Negotiator) is the construct that
manages the execution of work among the nodes. Part of the whole
point behind Hadoop is to /move the processing to where the data
is /and it's YARN that coordinates all that. It consists of a
ResourceManager daemon that communicates with all the worker

nodes

Post by dev null zero two via Ale
and NodeManager daemons that run on each of the worker nodes. You
can run the ResourceManager daemon and HDFS' NameNode daemon on
the same machines that act as worker nodes but past a point you
won't want to and past /that/ point you'd want to run each of
NameNode and ResourceManager on two separate machines. In that
regime, you'd have two machines dedicated to those roles (their
names would be taken out of the centrally-located workers file)
and the rest would run both the DataNode and NodeManager daemons,
forming the HDFS storage subsystem and the YARN execution

subsystem.

Post by dev null zero two via Ale
There is another construct, MapReduce, whose architecture I don't
fully understand yet; it comes into play as a later phase in
Hadoop computations and there is a JobHistoryServer daemon
associated with it.
Another place where the bridge is out with respect to my
understanding of Hadoop is coding for it - but I'll get there
eventually. There are other apps like Apache's Spark and Hive

that

Post by dev null zero two via Ale
use HDFS and/or YARN that I have better mental insight into, and

Post by dev null zero two via Ale
have successfully gotten Python/Spark demo programs to run on

YARN

Post by dev null zero two via Ale
in my cluster.
One thing I have learned is that Hadoop clusters do not
"genericize" well. When I first tried running the Hadoop-supplied
teragen/terasort example (goal: make a file of 10^10

100-character

Post by dev null zero two via Ale
lines and sort it), it failed for want of space available in
/tmp/hadoop-yarn but it ran perfectly when the file was cut down
to 1/100th that size. For my PXE-boot-based cluster, I gave my
worker nodes a separate partition for /tmp/hadoop-yarn and gave

Post by dev null zero two via Ale
optional transparent compression. There are a lot of parameters
for controlling things like minimum size and minimum size
increment of memory containers and JVM parameters that I haven't
messed with but to optimize the cluster for a given job, one

would

Post by dev null zero two via Ale
expect to.
What I have right now - basically, a single Gentoo Linux instance
for installation on a dual-homed edge node - is able to generate

Post by dev null zero two via Ale
working Hadoop cluster with an arbitrary number of nodes, limited
primarily by space, cooling, and electric power (the Dell

Optiplex

Post by dev null zero two via Ale
desktops I'm using right now max out at about an amp, so you have
to be prepared to supply at least N amps for N nodes). They can

Post by dev null zero two via Ale
purpose-built rack-mount servers, a lab environment full of thin
clients, or wire shelf units full of discarded desktops and

laptops.

Post by dev null zero two via Ale
- Jeff
_______________________________________________
Ale mailing list
https://mail.ale.org/mailman/listinfo/ale
See JOBS, ANNOUNCE and SCHOOLS lists at
http://mail.ale.org/mailman/listinfo
--
Sent from my mobile. Please excuse the brevity, spelling, and

punctuation.

--
Sent from my Android device with K-9 Mail. All tyopes are thumb related and reflect authenticity.

Jeff Hubbs via Ale

2018-09-10 15:22:46 UTC

Permalink

Post by Jim Kinney via Ale
Ha!!
My ultimate plan with the APS project was to cobble all the "waste
machines" together into a system-wide distributed storage cluster then
run a single-system image cluster tool like mosics or kerrigh. That
would make the entire school district a single instance, giant super
computer that K12 students use :-)

I haven't had occasion to use this yet but Hadoop has a thing where you
can divide your nodes into "racks" such that the ResourceManager knows
to expect network bottlenecks between groups of nodes and moves
processing and storage accordingly; in the APS environment one would
group each school into a "rack" (no idea if Hadoop supports "racks" of
"racks").

Post by Jim Kinney via Ale
Excellent work on the hadoop cluster!

Thanks! It's been interesting and I've covered a few things along the
way I'd long wanted to be able to do (like PXE-booting), so, bonus.

Post by Jim Kinney via Ale
I have my cluster nodes set to always pxe boot and the pxe boot
default is to fall back to local boot drive. That way I can drive a
new install/rebuild by twiddling a file on the DHCP server and
rebooting a node. Eventually, the nodes will use no local storage as
all will be reserved for /tmp (raid0 across all drives for speed) and
use an NFS mounted root and remote log file process. Basically a
homegrown PAAS set up controlled by job submission defined need.

I thought about running the working instance out of RAM leaving
everything NFS-based but I decided against it. For one thing, Hadoop
activity alone hammers the LAN and if a lot of additional traffic (the
Hadoop binary distribution is about 830MiB) is trying to concentrate on,
say, the edge node where the worker node instance actually lives while
there's a lot of HDFS traffic shooting from node to node, that's not cool.

Post by Jim Kinney via Ale
My current jobs are compiled matlab and custom python. Hadoop is
coming back. Can't run hadoop on the same nodes with the others as it
assumes full system control. Not enough demand yet for dedicated
hadoop stack. So a reboot to nfs root hadoop cluster with a temp "node
offline" status in torque seems to be feasible. Use pre-run scripts in
torque to call reboot to hadoop node and post-run to reboot to normal
cluster mode.

Sounds reasonable.

Post by Jim Kinney via Ale

Post by dev null zero two via Ale
that is pretty incredible (never thought to use Gentoo for this purpose).

I use it for everything. :)

Post by dev null zero two via Ale
have you thought about using orchestration tools for this
(Kubernetes etc.)?

I have been immersed in enough IRC/forum/StackExchange traffic
that I know that people do this, but thus far I haven't seen a
need past the simple crafting of a single readily replicable Linux
instance that justifies the added complexity. In addition to being
able to make changes to that instance in chroot on the edge node
and reboot the workers and any other daemon-running machines (to
facilitate this, I set the machines to boot to disk first and PXE
second and then I have a script that "breaks" the first drive on
each worker node by overwriting the GPT partition table and
forcing a reboot), I can still "fan" changes via ssh across all
nodes serially or simultaneously and I still have the NFS
mechanism available to me (right now, it handles only the Hadoop
workers file).
By the way, I've done this thing where the node setup script opens
the optical media tray, which then automatically closes after a
few minutes when the machine reboots. The on-disk instance is set
to open and immediately close the tray at boot, so if all the
machines have optical drives with trays that can open and close on
command there will be quite an entertaining racket when a whole
cluster starts up.
Hey, Jim/Aaron - just think; I could have turned Sutton Middle
School into a 500-node Hadoop cluster! There's a 24-seat lab at
work that'd be good for about 3.3TiB of HDFS.

Post by dev null zero two via Ale
For the past few months, I've been operating an Apache Hadoop
cluster at Emory University's Goizueta Business School. That
cluster is Gentoo-Linux-based and consists of a dual-homed
"edge node" and three 16-GiB-RAM 16-thread two-disk "worker"
nodes. The edge node provides NAT for the active cluster
nodes and holds a complete mirror of the Gentoo package
repository that is updated nightly. There is also an
auxiliary edge node (a one-piece Dell Vostro 320) with xorg
and xfce that I mostly use to display exported instances of
xosview from all of the other nodes so that I can keep an eye
on the cluster's operation. Each of the worker nodes carries
a standalone Gentoo Linux instance that was flown in via
rsync from another node while booted to a liveCD-style
distribution (SystemRescueCD, which happens to be Gentoo-based).
I have since set up the main edge node to form a "shadow
cluster" in addition to the one I've been operating. Via iPXE
and dnsmasq on the edge node, any x86_64 system that is
connected to the internal cluster network and allowed to
PXE-boot will download a stripped-down Gentoo instance via
HTTP (served up by nginx), boot to this instance in RAM, and
execute a bash script that finds, partitions, and formats all
of that system's disks, downloads and writes to those disks a
complete Gentoo Linux instance, installs and configures the
GRUB bootloader, sets a hostname based on the system's first
NIC's MAC address, and reboots the system into that
freshly-written instance.
At present, there is only one read/write NFS export on the
edge node and it holds a flat file that Hadoop uses as a list
of available worker nodes. The list is populated by the
aforementioned node setup script after the hostname is generated.
Both the PXE-booted Gentoo Linux instance and the on-disk
instance are managed within a chroot on the edge node in a
manner not unlike how Gentoo Linux is conventionally
installed on a system. Once set up as desired, these
instances are compressed into separate squashfs files and
placed in the nginx doc root. In the case of the PXE-booted
instance, there is an intermediate step where much of the
instance is stripped away just to reduce the size of the
squashfs file, which is currently 431MiB. The full cluster
node distribution file is 1.6GiB but I sometimes exclude the
kernel source tree and local package meta-repository to bring
it down to 1.1GiB. The on-disk footprint of the complete
worker node instance is 5.9GiB.
The node setup script takes the first drive it finds and
GPT-partitions it six ways: 1) a 2MiB "spacer" for the
bootloader; 2) 256MiB for /boot; 3) 32GiB for root; 4) 2xRAM
for swap (this is WAY overkill; it's set by ratio in the
script and a ratio of one or less would suffice); 5) 64GiB
for /tmp/hadoop-yarn (more about this later); 6) whatever is
left for /hdfs1. Any remaining disks identified are
single-partitioned as /hdfs2, /hdfs3, etc. All partitions are
formatted btrfs with the exception of /boot, which is vfat
for UEFI compatibility (a route I went down because I have
one old laptop I found that was UEFI-only and I expect that
will become more the case than less over time). A
quasi-boolean in the script optionally enables compression at
mount time for /tmp/hadoop-yarn.
One of Gentoo Linux's strengths is the ability to compile
software specifically for the CPU but the node instance is
set up with the gcc option -mtune=generic. Another
quasi-boolean setting in the node setup script will change
that to -march=native but that change will only effectuate
when packages are built or rebuilt locally (as opposed to in
chroot on the edge node, where everything must be built
generic). I can couple this feature with another feature to
optionally rebuild all the system's binaries native but
that's an operation that would take a fair bit of time
(that's over 500 packages and only some of them would affect
cluster operation). Similarly, in the interest of
run-what-ya-brung flexibility, I'm using Gentoo's genkernel
utility to generate a kernel and initrd befitting a
liveCD-style instance that will boot on basically any x86-64
along with whatever NICs and disk controllers it finds.
I am using the Hadoop binary distribution (currently 3.1.1)
as distributed directly by Apache (no HortonWorks; no
Cloudera). Each cluster node has its own Hadoop distribution
and each node's Hadoop distribution has configuration
features both in common and specific to that node, modified
in place by the node setup script. In the latter case, the
amount of available RAM, the number of available CPU threads,
and the list of available HDFS partitions on a system are
flown into the proper local config files. Hadoop services run
in a Java VM; I am currently using the IcedTea 3.8.0 source
distribution supplied within Gentoo's packaging system. I
have also run it under the IcedTea binary distribution and
the Oracle JVM with equal success.
Hadoop has three primary constructs that make it up. HDFS
(Hadoop Distributed File System) consists of a NameNode
daemon that runs on a single machine and controls the
filesystem namespace and user access to it; DataNode daemons
run on each worker node and coordinate between the NameNode
daemon and the local machine's on-disk filesystem. You access
the filesystem with command-line-like options to the hdfs
binary like -put, -get, -ls, -mkdir, etc. but in the on-disk
filesystem underneath /hdfs1.../hdfsN, the files you write
are cut up into "blocks" (default size: 128MiB) and those
blocks are replicated (default: three times) among all the
worker nodes. My initial cluster with standalone workers
reported 7.2TiB of HDFS available spread across six physical
spindles. As you can imagine, it's possible to accumulate
tens of TiB of HDFS across only a handful of nodes but doing
so isn't necessarily helpful.
YARN (Yet Another Resource Negotiator) is the construct that
manages the execution of work among the nodes. Part of the
whole point behind Hadoop is to /move the processing to where
the data is /and it's YARN that coordinates all that. It
consists of a ResourceManager daemon that communicates with
all the worker nodes and NodeManager daemons that run on each
of the worker nodes. You can run the ResourceManager daemon
and HDFS' NameNode daemon on the same machines that act as
worker nodes but past a point you won't want to and past
/that/ point you'd want to run each of NameNode and
ResourceManager on two separate machines. In that regime,
you'd have two machines dedicated to those roles (their names
would be taken out of the centrally-located workers file) and
the rest would run both the DataNode and NodeManager daemons,
forming the HDFS storage subsystem and the YARN execution subsystem.
There is another construct, MapReduce, whose architecture I
don't fully understand yet; it comes into play as a later
phase in Hadoop computations and there is a JobHistoryServer
daemon associated with it.
Another place where the bridge is out with respect to my
understanding of Hadoop is coding for it - but I'll get there
eventually. There are other apps like Apache's Spark and Hive
that use HDFS and/or YARN that I have better mental insight
into, and I have successfully gotten Python/Spark demo
programs to run on YARN in my cluster.
One thing I have learned is that Hadoop clusters do not
"genericize" well. When I first tried running the
Hadoop-supplied teragen/terasort example (goal: make a file
of 10^10 100-character lines and sort it), it failed for want
of space available in /tmp/hadoop-yarn but it ran perfectly
when the file was cut down to 1/100th that size. For my
PXE-boot-based cluster, I gave my worker nodes a separate
partition for /tmp/hadoop-yarn and gave it optional
transparent compression. There are a lot of parameters for
controlling things like minimum size and minimum size
increment of memory containers and JVM parameters that I
haven't messed with but to optimize the cluster for a given
job, one would expect to.
What I have right now - basically, a single Gentoo Linux
instance for installation on a dual-homed edge node - is able
to generate a working Hadoop cluster with an arbitrary number
of nodes, limited primarily by space, cooling, and electric
power (the Dell Optiplex desktops I'm using right now max out
at about an amp, so you have to be prepared to supply at
least N amps for N nodes). They can be purpose-built
rack-mount servers, a lab environment full of thin clients,
or wire shelf units full of discarded desktops and laptops.
- Jeff
_______________________________________________
Ale mailing list
https://mail.ale.org/mailman/listinfo/ale
See JOBS, ANNOUNCE and SCHOOLS lists at
http://mail.ale.org/mailman/listinfo
--
Sent from my mobile. Please excuse the brevity, spelling, and punctuation.

--
Sent from my Android device with K-9 Mail. All tyopes are thumb
related and reflect authenticity.