Advanced Development and Delivery (ADD) [Part-4]

This is the fourth installment of describing a radically more productive development and delivery environment.

The first part is here: Intro. In the previous parts I described the big picture and the Vagrant and EC2 bootstrap.

Node initialization

The previous parts described getting Vagrant and EC2 to have an operational node. For Vagrant it leverages ‘host’ virtual disk access to configure and bootstrap itself. For EC2, it leverages CloudFormation to configure and bootstrap itself. In both cases the very last thing the node does in the bootstrap is:

cd /root/gitrepo/`cat /root/nodeinfo/initgitrepo.txt`
include () { if [[ -f \"$1\" ]]; then source \"$1\"; else echo \"Skipped missing: $1\"; fi }
include it/nodeinit/common/init.sh

It is an ‘include/source’ to make sure it is at the same level as the initial bootstrap script. For EC2 this affects logging, so continual sourcing is preferred. In other cases, the ‘source’ enables sub-scripts to set values for subsequent scripts where subshells are more isolated.

init.sh

The init script first figures out where it is and sets up some important paths.

#!/bin/bash

#===================================================
#=== Want DIR to be root of the 'nodeinit' directory
#===================================================

export DIR="$( cd -P "$( dirname "${BASH_SOURCE[0]}" )" && pwd )/../"
export RESOURCE=${DIR}/resource
export COMMON=${DIR}/common

cron_1m.sh

It then gets some AWS resources, sets up a shared ‘cron’, and so on. I like a single ‘cron’ job running every minute so it is easy to understand what is going on. This is the ‘heartbeat’ of the server configuration infrastructure: a server can want to change any ‘minute’. They look every minute for something that makes them want to change and then they launch an activity. The look need to be fast: take about a second or two per ‘look’ and not cause much load. But the ‘change’ does not have to be fast: it could take minutes to reconfigure based on the change. So while changing, the ‘looking’ is disabled. For example, deploying a new WAR can take a while. The server stops looking for new WARs when deploying a WAR. Then starts looking again when it is back online.

At scale (say 100 servers) with servers all on NTP this one-minute rhythm can cause resource rushing. To counter that we need to ‘jitter’ the servers so they work on a different second of the minute, or even as much as minutes later at super-scale (1000 servers). That is done within the cron_1m.sh script after the look has established something needs to be done.

mkdir -p /root/bin
cp ${RESOURCE}/cron_1m.sh /root/bin/cron_1m.sh
chmod +x /root/bin/cron_1m.sh

cat <<EOS > /var/spool/cron/root
MAILTO=""

* * * * *  /root/bin/cron_1m.sh
EOS

More specific initialization

The above activities are done for any node. They all need to have heartbeats and some other common resources. But beyond that, it depends on the type of node and the type of stack what should be put on a particular node. This is done by simple ‘includes’ with the ‘nodeinfo’ that came from the configuration.

include ${DIR}part/`cat /root/nodeinfo/nodepart.txt`/init.sh
include ${DIR}stacktype/`cat /root/nodeinfo/stacktype.txt`/init.sh
include ${DIR}stacktype/`cat /root/nodeinfo/stacktype.txt`/part/`cat /root/nodeinfo/nodepart.txt`/init.sh

You can see the layout in the directory picture.

As of that picture, no ‘part’ or ‘stacktype’ exists. So a machine that is brought up is simply a heart-beating server, but a heart-beating server that can mutate on command every minute.

What are nodes doing every minute?

The next cool feature of ADD is that nodes do work based on the state of git repositories. For any given repository, they look for a ‘work.sh’ file within a ‘nodework’ directory for either all types of nodes (i.e. common again), or the specific type of node they are. So just like the other ‘include’ we get:

        bash_ifexist bin/nodework/common/work.sh
        bash_ifexist bin/nodework/part/`cat /root/nodeinfo/nodepart.txt`/work.sh
        bash_ifexist bin/nodework/stacktype/`cat /root/nodeinfo/stacktype.txt`/work.sh
        bash_ifexist bin/nodework/stacktype/`cat /root/nodeinfo/stacktype.txt`/part/`cat /root/nodeinfo/nodepart.txt`/work.sh

where the only change is these are not ‘sourced’ but executed within a sub-shell since they could do weird things to each other, and also this enables them not to block each other (if desired).

All of these ‘work’ scripts should quickly determine if anything has changed and then release themselves. While ‘work’ is going on, the main cron script is locked out.

     if [[ -e ${CURRENT_ACTION_FILE} ]]; then
         : #Don't do anything until the current action completes

work.sh

The main purpose of ‘work.sh’ is to detect changes. Any actual work will be in ‘work_ActualWork.sh’. In reverse, the ActualWork is simply:

echo "Doing the work for ${GIT_VERSION}"

So a one-liner appears in the log for the cron job just to prove the ‘ActualWork’ was done.

But ‘work.sh’ has to do a few things (very quickly) to detect if there are changes of relevance. It stores files in the ‘repo/.temp/add’ directory that keeps track of state. The example ‘work.sh’ will detect changes to the repository based on a watched ‘path’. This allows multiple things to use the same git repository but be looking at different parts. By default they look at the root, but it can be changed. No matter what ‘path’ is watched, the version of the ‘work’ is always the version of the git repository… not the path itself. In total, there are four ‘outer’ states possible:

The version of the work previously done is identical to the version of git now
The version of the work previously done is different from the version of git now, but the version of the watched path is the same
The version of the work previously done is different from the version of git now, and the watched path has changed
There is no work previously done (the first run of the work)

Of the above, only the last two should trigger work. You can branch differently based on the first run or subsequent runs, but generally it is best to be ‘idempotent’ with the work: you change the state of the server to a new state without caring what the previous state is/was.

The ‘inner’ state issue is the server could already be doing ‘ActualWork’, so you have to wait until that is done.

The core of the work.sh script is

export WORK_VERSION=$ADD_TEMP/work.sh_VERSION
export WORK_DOING_VERSION=$ADD_TEMP/work.sh_DOING_VERSION
export WORK_WATCH_PATH=$ADD_TEMP/work.sh_WATCH_PATH.txt

export WORK_WATCH_VALUE=`cat $WORK_WATCH_PATH`
export PREV_WORK_VERSION=`cat $WORK_VERSION`

#====================================================
#=== Now do comparison
#====================================================

export GIT_VERSION=`git rev-parse HEAD`

export DETECT_GIT_CHANGE=`git log --pretty=oneline ${PREV_WORK_VERSION}..  -- ${WORK_WATCH_VALUE} | awk '{print $1}'`

echo "Compared ${PREV_WORK_VERSION} to ${GIT_VERSION} for ${WORK_WATCH_VALUE} and got ${DETECT_GIT_CHANGE}"

mkdir -p ${ADD_TEMP}

if [[ -n "${DETECT_GIT_CHANGE}" ]] ;
then
    echo "Detected Change in Git Version!";

    if [[ -e ${WORK_DOING_VERSION} ]] ;
    then
       echo "Already doing `cat ${WORK_DOING_VERSION}`"
    else
       echo $GIT_VERSION > ${WORK_DOING_VERSION}
       source ${COMMON}/work_ActualWork.sh

       #Update the state.  This also does a clean startup on first run

       echo $GIT_VERSION > ${WORK_VERSION}
       echo $WORK_WATCH_VALUE > ${WORK_WATCH_PATH}

       rm -fr ${WORK_DOING_VERSION}
    fi
else
    echo "No change, move along";
fi

Speed!

How fast does this detection take? Basically one second for it to figure out which of the variations it is in, plus the time of the ‘git pull’. With a ‘small’ server and a small change, this a single second and basically no load:

Difference detected: start the work

cron_1m.sh: Start  20151002-013401
~/gitrepo/repo2_petulant-cyril ~

==> /root/log/cron_1m.sh_error.txt <==
From github.com:shaklee/repo2_petulant-cyril
   57f20ff..6f85e76  master     -> origin/master

==> /root/log/cron_1m.sh_log.txt <==
Updating 57f20ff..6f85e76
Fast-forward
 bin/nodework/common/work.sh | 2 ++
 1 file changed, 2 insertions(+)
Compared 57f20ff734c8836fa34f938bcc540a89bad9215c to 6f85e76a507ce599f42762ad7bf4ae639884ae12 for  and got 6f85e76a507ce599f42762ad7bf4ae639884ae12
Detected Change in Git Version!
Starting ActualWork at 20151002-013402
Doing the work for 6f85e76a507ce599f42762ad7bf4ae639884ae12

Difference detected (but already doing something)

cron_1m.sh: Start  20151002-012301
~/gitrepo/repo2_petulant-cyril ~
Already up-to-date.
Compared 7916053bb9f8bc3d952588a87a48da96dda7abe6 to 57f20ff734c8836fa34f938bcc540a89bad9215c for  and got 57f20ff734c8836fa34f938bcc540a89bad9215c
Detected Change in Git Version!
Already doing 57f20ff734c8836fa34f938bcc540a89bad9215c
Skipped missing: bin/nodework/part/controlnode/work.sh
Skipped missing: bin/nodework/stacktype/ControlServer1/work.sh
Skipped missing: bin/nodework/stacktype/ControlServer1/part/controlnode/work.sh
~
cron_1m.sh: Finish 20151002-012302

No Difference

cron_1m.sh: Start  20151002-012801
~/gitrepo/repo2_petulant-cyril ~
Already up-to-date.
Compared 57f20ff734c8836fa34f938bcc540a89bad9215c to 57f20ff734c8836fa34f938bcc540a89bad9215c for  and got
No change, move along
Skipped missing: bin/nodework/part/controlnode/work.sh
Skipped missing: bin/nodework/stacktype/ControlServer1/work.sh
Skipped missing: bin/nodework/stacktype/ControlServer1/part/controlnode/work.sh
~
cron_1m.sh: Finish 20151002-012802

Polyglot

Build Valuable Systems, Better and Faster