Introduction

Redundancy and high availability are necessary for a very wide variety of server activities. Having a single point of failure in terms of data storage is a very dangerous configuration for any critical data.

While many databases and other software allows you to spread data out in the context of a single application, other systems can operate on the filesystem level to ensure that data is copied to another location whenever it is written to disk. A clustered storage solution like GlusterFS provides this exact functionality.

In this guide, we will be setting up a redundant GlusterFS cluster between two 64-bit Ubuntu 12.04 VPS instances. This will act similar to an NAS server with mirrored RAID. We will then access the cluster from a third 64-bit Ubuntu 12.04 VPS.

General Concepts

A clustered environment allows you to pool resources (generally either computing or storage) in order to allow you to treat various computers as a single, more powerful unit. With GlusterFS, we are able to pool the storage of various VPS instances and access them as if it were a single server.

GlusterFS allows you to create different kinds of storage configurations, many of which are functionally similar to RAID levels. For instance, you can stripe data across different nodes in the cluster, or you can implement redundancy for better data availability.

In this guide, we will be creating a redundant clustered storage array, also known as a distributed file system. Basically, this will allow us to have similar functionality to a mirrored RAID configuration over the network. Each independent server will contain its own copy of the data, allowing our applications to access either copy, which will help distribute our read load.

Steps to Take on Each VPS

There are some steps that we will be taking on each VPS instance that we are using for this guide. We will need to configure DNS resolution between each host and setting up the software sources that we will be using to install the GlusterFS packages.

Configure DNS Resolution

In order for our different components to be able to communicate with each other easily, it is best to set up some kind of hostname resolution between each computer.

If you have a domain name that you would like to configure to point at each system, you can follow this guide to set up domain names with DigitalOcean.

If you do not have a spare domain name, or if you just want to set up something quickly and easily, you can instead edit the hosts file on each computer.

Open this file with root privileges on your first computer:

sudo nano /etc/hosts

You should see something that looks like this:

127.0.0.1       localhost gluster2

# The following lines are desirable for IPv6 capable hosts
::1     ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters

Below the local host definition, you should add each VPS’s IP address followed by the long and short names you wish to use to reference it.

It should look something like this when you are finished:

127.0.0.1       localhost hostname
first_ip gluster0.droplet.com gluster0
second_ip gluster1.droplet.com gluster1
third_ip gluster2.droplet.com gluster2

# The following lines are desirable for IPv6 capable hosts
::1     ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters

The gluster0.droplet.com and gluster0 portions of the lines can be changed to whatever name you would like to use to access each droplet. We will be using these settings for this guide.

When you are finished, copy the lines you added and add them to the /etc/hosts files on your other VPS instances. Each /etc/hosts file should contain the lines that link your IPs to the names you’ve selected.

Save and close each file when you are finished.

Set Up Software Sources

Although Ubuntu 12.04 contains GlusterFS packages, they are fairly out-of-date, so we will be using the latest stable version as of the time of this writing (version 3.4) from the GlusterFS project.

We will be setting up the software sources on all of the computers that will function as nodes within our cluster, as well as on the client computer.

We will actually be adding a PPA (personal package archive) that the project recommends for Ubuntu users. This will allow us to manage our packages with the same tools as other system software.

First, we need to install the python-software-properties package, which will allow us to manage PPAs easily with apt:

sudo apt-get update
sudo apt-get install python-software-properties

Once the PPA tools are installed, we can add the PPA for the GlusterFS packages by typing:

sudo add-apt-repository ppa:semiosis/ubuntu-glusterfs-3.4

With the PPA added, we need to refresh our local package database so that our system knows about the new packages available from the PPA:

sudo apt-get update

Repeat these steps on all of the VPS instances that you are using for this guide.

Install Server Components

In this guide, we will be designating the two of our machines as cluster members and the third as a client.

We will be configuring the computers we labeled as gluster0 and gluster1 as the cluster components. We will use gluster2 as the client.

On our cluster member machines (gluster0 and gluster1), we can install the GlusterFS server package by typing:

sudo apt-get install glusterfs-server

Once this is installed on both nodes, we can begin to set up our storage volume.

On one of the hosts, we need to peer with the second host. It doesn’t matter which server you use, but we will be preforming these commands from our gluster0 server for simplicity:

sudo gluster peer probe gluster1.droplet.com
peer probe: success

This means that the peering was successful. We can check that the nodes are communicating at any time by typing:

sudo gluster peer status
Number of Peers: 1

Hostname: gluster1.droplet.com
Port: 24007
Uuid: 7bcba506-3a7a-4c5e-94fa-1aaf83f5729b
State: Peer in Cluster (Connected)

At this point, our two servers are communicating and they can set up storage volumes together.

Create a Storage Volume

Now that we have our pool of servers available, we can make our first volume.

Because we are interested in redundancy, we will set up a volume that has replica functionality. This will allow us to keep multiple copies of our data, saving us from a single point-of-failure.

Since we want one copy of data on each of our servers, we will set the replica option to “2”, which is the number of servers we have. The general syntax we will be using to create the volume is this:

sudo gluster volume create volume_name replica num_of_servers transport tcp domain1.com:/path/to/data/directory domain2.com:/path/to/data/directory ... force

The exact command we will run is this:

sudo gluster volume create volume1 replica 2 transport tcp gluster0.droplet.com:/gluster-storage gluster1.droplet.com:/gluster-storage force
volume create: volume1: success: please start the volume to access data

This will create a volume called volume1. It will store the data from this volume in directories on each host at /gluster-storage. If this directory does not exist, it will be created.

At this point, our volume is created, but inactive. We can start the volume and make it available for use by typing:

sudo gluster volume start volume1
volume start: volume1: success

Our volume should be online currently.

Install and Configure the Client Components

Now that we have our volume configured, it is available for use by our client machine.

Before we begin though, we need to actually install the relevant packages from the PPA we set up earlier.

On your client machine (gluster2 in this example), type:

sudo apt-get install glusterfs-client

This will install the client application, and also install the necessary fuse filesystem tools necessary to provide filesystem functionality outside of the kernel.

We are going to mount our remote storage volume on our client computer. In order to do that, we need to create a mount point. Traditionally, this is in the /mnt directory, but anywhere convenient can be used.

We will create a directory at /storage-pool:

sudo mkdir /storage-pool

With that step out of the way, we can mount the remote volume. To do this, we just need to use the following syntax:

sudo mount -t glusterfs domain1.com:volume_name path_to_mount_point

Notice that we are using the volume name in the mount command. GlusterFS abstracts the actual storage directories on each host. We are not looking to mount the /gluster-storage directory, but thevolume1 volume.

Also notice that we only have to specify one member of the storage cluster.

The actual command that we are going to run is this:

sudo mount -t glusterfs gluster0.droplet.com:/volume1 /storage-pool

This should mount our volume. If we use the df command, you will see that we have our GlusterFS mounted at the correct location.

Testing the Redundancy Features

Now that we have set up our client to use our pool of storage, let’s test the functionality.

On our client machine (gluster2), we can type this to add some files into our storage-pool directory:

cd /storage-pool
sudo touch file{1..20}

This will create 20 files in our storage pool.

If we look at our /gluster-storage directories on each storage host, we will see that all of these files are present on each system:

# on gluster0.droplet.com and gluster1.droplet.com
cd /gluster-storage
ls
file1  file10  file11  file12  file13  file14  file15  file16  file17  file18  file19  file2  file20  file3  file4  file5  file6  file7  file8  file9

As you can see, this has written the data from our client to both of our nodes.

If there is ever a point where one of the nodes in your storage cluster is down and changes are made to the filesystem. Doing a read operation on the client mount point after the node comes back online should alert it to get any missing files:

ls /storage-pool

Restrict Access to the Volume

Now that we have verified that our storage pool can be mounted and replicate data to both of the machines in the cluster, we should lock down our pool.

Currently, any computer can connect to our storage volume without any restrictions. We can change this by setting an option on our volume.

On one of your storage nodes, type:

sudo gluster volume set volume1 auth.allow gluster_client_IP_addr

You will have to substitute the IP address of your cluster client (gluster2) in this command. Currently, at least with /etc/hosts configuration, domain name restrictions do not work correctly. If you set a restriction this way, it will block all traffic. You must use IP addresses instead.

If you need to remove the restriction at any point, you can type:

sudo gluster volume set volume1 auth.allow *

This will allow connections from any machine again. This is insecure, but may be useful for debugging issues.

If you have multiple clients, you can specify their IP addresses at the same time, separated by commas:

sudo gluster volume set volume1 auth.allow gluster_client1_ip,gluster_client2_ip

Getting Info with GlusterFS Commands

When you begin changing some of the settings for your GlusterFS storage, you might get confused about what options you have available, which volumes are live, and which nodes are associated with each volume.

There are a number of different commands that are available on your nodes to retrieve this data and interact with your storage pool.

If you want information about each of your volumes, type:

sudo gluster volume info
Volume Name: volume1
Type: Replicate
Volume ID: 3634df4a-90cd-4ef8-9179-3bfa43cca867
Status: Started
Number of Bricks: 1 x 2 = 2
Transport-type: tcp
Bricks:
Brick1: gluster0.droplet.com:/gluster-storage
Brick2: gluster1.droplet.com:/gluster-storage
Options Reconfigured:
auth.allow: 111.111.1.11

Similarly, to get information about the peers that this node is connected to, you can type:

sudo gluster peer status
Number of Peers: 1

Hostname: gluster0.droplet.com
Port: 24007
Uuid: 6f30f38e-b47d-4df1-b106-f33dfd18b265
State: Peer in Cluster (Connected)

If you want detailed information about how each node is performing, you can profile a volume by typing:

sudo gluster volume profile volume_name start

When this command is complete, you can obtain the information that was gathered by typing:

sudo gluster volume profile volume_name info
Brick: gluster1.droplet.com:/gluster-storage
--------------------------------------------
Cumulative Stats:
 %-latency   Avg-latency   Min-Latency   Max-Latency   No. of calls         Fop
 ---------   -----------   -----------   -----------   ------------        ----
      0.00       0.00 us       0.00 us       0.00 us             20     RELEASE
      0.00       0.00 us       0.00 us       0.00 us              6  RELEASEDIR
     10.80     113.00 us     113.00 us     113.00 us              1    GETXATTR
     28.68     150.00 us     139.00 us     161.00 us              2      STATFS
     60.52     158.25 us     117.00 us     226.00 us              4      LOOKUP
 
    Duration: 8629 seconds
   Data Read: 0 bytes
Data Written: 0 bytes
. . .

You will receive a lot of information about each node with this command.

For a list of all of the GlusterFS associated components running on each of your nodes, you can type:

sudo gluster volume status
Status of volume: volume1
Gluster process                                         Port    Online  Pid
------------------------------------------------------------------------------
Brick gluster0.droplet.com:/gluster-storage             49152   Y       2808
Brick gluster1.droplet.com:/gluster-storage             49152   Y       2741
NFS Server on localhost                                 2049    Y       3271
Self-heal Daemon on localhost                           N/A     Y       2758
NFS Server on gluster0.droplet.com                      2049    Y       3211
Self-heal Daemon on gluster0.droplet.com                N/A     Y       2825

There are no active volume tasks

If you are going to be administering your GlusterFS storage volumes, it may be a good idea to drop into the GlusterFS console. This will allow you to interact with your GlusterFS environment without needing to type sudo gluster before everything:

sudo gluster

This will give you a prompt where you can type your commands. This is a good one to get yourself oriented:

help

When you are finished, exit like this:

exit

Conclusion

At this point, you should have a redundant storage system that will allow us to write to two separate servers simultaneously. This can be useful for a great number of applications and can ensure that our data is available even when one server goes down.