Introduction
Redundancy and high availability are necessary for a very wide variety of server activities. Having a single point of failure in terms of data storage is a very dangerous configuration for any critical data.
While many databases and other software allows you to spread data out in the context of a single application, other systems can operate on the filesystem level to ensure that data is copied to another location whenever it is written to disk. A clustered storage solution like GlusterFS provides this exact functionality.
In this guide, we will be setting up a redundant GlusterFS cluster between two 64-bit Ubuntu 12.04 VPS instances. This will act similar to an NAS server with mirrored RAID. We will then access the cluster from a third 64-bit Ubuntu 12.04 VPS.
General Concepts
A clustered environment allows you to pool resources (generally either computing or storage) in order to allow you to treat various computers as a single, more powerful unit. With GlusterFS, we are able to pool the storage of various VPS instances and access them as if it were a single server.
GlusterFS allows you to create different kinds of storage configurations, many of which are functionally similar to RAID levels. For instance, you can stripe data across different nodes in the cluster, or you can implement redundancy for better data availability.
In this guide, we will be creating a redundant clustered storage array, also known as a distributed file system. Basically, this will allow us to have similar functionality to a mirrored RAID configuration over the network. Each independent server will contain its own copy of the data, allowing our applications to access either copy, which will help distribute our read load.
Steps to Take on Each VPS
There are some steps that we will be taking on each VPS instance that we are using for this guide. We will need to configure DNS resolution between each host and setting up the software sources that we will be using to install the GlusterFS packages.
Configure DNS Resolution
In order for our different components to be able to communicate with each other easily, it is best to set up some kind of hostname resolution between each computer.
If you have a domain name that you would like to configure to point at each system, you can follow this guide to set up domain names with DigitalOcean.
If you do not have a spare domain name, or if you just want to set up something quickly and easily, you can instead edit the hosts file on each computer.
Open this file with root privileges on your first computer:
sudo nano /etc/hosts
You should see something that looks like this:
127.0.0.1 localhost gluster2
# The following lines are desirable for IPv6 capable hosts
::1 ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
Below the local host definition, you should add each VPS’s IP address followed by the long and short names you wish to use to reference it.
It should look something like this when you are finished:
127.0.0.1 localhost hostname first_ip gluster0.droplet.com gluster0 second_ip gluster1.droplet.com gluster1 third_ip gluster2.droplet.com gluster2 # The following lines are desirable for IPv6 capable hosts ::1 ip6-localhost ip6-loopback fe00::0 ip6-localnet ff00::0 ip6-mcastprefix ff02::1 ip6-allnodes ff02::2 ip6-allrouters
The gluster0.droplet.com
and gluster0
portions of the lines can be changed to whatever name you would like to use to access each droplet. We will be using these settings for this guide.
When you are finished, copy the lines you added and add them to the /etc/hosts
files on your other VPS instances. Each /etc/hosts
file should contain the lines that link your IPs to the names you’ve selected.
Save and close each file when you are finished.
Set Up Software Sources
Although Ubuntu 12.04 contains GlusterFS packages, they are fairly out-of-date, so we will be using the latest stable version as of the time of this writing (version 3.4) from the GlusterFS project.
We will be setting up the software sources on all of the computers that will function as nodes within our cluster, as well as on the client computer.
We will actually be adding a PPA (personal package archive) that the project recommends for Ubuntu users. This will allow us to manage our packages with the same tools as other system software.
First, we need to install the python-software-properties
package, which will allow us to manage PPAs easily with apt:
sudo apt-get update
sudo apt-get install python-software-properties
Once the PPA tools are installed, we can add the PPA for the GlusterFS packages by typing:
sudo add-apt-repository ppa:semiosis/ubuntu-glusterfs-3.4
With the PPA added, we need to refresh our local package database so that our system knows about the new packages available from the PPA:
sudo apt-get update
Repeat these steps on all of the VPS instances that you are using for this guide.
Install Server Components
In this guide, we will be designating the two of our machines as cluster members and the third as a client.
We will be configuring the computers we labeled as gluster0
and gluster1
as the cluster components. We will use gluster2
as the client.
On our cluster member machines (gluster0 and gluster1), we can install the GlusterFS server package by typing:
sudo apt-get install glusterfs-server
Once this is installed on both nodes, we can begin to set up our storage volume.
On one of the hosts, we need to peer with the second host. It doesn’t matter which server you use, but we will be preforming these commands from our gluster0 server for simplicity:
sudo gluster peer probe gluster1.droplet.com
peer probe: success
This means that the peering was successful. We can check that the nodes are communicating at any time by typing:
sudo gluster peer status
Number of Peers: 1
Hostname: gluster1.droplet.com
Port: 24007
Uuid: 7bcba506-3a7a-4c5e-94fa-1aaf83f5729b
State: Peer in Cluster (Connected)
At this point, our two servers are communicating and they can set up storage volumes together.
Create a Storage Volume
Now that we have our pool of servers available, we can make our first volume.
Because we are interested in redundancy, we will set up a volume that has replica functionality. This will allow us to keep multiple copies of our data, saving us from a single point-of-failure.
Since we want one copy of data on each of our servers, we will set the replica option to “2”, which is the number of servers we have. The general syntax we will be using to create the volume is this:
sudo gluster volume create volume_name replica num_of_servers transport tcp domain1.com:/path/to/data/directory domain2.com:/path/to/data/directory ... force
The exact command we will run is this:
sudo gluster volume create volume1 replica 2 transport tcp gluster0.droplet.com:/gluster-storage gluster1.droplet.com:/gluster-storage force
volume create: volume1: success: please start the volume to access data
This will create a volume called volume1
. It will store the data from this volume in directories on each host at /gluster-storage
. If this directory does not exist, it will be created.
At this point, our volume is created, but inactive. We can start the volume and make it available for use by typing:
sudo gluster volume start volume1
volume start: volume1: success
Our volume should be online currently.
Install and Configure the Client Components
Now that we have our volume configured, it is available for use by our client machine.
Before we begin though, we need to actually install the relevant packages from the PPA we set up earlier.
On your client machine (gluster2 in this example), type:
sudo apt-get install glusterfs-client
This will install the client application, and also install the necessary fuse filesystem tools necessary to provide filesystem functionality outside of the kernel.
We are going to mount our remote storage volume on our client computer. In order to do that, we need to create a mount point. Traditionally, this is in the /mnt
directory, but anywhere convenient can be used.
We will create a directory at /storage-pool
:
sudo mkdir /storage-pool
With that step out of the way, we can mount the remote volume. To do this, we just need to use the following syntax:
sudo mount -t glusterfs domain1.com:volume_name path_to_mount_point
Notice that we are using the volume name in the mount command. GlusterFS abstracts the actual storage directories on each host. We are not looking to mount the /gluster-storage
directory, but thevolume1
volume.
Also notice that we only have to specify one member of the storage cluster.
The actual command that we are going to run is this:
sudo mount -t glusterfs gluster0.droplet.com:/volume1 /storage-pool
This should mount our volume. If we use the df
command, you will see that we have our GlusterFS mounted at the correct location.
Testing the Redundancy Features
Now that we have set up our client to use our pool of storage, let’s test the functionality.
On our client machine (gluster2), we can type this to add some files into our storage-pool directory:
cd /storage-pool
sudo touch file{1..20}
This will create 20 files in our storage pool.
If we look at our /gluster-storage
directories on each storage host, we will see that all of these files are present on each system:
# on gluster0.droplet.com and gluster1.droplet.com
cd /gluster-storage
ls
file1 file10 file11 file12 file13 file14 file15 file16 file17 file18 file19 file2 file20 file3 file4 file5 file6 file7 file8 file9
As you can see, this has written the data from our client to both of our nodes.
If there is ever a point where one of the nodes in your storage cluster is down and changes are made to the filesystem. Doing a read operation on the client mount point after the node comes back online should alert it to get any missing files:
ls /storage-pool
Restrict Access to the Volume
Now that we have verified that our storage pool can be mounted and replicate data to both of the machines in the cluster, we should lock down our pool.
Currently, any computer can connect to our storage volume without any restrictions. We can change this by setting an option on our volume.
On one of your storage nodes, type:
sudo gluster volume set volume1 auth.allow gluster_client_IP_addr
You will have to substitute the IP address of your cluster client (gluster2) in this command. Currently, at least with /etc/hosts
configuration, domain name restrictions do not work correctly. If you set a restriction this way, it will block all traffic. You must use IP addresses instead.
If you need to remove the restriction at any point, you can type:
sudo gluster volume set volume1 auth.allow *
This will allow connections from any machine again. This is insecure, but may be useful for debugging issues.
If you have multiple clients, you can specify their IP addresses at the same time, separated by commas:
sudo gluster volume set volume1 auth.allow gluster_client1_ip,gluster_client2_ip
Getting Info with GlusterFS Commands
When you begin changing some of the settings for your GlusterFS storage, you might get confused about what options you have available, which volumes are live, and which nodes are associated with each volume.
There are a number of different commands that are available on your nodes to retrieve this data and interact with your storage pool.
If you want information about each of your volumes, type:
sudo gluster volume info
Volume Name: volume1
Type: Replicate
Volume ID: 3634df4a-90cd-4ef8-9179-3bfa43cca867
Status: Started
Number of Bricks: 1 x 2 = 2
Transport-type: tcp
Bricks:
Brick1: gluster0.droplet.com:/gluster-storage
Brick2: gluster1.droplet.com:/gluster-storage
Options Reconfigured:
auth.allow: 111.111.1.11
Similarly, to get information about the peers that this node is connected to, you can type:
sudo gluster peer status
Number of Peers: 1
Hostname: gluster0.droplet.com
Port: 24007
Uuid: 6f30f38e-b47d-4df1-b106-f33dfd18b265
State: Peer in Cluster (Connected)
If you want detailed information about how each node is performing, you can profile a volume by typing:
sudo gluster volume profile volume_name start
When this command is complete, you can obtain the information that was gathered by typing:
sudo gluster volume profile volume_name info
Brick: gluster1.droplet.com:/gluster-storage -------------------------------------------- Cumulative Stats: %-latency Avg-latency Min-Latency Max-Latency No. of calls Fop --------- ----------- ----------- ----------- ------------ ---- 0.00 0.00 us 0.00 us 0.00 us 20 RELEASE 0.00 0.00 us 0.00 us 0.00 us 6 RELEASEDIR 10.80 113.00 us 113.00 us 113.00 us 1 GETXATTR 28.68 150.00 us 139.00 us 161.00 us 2 STATFS 60.52 158.25 us 117.00 us 226.00 us 4 LOOKUP Duration: 8629 seconds Data Read: 0 bytes Data Written: 0 bytes . . .
You will receive a lot of information about each node with this command.
For a list of all of the GlusterFS associated components running on each of your nodes, you can type:
sudo gluster volume status
Status of volume: volume1
Gluster process Port Online Pid
------------------------------------------------------------------------------
Brick gluster0.droplet.com:/gluster-storage 49152 Y 2808
Brick gluster1.droplet.com:/gluster-storage 49152 Y 2741
NFS Server on localhost 2049 Y 3271
Self-heal Daemon on localhost N/A Y 2758
NFS Server on gluster0.droplet.com 2049 Y 3211
Self-heal Daemon on gluster0.droplet.com N/A Y 2825
There are no active volume tasks
If you are going to be administering your GlusterFS storage volumes, it may be a good idea to drop into the GlusterFS console. This will allow you to interact with your GlusterFS environment without needing to type sudo gluster
before everything:
sudo gluster
This will give you a prompt where you can type your commands. This is a good one to get yourself oriented:
help
When you are finished, exit like this:
exit
Conclusion
At this point, you should have a redundant storage system that will allow us to write to two separate servers simultaneously. This can be useful for a great number of applications and can ensure that our data is available even when one server goes down.