Webmaster Tools: An Introduction to Apache

If Apache has always seemed like a black box to you, it’s time to learn just what’s going on behind the scenes!

Apache is the most popular web server available.

A web server’s job is basically to accept requests from clients and send responses to those requests. A web server gets a URL, translates it to a filename (for static requests), and sends that file back over the internet from the local disk, or it translates it to a program name (for dynamic requests), executes it, and then sends the output of that program back over the internet to the requesting party. If for any reason, the web server was not able to process and complete the request, it instead returns an error message. The word, web server, can refer to the machine (computer/hardware) itself, or the software that receives requests and sends out responses.
Apache is the most popular web server (after which comes Microsoft’s IIS) available. The reasons behind its popularity, to name a few, are:

It is free to download and install.
It is open source: the source code is visible to anyone and everyone, which basically enables anyone (who can rise up to the challenge) to adjust the code, optimize it, and fix errors and security holes. People can add new features and write new modules.
It suits all needs: Apache can be used for small websites of one or two pages, or huge websites of hundreds and thousands of pages, serving millions of regular visitors each month. It can serve both static and dynamic content.

What is Apache?

Functionality that you don’t need or want can easily be removed.

The Apache HTTP server is a software (or program) that runs in the background under an appropriate operating system, which supports multi-tasking, and provides services to other applications that connect to it, such as client web browsers. It was first developed to work with Linux/Unix operating systems, but was later adapted to work under other systems, including Windows and Mac. The Apache binary running under UNIX is called HTTPd (short for HTTP daemon), and under win32 is called Apache.exe.
Installing Apache on Linux does require a bit of programming skills (though it is not too difficult). Installing it on a Windows platform is straight forward, as you can run it through a graphical user interface.
Apache’s original core is fairly basic and contains a limited number of features. Its power rather comes from added functionality introduced through many modules that are written by programmers and can be installed to extend the server’s capabilities. To add a new module, all you need to do is install it and restart the Apache server. Functionality that you don’t need or want can easily be removed which is actually considered a good practice as it keeps the server small and light, starts faster, consumes less system resources and memory, and makes the server less prone to security holes. The Apache server also supports third party modules, some of which have been added to Apache 2 as permanent features. The Apache server very easily integrates with other open source applications, such as PHP and MySQL, making it even more powerful than it already is.

A web server in its simplest form is a computer with special software, and an internet connection that allows it to connect to other devices.

Every device connected to a network has an IP address through which others connect to and communicate with it. This IP address is sort of like a regular address that you need in real life to call or visit any contact of yours. If they didn’t have an address, you wouldn’t know how to call or reach them. IP addresses serve the exact same purpose. If a device didn’t have one, the other machines on the same network wouldn’t know how to reach it.
The Apache server offers a number of services that clients might make use of. These services are offered using various protocols through different ports, and include: hypertext transfer protocol (HTTP), typically through port 80, simple mail transfer protocol (SMTP), typically through port 25, domain name service (DNS) for mapping domain names to their corresponding IP addresses, genearlly through port 53, and file transfer protocol (FTP) for uploading and downloading files, usually through port 21.

How Apache Works

Apache’s main role is all about communication over networks, and it uses the TCP/IP protocol (Transmission Control Protocol/Internet Protocol which allows devices with IP addresses within the same network to communicate with one another).

The TCP/IP protocol is a set of rules that define how clients make requests and how servers respond, and determine how data is transmitted, delivered, received, and acknowledged.

The Apache server is set up to run through configuration files, in which directives are added to control its behavior. In its idle state, Apache listens to the IP addresses identified in its config file (HTTPd.conf). Whenever it receives a request, it analyzes the headers, applies the rules specified for it in the Config file, and takes action.
But one server can host many websites, not just one – though, to the outside world, they seem separate from one another. To achieve this, every one of those websites has to be assigned a different name, even if those all map eventually to the same machine. This is accomplished by using what is known as virtual hosts.
Since IP addresses are difficult to remember, we, as visitors to specific sites, usually type in their respective domain names into the URL address box on our browsers. The browser then connects to a DNS server, which translates the domain names to their IP addresses. The browser then takes the returned IP address and connects to it. The browser also sends a Host header with the request so that, if the server is hosting multiple sites, it will know which one to serve back.
For example, typing in www.google.com into your browser’s address field might send the following request to the server at that IP address:

GET / HTTP/1.1
Host: www.google.com

The first line contains several pieces of information. First, there is the method (in this case it’s a GET), the URI, which specifies which page to be retrieved or which program to be run (in this case it’s the root directory denoted by the /), and finally there is the HTTP version (which in this case is HTTP 1.1).

HTTP is a request / response stateless protocol.

HTTP is a request / response stateless protocol. It’s a set of rules that govern communication between a client and the server. The client (usually but not necessarily a web browser) makes a request, the server sends back a response, and communication stops. The server doesn’t look forward for more communication as is the case with other protocols that stay at a waiting state after the request is over.
If the request is successful, the server returns a 200 status code (which means that the page is found), response headers, along with the requested data. The response header of an Apache server might look something like the following:

HTTP/1.1 200 OK
Date: Sun, 10 Jun 2012 19:19:21 GMT
Server: Apache
Expires: Wed, 11 Jan 1984 05:00:00 GMT
Cache-Control: no-cache, must-revalidate, max-age=0
Pragma: no-cache
Last-Modified: Sun, 10 Jun 2012 19:19:21 GMT
Vary: Accept-Encoding,User-Agent
Content-Type: text/html; charset=UTF-8
Content-Length: 7560

The first line in the response header is the status line. It contains the HTTP version and the status code. The date follows next, and then some information about the host server and the retrieved data. The Content-Type header lets the client know the type of data retrieved so it knows how to handle it. Content-Length lets the client know the size of the response body. If the request didn’t go throw, the client would get an error code and message, such as the following response header in case of a page not found error:

HTTP/1.1 404 Not Found

TCP/IP Protocol

TCP/IP is actually two protocols built one on top of the other.

TCP/IP is actually two protocols built one on top of the other. The IP protocol is responsible for getting the transferred data from one point to another. It takes the data to be transferred between the two points, splits it into smaller packets, attaches the source and destination addresses to each packet, and transfers the data.
TCP handles the part that includes establishing the connection between the two parties, making sure the data arrives to its destination, taking care of any data loss and managing data recovery.
Once a message is received, the destination party sends an Acknowledged (ACK) message to the sending host if all goes well, notifying it of data arrival. If something goes wrong, such as the occurrence of a data loss situation, the destination sends a Not Acknowledged (NAK) message instead, notifying the sending host of the problem and informing it of the need to resend the data packet.
As discussed earlier, Apache offers many services, which clients might want to connect to, to make use of or benefit from. TCP manages each service so that it is accessed through a particular port to differentiate between the various services. This way, it ensures that any one given interface (or host) can offer multiple services. So when a client connects to a host, it passes the port number along with the IP address. Browsers use the HTTP protocol which by default uses port 80, so there’s no need for further specification.
The following image is a snap shot of my FTP software (WinScp). As you can see, to FTP my server I not only need to provide the IP address (or alternatively type in the domain name), but I also need to specify the port number that my server provides the service through. In the case of FTP, the port number is 21. In the case of SFTP (secure FTP), the port number is 22.

Under UNIX, a list of services offered along with their respective port numbers can be found in the file /etc/services. The following command will display the contents of the file:

more /etc/services

Below is a screenshot showing a part of the file. As you can see, services are listed in the first column, followed by the port number to be accessed at and the protocol name the service uses.

Under windows the file is called Services, and can be found under C:\WINNT\system32\drivers\etc\

Inetd

To preserve system resources, UNIX handles many of its services through the internet daemon.

To preserve system resources, UNIX handles many of its services through the internet daemon (inetd), as opposed to a constantly running daemon. The inetd is a super server that listens to the various ports and handles connection requests as it receives them by initiating a new copy of the appropriate daemon (program). The new copy of the program then takes it from there and works with the client, and inted goes back to listening to the server ports waiting for new client requests to handle. Once the request is processed and the communication is over, the daemon exits.

General Structure

As mentioned earlier, Apache can be installed on a variety of operating systems. Regardless of the platform used, a hosted website will typically have four main directories: htdocs, conf, logs, cgi-bin.
htdocs is the default Apache web server document directory, meaning it is the public directory whose contents are usually available for clients connecting through the web. It contains all static pages and dynamic content to be served once an HTTP request for them is received. Since files and sub-directories under htdocs are available to the public, correct handling of file permissions is of great importance so as not to compromise the server’s safety and security.
conf is the directory where all server configuration files are located. Configuration files are basically plain text files where directives are added to control the web server’s behavior and functionality. Each directive is usually placed on a separate line, and the hash (#) key indicates a comment so the line proceeded by it is ignored.
logs is the directory where server logs are kept, and includes Apache access logs and error logs. The Apache HTTP Server provides a variety of different mechanisms for logging everything that happens on it, from the initial request, through the URL mapping process, to the final resolution of the connection, including any errors that may have occurred in the process. In addition to this, third-party modules may provide logging capabilities, or inject entries into the existing log files, and applications such as PHP scripts, or other handlers, may send messages to the server error log.
cgi-bin is the directory where CGI scripts are kept. The CGI (Common Gateway Interface) defines a way for a web server to interact with external content-generating programs, which are often referred to as CGI programs or CGI scripts. These are programs or shell scripts that are written to be executed by Apache on behalf of its clients.
It is important to note that the above discussed file and directory names (as well as locations) can differ from one server to another depending on the Apache flavor installed and the operating system it runs under. The roles though remain the same.

Conclusion

…with more than half of the sites on the web running on it.

Apache has been the most popular web server on the internet since 1996, with more than half the sites on the web running on it. It played a key role in shaping and making the World Wide Web what it is today. The reasons behind its success are obvious and the way things are looking, it will probably stay in the lead at least for quite some time. This was meant to be an introductory session to this powerful piece of software and I hope it was of help in understanding what this great tool is and how it generally works.

Webmaster Tools

Thursday, July 12, 2012

An Introduction to Apache

What is Apache?

How Apache Works

TCP/IP Protocol

Inetd

General Structure

Conclusion

No comments:

Post a Comment

Facebook Badge