Dorja Web Statistics |
[Edit] |
more information on Dorja webalizer statistics
Notes on
Visits/Entry/Exit Figures
------------------
Any request made to the server which is logged, is
considered a 'hit'. The requests can be for anything... html pages, graphic
images, audio files, CGI scripts, etc... Each valid line in the server log is
counted as a hit.
This number represents the total number of requests
that were made to the server during the specified report period.
------------------
Some requests made to the server, require that the
server then send something back to the requesting client, such as a html page
or graphic image. When this happens, it is considered a 'file' and the files
total is incremented. The relationship between 'hits' and 'files' can be
thought of as 'incoming requests' and 'outgoing responses'.
------------------
Pages are, well, pages! Generally, any HTML
document, or anything that generates an HTML document, would be considered a
page. This does not include the other stuff that goes into a document, such as
graphic images, audio clips, etc... This number represents the number of
'pages' requested only, and does not include the other 'stuff' that is in the
page. What actually constitutes a 'page' can vary from server to server. The
default action is to treat anything with the extension '.htm', '.html' or
'.cgi'
as a page. A lot of sites will probably define
other extensions, such as '.phtml', '.php3' and '.pl' as pages as well. Some
people consider this number as the number of 'pure' hits... I'm not sure if I
totally agree with that viewpoint. Some other programs (and people :) refer to
this as 'Pageviews'.
------------------
Each request made to the server comes from a unique
'site', which can be referenced by a name or ultimately, an IP address. The
'sites' number shows how many unique IP addresses made requests to the server
during the reporting time period. This DOES NOT mean the number of unique
individual users (real people) that visited, which is impossible to determine
using just logs and the HTTP protocol (however, this number might be about as
close as you will get).
------------------
Whenever a request is made to the server from a
given IP address (site), the amount of time since a previous request by the
address is calculated (if any). If the time difference is greater than a
pre-configured 'visit timeout' value (or has never made a request before), it
is considered a 'new visit', and this total is incremented (both for the site,
and the IP address). The default timeout value is 30 minutes (can be changed),
so if a user visits your site at 1:00 in the afternoon, and then returns at
3:00, two visits would be registered. Note: in the 'Top Sites' table, the
visits total should be discounted on 'Grouped' records, and thought of as the
"Minimum number of visits" that came from that grouping instead.
Note:
Visits only occur on PageType requests, that is,
for any request whose URL is one of the 'page' types defined with the PageType
option. Due to the limitation of the HTTP protocol, log rotations and other
factors, this number should not be taken as absolutely accurate, rather, it
should be considered a pretty close "guess".
------------------
The KBytes (kilobytes) value shows the amount of
data, in KB, that was sent out by the server during the specified reporting
period. This value is generated directly from the log file, so it is up to the
web server to produce accurate numbers in the logs (some web servers do stupid
things when it comes to reporting the number of bytes). In general, this should
be a fairly accurate representation of the amount of outgoing traffic the
server had, regardless of the web servers reporting quirks.
Note: A kilobyte is 1024 bytes, not 1000 :)
------------------
The Top Entry and Exit tables give a rough estimate
of what URL's are used to enter your site, and what the last pages viewed are.
Because of limitations in the HTTP protocol, log rotations, etc... this number
should be considered a good "rough guess" of the actual numbers,
however will give a good indication of the overall trend in where users come
into, and exit, your site.
------------------
Referrers are weird critters... They take many
shapes and forms, which makes it much harder to analyze than a typical URL,
which at least has some standardization.?
What is contained in the referrer field of your log files varies
depending on many factors, such as what site did the referral, what type of
system it comes from and how the actual referal was generated.
Why is this??
Well, because a user can get to your site in many ways... They may have
your site bookmarked in their browser, they may simply type your sites URL
field in their browser, they could have clicked on a link on some remote web
page or they may have found your site from one of the many search engines and
site indexes found on the web.? The
Webalizer attempts to deal with all this variation in an intelligent way by
doing certain things to the referrer string which makes it easier to
analyze.? Of course, if your web server
doesn't provide referrer information, you probably don't really care and are
asking yourself why you are reading this section...
Most referrer's will take the form of
"http://somesite.com/somepage.html",
which is what you will get if the user clicks on a
link somewhere on the web in order to get to your site.? Some will be a variation of this, and look
something like "file:/some/such/sillyname", which is a reference from
a HTML document on the users local machine.?
Several variations of this can be used, depending on what type of system
the user has, if he/she is on a local network, the type of network, etc...? To complicate things even more, dynamic HTML
documents and HTML documents that are generated by cgi scripts or external
programs produce lots of extra information which is tacked on to the end of the
referrer string in an almost infinate number of ways.? If the user just typed your URL into their
browser or clicked on a bookmark, there won't be any information in the
referrer field and will take the form "-".
In order to handle all these variations, The
Webalizer parses the referrer field in a certain way.? First, if the referrer string begins with
"http", it assumes it is a normal referral and converts the
"http://" and following hostname to lowercase in order to simplify
hiding if desired.? For example, the
referrer "HTTP://WWW.MyHost.Com/This/Is/A/HTML/Document.html" will
become "http://www.myhost.com/This/Is/A/HTML/Document.html".? Notice that only the "http://" and
hostname are converted to lower case... The rest of the referrer field is left
alone.? This follows standard convention,
as the actuall method (HTTP) and hostname are always case insensitive, while
the document name portion is case sensitive.
Referrers that came from search engines, dynamic
HTML documents, cgi scripts and other external programs usually tack on
additional information that it used to create the page.? A common example of this can be found in
referrals that come from search engines and site indexes common on the
web.? Sometimes, these referrers URL's
can be several hundred characters long and include all the information that the
user typed in to search for your site.?
The Webalizer deals with this type of referrer by stripping off all the
query information, which starts with a question mark '?'.
The Referrer
"http://search.yahoo.com/search?p=usa%26global%26link" will be
converted to just "http://search.yahoo.com/search".
When a user comes to your site by using one of
their bookmarks or by typing in your URL directly into their browser, the
referrer field is blank, and looks like "-".? Most sites will get more of these referrals
than any other type.? The Webalizer
converts this type of referral into the string "- (Direct
Request)".? This is done in order to
make it easier to hide via a command line option or configuration file
option.? This is because the character
"-" is a valid character elsewhere in a referrer field, and if not
turned into something unique, could not be hidden without possibly hiding other
referrers that shouldn't be.
----------------------
? The
Webalizer will do a minimal analysis on referrer strings that it finds, looking
for well known search string patterns.?
Most of the major search engines are supported, such as yahoo,
altavista, lycos, etc...? Unfortunately,
search engines are always changing their internal/CGI query formats, new search
engines are coming on line every day, and the ability to detect _all_ search
strings is nearly impossible.? However,
it should be accurate enough to give a good indication of what users were
searching for when they stumbled across your site.? Note: as of version 1.31, search engines can
now be specified within a configuration file.?
See the sample.conf file for examples of how to specify additional
search engines.
----------------------------------
The majority of data analyzed and reported on by
The Webalizer is as accurate and correct as possible based on the input log
file.
However, due to the limitation of the HTTP
protocol, the use of firewalls, proxy servers, multi-user systems, the rotation
of your log files, and a myriad of other conditions, some of these numbers
cannot, without absolute accuracy, be calculated.? In particular, Visits, Entry Pages and Exit
Pages are suspect to random errors due to the above and other conditions.? The reason for this is twofold, 1) Log files
are finite in size and time interval, and
2) There is no way to distinguish multiple
individual users apart given only an IP address.? Because log files are finite, they have a
begining and ending, which can be represented as a fixed time period.? There is no way of knowing what happened
previous to this time period, nor is it possible to predict future events based
on it.? Also, because it is impossible to
distinguish individual users apart, multiple users that have the same IP
address all appear to be a single user, and are treated as such.? This is most common where corporate users sit
behind a proxy/firewall to the outside world, and all requests appear to come
from the same location (the address of the proxy/firewall itself).? Dynamic IP assignment (used with dial-up
internet accounts) also present a problem, since the same user will appear as
to come from multiple places.
For example, suppose two users visit your server
from XYZ company, which has their network connected to the internet by a proxy
server 'fw.xyz.com'.? All requests from
the network look as though they originated from 'fw.xyz.com', even though they
were really initiated from two seperate users on different PC's.? The Webalizer would see these requests as
from the same location, and would record only
1 visit, when in reality, there were two.? Because entry and exit pages are calculated
in conjunction with visits, this situation would also only record 1 entry and 1
exit page, when in reality, there should be 2.
As another example, say a single user at XYZ
company is surfing around your website..?
They arrive at 11:52pm the last day of the month, and continue surfing
until 12:30am, which is now a new day (in a new month).? Since a common practice is to rotate (save
then clear) the server logs at the end of the month, you now have the users
visit logged in two different files (current and previous months).? Because of this (and the fact that the
Webalizer clears history between months), the first page the user requests
after midnight will be counted as an entry page.
This is unavoidable, since it is the first request
seen by that particular IP address in the new month.
For the most part, the numbers shown for visits,
entry and exit pages are pretty good 'guesses', even though they may not be
100% accurate.? They do provide a good
indication of overall trends, and shouldn't be that far off from the real
numbers to count much.
You should probably consider them as the 'minimum'
amount possible, since the actual (real) values should always be equal or
greater in all cases.