Musings of a Data professional

Stuart Moore

Category: R

Getting Perfmon Data from SQL Server into R for analysis

In the last couple of posts we’ve looked at how to migrate perfmon data from CSV files into SQL Server (
Migrating perfmon CSV files into SQL Server for analysis) and how to use R to graph Perfmon data (
Simple plot of perfmon data in R).

So, the next obvious step is to combine the two, and use R to graph Perfmon data you’ve imported into SQL Server. It turns out that R makes this pretty simple via the RODBC Package.

Assuming you’ve created a SQL Server system DSN called perfmon_odbc, then the following code snippet would plot you a lovely graph of the counters:

  • % Processor Time
  • Avg. Disk Queue Length
  • Current Connections
  • Processor Queue Length

as recorded on Server A on the 19th June 2013:

install.packages(c("ggplot2","reshape2","RODBC"))
library("ggplot2")
library("reshape2")
library("RODBC")

perfconn<-odbcConnect("perfmon_odbc")
perf<-sqlQuery(perfconn,"select distinct a.CounterDateTime, b.CounterName, a.CounterValue from CounterData a inner join CounterDetails b on a.CounterID=b.CounterID where b.MachineName='\\\\Server A' and b.CounterName in('% Processor Time','Avg. Disk Queue Length','Current Connections','Processor Queue Length') and a.CounterDateTime>'2013-06-19' and a.CounterDateTime<'2013-06-20'")

perf$CounterDateTime <- as.POSIXct(perf$CounterDateTime)
ggplot(data=perf, aes(x=CounterDateTime,y=CounterValue, colour=CounterName)) +
geom_point(size=.5) +
stat_smooth() +
ggtitle("Server A OS Metrics - 11/06/2013")
odbcCloseAll()

Note that this time we didn’t have to melt our data like we did before, as the resultset from RODBC is already in the format we need. But also spot that we’ve had to add some extra \’s to the machine name. \ is an escape character in R, so we need the extra \ to escape the escape character.

Now you’ve all the flexibility of creating the dataset via T-SQL. This means you can start doing things like comparing 2 Servers across the same time period:

install.packages(c("ggplot2","reshape2","RODBC"))
library("ggplot2")
library("reshape2")
library("RODBC")

perfconn<-odbcConnect("perfmon_odbc")
perf<-sqlQuery(perfconn,"select distinct a.CounterDateTime, b.MachineName+' '+b.CounterName as 'CounterName', a.CounterValue from CounterData a inner join CounterDetails b on a.CounterID=b.CounterID where b.MachineName in('\\\\Server A', '\\\\Server B') and b.CounterName in('% Processor Time','Avg. Disk Queue Length','Current Connections','Processor Queue Length') and a.CounterDateTime>'2013-06-19' and a.CounterDateTime<'2013-06-20'")

perf$CounterDateTime <- as.POSIXct(perf$CounterDateTime)
ggplot(data=perf, aes(x=CounterDateTime,y=CounterValue, colour=CounterName)) +
stat_smooth() +
ggtitle("Server A &amp; B OS Metrics - 11/06/2013")
odbcCloseAll()

Note that in this SQL query we’ve specified 2 machine names, and then to make sure that R can distinguish between the counters we’ve appended the Machine Name to the Counter Name in the column list. I’ve also taken out the geom_point(size=.5) line, as with the number of counters now being plotted having the data points make the curves hard to see and compare.

You can extend this to pull the counters for a Server from 2 different days. This makes it easy to check if the spike is a normal daily occurence, or really is the source of your current issues:

install.packages(c("ggplot2","reshape2","RODBC"))
library("ggplot2")
library("reshape2")
library("RODBC")

perfconn<-odbcConnect("perfmon_odbc")
perf<-sqlQuery(perfconn,"select distinct a.CounterDateTime, 'Day 1 -'+b.CounterName as 'CounterName', a.CounterValue from CounterData a inner join CounterDetails b on a.CounterID=b.CounterID where b.MachineName='\\\\Server A' and b.CounterName in('% Processor Time','Avg. Disk Queue Length','Current Connections','Processor Queue Length') and a.CounterDateTime>'2013-06-19' and a.CounterDateTime<'2013-06-20' union select distinct a.CounterDateTime, 'Day 2 -'+b.CounterName, a.CounterValue from CounterData a inner join CounterDetails b on a.CounterID=b.CounterID where b.MachineName='\\\\Server A' and b.CounterName in('% Processor Time','Avg. Disk Queue Length','Current Connections','Processor Queue Length') and a.CounterDateTime>'2013-06-12' and a.CounterDateTime<'2013-06-13'")

perf$CounterDateTime <- as.POSIXct(paste("1900-01-01", substr(perf$CounterDateTime,12,28)))
ggplot(data=perf, aes(x=CounterDateTime,y=CounterValue, colour=CounterName)) +
stat_smooth() +
ggtitle("Server A Metrics - 13/06/2013 and 19/06/2013")
odbcCloseAll()

In this case we union the 2 results sets, and rename the counters to make sure R can identify which set are which date. We also set the date component on the time stamps to be the same day (I use 01/01/1900 as it’s nice and obvious when looking at a chart later to see that it’s been reset), this is to make sure R plots the time values against each other correctly.

Using R to average perfmon statistics and plot them

In the last post (Simple plot of perfmon data in R) I covered how to do a simple plot of perfmon counters against time. This post will cover a couple of slightly more advanced ways of plotting the data.

First up is if you want to average your data to take out some of the high points. This could be useful if you’re sampling at 15 second intervals with perfmon but don’t need that level of detail.

The initial setup and load of data is the same as before (if you need the demo csv, you can download it here):

install.packages(c("ggplot2","reshape2"))
library("ggplot2")
library("reshape2")

data <-read.table("C:\\R-perfmon\\R-perfmon.csv",sep=",",header=TRUE)
cname<-c("Time","Avg Disk Queue Length","Avg Disk Read Queue Length","Avg Disk Write Queue Length","Total Processor Time%","System Processes","System Process Queue Length")
colnames(data)<-cname
data$Time<-as.POSIXct(data$Time, format='%m/%d/%Y %H:%M:%S')

avgdata<-aggregate(data,list(segment=cut(data$Time,"15 min")),mean)

avgdata$segment<-as.POSIXct(avgdata$Time, format='%Y-%m-%d %H:%M:%S')
avgdata$Time<-NULL
mavgdata<-melt(avgdata,id.vars="segment")

ggplot(data=mavgdata,aes(x=segment,y=value,colour=variable))+
+ geom_point(size=.2) +
+ stat_smooth() +
+ theme_bw()

The first 8 lines of R code should look familiar as they’re the same used last time to load the Permon data and rename the columns. Once that’s done, then we:

10: Create a new dataframe from our base data using the aggregate function. We tell it to work on the data dataframe, and that we want to segment it by 15 minute intervals, and we want the mean average across that 15 minute section
11: We drop the Time column from our new dataframe, as it’s no longer of any us to us
12: Convert the segment column to a datetime format (note that we use a different format string here to previous calls, this is due to the way that aggregate writes the segment values.
13: We melt the dataframe to make plotting easier.

And then we use the same plotting options as we did before, which gives us:

R plot of perfmon data at 15 minute average

If you compare it to this chart we plotted before with all the data points, you can see that it is much cleaner, but we’ve lost some information as it’s averaged out some of the peaks and troughs throughout the day:

Perfmon data plotted on graph using R

But we can quickly try another sized segment to help out. In this case we can just run:

minavgdata<-aggregate(data,list(segment=cut(data$Time,"15 min")),mean)
minavgdata$Time<-NULL
minavgdata$segment<-as.POSIXct(minavgdata$Time, format='%Y-%m-%d %H:%M:%S')
mminavgdata<-melt(minavgdata,id.vars="segment")

ggplot(data=mminavgdata,aes(x=segment,y=value,colour=variable))+
+ geom_point(size=.2) +
+ stat_smooth() +
+ theme_bw()

Which provides us with a clearer plot that our original, but keeps much more of the information than the 15 minute average:

R plot of perfmon data at 1 minute average

Simple plot of perfmon data in R

In the last part (here) we setup a simple R install so we could look at analysing and plotting perfmon data in R. In this post we’ll look about creating a very simple plot from a perfmon CSV. In later posts I’ll show some examples of how to clean the data up, to pull it from a SQL Server repository, combine datasets for analysis and some of the other interesting things R lets you do.

So lets start off with some perfmon data. Here’s a CSV (R-perfmon) that contains the following counters:

  • Physical Disk C:\ Average Disk Queue Length
  • Physical Disk C:\ Average Disk Read Queue Length
  • Physical Disk C:\ Average Disk Write Queue Length
  • % Processor Time
  • Processes
  • Processor Queue Length

Perfmon was set to capture data every 15 seconds.

Save this to somewhere. For the purposes of the scripts I’m using I’ll assume you’ve put it in the folder c:\R-perfmon.

Fire up your R environment of choice, I’ll be using R Studio. So opening a new instance, and I’m granted by a clean workspace:

On the left hand side I’ve the R console where I’ll be entering the commands and on the right various panes that left me explore the data and graphs I’ve created.

As mentioned before R is a command line language, it’s also cAse Sensitive. So if you get any strange errors while running through this example it’s probably worth checking exactly what you’ve typed. If you do make a mistake you can use the cursor keys to scroll back through commands, and then edit the mistake.

So the first thing we need to do is to install some packages, Packages are a means of extending R’s capabilities. The 2 we’re going to install are ggplot2 which is a graphing library and reshape2 which is a library that allows us to reshape the data (basically a Pivot in SQL Server terms). We do this with the following command:

install.packages(c("ggplot2","reshape2"))

You may be asked to pick a CRAN mirror, select the one closest to you and it’ll be fine. Assuming everything goes fine you should be informed that the packages have been installed, so they’ll now be available the next time you use R. To load them into your current session, you use the commands:

library("ggplot2")
library("reshape2")

So that’s all the basic housekeeping out of the way, now lets load in some Perfmon data. R handles data as vectors, or dataframes. As we have multiple rows and columns of data we’ll be loading it into a dataframe.

data <-read.table("C:\\R-perfmon\\R-perfmon.csv",sep=",",header=TRUE)

IF everything’s worked, you’ll see no response. What we’ve done is to tell R to read the data from our file, telling it we’re using , as the seperator and that the first row contains the column headers. R using the ‘<-‘ as an assignment operator.

To prove that we’ve loaded up some data we can ask R to provide a summary:

summary(data)
    X.PDH.CSV.4.0...GMT.Daylight.Time...60.
 04/15/2013 00:00:19.279:   1
 04/15/2013 00:00:34.279:   1
 04/15/2013 00:00:49.275:   1
 04/15/2013 00:01:04.284:   1
 04/15/2013 00:01:19.279:   1
 04/15/2013 00:01:34.275:   1
 (Other)                :5754
 X..testdb1.PhysicalDisk.0.C...Avg..Disk.Queue.Length
 Min.   :0.000854
 1st Qu.:0.008704
 Median :0.015553
 Mean   :0.037395
 3rd Qu.:0.027358
 Max.   :4.780562

 X..testdb1.PhysicalDisk.0.C...Avg..Disk.Read.Queue.Length
 Min.   :0.000000
 1st Qu.:0.000000
 Median :0.000980
 Mean   :0.017626
 3rd Qu.:0.003049
 Max.   :4.742742

 X..testdb1.PhysicalDisk.0.C...Avg..Disk.Write.Queue.Length
 Min.   :0.0008539
 1st Qu.:0.0076752
 Median :0.0133689
 Mean   :0.0197690
 3rd Qu.:0.0219051
 Max.   :2.7119064

 X..testdb1.Processor._Total....Processor.Time X..testdb1.System.Processes
 Min.   :  0.567                               Min.   : 77.0
 1st Qu.:  7.479                               1st Qu.: 82.0
 Median : 25.589                               Median : 85.0
 Mean   : 25.517                               Mean   : 87.1
 3rd Qu.: 38.420                               3rd Qu.: 92.0
 Max.   :100.000                               Max.   :110.0

 X..testdb1.System.Processor.Queue.Length
 Min.   : 0.0000
 1st Qu.: 0.0000
 Median : 0.0000
 Mean   : 0.6523
 3rd Qu.: 0.0000
 Max.   :58.0000

And there we are, some nice raw data there. Some interesting statistical information given for free as well. Looking at it we can see that our maximum Disk queue lengths aren’t anything to worry about, and even though our Processor peaks at 100% utilisation, we can see that it spends 75% of the day at less the 39% utilisation. And we can see that our Average queue length is nothing to worry about.

But lets get on with the graphing. At the moment R doesn’t know that column 1 contains DateTime information, and the names of the columns are rather less than useful. To fix this we do:

cname<-c("Time","Avg Disk Queue Length","Avg Disk Read Queue Length","Avg Disk Write Queue Length","Total Processor Time%","System Processes","System Process Queue Length")
colnames(data)<-cname
data$Time<-as.POSIXct(data$Time, format='%m/%d/%Y %H:%M:%S')
mdata<-melt(data=data,id.vars="Time")

First we build up an R vector of the column names we’d rather use, “c” is the constructor to let R know that the data that follows is to interpreted as vector. Then we pass this vector as an input to the colnames function that renames our dataframe’s columns for us.

On line 3 we convert the Time column to a datetime format using the POSIXct function and passing in a formatting string.

Line 4, we melt our data. Basically we’re turning our data from this:

Time Variable A Variable B Variable C
19/04/2013 14:55:15 A1 B2 C9
19/04/2013 14:55:30 A2 B2 C8
19/04/2013 14:55:45 A3 B2 C7

to this:

ID Variable Value
19/04/2013 14:55:15 Variable A A1
19/04/2013 14:55:30 Variable A A2
19/04/2013 14:55:45 Variable A A3
19/04/2013 14:55:15 Variable B B2
19/04/2013 14:55:30 Variable B B2
19/04/2013 14:55:45 Variable B B2
19/04/2013 14:55:15 Variable C C9
19/04/2013 14:55:30 Variable C C8
19/04/2013 14:55:45 Variable C C7

This allows to very quickly plot all the variables and their values against time without having to specify each series

ggplot(data=mdata,aes(x=Time,y=value,colour=variable))+
geom_point(size=.2) +
stat_smooth() +
theme_bw()

This snippet introduces a new R technique. By putting + at the end of the line you let R know that the command is spilling over to another line. Which makes complex commands like this easier to read and edit. Breaking it down line by line:

  1. tells R that we want to use ggplot to draw a graph. The data parameter tells ggplot which dataframe we want to plot. aes lets us pass in aesthetic information, in this case we tell it that we want Time along the x axis, the value of the variable on the y access, and to group/colour the values by variable
  2. tells ggplot how large we’d like the data points on the graph.
  3. This tells we want ggplot to draw a best fit curve through the data points.
  4. Telling ggplot which theme we’d like it to use. This is a simple black and white theme.

Run this and you’ll get:

Perfmon data plotted on graph using R

The “banding” in the data points for the System processes count is due to the few discrete values that the variable takes.

The grey banding around the fitted line is a 95% confidence interval. The wider it is the greater variance of values there are at that point, but in this example it’s fairly tight to the plot so you can assume that it’s a good fit to the data.

We can see from the plot that the Server is ticking over first thing in the morning then we see an increases in load as people start getting into the office from 07:00 onwards. Appears to be a drop off over lunch as users head away from their clients, and then picks up through the afternoon. Load stays high through the rest of the day, and in this example it ends higher than it started as there’s an overnight import running that evening, though if you hadn’t know that this plot would have probably raised questions and you’d have investigate. Looking at the graph it appears that all the load indicators (Processor time% and System Processes) follow each other nicely, later in this series we’ll look at analysing which one actually leads the other

You can also plot the graph without all the data points for a cleaner look:

ggplot(data=mdata,aes(x=Time,y=value,colour=variable))+
+ stat_smooth() +
+ theme_bw()

R plot of perfmon without points

So putting all that together we have the following script which can just be cut and pasted across into R:

install.packages(c("ggplot2","reshape2"))
library("ggplot2")
library("reshape2")
data <-read.table("C:\\R-perfmon\\R-perfmon.csv",sep=",",header=TRUE)
cname<-c("Time","Avg Disk Queue Length","Avg Disk Read Queue Length","Avg Disk Write Queue Length","Total Processor Time%","System Processes","System Process Queue Length")
colnames(data)<-cname
data$Time<-as.POSIXct(data$Time, format='%m/%d/%Y %H:%M:%S')
mdata<-melt(data=data,id.vars="Time")
ggplot(data=mdata,aes(x=Time,y=value,colour=variable))+
geom_point(size=.2) +
stat_smooth() +
theme_bw()

Plotting/Graphing Perfmon data in R – Part 1

Perfmon is a great tool for getting lots of performance information about Microsoft products. But once you’ve got it it’s not always the easiest to work with, or present to management. As a techy you want to be able to use the data to compare systems across time, or even to cross corellate performance between seperate systems.

And management want graphs. Well, we all want graphs. For many of my large systems I now know the shape of graph that plotting CPU against time should give me, or the amount of through put per hour, and it only takes me a few seconds to spot when something isn’t right rather than poring through reams of data.

Perfmon data plotted on graph using R

There are a number of ways of doing this, but my preferred method is to use a tool called R. R is a free programming language aimed at Statistical analysis, this means it’s pefectly suited to working with the sort of data that comes out of perfmon, and it’s power means that when you want to create baselines or compare days it’s very useful. It’s also very easy to write simple scripts in R, so all the work can be done automatically each night.

It’s also one of the tools of choice for statistical analysis in the Big Data game, so getting familiar with it is a handy CV point for SQL Server DBAs.

Setting R up is fairly straight forward, but I though I’d cover it here just so the next parts of this series follow on.

The main R application is available from the CRAN (Comprehensive R Archive Network) repository – http://cran.r-project.org/mirrors.html From there, pick a local mirror, Click the Download for Windows link, select the “base” subdirectory, and then “Download R 3.0.0 for Windows” (version is correct at time of writing 18/04/2013).

You now have an installer, double click on it and accept all the defaults. If you’re running a 64-bit OS then it will by default install both the 32-bit and 64-bit versions. You’ll now have a R folder on your Start Menu, and if you select the you’ll end up at the R console:

R console on Windows

R is a command line language out of the box. You can do everything from within the console, but it’s not always the nicest place. So the next piece of software to obtain and install is RStudio, which can be downloaded from here, and installed using the defaults. This is a nicer place to work:

R studio analysing perfmon data

You should now have a working R setup. In the next part we’ll load up some perfmon data and do some work with it.

Powered by WordPress & Theme by Anders Norén