I’m currently migrating a lot of SQL Server instances onto newer virtual machines. Quite a few of these instances talk to each other via Linked Server for various historical reasons. And a lot of that chat is done via distributed transactions, which means configuring MS Distributed Transaction Coordinator.
So one of the first things I need to do, is check that that DTC is working between the 2 boxes. This is dead simple with official PowerShell module for msdtc. It’s as simple as running Test-Dtc and then working through any errors. So, we just run this:
and as this is a post about fixing something, it won’t suprise you that I’m going to get an error message:
"The OleTx CID on SOURCE and DESTINATION is the same. The CID should be unique to each computer."
At C:\Windows\system32\WindowsPowerShell\v1.0\Modules\MsDtc\TestDtc.psm1:266 char:13
+ throw ([string]::Format($Strings.SameCids, "OleTx", $Loca ...
+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+ CategoryInfo : OperationStopped: ("The OleTx CID ...each computer.":String) [], RuntimeException
+ FullyQualifiedErrorId : "The OleTx CID on SOURCE and DESTINATION is the same. The CID should be uniqu
e to each computer."
But in much more fetching shared of red and yellow, announcing that “The OleTx CID on SOURCE and DESTINATION is the same. The CID should be unique to each computer.”
The cause is really simple to grasp. When the Distributed Transaction Coordinator is installed it registers a GUID to identify it, the theory being a GUID clash should be a vanishingly rare occurance.
That is, until someone’s cloning Virtual Machines. So I have a batch of shiny new VMs that all think they’re the same instance of DTC. That’s not so good. It used to be the accepted fix was to manually remove the the distributed transaction coordinator, clean the registry, restart and then reinstall everything. That sounds like a lot of work to me!
The msdtc module makes it very simple to do, so we’re staring off here:
While DTC normally installs with a default log path, we’ll just make sure and grab it first. Then it’s just as simple as Uninstall-DTC and Install-DTC:
So you’ve got the when, and you’ve got the who, now how do you precede? Remember you could be holding someone’s job or professional reputation in your hands.
As a long time DBA I’m well aware that applications are not always the best written things, and instructions don’t always spell out correctly what an option is going to do.
In the case I was writing about I withheld the name from general knowledge, a single senior manager was made aware that I knew and would be investigating. Why did I do this?
In the case of this particular third party application , there were a couple.
Deleting data within the app should write data to an internal audit trail, this hadn’t
Users cannot generate their own SQL queries, so this wasn’t an ad hoc query either
The single row delete was in it’s own transaction
This particular app is badly written. We’ve seen it fail to correctly write it’s own ini files on exit
It has a habit of crashing out for no reason
So a bit like a lot of applications out there. Now I had some time to investigate I started by going through logs on the client Windows machine and trying to replicate on my own.
Eventually I traced it back to a coding error in the application where they processed an Update as:
Delete old data
Insert new data
Write audit log
None of this happened in an encompassing transaction (from studying the Transaction Log). So when the app crashed just after step 1, the row was just left missing and nothing there.
To the user who was logged in at the time, they wouldn’t have been aware of the crash as the GUI doesn’t display app status. The crash also happened at 16:55, which is pretty much home time so they wouldn’t have looked at the record again
Now I had the evidence I could talk to the Senior manager and show that we needed to raise this with the supplier, and that the user was blameless, but that we should recommend to users that they should double check any updates to be sure.
Releasing the name to a less trusted source without this background information would have probably led to the user being blamed for something that wasn’t really their fault
So, just because you have the information it doesn’t mean it’s actionable with adding more context and knowledge around it.
Nice quiet day in the office, busily cracking through the To Do list when suddenly something pops into the ticketing system as a P1. Data has gone ‘missing’ from a corporate system, and there’s nothing in the audit trail about when it went missing or how it went missing. For how, read ‘Whodunnit’!
Now if there’s nothing in the audit trails, and noone is confessing then that doesn’t leave me with very much to go on. A slight break through is that someone clearly recalls seeing the data at 09:45 on the 3rd of Febuary, but that’s as good as it gets. We’re now on the 6th Febuary, so that’s a window of just over 3 days it could have gone missing.
Now, as any DBA knows, SQL Server comes equipped with a bit in ‘audit’ for any operation that modifies data. It’s just awkward and complex to read and search.
The name of that audit is the Transaction Log. SQL Server won’t do any data modification without it being written into the log, as SQL Server doesn’t count the transaction committed until it’s logged as it needs it for recovery after a restart.
The transaction log holds a LOT of information. Searching 3 days worth of it would be like looking for something in a hay stack. And at the moment we don’t even know if that’s a needle, a pin or an apple seed
An aside about reading Transaction Logs
SQL Server comes with 2 built in, but undocumented, functions fn_dblog and fn_dump_dblog. Undocumented means Microsoft don’t publish documentation and reserve the right to modift them without notice. But they’re a pretty open secret, and there’s pleny of info out there that
For the purposes of this post, you just to need to know this about them:
fn_dblog reads the current transaction log
fn_dump_dblog reads transaction log backups
If you’re on a SQL version lower that SQL Server 2014 or SQL Server 2012 SP2 then there is a know bug that uses up threads and can cause hanging.
Finding the time of deletion
So the first thing to do is to is to work out a time window when the data went missing. This was going to be a tedious process, so lets automate it. If we’re automating a SQL Server task, then dbatools becomes the obvious answer.
The plan is:
Restore the database the last time the data was seen
Roll forward in 6 hour increments until the data disappears
Roll forward in 30 minute steps from the beginning of the last 6 hour lost until the data disappears
Roll forward from the 30 minute start to the disappearance of the data in 5 minute increments
So we now have a 5 minute window we need to search in the transaction log backup, which is much better than 72 hours
To speed things up we’ll scan all the backup headers first and save them into an object which we can reuse through out the process. We’ll also rename the database, and change the filenames on restore. This is because we are restoring the database onto the instance it came from, the reason we have to do this will become clearer later on
We also gather some extra information about the missing rows, which I’ll explain in a moment.
# Time to start Restoring from
$startTime = get-date ('03/02/2020 09:45')
# The time in minutes between restore points
$windowSize = @('720','30','5')
# Used to track how far through the windows we are
$counterLoop = 0
# Query to test if the object has been deleted during the window
# We also gather some internal SQL Server page information we'll need later
$sqlQuery = 'select db_id(), sys.fn_PhysLocFormatter(%%physloc%%),* from dbo.Objects where ObjectID=''1'''
# Scan all the backup headers first to speed things up
$backups = Get-DbaBackupInformation -Path c:\RestoredBackps -SqlInstance MyInstance
# Set a variable to tell restore whether to start a new restore, or to continue
$restoreContiue = $false
while ($counterLoop -lt $windowSize.count){
$backups | Restore-DbaBackup -SqlInstance MyInstance -DatabaseName RestoreDb -ReplaceDbNameInFile -TrustDbBackupHistory -RestoreTime $startTime.AddMinutes($windowSize[$loopcount]) -Standby c:\Standby -Continue:$restoreContiue
#Prevent the old page information from being overwritten
$oldResult = $sqlResult
$sqlResult = Invoke-DbaQuery -SqlInstance MyInstance -DatabaseName RestoreDb -Query $sqlQuery
if ($sqlResult.count -eq 0){
# If we get no results, we're past when the data has disappeared
# So, move to the next loop and start a new restore
$counterLoop++
$restoreContiue = $false
} else {
# If data still there, start the next increment
$startTime = $startTime.AddMinutes($windowSize[$counterLoop])
$restoreContiue = $true
}
}
Write-Host "Data disappeared between $startTime and $($startTime.AddMinutes($windowSize[$counterLoop]-1))`n"
Write-Host "Missing data was on the following pages:`n"
$oldResult
Now we know when the data disappeared, now we just want how
Searching the Transaction Logs
SQL Server Transaction logs store a LOT of information, so querying them will return more information that you really want. Also the information in the Transaction Log is for the benefit of SQL Server, so not all of it is easily readable for us humans
For instance, when we’re looking for a specific missing row the easiest way is to search for transaction log records that modified the physical location the row existed on. While the actual row data and SQL details are in the log, they’re encoded, so getting to them is a lot of work.
We’re looking for a specific row in a specific table so we need the following 3 bits of information to identify it’s physical location:
Datafile ID
Page ID
Slot ID
These can be found for a specific row with the following query using the fn_PhysLocFormatter function which handily returns the physical location of a row in a table:
select db_id(),sys.fn_PhysLocFormatter(%%physloc%%) from dbo.Objects where ObjectID='1'
This will give you result like:
db_id
File:Page:Slot
5
(1:232:58)
The reason we were saving this from the previous iteration of the loop is because it wouldn’t exist once the data had been deleted.
Now all we need to do is to search the transaction log. However, there’s one more small step before we can do that. In the transaction log the FileID and PageID are stored as fixed length hexadecimal values, and we’ve just pulled them out in decimal format.
I convert them with PowerShell as it’s a little easier for me to remember:
Note how the FileID and PageID are padded out to 4 characters and 8 characters respectively. This query will return all the transactions that affected that slot on that page in that file. Depending on how busy that page is that might be a lot.
select * from fn_dblog(NULL,NULL) where [Page ID]='0001:00000150' and [Slot ID]='58'
Fiding out who and when
You’ll instantly find out just how much information there is in a transaction log record. So let’s trim down the data a bit much to a couple of things we want to know:
Who issued the command
Exactly when it was run
And also limit it to just delete records:
select [Transaction ID],[Begin Time],[Transaction SID] from fn_dblog(NULL,NULL) where [Page ID]='0001:00000150' and [Slot ID]='58' and Operation='LOP_DELETE_ROWS
If you’re lucky and this returnselect [Transaction ID],[Begin Time],[Transaction SID] from fn_dblog(NULL,NULL) where [Page ID]=’0001:00000150′ and [Slot ID]=’58’ and Operation=’LOP_DELETE_ROWSs a single row, you’ll find that only [Transaction ID] is populated as the other goodies are recorded at the Transaction wrapper level, not the statement level, so we take the Transaction ID and use that:
select [Transaction ID],[Begin Time],suser_sname([Transaction SID]) from fn_dblog(NULL,NULL) where [Transaction ID]=’0000:000003ba’
select [Transaction ID],[Begin Time],suser_sname([Transaction SID]) from fn_dblog(NULL,NULL) where [Transaction ID]='0000:000003ba'
This returns all the rows for the specified transaction in the transaction log. You’ll have 2 rows for the Transaction wrapper, the BEGIN and the COMMIT, these will be the first and last row.
The rows in between will be the actual deletion records. You might have one or more depending how much data was in the row, or if it was part of a delete than remove multiple rows.
The BEGIN transaction record contains the information you wanted:
[Begin Time] gives you the time the delete started
[Transaction SID] gives you the Security Identifier of the account that executed the transaction.
We user SUSER_SNAME to convert the SID to a username. This is the reason I mentioned that you want to do this on the same instance as the original database was on, this is to make sure the SIDs match up and you get the right name!
Conclusion
As you can see, everything you’d ever want to know about what happened in SQL server is in the transaction log. But it’s not an easy beast to work with, much better to look at better methods to save you having to do this.
It’s all been a bit quiet around here with lots on at work and training for a marathon. One piece of new I’ve not menitoned is that I’m now CISSP certified 👍. I sat the exam in December, but it’s taken a while for the paperwork and accreditation to get sorted and for everything to become official
The main thing I’ve taken away from studying for the certification is that process rules everything. As an in the trenches DBA we’re more worried about the specifics of SQL Server permissions or the Oracle auiting is correct. But why are we doing those things, and are we doing them correctly?
Correctly doesn’t just mean technically correct, we’re almost certainly doing them that way or things will break. What I mean is are we correctly implementing the processes and policies that drive the rest of the business?
For example, everyone loves backups (yeah, I know, this is my favourite thing as well). But how much should you be keeping? Do you really need 3 years of backups? Is that just increasing the amount of data you could loose in a breach? Are the older backups encrypted, and would restoring and encrypting them break the purposes they were kept for?
If you’re never going to use them why are you keeping them? If it’s just for a CYA audit reason, then why not just keep the audit logs? Less chance of leaking PII or Financial data if you’ve just got the bare bones of X did Y on Z.
This alignment with Organisational policy is a core reqiurement for doing SQL Server security correctly. It will allow you to concentrate on exactly what needs to be done and not just running around implementing ad-hoc fixes every time a hole appears.
Expect more posts on this topic over the coming months. And if you want a deep dive into SQL Server Security then I’m presenting a full day workshop at SqlBits 2020 – SQL Server Security from the Groud Up on Wednesday 1st April