Reading SQL Server Backup file headers directly using PowerShell.

As part of my work on the Backup/Restore portions of DBATools I was interested to see if I could avoid the overhead of having to parse files using SQL Server to find out if they were backup files, and if they were what they contained. (I’m a DBA, ergo I’m paranoid, ergo I’m not trusting a file extension!)

Working on the assumption that SQL Server has a RESTORE HEADERONLY option, and that the speed results are returned with even from large backup files, I was hoping to find all the information I needed fairly early on in the file headers.

In this post I’m going to show how I worked through this using PowerShell (DBATools is trying to avoid using anything other than easily available PS cmdlets, so pulling dlls and other .Net objects would feel a little wrong!).

tl;dr

It sort of works. Not very useful with compressed backups, but reliable for filtering normal backups in a hurry.

Read on for code

So the first thing I wanted was a good sample of database backups that I could be sure were going to be pretty similar, and also tiny just in case I needed to scan the entire file.

So using this T-SQL I spun up an empty database, and then created 10 Full Database Backups and 10 Transaction Backups. Between each backup I did the least amount of work I could think of, just to make sure LSNs were incrememented nicely.

use master
go

if exists (select * from sys.databases where name='logcreate')
begin
	ALTER DATABASE logcreate SET single_USER with rollback immediate
	drop database logcreate
end
go


create database logcreate
go

alter database logcreate set recovery full
go

use logcreate
go

IF EXISTS (SELECT * FROM SYS.tables WHERE name='steps')
begin
	drop table steps
end
go

create table steps(
step integer,
dt datetime2
);
go

declare @i integer
declare @fn varchar(50)
set @i=0


backup database [logcreate] to disk='C:\SQLAutoRestoreUAT\logcreate\logcreate.bak'
declare @i integer
declare @fn varchar(50)
set @i=1

while (@i<10)
begin
insert into steps values (@i, getdate())
select @i=@i+1
select @fn='C:\SQLAutoRestoreUAT\logcreate\logcreate_'+convert(varchar(2),@i)+'.bak'
backup database [logcreate] to disk=@fn
waitfor delay '00:00:30'
end

set @i=1
while (@i<10)
begin
insert into steps values (@i, getdate())
select @i=@i+1
select @fn='C:\SQLAutoRestoreUAT\logcreate\logcreate_'+convert(varchar(2),@i)+'.trn'
backup log[logcreate] to disk=@fn
waitfor delay '00:00:30'
end

Marvellous, 10 solid backup and transaction files.

Using Get-Content to read in a file is pretty standard fare in powershell. Everyone uses it to grab text/csv/json data before piping it to something more useful. It can also handle various encodings, including bytes

I looked at Format-Hex, but that’s only $PSVersion -ge 5, so not too portable unfortunately, and also doesn’t make it easy to cross check across large number of files

Methodology was

  • read in 10 files of one particular type in a folder
  • read the first 10000 bytes of each file into an array, and then add that to an array ($bfiles)
  • Loop through 10000 times
  • On each iteration cross check the values from each file at that offset
  • If they’re all the same and non-zero, turn them into characters and add to an array

And lo, this thinking begat this:

$cnt = 10000
$files = Get-ChildItem \\C:\SQLAutoRestoreUAT\logcreate\ -filter *.bak -recurse -File
$bfiles = New-Object System.Collections.ArrayList
$obytes = New-Object System.Collections.ArrayList
    foreach ($file in $files)
    {
        $null = $bfiles.add([Byte[]] (Get-Content -Path $file.fullname -TotalCount $cnt -Encoding Byte))
    }


For ($i=0; $i -lt $cnt){
    $q = 0
    For ($j=1; $j -lt ($files.count-1)){
        if ($bfiles[($j-1)][$i] -eq $bfiles[$j][$i] -and $bfiles[$j][$i] -ne 0){
            $q++
        }           
    $j++
    }
    if ($q -eq 9){
        $t = [char][byte]$bfiles[$j-1][$i]
        $null = $obytes.add([char][byte]$bfiles[$j-1][$i])
    }
    $i++
}

Now there was a nice array $obytes full of information, it just needed turning into something readable, so a quick -join:

$obytes -join ('')

And I had this for the log files:

TAPEŒ‡,^ƒ Microsoft SQL ServerRAID ;­ðSPAD&4SFMB€‡7HSSET˜dózƒ LAB\STUARTADMINSPADR@VOLBtbIM[ƒ C:WindowsSTU100LAB-02SPADvdMSCI8:MQCIVXSCINØA	""""B³ƒ 
|pƒ L‚	T6¢	BOÑ·Ieé@‹ó²¥9å¶Ú"÷ç¼ú˜8C»O|ƒ,L#Âq[[oÝ·L¿èä®ÏnM	Ð÷ç¼ú˜8C»O|ƒ,L#Â"B³"logcreateSTU100LAB-02\SQLEXPRESS2016SFGIîà""""B³""÷ç¼ú˜8C»O|ƒ,L#ÂPRIMARYSFINÀ €ªàÿÿÿ€àÿÿÿ€"B³q
[[oÝ·L¿èä®ÏnM‚ôöÒÔÈO»[O“„%="""¿ÊùpWÒHlogcreateC:\Program Files\Microsoft SQL Server\MSSQL13.SQLEXPRESS2016\MSSQL\DATA\logcreate.mdfSFINÐ €²€àÿÿÿ€JÁ¹û‰‘ØL™Zžwj´Ô"""Hlogcreate_logC:\Program File
s\Microsoft SQL Server\MSSQL13.SQLEXPRESS2016\MSSQL\DATA\logcreate_log.ldfSPADFTMSTLTFø\" @ÀAPAD€~þMQTLP" @0 

and this for the Full backups:

TAPEŒ‡,^ƒ Microsoft SQL ServerRAID ;­ðSPAD&4SFMB€‡7HSSET˜dózƒ LAB\STUARTADMINSPADR@VOLBtbIM[ƒ C:WindowsSTU100LAb-02SPADvdMSCI8:MQCIVXSCINØA	"""ƒ |p
ƒ D‚	T6¢	BOÑ·Ieé@‹ó²¥9å¶Ú"÷ç¼ú˜8C»O|ƒ,L#Â	Ð÷ç¼ú˜8C»O|ƒ,L#Â"B³"logcreateSTU100LAB-02\SQLEXPRESS2016SFGIîà""""B³""÷ç¼ú˜8C»O|ƒ,L#ÂPRIMARYSFINÀ„ €ªàÿÿÿ€àÿÿÿ€‚ôöÒÔÈO»[O“„%="""Hl
ogcreateC:\Program Files\Microsoft SQL Server\MSSQL13.SQLEXPRESS2016\MSSQL\DATA\logcreate.mdfSFINÐ €²€àÿÿÿ€JÁ¹û‰‘ØL™Zžwj´Ô"""Hlogcreate_logC:\Program Files\Microsoft SQL Server\MSSQL13.SQLEXPRESS2016
\MSSQL\DATA\logcreate_log.ldfSPADFTMSDA<>APAD€–MQDAci01.ƒ‡‹“§±±µ¹½ÁËçñû/3==O_______oooyƒŸ©¹ÕÝ5BOÑ·Ieé@‹ó²¥9å¶Úÿÿÿÿ ÿÿÿÿ‚ôöÒÔÈO»[O“„%=logcreate

A couple of things quickly jump out:
The files start with TAPE, which gives us a fairly quick way of checking it’s a SQL Server backup file.
I can see the DB Owner (my lab account – LAB\STUARTADMIN)
The server the db was on (a lab machine – STU100LAb-02)
I can see the Filelist information:

logcreateC:\Program Files\Microsoft SQL Server\MSSQL13.SQLEXPRESS2016\MSSQL\DATA\logcreate.mdf
logcreate_logC:\Program Files\Microsoft SQL Server\MSSQL13.SQLEXPRESS2016\MSSQL\DATA\logcreate_log.ldf

The most interesting piece though, is the single character at offset 3750. For the Full backup, it’s a D. For a Log backup it’s an L. That seems remarkably useful and sane to me. Just to prove a point I checked a bunch of Differential backups and found I at that offset (a hangover from Incremental perhaps?)

And the other information?

So far so good, but the other thing I’d really want is the Log Sequence Numner (LSN) information. So how about searching for that.

Using RESTORE HEADERONLY on my first trn backup I got the values:

FirstLSNLastLSNCheckpointLSNDatabaseBackupLSN
34000000006600179340000000381000013400000000660017934000000006600179

Let’s try searching the file for that value, by converting it to charactrers:

$spfile = ([Byte[]] (Get-Content -Path C:\SQLAutoRestoreUAT\logcreate\logcreate_1.trn -Encoding Byte))
$charfile = ($spfile | ForEach-Object{[char][byte]$_}) -join ('')
$charfile.contains('34000000006600179')

No success. A quick review of my LSN notes, and I remember that SQL Server actually stores them as hexadecimal values in log files, the decimal version only being for the wetware to use. Luckily the notes also pointed to a Microsoft function to convert them – dbo.fn_convertnumericlsntobinary. So now I knew I was actually looking for 00000022:00000042:00B3. So let’s try that, but first we need to convert the file contents to hex:

$hexfile=($spfile | %{("0{0:X}" -f $_).Substring(("0{0:X}" -f $_).length-2)}) -join ('')

Yep, bad coding style, but it works for playtime!

So lets search for that hex value:

$hexfile.contains("000000220000004200B3")
False

Oops, that shouldn’t have happened. But then again, Windows is Little Endian, so let’s try that way!
(Just in case you’ve not discovered endianness before, it’s a concept about the order in which a computer stores numbers, a little endian systems stores the smallest unit first (ie 00B3 will be stored B300 (each hex pair is a number), a big endian system will store is at 00B3). Windows is little endian, other systems like MacOS and Solaris are Big endian. )

$hexfile.Contains("2200000042000000B300")
True
$hexfile.IndexOf("2200000042000000B300")
7388

Yep, we’ve found that! Now, there should be 4 LSNs in there, so lets see if we can find them. Powershell’s String IndexOf method only returns the first occurence in the string unless we seed it with a new starting point, so again, quick and dirty:

$a = $hexfile.IndexOf("2200000042000000B300")
$b = $hexfile.IndexOf("220000007D0100000100")
$c = $hexfile.IndexOf("2200000042000000B300", $a+1)
$d = $hexfile.IndexOf("2200000042000000B300", $c+1)
write-host "$a - $b - $c - $d"
7388 - 7408 - 7428 - 7448

20 positions between each one, so they line up nicely next to each other. To save us having to do the full hex conversion, can we pull them out of the byte array piecemeal?

Yes, we can:

($spfile[3694..3702] | %{("0{0:X}" -f $_).Substring(("0{0:X}" -f $_).length-2)}) -join ('')

And a quick peak at my other backups files and it seems to hold true for Full and Differentials as well. The one thing I’ve not checked is what happens in you have multiple backups in a single file, that’s on the to do list! I’ve also left out converting the hex back to the numeric LSNs we normally see, but that can’t be too hard…..

Success?

So now I can grab all that info, lets try it in the wild. I’m going to scan a random array of backup files I happen to have lying around from various sources and systems. To keep things simple, I’m just going to check they’re SQL Server backup files, and filter them into Full, Diff and Transaction backups:

$files = Get-ChildItem c:\SQLBackups\ -Recurse -File
foreach ($file in $files){
    $r = ([Byte[]] (Get-Content -Path $file.fullname -TotalCount 3750 -Encoding Byte))
    $btype = switch ([char][byte]$r[3749])
    {
        'D' {'FULL'}
        'L' {'LOG'}
        'I' {'DIFF'}
        Default {'WHO KNOWS'}
    }
    write-host "$($file.fullname) is $btype"
}

And we get a *LOT* of “WHO KNOWS”! Wait, what, how, everything in that folder is a SQL Server backup of one kind or another. It looks like they’re all under the same folder, so probably from the same database. Grab some and run them through RESTORE HEADERONLY and everything comes back nicely. The 5th column has the answer, it’s a compressed backup.

Wondering if there’s a ‘normal’ set of headers in there somewhere else I pushed it through the process above to see if there was anything coming through, unfortunately the only commonality is that they all start with MSSQLBAK. All my digging around shows that the compression algorithim is proprietary, so that’s the end of that avenue.

And I’m guessing encrypted backups are going to be an ever bigger pain to work with…..

A waste of time then?

No not really. I’ve a faster way of checking if a file is a SQL Server backup file, so don’t need to waste time sending a badly named text file up to the SQL box to read headers. So that’s a good first parse. If it’s not compressed or encrypted then I can also pull information out of it. And in the wide world that still leaves a lot of files we can potentially rip through without touching a SQL box.

And I’ve got to poke around with some low level PowerShell which is a good learning experience.

One thought on “Reading SQL Server Backup file headers directly using PowerShell.

  1. […] One of the most common requests has been ways of speeding up reading the headers from the backup files. Unfortunately we have to do this so we can be sure of what they contain, and there aren’t any shortcuts (I’ve looked into it in some depth before). […]

Leave a Reply

Your email address will not be published. Required fields are marked *