As part of my work on the Backup/Restore portions of DBATools I was interested to see if I could avoid the overhead of having to parse files using SQL Server to find out if they were backup files, and if they were what they contained. (I’m a DBA, ergo I’m paranoid, ergo I’m not trusting a file extension!)
Working on the assumption that SQL Server has a RESTORE HEADERONLY option, and that the speed results are returned with even from large backup files, I was hoping to find all the information I needed fairly early on in the file headers.
In this post I’m going to show how I worked through this using PowerShell (DBATools is trying to avoid using anything other than easily available PS cmdlets, so pulling dlls and other .Net objects would feel a little wrong!).
tl;dr
It sort of works. Not very useful with compressed backups, but reliable for filtering normal backups in a hurry.
Read on for code
So the first thing I wanted was a good sample of database backups that I could be sure were going to be pretty similar, and also tiny just in case I needed to scan the entire file.
So using this T-SQL I spun up an empty database, and then created 10 Full Database Backups and 10 Transaction Backups. Between each backup I did the least amount of work I could think of, just to make sure LSNs were incrememented nicely.
use master go if exists (select * from sys.databases where name='logcreate') begin ALTER DATABASE logcreate SET single_USER with rollback immediate drop database logcreate end go create database logcreate go alter database logcreate set recovery full go use logcreate go IF EXISTS (SELECT * FROM SYS.tables WHERE name='steps') begin drop table steps end go create table steps( step integer, dt datetime2 ); go declare @i integer declare @fn varchar(50) set @i=0 backup database [logcreate] to disk='C:\SQLAutoRestoreUAT\logcreate\logcreate.bak' declare @i integer declare @fn varchar(50) set @i=1 while (@i<10) begin insert into steps values (@i, getdate()) select @i=@i+1 select @fn='C:\SQLAutoRestoreUAT\logcreate\logcreate_'+convert(varchar(2),@i)+'.bak' backup database [logcreate] to disk=@fn waitfor delay '00:00:30' end set @i=1 while (@i<10) begin insert into steps values (@i, getdate()) select @i=@i+1 select @fn='C:\SQLAutoRestoreUAT\logcreate\logcreate_'+convert(varchar(2),@i)+'.trn' backup log[logcreate] to disk=@fn waitfor delay '00:00:30' end
Marvellous, 10 solid backup and transaction files.
Using Get-Content
to read in a file is pretty standard fare in powershell. Everyone uses it to grab text/csv/json data before piping it to something more useful. It can also handle various encodings, including bytes
I looked at Format-Hex
, but that’s only $PSVersion -ge 5, so not too portable unfortunately, and also doesn’t make it easy to cross check across large number of files
Methodology was
- read in 10 files of one particular type in a folder
- read the first 10000 bytes of each file into an array, and then add that to an array ($bfiles)
- Loop through 10000 times
- On each iteration cross check the values from each file at that offset
- If they’re all the same and non-zero, turn them into characters and add to an array
And lo, this thinking begat this:
$cnt = 10000 $files = Get-ChildItem \\C:\SQLAutoRestoreUAT\logcreate\ -filter *.bak -recurse -File $bfiles = New-Object System.Collections.ArrayList $obytes = New-Object System.Collections.ArrayList foreach ($file in $files) { $null = $bfiles.add([Byte[]] (Get-Content -Path $file.fullname -TotalCount $cnt -Encoding Byte)) } For ($i=0; $i -lt $cnt){ $q = 0 For ($j=1; $j -lt ($files.count-1)){ if ($bfiles[($j-1)][$i] -eq $bfiles[$j][$i] -and $bfiles[$j][$i] -ne 0){ $q++ } $j++ } if ($q -eq 9){ $t = [char][byte]$bfiles[$j-1][$i] $null = $obytes.add([char][byte]$bfiles[$j-1][$i]) } $i++ }
Now there was a nice array $obytes
full of information, it just needed turning into something readable, so a quick -join
:
$obytes -join ('')
And I had this for the log files:
TAPE,^ Microsoft SQL ServerRAID ;ðSPAD&4SFMB7HSSETdóz LAB\STUARTADMINSPADR@VOLBtbIM[ C:WindowsSTU100LAB-02SPADvdMSCI8:MQCIVXSCINØA """"B³ |p L T6¢ BOÑ·Ieé@ó²¥9å¶Ú"÷ç¼ú8C»O|,L#Âq[[oÝ·L¿èä®ÏnM Ð÷ç¼ú8C»O|,L#Â"B³"logcreateSTU100LAB-02\SQLEXPRESS2016SFGIîà""""B³""÷ç¼ú8C»O|,L#ÂPRIMARYSFINÀ ªàÿÿÿàÿÿÿ"B³q [[oÝ·L¿èä®ÏnMôöÒÔÈO»[O%="""¿ÊùpWÒHlogcreateC:\Program Files\Microsoft SQL Server\MSSQL13.SQLEXPRESS2016\MSSQL\DATA\logcreate.mdfSFINÐ ²àÿÿÿJÁ¹ûØLZwj´Ô"""Hlogcreate_logC:\Program File s\Microsoft SQL Server\MSSQL13.SQLEXPRESS2016\MSSQL\DATA\logcreate_log.ldfSPADFTMSTLTFø\" @ÀAPAD~þMQTLP" @0
and this for the Full backups:
TAPE,^ Microsoft SQL ServerRAID ;ðSPAD&4SFMB7HSSETdóz LAB\STUARTADMINSPADR@VOLBtbIM[ C:WindowsSTU100LAb-02SPADvdMSCI8:MQCIVXSCINØA """ |p D T6¢ BOÑ·Ieé@ó²¥9å¶Ú"÷ç¼ú8C»O|,L# Ð÷ç¼ú8C»O|,L#Â"B³"logcreateSTU100LAB-02\SQLEXPRESS2016SFGIîà""""B³""÷ç¼ú8C»O|,L#ÂPRIMARYSFINÀ ªàÿÿÿàÿÿÿôöÒÔÈO»[O%="""Hl ogcreateC:\Program Files\Microsoft SQL Server\MSSQL13.SQLEXPRESS2016\MSSQL\DATA\logcreate.mdfSFINÐ ²àÿÿÿJÁ¹ûØLZwj´Ô"""Hlogcreate_logC:\Program Files\Microsoft SQL Server\MSSQL13.SQLEXPRESS2016 \MSSQL\DATA\logcreate_log.ldfSPADFTMSDA<>APADMQDAci01.§±±µ¹½ÁËçñû/3==O_______oooy©¹ÕÝ5BOÑ·Ieé@ó²¥9å¶Úÿÿÿÿ ÿÿÿÿôöÒÔÈO»[O%=logcreate
A couple of things quickly jump out:
The files start with TAPE, which gives us a fairly quick way of checking it’s a SQL Server backup file.
I can see the DB Owner (my lab account – LAB\STUARTADMIN)
The server the db was on (a lab machine – STU100LAb-02)
I can see the Filelist information:
logcreateC:\Program Files\Microsoft SQL Server\MSSQL13.SQLEXPRESS2016\MSSQL\DATA\logcreate.mdf logcreate_logC:\Program Files\Microsoft SQL Server\MSSQL13.SQLEXPRESS2016\MSSQL\DATA\logcreate_log.ldf
The most interesting piece though, is the single character at offset 3750. For the Full backup, it’s a D. For a Log backup it’s an L. That seems remarkably useful and sane to me. Just to prove a point I checked a bunch of Differential backups and found I at that offset (a hangover from Incremental perhaps?)
And the other information?
So far so good, but the other thing I’d really want is the Log Sequence Numner (LSN) information. So how about searching for that.
Using RESTORE HEADERONLY on my first trn backup I got the values:
FirstLSN | LastLSN | CheckpointLSN | DatabaseBackupLSN |
---|---|---|---|
34000000006600179 | 34000000038100001 | 34000000006600179 | 34000000006600179 |
Let’s try searching the file for that value, by converting it to charactrers:
$spfile = ([Byte[]] (Get-Content -Path C:\SQLAutoRestoreUAT\logcreate\logcreate_1.trn -Encoding Byte)) $charfile = ($spfile | ForEach-Object{[char][byte]$_}) -join ('') $charfile.contains('34000000006600179')
No success. A quick review of my LSN notes, and I remember that SQL Server actually stores them as hexadecimal values in log files, the decimal version only being for the wetware to use. Luckily the notes also pointed to a Microsoft function to convert them – dbo.fn_convertnumericlsntobinary. So now I knew I was actually looking for 00000022:00000042:00B3
. So let’s try that, but first we need to convert the file contents to hex:
$hexfile=($spfile | %{("0{0:X}" -f $_).Substring(("0{0:X}" -f $_).length-2)}) -join ('')
Yep, bad coding style, but it works for playtime!
So lets search for that hex value:
$hexfile.contains("000000220000004200B3") False
Oops, that shouldn’t have happened. But then again, Windows is Little Endian, so let’s try that way!
(Just in case you’ve not discovered endianness before, it’s a concept about the order in which a computer stores numbers, a little endian systems stores the smallest unit first (ie 00B3 will be stored B300 (each hex pair is a number), a big endian system will store is at 00B3). Windows is little endian, other systems like MacOS and Solaris are Big endian. )
$hexfile.Contains("2200000042000000B300") True $hexfile.IndexOf("2200000042000000B300") 7388
Yep, we’ve found that! Now, there should be 4 LSNs in there, so lets see if we can find them. Powershell’s String IndexOf method only returns the first occurence in the string unless we seed it with a new starting point, so again, quick and dirty:
$a = $hexfile.IndexOf("2200000042000000B300") $b = $hexfile.IndexOf("220000007D0100000100") $c = $hexfile.IndexOf("2200000042000000B300", $a+1) $d = $hexfile.IndexOf("2200000042000000B300", $c+1) write-host "$a - $b - $c - $d" 7388 - 7408 - 7428 - 7448
20 positions between each one, so they line up nicely next to each other. To save us having to do the full hex conversion, can we pull them out of the byte array piecemeal?
Yes, we can:
($spfile[3694..3702] | %{("0{0:X}" -f $_).Substring(("0{0:X}" -f $_).length-2)}) -join ('')
And a quick peak at my other backups files and it seems to hold true for Full and Differentials as well. The one thing I’ve not checked is what happens in you have multiple backups in a single file, that’s on the to do list! I’ve also left out converting the hex back to the numeric LSNs we normally see, but that can’t be too hard…..
Success?
So now I can grab all that info, lets try it in the wild. I’m going to scan a random array of backup files I happen to have lying around from various sources and systems. To keep things simple, I’m just going to check they’re SQL Server backup files, and filter them into Full, Diff and Transaction backups:
$files = Get-ChildItem c:\SQLBackups\ -Recurse -File foreach ($file in $files){ $r = ([Byte[]] (Get-Content -Path $file.fullname -TotalCount 3750 -Encoding Byte)) $btype = switch ([char][byte]$r[3749]) { 'D' {'FULL'} 'L' {'LOG'} 'I' {'DIFF'} Default {'WHO KNOWS'} } write-host "$($file.fullname) is $btype" }
And we get a *LOT* of “WHO KNOWS”! Wait, what, how, everything in that folder is a SQL Server backup of one kind or another. It looks like they’re all under the same folder, so probably from the same database. Grab some and run them through RESTORE HEADERONLY and everything comes back nicely. The 5th column has the answer, it’s a compressed backup.
Wondering if there’s a ‘normal’ set of headers in there somewhere else I pushed it through the process above to see if there was anything coming through, unfortunately the only commonality is that they all start with MSSQLBAK
. All my digging around shows that the compression algorithim is proprietary, so that’s the end of that avenue.
And I’m guessing encrypted backups are going to be an ever bigger pain to work with…..
A waste of time then?
No not really. I’ve a faster way of checking if a file is a SQL Server backup file, so don’t need to waste time sending a badly named text file up to the SQL box to read headers. So that’s a good first parse. If it’s not compressed or encrypted then I can also pull information out of it. And in the wide world that still leaves a lot of files we can potentially rip through without touching a SQL box.
And I’ve got to poke around with some low level PowerShell which is a good learning experience.
Taza Sahar
Hi,
I have a problem related to this. I received a SQL back up file, which looked similar to the TAPE file you have. However, I cannot restore this backup file to my new SQL server instance (it returns an error ).
Do you have any idea how I could do this?
Cheers
Stuart Moore
Hi Taza,
What’s the error message? It could unfortunately be that the file is corrupted.