Musings of a Data professional

Stuart Moore

Deleting Large amounts of data in SQL Server – Part 1

There comes a time when you’ll be asked to remove a large amount of data from one of your SQL Server databases. Perhaps an archive of order information is needs pruning, or removing those session records that aren’t needed anymore.

Yu’ve been off to have a look at, and it all looks simple enough. Looks like there’s a couple of million rows to delete, but you can identify them all easily. How easy is this going to be you think:

DELETE * FROM Table1 where DateAdded<getdate()-90;

And then it just sits there, and sits there, and sits there. Or you get a nice error message about breaking a key constraint. What to do about it?

The reason it sat there spinning away is because SQL Server is running it as one big transaction. So when you tell it to delete 20 million rows, it’s going to get a lock on all 20 million records, escalating nicely up from row to Page locks, and it’s going to want to write all of this out to the transaction log before it starts the actual deletes.

Now not only is your SSMS windows spinning away, but you’re putting a large load onto the backend SQL Server. Users are going to be blocked from performing there actions and you’re going to have some phenomenally large transaction log growth to manage.

Hopefully you’ve killed the process off before there’s been too much damage. What’s Plan B? What you need to do, is to break the job down into smaller batches and then run those with a loop. We’re not going to use cursors here, just a temporary table and a T-SQL while loop. This solution is based on a template I use a lot, I’m going to introduce it in sections throughout this post building it up, so I’ll be introducing some variables that might not seem that useful now, but will make sense later. This demo assumes your tales have a unique primary key which is a GUID, it can be modified to handle composite keys as well if you aren’t that lucky

declare @RowsToDelete integer, @RowCount integer, @RowsToIgnore integer;
declare @Stop bit;
declare @DeletedTable table (UniqueID uniqueidentifier not null primary key);

--Size of batch
select @RowsToDelete = 10000;

--This allows us to say how many rows to ignore. Sometimes rows are added faster than you can delete them, and this is to prevent constantly looping
Select @RowsToIgnore = 5;

--This controls if we continue with the deletions
select @Stop = 0;

while (@stop = 0)
begin
    begin transaction
    insert into @DeletedTable select top (@RowsToDelete) Table1UniqueID from Table1 where DateAdded<getdate()-90;
    select @RowCount=count(1) from @DeleteTable;
    select @stop = case when @Rowcount<=@RowsToIgnore then 1 else 0 end;
    if (@stop = 0)
    begin
        delete Table1 from Table1 T1 inner join @DeletedTable dt on T1.Table1UniqueID=dt.UniqueID;
    end
    delete from @DeletedTable;
    commit transaction
end

There we have a nice little bit of T-SQL that will continuously nibble away at the records you want to delete. Hopefully any locks taken will only be short lived, and your normal transaction log backups will take care of themselves.

You’ll find that there’ll be an optimal value for @RowsToDelete for your particular scenario. It might be ten thousand or it might be lower. You’ll need to experiment on a test system to find your sweet spot. Talking of testing, lets add another option to our mix. We’ll add a counter so we can say ‘Only run this many times’ so we can get a good estimate of timings before doing a full run:

declare @RowsToDelete, @RowCount, @RowsToIgnore, @RunCount integer;
declare @Stop bit;
declare @DeletedTable table (UniqueID uniqueidentifier not null primary key);

--Size of batch
select @RowsToDelete = 10000;

--This allows us to say how many rows to ignore. Sometimes rows are added faster than you can delete them, and this is to prevent constantly looping
Select @RowsToIgnore = 5;

--This controls if we continue with the deletions
select @Stop = 0;

--We'll set this to run 10 times:
select @RunCount = 10;

while (@stop = 0)
begin
    begin transaction
    insert into @DeletedTable select top (@RowsToDelete) Table1UniqueID from Table1 where DateAdded<getdate()-90;
    select @RowCount=count(1) from @DeleteTable;
    select @stop = case when @Rowcount<=@RowsToIgnore 1 else 0 end;
    select @stop = case where @RunCount=0 0 else 1 end;
    if (@stop = 0)
    begin
        delete Table1 from Table1 T1 inner join @DeletedTable dt on T1.Table1UniqueID=dt.UniqueID;
    end
    delete from @DeletedTable;
    commit transaction
    select @RunCount  = @RunCount - 1;
end

Now our little routine will run 10 times and then stop. You’ll notice that I do an explicit check for @RunCount equal to 0, this is so I can set @RunCount to -1 if I want the routine to run until finished but record the number of executions.

Excellent, we can now run it in small controlled sections gradually nibbling away at the huge task. We can extend it a little more as well if we want. How about adding a control to stop it after it’s been running for a certain time? Say you want to start it as you leave the office, but want it to stop before the overnight processing happens. Something like this:

declare @RowsToDelete, @RowCount, @RowsToIgnore, @RunCount, @HoursToRun integer;
declare @Stop bit;
declare @StartTime datetime;
declare @DeletedTable table (UniqueID uniqueidentifier not null primary key);

--Size of batch
select @RowsToDelete = 10000;

--This allows us to say how many rows to ignore. Sometimes rows are added faster than you can delete them, and this is to prevent constantly looping
Select @RowsToIgnore = 5;

--This controls if we continue with the deletions
select @Stop = 0;

--Set to -1 this time so we keep on running.
select @RunCount = -1;

--Need to record the start time so we can compare later

select @StartTime = CURRENT_TIMESTAMP;

--Let's let it run for 4 hours.
select @HoursToRun = 4;

while (@stop = 0)
begin
    begin transaction
    insert into @DeletedTable select top (@RowsToDelete) Table1UniqueID from Table1 where DateAdded<getdate()-90;
    select @RowCount=count(1) from @DeleteTable;
    select @stop = case when @Rowcount<=@RowsToIgnore 1 else 0 end;
    select @stop = case where @RunCount=0 0 else 1 end;
    select @stop = case where (datediff(hh,@StartTime,CURRENT_TIMESTAMP)<= @HoursToRun) 0 else 1 end;
    if (@stop = 0)
    begin
        delete Table1 from Table1 T1 inner join @DeletedTable dt on T1.Table1UniqueID=dt.UniqueID;
    end
    delete from @DeletedTable;
    commit transaction;
    select @RunCount  = @RunCount - 1;
end

This is a case where you will want to be carefull with your deletion batch sizing, this script will only stop the first batch to start after the time is up. So if your batches are taking 30 minutes to run, then the batch starting 3 hours and 59 minutes after the start will continue running until 4 hours and 29 minutes.

I’ve been using this script as a basis for large batch deletions for a number of years. You can easily extend to fit different scenarios. If you find it useful, or find a scenario it really helps out with then please let me know.

(And those Foreign key violation error messages? I’ll be dealing with those in Part 2)

Previous

Adding SQL Server jobs using PowerShell

Next

Upcoming Presentations

4 Comments

  1. Dan

    Hi Stuart,

    I am curious, what is your opinion of your divide-and-conquer vs. the save to temp table, truncate, bulk insert from temp table method described here:
    http://social.technet.microsoft.com/wiki/contents/articles/20651.sql-server-delete-a-huge-amount-of-data-from-a-table.aspx ? Is there any benefit to having the individual delete transactions in your transaction log backups? Is there some other advantage of this method?

    • Avatar photo

      Depends on the size of the table and how much space you have. I’ve recently had to trim a 100GB table down to size. The organisation couldn’t afford to have a 2nd copy of that table on their limited SAN (that’s why they wanted the table pruning in the first place).

      Also, that approach would need to have the application offline, or have users with missing/iconsistent data during the reinsert phase. And during that insert you’re going to have to be very careful about reseeding any identity columns to prevent clashes.

      And as I go into the now published Part 2 (sorry about that again!), this method also works when you have foreign keys or constraints on your data. Those would be a major worry for me with the truncate/insert method, there are workarounds, but the risk of data inconsistency would stop me fro considering most of them.

      Transaction log size overhead is usually the least of my worries. If needed I’ll just run more frequent t-log backups while the script is running (I’ve dropped to 2 minutes before).

      As always with these decisions it’s going to come down to what you need to do, what resources you have available and what the business requirements are. Test, test and test some more before trying either approach on production data where you can’t undo!

  2. Dan

    Where is Part 2? It’s really annoying when blogs make you search for the other parts of a multi-part series rather than just linking them together.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Powered by WordPress & Theme by Anders Norén