I've been tasked to try to recover a database which suffered from corruption (due to I/O failure, which has been fixed since). I'm not familiar with the database or what it contains.
I've been given an old (~3 weeks) full backup and a series of transaction logs... however there are missing transaction logs, so I can only recover up to a certain date. There's like 2.5 weeks of data missing (and there's a lot of data being added to this database constantly).
I also have been given a copy of the corrupt database (which is accessible, but with a lot of pages corrupt/missing).
I've tried the typical
DBCC CHECKDB commands (still no
repair_allow_data_loss, that will be my last resort if nothing else works).
After many comes and goes to the database (the db is a 1.5 terabyte little monster and everything I do is slow and takes a while), I've tried to do an online page restore from the last known good backup for the corrupt pages.
To do that, I've done an script that creates many
RESTORE DATABASE <foo> PAGE='pages' FROM DISK='<bar.bak>' commands from the
DBCC CHECKDB output (básically a regex and a distinct)... so far so good, this worked up to a point where it said I had reached a limit of 1000 pages per file (there are 8 files on this db) per restore command.
So it asks me to "complete the online restore", but I'm at a loss at how to do that... I don't have a tail log or anything more complete than the full backup I'm starting with, so I basically don't know how to complete the restore to keep trying with the rest of pages.
I've tried a
RESTORE DATABASE <foo> WITH RECOVERY but that didn't work either, it asks me for a log which I don't have.
Does anyone have any tips on how I could try to recover anything from here? Or how to "complete" the online restore so I can keep trying to recover more pages? Would I have the same problem if I try an offline restore (basically adding
WITH NORECOVERY to everything and then try to bring it back at the end?)
Working out the database by hand is basically undoable... there are hundreds of tables with millions of rows and there's no clear meaning of what any of it are. The corrupt DB will fail on
SELECT queries after some million of rows but I'm not sure I can work out where. I've tried rebuilding all non-clustered indexes, but there are corrupt pages with row data, so that didn't work either.
Some data loss would be acceptable, but consistency on the DB should at least try to be achieved.
The corrupt database is -still- online and clients are working on it (so it keeps getting new data), so any process I do on the lab bench should be reproducible on the production database afterwards (downtime will be hard for it).
This is SQL Server 2014 Enterprise
PS: I'm no DBA... I'm a programmer, but the client has tried some "expert" sql disaster recovery services and they have given up, so I've been asked to look at it and see if I could do anything.
Update: after many tests, the page by page restoring was a no-go, so we've ditched the idea. We are going for a manual recovery (manually selecting missing records from the corrupt tables and inserting them in the last known good backup), doing some automated tools for it (again, there are hundreds and hundreds of tables).
The standard procedure would be to:
After the new log backup has been applied, the page restore is completed and the pages are then usable.
RESTORE DATABASE <database> PAGE='1:57, 1:202, 1:916, 1:1016' FROM <file_backup_of_file_B> WITH NORECOVERY; RESTORE LOG <database> FROM <log_backup> WITH NORECOVERY; RESTORE LOG <database> FROM <log_backup> WITH NORECOVERY; BACKUP LOG <database> TO <new_log_backup>; RESTORE LOG <database> FROM <new_log_backup> WITH RECOVERY; GO
However, you have holes in your TLOG backups, and restoring with above procedure might bring your database back into a state in time that you do not desire.
You are in a complicated situation.
Your database has corrupt pages and your company is constantly adding new data to a database with issues. This could result in a total downtime of the database. Do you want to risk that?
Somebody is going to be held responsible and the more you try to fix it, the more management might be inclined to decide that you might be that person in the end. Do you want to risk that?
You are putting yourself in a difficult situation by taking on a role that you were not employed for. You are trying to achieve something that neither your company DBAs nor your external consultant were capable of. While it may seem to be a noble gesture, you are putting yourself at risk. You might have "implicitly promised" something that you will never be able to fulfil. Do you want to risk that?
When somebody working with the database queries data that is corrupt, they are possibly going to receive an error message. Daily work is already being impacted. The longer you wait with the inevitable the more productivity will be impacted. Do you want to risk that? (This question could also be raised with management)
Your company's backup procedure seems to be faulty (otherwise how would TLOG backups be missing?) and you are still running your production database as if there were no issues. Do you want to risk that?
The best recommendation I can give you is to halt production and call Microsoft! Or at least call Microsoft and possibly halt production.
While my writing may seem overly cautious and slightly dramatised from your perspective, I can personally relate to an experience as DBA where data was lost in a similar situation. We only lost half a days data, but we had to re-synchronise a lot of data with surrounding systems.
The longer you wait the more expensive recovery could become.
As for the limitation on page restores, here a quote from the official documentation:
The maximum number of pages that can be restored into any single file in a restore sequence is 1000. However, if you have more than a small number of damaged pages in a file, consider restoring the whole file instead of the pages.
Reference: RESTORE Statements - Arguments (Transact-SQL) (Microsoft Docs)
When all is back to normal, the DBAs and/or external consultants might want to consider implementing a different backup/restore policy/procedure for your database. As it has to be up 7x24 you can't risk having a backup procedure that doesn't provide adequate restore capabilities for any situation.
I see you have tried different methods including working with data recovery "experts" to repair this corrupt database especially with the size of over 1 TB. This makes the process much harder and a race against time. As an experienced DBA, I have come across similar situations where most of the time, there are good backups available to restore. In a case of inheriting bad backups and corrupt database, I have heavily relied on a third party tool called Stellar Phoenix SQL Database Repair tool. This tool is well renowned for repairing corrupt databases (.mdf and .ndf). Below are the few functionalities of the tool:
Performs Recovery of deleted records from SQL Database
Saves Scan result of Database to Perform Recovery at Later Stage
The tool requires the .mdf and .ndf files to be offline so it works great that you have a copy of the corrupted PROD database and not have to stop SQL Server services.
The best part is the trial version provides you the full functionality of the tool except that repaired database cannot be exported/saved. You would still be able to view all the database objects recovered and the extensive repair log file that provides details on the different stages of the repair process.
Feel free to download and see if it helps out. Download here
I also wrote a blog on how the tool works at this site: samosql blogs
Thanks and HTH to make you the HERO of the day!
PS. When this storm is over, remember to tell management that there needs to be a major overhaul of their backup procedures especially for such a database. A repeat of this scenario is totally unacceptable! :)