Odds & Ends: Tombstones on The Internet
Or Why Anything You Delete Might Not Actually Be Gone
Imagine this situation:
You just signed up for Face+, a new social media service. You find family and friends here, and you are just ecstatic about all the new bells and whistles. Suddenly, an ex finds you and sends you a message. OMG, what a stalker! You write back a nasty reply and post it. A few hours later you realize that you might have made a mistake, so you delete your reply and compose a more thoughtful one that doesn't include threats of bodily harm.
Here's the big question: was your original reply really deleted?
To answer this question, first we need to discuss how information is stored on the Internet.
The vast majority of data on the Internet is stored via a distributed system. This means your data is physically stored on multiple computers in various locations, so that you have access to your data regardless of where you are and what's going on in the real world. For example, if Houston has a flood and all local data centers go offline, all of the information needs to be replicated in other data centers so that users aren't impacted by the outage. All of the major players on the Internet do this, which explains why a random hard drive failing in a data center doesn't lose data for anyone.
This distribution of data means that to truly delete something, you need to delete that piece of information off of all the hard drives in every location, which is somewhat of a logistics nightmare. To combat the problem, tombstones were created. A tombstone is a placeholder that states a piece of data has been deleted. When information flows around the Internet, all the pointers to the data now find a tombstone, so whomever or whatever is looking for the deleted information is told that that data has been deleted.
The problem comes from how and when tombstones are deleted. You see, I/O (input/output) is expensive, as is CPU cycles. But storage mediums, such as hard drives, are relatively cheap. So leaving tombstones around for a long time is not a problem because they take up a tiny amount of space. What matters to most users, though, is that the data referenced by the tombstone is only marked for deletion. There is nothing to say the data has been deleted from the hard drive.
There are a lot of practices in place for how and when to defragment disks, how to store and retrieve data in a timely fashion, and how to manage data centers. These practices balance the need to provide fast service with the expense of a data center. To get faster service, you can:
Increase the CPU speed
Increase the amount of memory
Increase the speed of the hard drives
Increase the speed of the network connection
Increase the I/O speed by not doing a lot of writes and keeping related data physically close on the media for faster reads
The last bullet point is key to this discussion. By using tombstones, reads are much faster and with a definite answer if someone asks for a piece of data that has been marked for deletion. Without a tombstone, a request comes back "not found" with no reason behind the error. With a tombstone, a request comes back "data has been deleted".
What does this mean for the average user? Basically, if you delete something from the Internet, you will probably lose the ability to access that piece of data through regular websites. But that does not mean that your data has been physically deleted from all of the places where it exists in data centers.
So to answer the original question, your original reply is marked with a tombstone, but there is no guarantee as to when it will be deleted or how long it will remain hidden on the hard drives.