Program to Find Duplicate Files in a File System

Given in a directory, we have to write a program to find duplicate files in the file system.
For eg:
Let’s assume we have following files and it’s content in the file system.

char files[][80][80] = {{"1.txt", "abcd"},
                        {"2.txt", "efgh"},
                        {"3.txt", "efgh"},
                        {"4.txt", "abcd"},
                        {"5.txt", "efgh"},
                        {"6.txt", "efgh"},
                        {"7.txt", "xyz"}

Based on above input, duplicate files are – “1.txt” and “4.txt”. “2.txt”, “3.txt”, “5.txt” and “6.txt” are also duplicate files.

Algorithm

  • Create a hash table of the files in which key is the size of file.
  • For every file, search in hash table.
    • If an entry is found, then files are same if content are same.

Implementation

#include <iostream>
#include <map>

using namespace std;

char files[][80][80] = {{"1.txt", "abcd"},
						{"2.txt", "efgh"},
						{"3.txt", "efgh"},
						{"4.txt", "abcd"},
						{"5.txt", "efgh"},
						{"6.txt", "efgh"},
						{"7.txt", "xyz"}
					   };

int main ()
{
	int size = sizeof(files)/sizeof(files[0]);
	cout << "Size: " << size << endl;
	for (int i = 0; i < size; i++) {
		printf ("%s -- %s\n", files[i][0], files[i][1]);
	}
	map <size_t, string> file_content;
	for (int i = 0; i < size; i++) {
		size_t hash_id = hash<string>{} (files[i][1]);

		map <size_t, string>::iterator it = file_content.find (hash_id);
		if (it != file_content.end ()) {
			/* We can compare the content of both the files.
			   Here, we am assuming since file size is same, content will also be same. */
			cout << "Duplicate file found: " << files[i][0] << " and " << it->second << endl;
			continue;
		}
		file_content.insert ({hash_id, files[i][0]});
	}
}

Let’s have a look into the output of the above program.

Size: 7
1.txt -- abcd
2.txt -- efgh
3.txt -- efgh
4.txt -- abcd
5.txt -- efgh
6.txt -- efgh
7.txt -- xyz
Duplicate file found: 3.txt and 2.txt
Duplicate file found: 4.txt and 1.txt
Duplicate file found: 5.txt and 2.txt
Duplicate file found: 6.txt and 2.txt

Leave a Reply

Your email address will not be published. Required fields are marked *