Given in a directory, we have to write a program to find duplicate files in the file system.
For eg:
Let’s assume we have following files and it’s content in the file system.
char files[][80][80] = {{"1.txt", "abcd"},
{"2.txt", "efgh"},
{"3.txt", "efgh"},
{"4.txt", "abcd"},
{"5.txt", "efgh"},
{"6.txt", "efgh"},
{"7.txt", "xyz"}
Based on above input, duplicate files are – “1.txt” and “4.txt”. “2.txt”, “3.txt”, “5.txt” and “6.txt” are also duplicate files.
Algorithm
- Create a hash table of the files in which key is the size of file.
- For every file, search in hash table.
- If an entry is found, then files are same if content are same.
Implementation
#include <iostream>
#include <map>
using namespace std;
char files[][80][80] = {{"1.txt", "abcd"},
{"2.txt", "efgh"},
{"3.txt", "efgh"},
{"4.txt", "abcd"},
{"5.txt", "efgh"},
{"6.txt", "efgh"},
{"7.txt", "xyz"}
};
int main ()
{
int size = sizeof(files)/sizeof(files[0]);
cout << "Size: " << size << endl;
for (int i = 0; i < size; i++) {
printf ("%s -- %s\n", files[i][0], files[i][1]);
}
map <size_t, string> file_content;
for (int i = 0; i < size; i++) {
size_t hash_id = hash<string>{} (files[i][1]);
map <size_t, string>::iterator it = file_content.find (hash_id);
if (it != file_content.end ()) {
/* We can compare the content of both the files.
Here, we am assuming since file size is same, content will also be same. */
cout << "Duplicate file found: " << files[i][0] << " and " << it->second << endl;
continue;
}
file_content.insert ({hash_id, files[i][0]});
}
}
Let’s have a look into the output of the above program.
Size: 7
1.txt -- abcd
2.txt -- efgh
3.txt -- efgh
4.txt -- abcd
5.txt -- efgh
6.txt -- efgh
7.txt -- xyz
Duplicate file found: 3.txt and 2.txt
Duplicate file found: 4.txt and 1.txt
Duplicate file found: 5.txt and 2.txt
Duplicate file found: 6.txt and 2.txt