Prevent File Duplication with MD5 Checksums

Prevent File Duplication with MD5 Checksums

·

6 min read

Users uploading duplicate files can account for 5-30% of the data stored by your service. This can cause significant resource mismanagement in large applications that store gigabytes or terabytes of data. Storing duplicate data would also lead to unnecessary costs when using third-party services like AWS S3. This article covers how to prevent duplicate file storage with the MD5 hash.

Hashing

Hash functions generate a fixed output for any data passed into the function. It has various use cases like secure password storage on databases, providing security checksums for files, and it is also used in blockchain technology. MD5 is a popular hash algorithm and is perfect for this article's purpose.

The MD5 hash generates a 128-bit hash value for any arbitrary data. It is a uniform hash, so the hashed data remains the same for a specific input. This is why it can be used to check if a duplicate file has been uploaded.

Getting Started

For this article, we will set up a simple Express server with MongoDB to store the file hashes and access URLs. We will start with creating our project folder and then navigate to the folder:

mkdir express-file-manager
cd express-file-manager

Then we will initialize a new Node project with the following:

npm init -y

Next, we will install all project dependencies required for this project.

npm install typescript --save-dev
npm install ts-node --save-dev
npm install express
npm install @types/express --save-dev
npm install multer
npm install @types/multer --save-dev
npm install mongoose

We have installed Express, the backend framework for this project. Multer handles file handling in requests, and Mongoose is an ORM for interacting with MongoDB.

Next up, we will create a tsconfig.json file with the following:

npx tsc --init

Setting up a Mongoose Model

We will create a model called FileModel which represents a collection on the MongoDB instance. This collection keeps track of files that are uploaded through the service. It will also store important information about the uploaded files. To start, we will create model.ts file in our project folder. Below is the code:

import { Schema, model, Document } from 'mongoose';

interface IFile extends Document {
    name: string;
    path: string;
    size: number;
    fileHash: string;
    createdAt: Date;
    updatedAt: Date;
}

const fileSchema = new Schema<IFile>({
    name: { type: String, required: true },
    path: { type: String, required: true },
    size: { type: Number, required: true },
    fileHash: { type: String, required: true },
    createdAt: { type: Date, default: Date.now },
    updatedAt: { type: Date, default: Date.now }
});

export const FileModel = model<IFile>('File', fileSchema);

The fileSchema has a few important properties:

  • name: The name of the file.

  • path: The location of the file on the server

  • size: The size of the file in bytes

  • fileHash: The MD5 checksum of the file.

Setting up the Application Server

In this section, we will set up an Express server. It will have an endpoint for uploading a file. It will also connect to the MongoDB instance. To get started, we will create an app.ts file and write the following code:

import express from 'express';
import mongoose from 'mongoose';
import fs from 'fs/promises'
import path from 'path'


const app = express();
const port = 3000;


mongoose.connect("mongodb://127.0.0.1:27017/file-service?directConnection=true")
    .then(
        () => {
            console.log("Mongoose connection initialised.")

            // create storage directory
            fs.mkdir(path.join(__dirname, 'files'))
                .then(() => console.log("File directory created"))
                .catch(() => console.log("File directory has already been initialized"))

            app.listen(port, () => {
                console.log(`Server is running on http://localhost:${port}`);
            })
        }
    )
    .catch(err => console.log("mongoose connection failed with error:", err.message))

The app.ts currently has a lot going on:

  1. We are initializing an Express Application with const app = express();

  2. We are using mongoose.connect to connect to a MongoDB instance. A local connection string is used but can be replaced with any suitable options.

  3. After connecting to the database, we create the directory where our files will be uploaded. fs.mkdir is used to perform this action. The “Promise” returned by the function rejects if the directory already exists, so we prevent an unexpected error by using .catch

  4. We bind the server to the port, which is currently 3000

Setting Up the Upload Endpoint

Next, we need to set up an endpoint to store files. To do this, we will adjust the previous app.ts code.

import express from 'express';
import mongoose from 'mongoose';
import fs from 'fs/promises'
import path from 'path';
import multer from 'multer';
import crypto from "crypto";
import { FileModel } from './models';

const app = express();
const port = 3000;

const upload = multer({
    storage: multer.memoryStorage()
})

app.post('/files', upload.single('file'), async (req, res) => {
    let file = req.file!
    if (!file) {
        return res.status(400).json({ message: "file is required" });
    }

    // hash file data with md5
    let fileHash = crypto.createHash('md5').update(file.buffer).digest('hex');

    // check if existing hash exists
    let existingFile = await FileModel.findOne({ fileHash })
    if (existingFile) return res.json({ message: "file retrieved", data: existingFile })

    let filename = `${crypto.randomUUID()}.${file.originalname.split('.').at(-1)}`

    let filePath = path.join('files', filename);

    // store file
    await fs.writeFile(filePath, file.buffer)

    let fileData = await FileModel.create({
        name: filename,
        path: filePath,
        size: file.size,
        fileHash
    })

    return res.json({ message: "file created", data: fileData });
});



mongoose.connect("mongodb://127.0.0.1:27017/file-service?directConnection=true")
    .then(
        () => {
            console.log("Mongoose connection initialised.")

            // create storage directory
            fs.mkdir(path.join(__dirname, 'files'))
                .then(() => console.log("File directory created"))
                .catch(() => console.log("File directory has already been initialized"))

            app.listen(port, () => {
                console.log(`Server is running on http://localhost:${port}`);
            })
        }
    )
    .catch(err => console.log("mongoose connection failed with error:", err.message))

Let’s go through the code to understand key aspects of the code:

  1. A POST endpoint is set for the app with app.post

  2. We used the multer middleware to check for a file in the request with a file key

  3. crypto.createHash('md5').update(file.buffer).digest('hex'); creates an MD5 hash of the file’s raw byte data. Since the hash will always be the same for the same file, we use it to check if an existing file exists. If it does, we return the file to the user instead of saving it on our server again and expending resources.

  4. We create a unique file name which has the format ${uuid}.{file_extension}

  5. Lastly, we save the file to our files directory and store the file data on MongoDB through the FileModel (if the file had not previously existed)

To test the code, we can run the following on the command line:

npx ts-node app.ts

After running it, we should see a files directory has been created. We can use Postman to test the request:

Note: Make sure the key of the file you are uploading is named file. The multer middleware checks if a file with a key named file exists.

After running the request, the file is saved in the files directory that was created. If we run the request with the same file, the message in the response becomes “file retrieved” instead of “file created,” and no new file will be saved to the files directory.

Conclusion

We have learned about the MD5 hash and seen a useful application for it. Additionally, we have explored some basic filesystem methods available in Node.js. Preventing file duplication can save a lot of storage, which can be costly to maintain.