Users uploading duplicate files can account for 5-30% of the data stored by your service. This can cause significant resource mismanagement in large applications that store gigabytes or terabytes of data. Storing duplicate data would also lead to unnecessary costs when using third-party services like AWS S3. This article covers how to prevent duplicate file storage with the MD5 hash.
Hashing
Hash functions generate a fixed output for any data passed into the function. It has various use cases like secure password storage on databases, providing security checksums for files, and it is also used in blockchain technology. MD5 is a popular hash algorithm and is perfect for this article's purpose.
The MD5 hash generates a 128-bit hash value for any arbitrary data. It is a uniform hash, so the hashed data remains the same for a specific input. This is why it can be used to check if a duplicate file has been uploaded.
Getting Started
For this article, we will set up a simple Express server with MongoDB to store the file hashes and access URLs. We will start with creating our project folder and then navigate to the folder:
mkdir express-file-manager
cd express-file-manager
Then we will initialize a new Node project with the following:
npm init -y
Next, we will install all project dependencies required for this project.
npm install typescript --save-dev
npm install ts-node --save-dev
npm install express
npm install @types/express --save-dev
npm install multer
npm install @types/multer --save-dev
npm install mongoose
We have installed Express, the backend framework for this project. Multer handles file handling in requests, and Mongoose is an ORM for interacting with MongoDB.
Next up, we will create a tsconfig.json
file with the following:
npx tsc --init
Setting up a Mongoose Model
We will create a model called FileModel
which represents a collection on the MongoDB instance. This collection keeps track of files that are uploaded through the service. It will also store important information about the uploaded files. To start, we will create model.ts
file in our project folder. Below is the code:
import { Schema, model, Document } from 'mongoose';
interface IFile extends Document {
name: string;
path: string;
size: number;
fileHash: string;
createdAt: Date;
updatedAt: Date;
}
const fileSchema = new Schema<IFile>({
name: { type: String, required: true },
path: { type: String, required: true },
size: { type: Number, required: true },
fileHash: { type: String, required: true },
createdAt: { type: Date, default: Date.now },
updatedAt: { type: Date, default: Date.now }
});
export const FileModel = model<IFile>('File', fileSchema);
The fileSchema
has a few important properties:
name: The name of the file.
path: The location of the file on the server
size: The size of the file in bytes
fileHash: The MD5 checksum of the file.
Setting up the Application Server
In this section, we will set up an Express server. It will have an endpoint for uploading a file. It will also connect to the MongoDB instance. To get started, we will create an app.ts
file and write the following code:
import express from 'express';
import mongoose from 'mongoose';
import fs from 'fs/promises'
import path from 'path'
const app = express();
const port = 3000;
mongoose.connect("mongodb://127.0.0.1:27017/file-service?directConnection=true")
.then(
() => {
console.log("Mongoose connection initialised.")
// create storage directory
fs.mkdir(path.join(__dirname, 'files'))
.then(() => console.log("File directory created"))
.catch(() => console.log("File directory has already been initialized"))
app.listen(port, () => {
console.log(`Server is running on http://localhost:${port}`);
})
}
)
.catch(err => console.log("mongoose connection failed with error:", err.message))
The app.ts
currently has a lot going on:
We are initializing an Express Application with
const app = express();
We are using
mongoose.connect
to connect to a MongoDB instance. A local connection string is used but can be replaced with any suitable options.After connecting to the database, we create the directory where our files will be uploaded.
fs.mkdir
is used to perform this action. The “Promise” returned by the function rejects if the directory already exists, so we prevent an unexpected error by using.catch
We bind the server to the port, which is currently
3000
Setting Up the Upload Endpoint
Next, we need to set up an endpoint to store files. To do this, we will adjust the previous app.ts
code.
import express from 'express';
import mongoose from 'mongoose';
import fs from 'fs/promises'
import path from 'path';
import multer from 'multer';
import crypto from "crypto";
import { FileModel } from './models';
const app = express();
const port = 3000;
const upload = multer({
storage: multer.memoryStorage()
})
app.post('/files', upload.single('file'), async (req, res) => {
let file = req.file!
if (!file) {
return res.status(400).json({ message: "file is required" });
}
// hash file data with md5
let fileHash = crypto.createHash('md5').update(file.buffer).digest('hex');
// check if existing hash exists
let existingFile = await FileModel.findOne({ fileHash })
if (existingFile) return res.json({ message: "file retrieved", data: existingFile })
let filename = `${crypto.randomUUID()}.${file.originalname.split('.').at(-1)}`
let filePath = path.join('files', filename);
// store file
await fs.writeFile(filePath, file.buffer)
let fileData = await FileModel.create({
name: filename,
path: filePath,
size: file.size,
fileHash
})
return res.json({ message: "file created", data: fileData });
});
mongoose.connect("mongodb://127.0.0.1:27017/file-service?directConnection=true")
.then(
() => {
console.log("Mongoose connection initialised.")
// create storage directory
fs.mkdir(path.join(__dirname, 'files'))
.then(() => console.log("File directory created"))
.catch(() => console.log("File directory has already been initialized"))
app.listen(port, () => {
console.log(`Server is running on http://localhost:${port}`);
})
}
)
.catch(err => console.log("mongoose connection failed with error:", err.message))
Let’s go through the code to understand key aspects of the code:
A POST endpoint is set for the app with
app.post
We used the multer middleware to check for a file in the request with a
file
keycrypto.createHash('md5').update(file.buffer).digest('hex');
creates an MD5 hash of the file’s raw byte data. Since the hash will always be the same for the same file, we use it to check if an existing file exists. If it does, we return the file to the user instead of saving it on our server again and expending resources.We create a unique file name which has the format
${uuid}.{file_extension}
Lastly, we save the file to our
files
directory and store the file data on MongoDB through theFileModel
(if the file had not previously existed)
To test the code, we can run the following on the command line:
npx ts-node app.ts
After running it, we should see a files
directory has been created. We can use Postman to test the request:
Note: Make sure the key
of the file you are uploading is named file
. The multer middleware checks if a file with a key named file
exists.
After running the request, the file is saved in the files
directory that was created. If we run the request with the same file, the message in the response becomes “file retrieved” instead of “file created,” and no new file will be saved to the files
directory.
Conclusion
We have learned about the MD5 hash and seen a useful application for it. Additionally, we have explored some basic filesystem methods available in Node.js. Preventing file duplication can save a lot of storage, which can be costly to maintain.