What is hashing and why is it important in cybersecurity

Hashing is a mathematical operation commonly used to protect data integrity. Learn about how it works and its applications in computer science and cybersecurity.

What is Hashing?

Hashing is a process that converts any input to a fixed-length, seemingly random output. For example, you could hash the word “Hello” or the first billion digits of pi with SHA-256 (a widely used hash function), and the result would always be 256 bits long.

This is accomplished using a mathematical function called a hash function, which is a collision-resistant one-way function. This means that it’s essentially impossible to find two inputs that produce the same hash value or the input that produced a specific hash value. Of course, this assumes that you’re using a secure hash function.

The Mechanics of Hashing

Hash functions are a type of cryptographic algorithm commonly used to ensure data integrity. To be effective, a hash function needs to have certain properties, including the following:

One-Way Function

Hash functions are mathematical operations that take any input and output a fixed-size value. This involves mapping many different inputs to the same output value.

The modulo operation is an example of a function that accomplishes this without the full capabilities of a hash function. Modulo is the remainder after division, so, for example, 25 modulo 10 equals 5. In this example, any integer value that ends in a 5 produces the same hash output (5). This makes it infeasible to determine the input to a hash function from its output since there are an infinite number of potential options.

Collision Resistance

The hash functions used for cybersecurity and computer science add collision resistance, which means that it is infeasible to find two inputs that produce the same output. This is important to common applications of hash functions, such as integrity protections and hash tables. In fact, it’s so important that a hash function is considered broken as soon as a hash collision has been successfully generated for it. This is why the MD5 and SHA-1 hash functions are no longer used for security-related applications.

Hash function non-locality is important to collision resistance and means that similar inputs to the hash function produce very different outputs. In general, flipping a single bit of the input to a cryptographically secure hash function should flip half of the output bits. This is important because otherwise, tools such as hill climbing algorithms could be used to find hash collisions.

Deterministic, Not Random

Hash functions are designed to be one-way functions, meaning that you can’t calculate the input from the output. Think of a hash function like a paper shredder. Once you put a document through the shredder, it’s easy to see the pile of shredded pieces (the output), but it’s almost impossible to put those pieces back together to recreate the original document (the input).

Additionally, hash functions should be collision-resistant. This means that it’s difficult to find two inputs that produce the same output.

This combination might make it seem like hash functions should involve a source of randomness. If the calculation is random, then it’s much easier to hit these goals. However, this would destroy most of the applications of hash functions, which rely on them being deterministic calculations. This means that hashing the same input will always produce the same output.

The Evolution of Hash Functions

Hash functions have existed for decades with modern hash functions replacing older ones as they became insecure. Some well-known hash functions include:

Message Digest 5 (MD5): MD5 is one of several hash functions created by Ronald Rivest. It was developed in 1991 and published in 1992. However, the hash function has been broken since 2004 when the first full collision was found in it.
SHA-1: Secure Hash Algorithm 1 (SHA-1) is a hash function released by the National Security Agency (NSA) in 1995 as a replacement for SHA, which they released in 1993. In 2005, theoretical attacks on SHA-1 were published, rendering it insecure and untrusted.
SHA-2: SHA-2 is a family of hash functions released in 2001 to replace SHA-1. This includes the well-known SHA-256 algorithm.
SHA-3: Between 2007 and 2012, the National Institute of Standards and Technology (NIST) ran a contest to choose SHA-3. In 2012, Keccak was chosen as SHA-3 and officially standardized in 2015.

Hash functions have also been created using various techniques over the years. For example, the SHA-1 and SHA-2 hash families were built using the Merkle-Damgård construction, which breaks the input into blocks of a particular size (padding them if needed) and passing them through a compression function.

Keccak, or SHA-3, works differently from earlier hash functions because it uses something called a 'sponge construction,' which you can think of like a kitchen sponge soaking up water. During the 'absorbing' stage, the input data is mixed into the sponge (the internal state), similar to how a sponge soaks up liquid. Then, when you squeeze the sponge, it releases water, just like SHA-3 produces the hash output. What’s unique is that you can keep squeezing the sponge to get more output, making SHA-3 capable of producing hashes of different lengths, unlike its predecessors.

Hashing in the Realm of Cybersecurity

Hashing is used for various purposes in cybersecurity, including ensuring data integrity and as part of digital signatures. It’s distinct from encryption, which is used to protect data confidentiality.

Ensuring Data Integrity Through Hash Functions

One of the most common applications of hash functions in cybersecurity is ensuring data integrity. The goal is to make it easy to detect if data is changed, whether maliciously or through some transmission or storage error.

Hash functions are used to protect data integrity because they are deterministic, collision-resistant functions. This is useful because, given some data, no one can modify that data in a way that ensures it keeps the same hash. Therefore, if you have a trusted hash value and the corresponding data, you can easily compute the hash of the data and compare it to the trusted hash. If they match, then no changes have been made to the data. Any change to the input, even of a single bit, should produce a significant and easily detectable change in the output.

Hashing vs. Encryption

Hash functions and encryption algorithms are both cryptographic algorithms, meaning that they use similar techniques and mathematical operations to achieve their goals. However, hashing is not encryption, and the two types of algorithms are designed for different purposes.

Encryption is designed to protect the confidentiality of data using a secret key. If someone encrypts a message, then it is scrambled in a way that makes it unreadable to anyone without the decryption key. Encryption is designed to protect against eavesdroppers and is distinct from hashing due to its use of secret keys and reversibility (e.g., you want the recipient to actually be able to read the message).

In contrast, hash functions are designed to protect data integrity. There are no secret keys involved, and these functions are specifically designed to be irreversible (which is essential to their collision resistance). Also, a hash value is commonly sent alongside the original data, so hash functions don’t protect that original data from being read.

Encryption and hashing are distinct but complementary processes. Encryption prevents eavesdroppers from reading sensitive information, and hashing can help to detect modifications to data. Together, they address potential security risks of storing or transmitting sensitive data on untrusted systems (like the Internet).

The Importance of Cryptographic Hash Functions in Digital Signatures

A digital signature is a cryptographic operation that ensures both data integrity and authenticity. It accomplishes this by combining hashing and encryption.

The role of a hash function in digital signatures is to improve efficiency and enable easy detection of changes to the data. The first step of the digital signature process is to hash the data to be signed.

The signature is generated using a public key encryption algorithm in reverse. The signer uses their private key to “encrypt” the hash of the message, generating the signature. When the message is sent, the digital signature and the signer’s public key are transmitted with it. Using the public key, the recipient can “decrypt” the signature to produce the hash of the message. They can then compute the hash of the message they received and compare the two versions to see if anything changed. In a digital signature, hashing makes it easy to detect changes to the data, protecting integrity. Public key cryptography provides authentication since only someone with knowledge of the private key could generate a valid digital signature that could be validated using their public key.

Practical Applications of Hashing

Hash functions are used for various purposes. These include everything from enhancing the efficiency of data lookups to security applications to enabling technologies like blockchain.

Hashing in Data Retrieval

Hash tables are an application of hashing in computer science designed to enhance the speed and efficiency of data retrieval. They use hash functions for indexing when storing records. For example, imagine the scenario where an application needs to store various user records that are accessed based on their name. To find a particular user’s record, they’d need to search through the entire list from the beginning of the alphabet to that user, which could take a variable amount of time depending on the number of users and the name in question.

With a hash table, the records are organized into memory based on the hash of the user’s name. When performing a lookup, the application would hash the name in question and would know where in memory to look for the record.

This use case takes advantage of hash function collision resistance, which means that multiple inputs are highly unlikely to produce the same output. In theory, the hash values of the users’ names should be uniformly distributed over the space of possible outputs. For example, if you had 128 users and a hash function with 256 possible outputs, it’s unlikely that any two users would have the same hash and highly unlikely that three or more would.

This is useful because it dramatically speeds up the process of retrieving a user’s record. For the price of a single hash calculation, the application moves from searching through a list of 128 user records to only checking the set associated with a particular hash function. On average, this changes the complexity from 64 checks to 1-2.

How Hashing Facilitates Secure Password Storage

Password-based authentication systems need a way to determine if a provided password is correct. The easiest way of doing so would be to store a list of usernames and the associated passwords. When a user tries to authenticate, the system could then compare a provided password with the stored version. However, this approach is horribly insecure. Anyone with access to the password file could steal the passwords of everyone else on the system.

Password storage best practices involve storing a hashed and salted password rather than the plaintext password. Since hash functions are deterministic, comparing the hash of a provided password to a stored hash is just as good as comparing the two passwords themselves. Collision resistance means that it’s infeasible for an attacker to find another password that would produce the same hash and wrongfully provide access to a user’s account.

Salting is a mechanism used to protect against password-guessing attacks using hashes. Since hash functions are deterministic, two users with the same password, a regrettably common occurrence, would have the same password hash. Additionally, an attacker could precompute the hashes of common passwords, producing something called a rainbow table. Salting involves adding a randomly generated value to a password before hashing it for the first time. This value is stored alongside the password, enabling it to be used when computing hashes during the authentication process. Its role is to ensure that identical passwords don’t produce identical hashes and to render rainbow tables unusable.

Signature-Based Antimalware

Signature-based analysis is a key component of any antimalware system. These systems will maintain a database of known malware variants and check new files against it. If a file matches a signature in the database, then the file is verified to be malware.

If antimalware used filenames in this database, it would be easy to evade detection, and storing the entirety of each malicious executable would be inefficient and unscalable. Instead, the signatures in the malware database are hash values of the executables. This provides an efficient but secure method of identifying whether a file is a known malicious executable.

Blockchain and Hashing

Blockchain technology is an example of a technology that couldn’t exist without hash functions. Many key features of the blockchain are based on the properties of hash functions, including:

Consensus Algorithms: Blockchain consensus algorithms officially determine who creates the next block in the chain. This requires an algorithm that each node in the network can independently calculate and that is impossible to cheat. Hash functions play a key role in this because they are deterministic, one-way functions.
Merkle Trees: Hash functions use Merkle trees or similar data structures to include a summary of a block’s transactions in a block header. Merkle trees are built using hash functions, which allow a single hash value to be stored in the block header while ensuring the integrity of the tree and all the transactions it contains.
Ledger Integrity: Blockchain gets its name from the fact that the distributed ledger is built using a series of blocks that are “chained” together. These “chains” are hash functions. Each block’s header contains the hash of the previous block header, making it impossible to rewrite one block in the chain without rewriting every block that follows it.
Digital Signatures: Blockchain uses digital signatures to ensure the integrity and authenticity of each transaction recorded on the digital ledger. Hash functions are a core part of digital signature algorithms.

Securely Using Hash Functions

Some best practices for using hash functions include:

Choose the Right Cryptographic Algorithm: Hash functions and encryption algorithms may seem similar, but they’re designed for different purposes. Hash functions should only be used to ensure data integrity, not confidentiality.
Select a Secure Hash Function: Hash functions are considered insecure once a hash collision has been detected for them. Although these can be used for some purposes (like hash tables), only secure algorithms should be used for password storage and other security-focused applications.
Use an Existing Implementation: Cryptographic algorithms like hash functions can be tricky to implement and fragile. Use an existing implementation of a trusted algorithm rather than developing your own.
Follow Best Practices: Always follow established best practices when using hash functions for security-related tasks. For example, passwords should always be stored using a secure hash algorithm and a unique, random salt for each password.

How Bitdefender can help

Hashing is a core component of many cybersecurity solutions. For example, GravityZone uses file hashes to help identify known malware attempting to infect a computer. GravityZone Integrity Monitoring can monitor entire systems for unauthorized changes.

This includes the ability to create rules to identify hash changes in files and the Windows registry. This can serve as a powerful tool to organizations looking to improve their security or who are in the process of achieving a zero-trust architecture.

Overview

Definition
How it works
Evolution
Applications

Security Solutions

Frequently Asked Questions

How is hashing used for software security?

Hashing is commonly used to ensure the integrity of software that is downloaded from the Internet. A webpage might include a file hash alongside an application or digitally sign the application. If the hash or signature is invalid, the application won’t be permitted to run.

How can hash functions be reversed?

Hash functions are one-way functions, meaning that it’s impossible to reverse them to extract the original input. However, it is possible to perform a brute force search for a matching input. For example, password crackers hash common passwords and look for a match within a database of leaked password hashes.

How does hashing relate to key derivation functions (KDFs)?

A key derivation function (KDF) converts a password into a cryptographic key. This involves hashing the password several times in an attempt to make a brute force search infeasible for an attacker.