How Compression Programs Compress and Decompress Your Files
Most computer files are considered unnecessary and can be dispensed with, containing the same information many times over. File compression programs simply eliminate this excess and instead of repeating these files over and over, compression programs include these files once with references to the places where these files were repeated without being present there, and then allow the program to return them to their place in the original file. If you don’t understand this yet, do not rush as we have not finished yet. You will understand more in the following examples, God willing, and the upcoming details. As an example, I will present to you a verse that says:
“If you do not know, that is a tragedy… And if you know, the tragedy is greater”
This verse contains 12 words consisting of 43 letters and 12 spaces plus two periods “..” If each letter, space, or punctuation takes up one byte of memory, you will find that the total size of this verse is 57 bytes. To reduce this size, we need to look at the repeated things that can be dispensed with. You will notice:
- “Know” repeated twice
- “Do not know” repeated twice
- “Tragedy” repeated twice
3 words give us more than half of the second half of the verse. If we want to write the second half of the verse, we will only write the first half and only refer to the 3 repeated words in the second half based on the words already in the first half. Let us understand more how compression programs execute this process.
How Files are Constructed Inside a Compressed File
Most compression programs use a system called the LZ adaptive dictionary-based algorithm, with LZ referring to the designers of this algorithm and the dictionary word referring to the method used to classify information within the compression file. Systems differ in the arrangement and preparation of these dictionaries, but they can be as simple as arranging numbers, for example. When we go back to the verse we discussed, we will take the repeated words and create a list of numbers to represent these words, replacing the words with numbers where each number refers to a specific word.
1- Know
2- Do not know
3- Tragedy
And thus the verse will be:
If 1 do 2, that is 3… And if 1 2, the 3 is greater
If you understand how this system works, you will easily be able to reconstruct the original sentence using this dictionary and numbers. This is exactly what the decompression program on your device does when it unpacks a compressed file you downloaded from the internet. You also know that compression programs can allow you to open the original file within the compression program itself. For this to happen, you must realize that these programs contain a small file extraction program with the compressed file, which automatically rebuilds the original file as soon as it is downloaded to your device within the compressed file, using the agreed-upon dictionary. But what space did we compress from this system?
“If you do not know, that is a tragedy… And if you know, the tragedy is greater” This sentence is shorter than “If you do not know, that is a tragedy… And if you know, the tragedy is greater,” but keep in mind that we need to always keep the dictionary with the file to help the decompression program understand the decompression system. During actual compression and decompression of various types of files, it is much more complex than this, but in order to understand, let’s go back to the idea that each letter, space, or punctuation has a one-byte size, and we find that the original sentence was 57 bytes in size, and our compressed sentence – including spaces – is 39 bytes in size, and in addition to the dictionary size of 15 bytes (3 words + 3 numbers), we find in the end that the file size is 54 bytes, which is not far from the original file size! However, this is just one sentence or one verse from a complete poem. Imagine that the compression program works on the rest of this poem with similar repeated words and other recurring words as well, which prompts it to write a new dictionary of other repeated words to get the most suitable dictionaries for compression.
Finding Common Parts
In the example we discussed, we found the repeated words and formed a dictionary, which is the clearest method in front of us for creating the dictionary. But compression programs see common parts and create a dictionary in a different way. Compression programs do not have the concept of single words, but rather carefully select repeated parts to create a dictionary. It is important to note that I said repeated parts, not repeated words. If we applied this to the previous verse, we would get a completely different dictionary. If the compression program analyzed this verse – let’s agree from now on to call it a phrase for simplicity – the compression program would find that the letter “A” is followed by a space twice in “If” and “Do not,” so it can include “A + Space” in the dictionary, but it will find that it is useless in this phrase, so it won’t reduce the size much. However, if it continues this in the rest of the poem, it will find that this form “A + Space” is repeated many times in the rest of the poem and can easily add it to the dictionary. However, if the compression program wants to compress only this phrase, it will search for another way to build the dictionary to achieve the best way to reduce words and size to the maximum by replacing two letters together or three letters, or a letter with an added space or a letter at the end of one word with letters at the beginning of another word, and so on until it reaches the appropriate dictionary. This is a very complex process, which explains the word “adaptive” in LZ adaptive dictionary-based algorithm, meaning that it adapts according to the file it is working on.
Now, How Good is this System?
The percentage of file size reduction depends on a number of factors, including the type and size of the file and the compression program. For most languages, there are specific words and letters that often appear together in the same location. Due to this high repetition rate, text files are the best for compression, with size reduction reaching up to 50% or more in a normal text file. Additionally, most programming languages are repetitive in nature because they rely on a small set of commands related to each other, making them easy to compress. However, for files containing asymmetrical information such as MP3 files and graphics, compression is difficult with this system because they do not repeat much of the information within them. Now you should know that as long as the file contains many repetitions, the compression and reduction of the file size are increased, which you can see by referring to our example presented earlier. If we had shown you the rest of the poem from which this verse was taken, we would replace the repeated parts in the rest of the poem, and with the possibility of merging letters together and comparing the best ways to replace letters and creating a suitable dictionary. I believe you have grasped this idea. The efficiency of compression depends on the compression system used by the compression and decompression program. Some programs examine specific parts of files, while others have compression dictionaries inside other dictionaries, compressing large files efficiently but not for small sizes. However, all compression programs of this type have the same basic idea, and these programs always attempt to build a better compression system than the one they are currently using.
Lossless Compression – This type of compression we talked about is called Lossless Compression because it allows you to recover the original file without any loss when decompressed. All Lossless Compression Programs are built on the idea of breaking the file into a smaller form for transmission or storage in a smaller size, then restoring it back to the original form, allowing it to be used again.
Lossy Compression – This is another type of compression, meaning compression with loss, and this type works differently by simply excluding unnecessary bits of information from the data and adapting the file to be smaller in size. This type is often used to reduce the size of bitmap images that tend to be large, and how this works, let us think about…
How can your device compress the images coming to it via the scanner?
Programs that do not cause loss in compression do not do much with this type of file, although some parts of an image may be similar, such as a sky being entirely blue, but most pixel parts in the image are somewhat different. To create a smaller image without compromising stability and resolution, you must change the color value for a specific pixel. If the image contains a large area of blue sky, the compression program will keep one blue color to indicate all the blue sky pixels in the image, and the compression program then rewrites the file assuming that each pixel of blue sky corresponds to the blue color. If the compression process is successful, you will not notice the difference, and the change, but the file size will undoubtedly decrease.
Certainly, with lossy compression, you will not be able to recover the original file size after compression, and for this reason, you should not use this type of compression for anything you want to restore to its original size, such as software programs, databases, and others.
Other Topics That Might Interest You
- How Does a Keyboard Work and Interact with the Computer?
- How Does GPS Determine Your Location?
- Why Does Internet Speed Differ from Download Speed?
- The Difference Between File Formats such as FAT, FAT32, NTFS, REFS, exFAT
- How Do File Recovery Programs Retrieve Deleted Files?
Thanks be to Allah, I have finished… If the topic is not over for you and you still have difficulty understanding a point in the topic, I will be happy to reply to you, and remember not to let anything pass over you without understanding its basic concept at least… Thank you.