Content uploaded by Md. Ashiq Mahmood
Author content
All content in this area was uploaded by Md. Ashiq Mahmood on Jun 05, 2020
Content may be subject to copyright.
1st International Conference on Advances in Science, Engineering and Robotics Technology 2019 (ICASERT 2019)
978-1-7281-3445-1/19/$31.00 ©2019 IEEE
Efficient Compression Scheme for Large
Natural Text Using Zipf Distribution
Md. Ashiq Mahmood and K.M. Azharul Hasan
Department of Computer Science and Engineering
Khulna University of Engineering & Technology (KUET)
Khulna, Bangladesh
Email: ashiqmahmoodbipu@gmail.com, azhasan@gmail.com
Abstract— Data compression is the way toward modifying,
encoding or changing over the bit structure of data in such a way
that it expends less space. Character encoding is somewhat
related to data compression which represents a character by
some sort of encoding framework. Encoding is the way toward
putting a succession of characters into a specific arrangement for
effective transmission or capacity. Compression of data covers a
giant domain of employments including information
correspondence, information storing and database development.
In this paper we propose an efficient and new compression
algorithm for large natural datasets where any characters is
encoded by 5 bits called 5-Bit Compression (5BC). The algorithm
manages an encoding procedure by 5 bits for any characters in
English and Bangla using table look up. The look up table is
constructed by using Zipf distribution. The Zipf distribution is a
discrete distribution of commonly used characters in different
languages. 8 bit characters are converted to 5 bits by parting the
characters into 7 sets and utilizing them in a solitary table. The
character’s location is then used uniquely encoding by 5 bits. The
text can be compressed by 5BC is more than 60% of the actual
text. The algorithm for decompression to recover the original
data is depicted also. After the output string of 5BC is produced,
LZW and Huffman techniques further compress the output
string. Optimistic performance is demonstrated by our
experimental result.
Keywords—
encoding; compression; decompression; 5-bit
compression; compression ratio.
I.
I
NTRODUCTION
Data compression is broadly used and imperative issue in
information science applications because of capacity limit and
exchange speed of different systems [1][2]. There are
undeniable advantages which may be accomplished
incorporates diminishing capacity necessity and data
transmission, encoding littler bit numbers, short time essential
for broadcasting, powerful channel utilization etc. [3][5].
Having a decent packed scheme conspire remapping any text
by shorter bit is vital [1][2][5]. So a new system of encoding
can be important representing to a source of data precisely
conceivable utilizing least bit numbers putting the actual
meaning unaltered [1][2][11]. In many cases, to reestablish
data into the original organization because of the compacted
structure is vital [6]. Zipf distribution is used in this paper to
construct the dictionary of characters. It actually came from
Zipf’s law. Zipf's law can be said to an experimental law
formulated utilizing mathematical statistics that alludes to the
way that numerous sorts of data studied in the physical and
social sciences can be approximated with a Zipf distribution.
Zipf's law expresses that given some corpus of natural
language utterances, the frequency of any word is conversely
corresponding to its position in the frequency table [4]. In this
way the most frequent word will happen approximately twice
as regularly as the second most frequent word, three times as
regularly as the third most frequent word. The compression
techniques of data are ordered extensively as lossy and
lossless [8]. Lossy technique comprises towards a change to
imperative against immaterial data, trailed by compression of
lossless data in the essential case and disposing rest of the
data. As a matter of fact, Lossy technique functions by
decreasing amount of document by reducing superfluous text
or data [1][11]. The 5BC algorithm which we proposed is
lossless [1][2][8]. The technique deals with the mapping of
both forward and backward. 5BC alters characters from 8 bit
to 5 bits by parting the characters into 7 sets and utilizing them
in a solitary table. By using Zipf distribution, the characters in
the table i.e. dictionary are mapping [4][12]. Same characters
in the corresponding set code are kept together since it
produces shorter sequence of bit code. The 5BC can be able
to compress data by more than 60% of the actual data. 5BC
compression technique is a logical scheme that shows
promising efficiency.
II.
RELATED
WORKS
The Huffman technique is supposed to be a lossless technique
which relies on entropy [9]. It actually construct some binary
trees so the codes can be found from the binary tree sequence.
Huffman algorithm is said to be the variable length encoding
as it mainly works with length of the variable. In the contrary,
our 5BC algorithm constantly settled to work with 5 bits as it
were. The rule used in this technique is utilizing progressively
visit each data via a shorter amount of bits for encoding. In
the records of JPEG, Huffman technique is used. JPEG 2000
(JP2) is supposed to be an image compression standard and
coding system [13]. Static Huffman and Adaptive Huffman
are the classification of Huffman algorithm [9]. Static
Huffman ascertain frequencies right off bat and afterward
they create some typical tree for the forward and backward
mapping [9][15] though Adaptive Huffman functions by
building up trees by figuring frequencies thus constructs two
trees in both of these running processes [14]. There exists
some compression techniques which are based on dictionaries
rather than factual organizations [13][15]. A well-known
dictionary based technique is Lempel-Zev Welch or LZW
technique [9]. In this algorithm, an individual string αK is
constructed by a string α including a K which can be found
through the dictionaries. α is additionally added into the
dictionary [10]. At the end, the character of string has
additionally supplanted again [5]. The existing dictionary
functions in form of dynamic information thus it doesn’t show
to be static [10]. This crude dictionary can be recuperated
through the decompressed information at the time of
disentangling it. The LZW dictionary can be precisely worked
amid the definite specific during Compression and
decompression, and disposed of in the wake of packing or
process of decompressing has been ended [10]. There is
another technique named LZ Technique [5][13] which is
additionally popular because of it’s straightforwardness as
well as It has effective compression proportions. Our
technique utilizes a table look up which remains a static one
and it will not change during the arrival of new data. The look
up table i.e. the dictionary remains fixed and can be utilized
for the purpose of compression and decompression as time
required.
III.
EFFICIENT
COMPRESSION
SCHEME
FOR
LARGE
NATURAL
T
EXT
Generally, for the purpose of encoding a character in a
memory, 8 bit space in a memory is required [1][2]. We
introduce a character encoding scheme of 5 bit called 5BC
representing a character via 5 bits in lieu of 8 bits. The
algorithm works with all the general printable characters
covering both English and Bangla characters that has in
normal keyboard. A lookup table is built for representing the
Characters by 5 bit. The table is illustrated in TABLE I.
Using 5 bit it can be combination possible. Among
the 32 combinations 25 combinations are used for
compression and the rest of the 7 are used for set
representation. The character sets are splitted within the
number of 7 sets. In the lookup table,
• Bangla characters are placed in Set1, Set2, Set3
• English characters are placed in Set4, Set5, Set6 &
Set7
• 7 binary combination are taken for the 7 set.
The Unicode takes 16 bit of memory for representing the
natural language character set. This modified 5 bit characters
will be converted to 16 bit Unicode characters.
TABLE I. Lookup table for 5BC
Serial
no.
Decimal
value
Binary
value
Set1
Set2
Set3
Set4
Set5
Set6
Set7
1
0
00000
A
Space
1
X
2
1
00001
B
,
2
Y
3
2
00010
C
a
3
Z
4
3
00011
D
b
4
x
5
4
00100
E
c
5
y
6
5
00101
F
d
6
z
7
6
00110
G
e
7
'
8
7
00111
H
f
8
!
9
8
01000
I
g
9
"
10
9
01001
J
h
0
#
11
10
01010
K
i
+
\
12
11
01011
L
j
-
~
13
12
01100
M
k
*
^
14
13
01101
N
l
/
|
15
14
01110
O
m
=
$
16
15
01111
P
n
(
:
17
16
10000
Q
o
)
;
18
17
10001
R
p
{
_
19
18
10010
S
q
}
`
20
29
10011
T
r
<
21
20
10100
U
s
>
22
21
10101
V
t
[
23
22
10110
W
u
]
24
23
10111
.
v
%
25
24
11000
?
w
&
26
25
11001
Set-1
27
26
11010
Set-2
28
27
11011
Set-3
29
28
11100
Set-4
30
29
11101
Set-5
31
30
11110
Set-6
32
31
11111
Set-7
Algorithm 1: Forward Mapping
Input: Normal string S,
Output: An encoded compressed string Sc
1: By adding set change,
2: Representing the string S by S`
3: By using the lookup table,
4: Representing the string S` by its corresponding 5 bits
representation.
5: The representative string S’ contain K bits.
Example 1:
Original Text (Input):
I am Bangladeshi
Set Representation: Set4 I Set5 space am space Set4 B
Set5 angladeshi
Decimal Representation: 28 8 29 0 2 14 0 28 1 29 2 15
8 13 2 5 6 20 9 10
5 bit representation: 11100 01000 11101 00000 00010
01110 00000 11100 00001 ….…
16 Bit representation: 1110001000111010
0000000100111000 …………
Compressed String: 脻Ჟ䑨ꡉǐ廷
Example 2:
Compressed String: 脻Ჟ䑨ꡉǐ廷
16 Bit representation:
--- 1110001000111010
脻 --- 0000000100111000
After dividing it by 16
The 5 bit Representation:
11100 --- 28 --- Set4
01000 --- 8 --- I
11101 --- 29 --- space
00000--- 0 ---- Set5 …………..
Decompressed String:
Set4 I Set5 space am space Set4 B Set5 angladeshi
Original Text: I am Bangladeshi
IV.
ANALYTICAL
ANALYSIS
In 5BC, each characters of 8 bits are represented via 5 bits
hence 3 bits are saved. So the probable efficiency of 5BC is
(8-5)/8= 37.5%. But after compressing it with Unicode
scheme the best efficiency would be more than 60%.
We tried to develop a theoretical analysis to calculate the
efficiency precisely. The parameters which are considered for
analytical analysis are shown in following table.
TABLE II. PARAMETERS FOR ANALYTICAL
EVALUATION
Parameter
Description
I
Original data
E
Encoded data
Ls
Length of the set
Li
Length of I
Ld
Length of E; Ld = Li + Ls
T
Total bytes for I
Te
Total bytes for E
η
Efficiency( Te/T)
9: compare this new C with the next set code,
When C is repeated, the next bit stream will be
Cropped
10: Including the bit code of the repeated C.
11: end if
12: From the remaining Kth bits stream, taking 5 bits
and representing it according to the set of
characters in the lookup table.
13: After excluding the number of set, the actual
string can be generated.
6: if K%16=0 then
7: jump into end
8: end if
9: else if K%16!=0 then
10: Sum up m bits in the formation of the last 5 bit set
code by taking every bit of the set
11: Code one by one and increasing in the same way
until (K+m)%16=0
12: end else if
13: Represent the 16 bits from (K+m)/K to,
14: Compress string Sc by applying corresponding
Unicode text.
Algorithm 2: Backward Mapping
Input: Compressed String Sc
Output: An encoded compressed string Sc
1: Representing the string Sc according to it’s
corresponding 16 bits value of binary number from
the Unicode table.
2: Let, Sc contain K bits.
3: For determining 5 bit representation,
4: At first find .
5: Store the 1st set code of K’ in C.
6: Take every 5 bit set code of the K’ bit and
compare it with C.
7: if C is not repeated then
8: which means there is an occurrence of set
change so C is updated with the new 5 bit set code and
Let Input data I has length Li. After encoding it becomes
E and length become Ld
Where
Total bytes for I,
Total bytes for E,
For usable compression, then
It is ac tual ly the proportion between the set length and
original length. It demonstrates that this technique is properly
usable until the length of set remains equal or less than 3/5th
of the actual d a t a length. And if so then we will get a
positive efficiency. But if the ratio is greater than 3/5th the
efficiency would be negative.
V.
E
XPERIMENTAL
R
ESULT
Since 5BC produces a large Unicode string in a printable
format. It can be further compressed after applying them
in LZW and Huffman. We actually experiment 5
algorithms namely, 5BC, LZW, Huffman, 5BC+LZW and
5BC+Huffman. The experimental result shows the
comparison between this 5 algorithms. We have done 3
kind of test comprising the Compression of files,
Compression ratio and Compression time.
Compression of files: Here we can see the comparison
between the compressions sizes achieved after applying
the above mentioned compression techniques on the
different datasets.
Compression ratio: Here we can see the comparison
between the compressions ratios achieved after applying
the above mentioned compression techniques on the
different datasets
Compression time: Here we can see the comparison
between the compressions times achieved after applying
the above mentioned compression techniques on the
different datasets.
A. Compression of files:
Fig. 1. Compression of file for different data set
From Fig. 1. The result shows that LZW provides the best
result in terms of compression of different size file. But our
proposed 5BC algorithm also provides very promising result.
Our proposed technique performs far better than the well-
known Huffman algorithm. And another fact that in terms of
increasing data or increasing the file size our technique will
converge the compression result of LZW. And if we combine
these two techniques with 5BC, both the algorithm
5BC+Huffman as well as 5BC+LZW improves their
performance.
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
12345678910
FILE SIZE(MB)
NUMBER OF FILES
C O M P RE S S IO N O F F I L ES
File size 5BC LZW
Huffman 5BC+LZW 5BC+Huffman
B. Compression Ratio:
Fig. 2. Demonstrates the compression ratio for different
datasets. The result shows that LZW provides very good
compression ratio at the initial level when the file size is small
but as the size of data is increasing, the efficiency rate will be
poor Because LZW provides bad compression ratio for less
dictionary size and as the character length increases the
dictionary will not easily be made of. So performance
reduced. But our proposed idea provides a constant efficiency
Fig. 2. Compression ratio for different data set
rate at any file size and the performance improves for the
increasing number of datasets. It provides a quite better
efficiency rate than Huffman. And at some point it will
converge the efficiency rate of LZW also. The reason
Huffman provides poor efficiency that Huffman technique
derives the data from the probability or frequency of
occurrence of the possible values in the source symbol [13].
So if the data set is quite large, the frequency generation would
be more difficult and a large number of individual symbol will
be created. As a result, it provides compression ratio which is
average. Another important scenario is that after combining
this both algorithm with 5BC, the performance of Huffman
will be dramatically improved because this time it works with
the compressed data by the 5BC which is much smaller than
the original data. The performance of LZW is also improved.
C. Compression time:
Figure 3 provides the compression times for different
datasets. The result shows that our proposed idea shows
average result time when after combining it with LZW and
Huffman the performance improves drastically. Though
Huffman individually provides the time quite similar with our
proposed idea.
Fig. 3. Compression time for different data set
VI.
CONCLUSION
In this research, we try to propose an efficient encoding
and decoding technique for data compression which
provides a fast and promising performance. This technique
is a 5 bit compression scheme and it’s character set i.e.
dictionary is produced by the Zipf distribution. The
proposed technique provides a promising efficiency rate
which is more than 60%. This technique can be easily
utilized compressing a huge amount of natural datasets.
Both the mapping forward and backward make this
compression mapping complete as it shows i.e. while
decomposing we can achieve the original text. This
technique can be implemented in building an innovative
and efficient software which can be applied in
communication of data, storage of data and database
engineering. It can be utilized to parallel processing
environment as well as load balancing technique to
achieve promising encoding time.
REFERENCES
[1] Ashiq Mahmood, Tarique Latif, and K. M. Azharul Hasan, “An
Efficient 6 bit Encoding Scheme for Printable Characters by table
lookup (6BE)”, International Conference on Electrical, Computer and
Communication Engineering (ECCE), pp. 468-472, 2017.
[2] Md. Ashiq Mahmood, Tarique Latif, K. M. Azharul Hasan and riadul
Islam, “A Feasible 6 Bit Text Database Compression Scheme with
Character Encoding (6BC)”, 2018 21st International Conference of
Computer and Information Technology (ICCIT), pp. 1-6, 2018
[3] Nieves R. Brisaboa, Antonio Fariña Gonzalo Navarro “Lightweight
natural language text compression”, Information retrieval 10(1), 1-33,
2007.
[4] Wentian Li “Random Texts Exhibit Zipfs-Law-Like Word”, Ieee
Transactions On Informationtheory, Vol. 38, No. 6,November 1992
[5] Md. Abul Kalam Azad, Rezwana Sharmeen, Shabbir Ahmad and S. M.
Kamruzzaman, “An Efficient Technique for Text Compression” The 1st
International Conference on Information Management and Business,
pp. 467-473, 2005.
[6] M. M. Kodabagi, M. V. Jerabandi, Nagaraj Gadagin, “Multilevel
security and compression of text data using bit stuffing and huffman
coding”, 2015 International Conference on Applied and Theoretical
Computing and Communication Technology (iCATccT), pp. 800 - 804,
2015.
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 . 0 5 1
0 . 1 1 2
0 . 3 0 3
0 . 5 7 2
0 . 8 1 8
1 . 0 5
1 . 5 5
2 . 5
3 . 8 4
4 . 9 9
EFFICIENCY
FILE SIZE(MB)
RE SU LT O F E FFI CI EN CY
5BC LZW Huffman
5BC+LZW 5BC+Huffman
0
4
8
12
16
20
24
28
32
36
40
0.051
0.112
0.303
0.572
0.818
1 . 0 5
1 . 5 5
2 . 5
3 . 8 4
4 . 9 9
COMPRESSION TIME(HOURS)
FILE LENGTH(MB)
CO MPR ES S I ON T IM E
5BC LZW Huffman
5BC+LZW 5BC+Huffman
[7] Muthukumar Murugesan, T. Ravichandran, “Evaluate Database
Compression Performance and Parallel Backup,” International Journal
of Database Management Systems (IJDMS) Vol.5(4), 17-25, 2013.
[8] Vu H. Nguyen, Hien T. Nguyen, Hieu N. Duong and Vaclav Snasel ”
n-Gram-Based Text Compression, ” Computational Intelligence and
Neuroscience, Volume 2016.
[9] Senthil Shanmuga Sundaram, and Robert Lourdusamy, “A
Comparative Study Of Text Compression Algorithms,” International
Journal of Wisdom Based Computing, Vol.1 (3), pp. 68-76, 2011
[10] A Carus, and A Mesut, “Fast text compression using multiple static
dictionaries,” Information Technology Journal, 1013-1021, 2010.
[11] Replacement And Bit Reductionn”, SIPM, FCST, ITCA, WSE, ACSIT,
CS & IT 06, pp. 407–416, 2012.
[12] N. Miladinovic, “Minimum-cost encoding of information. Coding with
digits of unequal cost”, 4th International Conference on
Telecommunications in Modern Satellite, Cable and Broadcasting
Services. TELSIKS'99 (Cat. No.99EX365), pp. 600 – 603, vol.2, 1999
[13] M. Sangeetha, P. Betty, G.S. Nanda Kumar, “A biometrie iris image
compression using LZW and hybrid LZW coding algorithm,” 2017
International Conference on Innovations in Information, Embedded and
Communication Systems (ICIIECS), pp. 1-6, 2013.
[14] Wenjun Huang, Weimin Wang, Hui Xu, “A Lossless Data Compression
Algorithm for Real-time Database,”. 2006 6th World Congress on
Intelligent Control and Automation, , pp. 6645-6648, 2006
[15] Ahmed Mokhtar, A. Mansour, Mona A. M. Fouad, “Dictionary based
optimization for adaptive compression techniques”, 2013 36th
International Convention on Information and Communication
Technology, Electronics and Microelectronics (MIPRO), pp. 421 – 425,
2013