Conference PaperPDF Available

A Dictionary based Compression Scheme for Natural Language Text with Reduced Bit Encoding

Authors:

Abstract

Data compression, also called compaction, the process of reducing the amount of data needed for the storage or transmission of a given piece of information, typically by the use of encoding techniques. Character encoding is genuinely related to data compression which represents characters with a type of encoding technique. Encoding characterizes the way toward putting a movement of characters into a specific arrangement for incredible transmission or point of confinement. Compression of data covers a goliath space of employments including data correspondence, data securing and database improvement. For the most part two surely understood compression procedures named Huffman and LZW are really utilized for text compression. In this paper, we propose an effective and straightforward compression techniques for huge common text by a 5 bit encoding scheme which can convert 8 bit characters to 5 bit named 5 Bit Encoding Scheme (5BE). It can most likely beat Huffman and LZW regarding compression proportion. This plan gives an encoding calculation changing over any 8 bit characters in English and Bangla by 5 bit by using a look up table. The look up table is created by utilizing Zipf dissemination which is a discrete circulation of generally utilized characters in various dialects. In the wake of changing over the characters into 5 bit, we consistently ascertain a k-Series scheme to build a database dictionary. With the penalty of storage for the dictionary, we compress a natural text by 87%. This dictionary will be used by the compression and decompression algorithms and to be employed in the client side. Therefore, constructed only once. Hence the facilities provided by the compression technique will be found without interruption. The reverse algorithm to recuperate the genuine data is additionally illustrated. We compare our algorithm to both the outstanding Huffman and LZW strategy. Promising execution is exhibited by our trial result.
A Dictionary based Compression Scheme for
Natural Language Text with Reduced Bit
Encoding
0G$VKLT0DKPRRGDQG.0$]KDUXO+DVDQ
'HSDUWPHQWRI&RPSXWHU6FLHQFHDQG(QJLQHHULQJ
.KXOQD8QLYHUVLW\RI(QJLQHHULQJ7HFKQRORJ\.8(7
.KXOQD%DQJODGHVK
(PDLODVKLTPDKPRRGELSX#JPDLOFRPD]KDVDQ#JPDLOFRP
Abstract— Data compression, also called compaction, the
process of reducing the amount of data needed for the storage or
transmission of a given piece of information, typically by the use
of encoding techniques. Character encoding is genuinely related
to data compression which represents characters with a type of
encoding technique. Encoding characterizes the way toward
putting a movement of characters into a specific arrangement for
incredible transmission or point of confinement. Compression of
data covers a goliath space of employments including data
correspondence, data securing and database improvement. For
the most part tw o surely understood compression procedures
named Huffman and LZW are really utilized for text
compression. In this paper, we propose an effective and
straightforward compression techniques for huge common text
by a 5 bit encoding scheme which can convert 8 bit characters to
5 bit named 5 Bit Encoding Scheme (5BE). It can most likely
beat Huffman and LZW regarding compression proportion.
This plan gives an encoding calculation changing over any 8 bit
characters in English and Bangla by 5 bit by using a look up
table. The look up table is created by utilizing Zipf dissemination
which is a discrete circulation of generally utilized characters in
various dialects. In the wake of changing over the characters into
5 bit, we consistently ascertain a k-Series scheme to build a
database dictionary. With the penalty of storage for the
dictionary, we compress a natural text by 87%. This dictionary
will be used by the compression and decompression algorithms
and to be employed in the client side. Therefore, constructed only
once. Hence the facilities provided by the compression technique
will be found without interruption. The reverse algorithm to
recuperate the genuine data is additionally illustrated. We
compare our algorithm to both the known Huffman and LZW
technique. Promising efficiency is exhibited by our experimental
result.
Keywords—
encoding; compression; decompression; 5-bit
compression; compression ratio.
,
,
1752'8&7,21
'DWD FRPSUHVVLRQ DV RIWHQ DV SRVVLEOH QDPHG DV VRXUFH
FRGLQJLV DPHWKRGRI RXWOLQHRIGDWDRU VWDWHHQFRGLQJGDWD
XVLQJ OHVVHU QXPEHU RI ELWV WKDQ DQ XQHQFRGHG SRUWUD\DO
ZRXOGXVH>@,WVHHNVDIWHUIURPWKHPDQQHULQZKLFKWKDWWKH
UHFLSLHQWRIWKHGDWDPXVWWKLQNDERXWWKHHQFRGLQJSODQXVHG
E\VHQGHUDQGLVILWIRUWUDQVODWLQJLWWRUHFRXSWKHJHQXLQHW\SH
RI GDWD 7KH EDVLF UROH IRU FRPSUHVVLRQ LV WR GLPLQLVK WKH
DGGLWLRQDO URRP UHTXLUHG WR VWRUH WKH GDWD UHGXFH WKH
H[FKDQJHVSHHGQHHGLQRUGHUWRWUDQVPLWLWWKXVO\GHFUHDVLQJ
KDUGDQGIDVWFRVW>@5HJDUGOHVVRIWKHZD\WKDWDWUHPHQGRXV
DGGLWLRQDOURRPLVRSHQIRUVHFXULQJWKHGDWDDQ\ZD\LWFDQJR
SDVWDV IDU DV SRVVLEOH'DWDFRPSUHVVLRQSURFHGXUH PD\ EH
SRUWUD\HG XQGHU WZR RUGHUV >@ /RVVOHVV FRPSUHVVLRQ DQG
/RVV\FRPSUHVVLRQ DUHWKHNLQGVRIGDWD FRPSUHVVLRQ>@
/RVVOHVV FRPSUHVVLRQ SURFHGXUH LV DOO WKH PRUH IUHTXHQWO\
PLVXVHWUXHUHLWHUDWLRQVRWKHVHQGHUVGDWDFDQEHUHSUHVHQWHG
DOOWKHPRUHTXLFNO\ZLWKRXW HUURU>@/RVVOHVVFRPSUHVVLRQ
LVSRVVLEOHLQOL JKWR IWKHIDFWW KD WDQHQRU PR XVVH JPHQWRIWK LV
SUHVHQWUHDOLW\GDWDKDVDXWKHQWLFUHGXQGDQF\>@$VIDUDV
ORVVOHVV FRPSUHVVLRQIUDPHZRUN GDWD PLVKDS LV SURKLELWHG
5HDO GDWD FDQ EH FKDQJHG IURP WKH VWXIIHG RQH $QRWKHU
FRPSUHVVLRQ VWUDWHJ\ FDOOHG ORVV\ GDWD FRPSUHVVLRQ LV
SRVVLEOHLIPLQRUGDWDKDUGVKLSLVFRPPHQGDEOH>@)RUWKLV
FLUFXPVWDQFHRQHRIDNLQGGDWDFDQWEHUHYDPSHGIURPWKH
FRPSDFWHGGDWDDVDUHVXOWRIWKHFOHDULQJRIVRPHDEXQGDQFH
GDWDZKLOH FRPSUHVVLRQSURFHVV>@7KHGDWD FRPSUHVVLRQ
FDQ EH PDGH FRQWLQXRXVO\ FDSDEOH E\ WKH K\EULGL]DWLRQ RI
GLIIHUHQW SURFHGXUHV >@ 7KH NH\JUHDW SRVLWLRQ RI VXFK
VWUDWHJ\LVWKDWLWFDQSDFNWKH\LHOGUHSRUWZKLFKLVFRQYH\HG
LQWKHZDNHRIDSSO\LQJFHUWDLQFRPSUHVVLRQIUDPHZRUNVRQD
UHFRUG7KLV \LHOGVD SUHYDOHQWUHVXOW=LSIVFDWWHULQJLVXVHG
LQWKLVSDSHUWRIDEULFDWHWKHLQTXLU\WDEOHRIFKDUDFWHUV,WWUXO\
VWDUWHGIURP =LSIV ODZ >@=LSIVODZFDQEHVDLGWR D WHVW
ODZSRLQWE\SRLQWXWLOL]LQJQXPHULFDOELWVRINQRZOHGJHWKDW
LQWLPDWHV WKH ZD\ ZKHUHLQ WKDW GLIIHUHQW VRUWV RI GDWD
FRQVLGHUHG LQ WKH SK\VLFDO DQG KXPDQ VFLHQFHV FDQ EH
DSSUR[LPDWHGZLWKD=LSIWUDQVSRUW=LSIVODZFRQYH\VWKDW
JLYHQ VRPH FRUSXV RI RUGLQDU\ ODQJXDJH HQXQFLDWLRQV WKH
UHSHDW RI DQ\ ZRUG LV RI FRXUVH FRQWUDVWLQJ ZLWK LWV
FLUFXPVWDQFH LQ WKH UHSHDW WDEOH >@ 7KXVO\ WKH PRVW
FHDVHOHVVZRUGZLOOKDSSHQJHQHUDOO\WZLFHDVW\SLFDOO\DVWKH
VHFRQG PRVW SURJUHVVLYH ZRUG RQ YDULRXV RFFDVLRQV DV
URXWLQHO\ DV WKH WKLUG PRVW UHOHQWOHVV ZRUG 7KH WHFKQLTXH
ZKLFK ZH SURSRVHG LH %( LVUHDOO \ORVVOHVV VFKHPH 7KH
IUDPHZRUN GHDOV ZLWK WKH PDSSLQJ RI ERWK IRUZDUG DQG
,(((,QWHUQDWLRQDO&RQIHUHQFHRQ5RERWLFV$XWRPDWLRQ$UWLILFLDOLQWHOOLJHQFHDQG,QWHUQHWRI7KLQJV5$$,&21
1RYHPEHU'HFHPEHU'KDND%DQJODGHVK
,((( 
Authorized licensed use limited to: Macquarie University. Downloaded on June 23,2020 at 06:21:00 UTC from IEEE Xplore. Restrictions apply.
UHYHUVHPDSSLQJ,WFKDQJHVFKDUDFWHUVIURPELWWRELWVE\
LVRODWLQJ WKH FKDUDFWHUV LQWR  VHWV DQG XWLOL]LQJ WKHP LQ DQ
LQTXLU\ WDEOH %\ XVLQJ =LSIW UDQVSRUWWKH FKDUDFWHUV LQ WKH
WDEOHDUHPDSSLQJ6DPHFKDUDFWHUVLQWKHFRQWUDVWLQJVHWFRGH
DUH NHSW WRJHWKHU VLQFH LW SURGXFHV VKRUWHU JDWKHULQJ RI ELW
FRGH%(FDQLQDOOSUREDELOLW\SDFNGDWDE\RYHURIWKH
JHQXLQHGDWD
,,
5(/$7('
:25.6
,Q'DYLG+XIIPDQLQYHQWHG+XIIPDQFRGLQJWHFKQLTXH
>@ ,W ZRUNV ZLWK LQWHJHU OHQJWK FRGHV $ +XIIPDQ WUHH
LOOXVWUDWHV+XIIPDQFRGHVIRUWKHFKDUDFWHUWKDWPD\VKRZXS
LQDWH[WILOH2QWKHFRQWUDU\WR$6&,,RU8QLFRGH+XIIPDQ
FRGHXWLOL]HVGLVWLQFWLYHQXPEHURIELWVWRHQFRGHOHWWHUV>@
+XIIPDQ FRGLQJ DSSUR[LPDWHV WKH SRSXODWLRQ GLVWULEXWLRQ
ZLWKSRZHUVRI WZR SUREDELOLW\ ,I WKH WUXHGLVWULEXWLRQGRHV
FRQVLVWRI SRZHUVRI WZR SUREDELOLW\ DQG WKHLQSXWV\PEROV
DUHFRPSOHWHO\XQFRUUHODWHG+XIIPDQFRGLQJLVRSWLPDO,WLV
KRZHYHURSWLPDODPRQJDOOHQFRGLQJWKDWDVVLJQVSHFLILFVHWV
RIELWVWRVSHFLILFV\PEROVLQWKHLQSXW7KLVVWUDWHJ\VKRXOG
EHDORVVOHVVV\VWHPZKLFKGHSHQGVRQHQWURS\>@,WUHDOO\
GHYHORSVRPHSDLUHGWUHHVVRWKHFRGHVFDQEHIRXQGIURPWKH
SDUDOOHO WUHH JURXSLQJ +XIIPDQ WHFKQLTXH LV VDLG WR EH WKH
YDULDEOHOHQJWKHQFRGLQJDVLW SULPDULO\ZRUNVZLWKOHQJWKRI
WKHYDULDEOH>@7KHDOJRULWKPXWLOL]HGLQWKLVSURFHGXUH
LV XVLQJ G\QDPLFDOO\ YLVLW HYHU\ GDWXP WKURXJK D VKRUWHU
PHDVXUHRIELWVIRUHQFRGLQJ,QWKHUHFRUGVRI-3(*+XIIPDQ
VWUDWHJ\ LV XWLOL]HG -3(*  -3 VKRXOG EH D SLFWXUH
FRPSUHVVLRQVWDQGDUG DQG FRGLQJIUDPHZRUN >@ 6WDWLF
+XIIPDQ DQG $GDSWLYH +XIIPDQ >@ DUH WKH JURXSLQJ RI
+XIIPDQ FDOFXODWLRQ 6WDWLF +XIIPDQ ILQG RXW IUHTXHQFLHV
GLUHFWO\RIIEDWDQGD ZKLOH ODWHU WKH\PDNHVRPHUXQ RI WKH
PLOO WUHH IRU WKH IRUZDUG DQG LQ UHYHUVH PDSSLQJ KRZHYHU
$GDSWLYH +XIIPDQ ZRUNV E\ VWUXFWXUH XS WUHHV E\ ILJXULQJ
IUHTXHQFLHV FRQVHTXHQWO\GHYHORSV WZRWUHHVLQERWK RIWKHVH
UXQQLQJSURFHGXUHV7KHUHH[LVWVVRPHFRPSUHVVLRQVWUDWHJLHV
ZKLFK GHSHQG RQ OH[LFRQV DV RSSRVHG WR YHULILDEOH
DVVRFLDWLRQV >@ $Q RXWVWDQGLQJ ZRUG UHIHUHQFH EDVHG
V\VWHPLV/HPSHO=HY:HOFKRU/=:VWUDWHJ\>@/HPSHO±
=LY±:HOFK/=:LVDQDOOLQFOXVLYHORVVOHVVGDWD
FRPSUHVVLRQWHFKQLTXH LQYHQWHGE\$EUDKDP /HPSHO-DFRE
=LYDQG7HUU\:HOFK>@:HOFKSXEOLVKHGLWLQDV
DQ LPSURYHG XVDJH RI WKH /= DOJRULWKP >@ ZKLFK ZDV
SUHYLRXVO\ SXEOLVKHG E\ /HPSHO DQG =LY LQ  7KH
DOJRULWKP LV VLPSOH WR LPSOHPHQW DQG KDV WKH SRWHQWLDO IRU
YHU\KLJK WKURXJKSXW LQ KDUGZDUHLPSOHPHQWDWLRQV,WLVWKH
DOJRULWKP RI WKH ZLGHO\ XVHG 8QL[ ILOH FRPSUHVVLRQ XWLOLW\
FRPSUHVVDQGLVXVHGLQWKH*,)LPDJHIRUPDW,QWKLV
FDOFXODWLRQ DQ LQGLYLGXDO VWULQJ Į. LV EXLOW E\ D VWULQJ Į
LQFOXGLQJD.ZKLFKFDQEHIRXQGWKURXJKWKHOH[LFRQVĮ LV
PRUHRYHULQFOXGHGLQWRWKHZRUGUHIHUHQFH>@7RZDUGWKH
HQGWKHFKDUDFWHURIVWULQJKDVPRUHRYHUGLVSODFHGDJDLQ7KH
FXUUHQWOH[LFRQZRUNVLQW\SH RIG\QDPLFGDWDLQWKLVZD\LW
GRHVQW VKRZ WR EH VWDWLF >@ 7KLV XQUHILQHG ZRUG
UHIHUHQFHFDQEHUHFRYHUHGWKURXJKWKHGHFRPSUHVVHGGDWDDW
WKHVHDVRQRIXQUDYHOLQJLW7KH/=:ZRUGUHIHUHQFHFDQEH
FRUUHFWO\ ZRUNHG LQ WKH PLGVW RI WKH SRVLWLYH H[SOLFLW DPLG
&RPSUHVVLRQDQGGHFRPSUHVVLRQDQGGLVFDUGHGLQ WKHZDNH
RISUHVVLQJRUSURFHGXUHRIGHFRPSUHVVLQJKDVEHHQILQLVKHG
>@2XUV\VWHPXVHVDORRNXSWDEOHZKLFKUHPDLQVDVWDWLF
RQHDQGLWZRQWFKDQJHDPLGWKHHQWU\RIQHZGDWD7KHORRN
LQWRWDEOH IRUH[DPSOH WKH GLFWLRQDU\VWD\V IL[HG DQG FDQ EH
XVHGZLWKWKHHQGJRDORIFRPSUHVVLRQDQGGHFRPSUHVVLRQDV
WLPHUHTXLUHG
,,,
()),&,(17
&2035(66,21
6&+(0(
)25
/$5*(
1$785$/
7
(;7
 ELW PHPRU\ LV UHTXLUHG IRU HQFRGLQJ DQ\ FKDUDFWHU 7KH
SURSRVHG VWUDWHJ\ %( LV UHDOO\ D SODQ RI  ELW FKDUDFWHU
HQFRGLQJ FDOFXODWLRQ WKDW UHSUHVHQWV D FKDUDFWHU E\  ELWV
LQVWHDGRIELWV7KLVSODQZRUNVZLWKDOOWKHUHJXODUSULQWDEOH
FKDUDFWHUVFRYHULQJERWK%DQJODDQG(QJOLVKFKDUDFWHUVZKLFK
FDQEHIRXQGLQQRUPDOFRQVROH $ ORRNXS WDEOH LV XVHG IRU
VSHDNLQJ WR WKH FKDUDFWHUV E\  ELW 7KH ORRNXS WDEOH LV
GHYHORSHG E\ XWLOL]LQJ =LSI GLVWULEXWLRQ >@ ZKLFK LV D
GLVFUHWH FRQYH\DQFH RI RUGLQDULO\ XWLOL]HG FKDUDFWHUV LQ
YDULRXVGLDOHFWV7KHWDEOHLVVKRZQLQ7$%/(,
:H FDQ SRVVLEO\ IRXQG DW OHDVW A  FRPELQDWLRQ DIWHU
XWLOL]LQJELW:LWKLQWKHFRPELQDWLRQVFRPELQDWLRQV
DUHXWLOL]HGIRUFRQYHUWLQJWKHRULJLQDOELWFKDUDFWHUWRELW
DQG WKH UHVW RI WKH  FRPELQDWLRQV DUH XWLOL]HG IRU
UHSUHVHQWDWLRQRI WKH VHWV 7KH SULQWDEOH (QJOLVKDQG%DQJOD
FKDUDFWHUVDUHSDUWHGLQWRWKHQXPEHURIVHWV
,QWKHORRNXSWDEOH
&KDUDFWHUVRIWKH%HQJDOLDOSKDEHWDUHSODFHGLQ
6HW6HW6HW
&KDUDFWHUVRIWKH(QJOLVKDOSKDEHWDUHSODFHGLQ
6HW6HW6HW6HW
7KHUHVWRIWKHELQDU\FRPELQDWLRQDUHWDNHQIRU
WKHVHW
7$%/(,
/RRNXSWDEOHIRU%(
Seri
al
no.
Deci
mal
value
Bina
ry
valu
e
Set-
1
Set
-2
Se
t-3
S
et-4
Set
-5
S
et-6
S
et-7
 
ͧ
΃
äΨ( H  4
 ͨ΄äβ7 W  ;

ͩ
΅äζ$ D  =
 ͯ
Ά
äι2 R  -
 äΚ
·
γ5 U  
 Λä
͸
Ζ
, L  "
 ä
Ή
Ͱ1 Q  µ
 äΤ
Ί
6 V  
 έä
Η
Ͳ+ K  
 
 έäΚ
Ό
' G  
 
 έäδ
ͻ
Ҙ/ O  ?
 

͵
΂
҇& F  a
 

Ͷ
;҈8 X  A
 

ͷ
΀
҉3 S  _
 

Έ
͹Ҋ0 P 
 

Α
Β
ҋ: Z  
 

Γ
ͺ
Ҍ) I  
 

΍
Ύ
ҍ* J ^ B
 

ͼ
Ε
Ҏ< \ ` C
 

΋
ͽ
ҏ% E  
 

Ώ
ͫҐ9 Y ! 
 
 Ϳͱ.N>
 

ΐ
ͭ][@
 

΁
äΧM
 

Δ
äθTVSDFH
 
  6HW
 
  6HW

Authorized licensed use limited to: Macquarie University. Downloaded on June 23,2020 at 06:21:00 UTC from IEEE Xplore. Restrictions apply.


  6HW
 
  6HW
 
  6HW
 
  6HW
 
  6HW
A. Database Dictionary creation using k-Series:
6XEVHTXHQWWRFKDQJLQJRYHUWKHELWFKDUDFWHUVLQWRELWV
ZHKDYHDELWVWUHDPRIELWVRIHDFKFKDUDFWHU)RUDQ\LQIR
FRQWHQW7 ZHPDNHDELWVWUHDPELWVIRU HDFKFKDUDFWHU
DQGIURPWKLV ELW VWUHDP ZH SDUWLWLRQLWE\ WR WDNH  ELWV
HDFK:HVHWWUDLOLQJWKHODVW VHWQXPEHUWRPDNHLW PRG
HTXLYDOHQWWR]HURLIWKHOHQJWKRIWKHELWVWUHDPLVQWPRG
HTXLYDOHQWWR]HUR)URPWKLVELWVZHKDYHA YDULRXV
EOHQGVRIELWV6LQFHHYHU\RQHRIWKHFKDUDFWHUVLVVSRNHQWR
E\ELWV ZHLQFOXGHDIL[HGSLHFHGHVLJQEHIRUHHYHU\RQH
RIWKHELWV$IWHUDGGLQJIL[HGELWSDWWHUQZHJHW
$6&,, YDOXH UDQJH  7DEOH  VKRZV WKH FKDUDFWHUV
DORQJZLWKLWVGHFLPDODQG$6&,,YDOXHV+HQFHDQ\RIWKH
FKDUDFWHUVVKRZQLQ7DEOH EHFRPHVD FKDUDFWHUVKRZQLQ
7DEOH
7$%/(,,
&RQYHUWHGOLVWRIFKDUDFWHUV
6HULDO
1R
&KDUDFWHUV 'HFLPDO
9DOXH
%LQDU\YDOXH
 #  
 $  
 %  
 &  
 '  
 (  
 )  
 *  
 +  
 ,  
 -  
 .  
 /  
 0  
 1  
 2  
Original Text (Input):
$&RPSUHVVLRQVFKHPH
Set Representation:
6HW$6HWVSDFH6HW&6HWRPSUHVVLRQVSDFHVFKHPH
Decimal Representation:

5 bit representation     
«
After Dividing by 4:
«
Adding 0100 to every combination:
    
«
Corresponding Ascii Character:
1#..+1%2-&*&'+#&,/-&&)#1.'#$/#
 GLIIHUHQWFRPELQDWLRQV ZLWK WKHVH GHILQHG FKDUDFWHUV LQ
7$%/(,,DUHFRQVWUXFWHGZKLFKDUHVWRUHGLQDGDWDEDVH6R
ZHFDOOHGLWN6HULHVZKHUHNDFWXDOO\GHILQHVWKHQXPEHURI
FRPELQDWLRQRIWKHFKDUDFWHUV
:H YDULDWH WKH YDOXH RI N IURP  WR  DQG IRXQG WKH
GLFWLRQDU\VL]HZKLFKLVVWRUHGELQWKH'DWDEDVH
7KLVYDULDWLRQLVGHPRQVWUDWHGE\DJUDSKLQ)LJ
)URP)LJLWFDQEHVHHQWKDWE\WKHLQFUHDVLQJYDOXHRIN
WKHGLFWLRQDU\VL]HLQWKHGDWDEDVHLQLQFUHDVLQJJUDGXDOO\
)LJ 'LFWLRQDU\VL]HRIWKH'DWDEDVHIRUGLIIHUHQWN
6HULHVYDOXH
:HVWRUHWKHN6HULHVLQGDWDEDVHE\XVLQJWKHIROORZLQJ
DOJRULWKP
,IN 'DWDEDVHFUHDWHGLQWKLVIRUP
 $$
 $%
 $&
 $'
« «
7KLV DOJRULWKP LV GHYHORSHG E\ VLPSO\ PXOWLSO\ RQH
FKDUDFWHURUDVWHDPRIFKDUDFWHUWRDQRWKHUFKDUDFWHU
/LNHLIZHZDQWWRVWRUH$%
7KHQILUVWO\ZHVWRUH$WKHQPXOWLSO\$ZLWK%ZKLFKPDNHV
LW$%
$IWHUVWRULQJLWLQWRWKHGDWDEDVHZHMXVWTXHU\LW¶VLQGH[E\
D VLPSOH VHOHFWLRQ ZLWK SURMHFWLRQ RSHUDWLRQ RI GDWDEDVH
ZKHUHWKHSURMHFWLRQRSHUDWLRQFRQWDLQVWKHN6HULHVQXPEHU
DQGWKHSDUWLFXODUN6HULHVFRPELQDWLRQ
6RLIZHZDQWWRILQGWKHLQGH[RI$%ZHMXVWTXHU\LQWRWKH
GDWDEDVH ZLWK N6HULHV QXPEHU  VLQFH $% FRQWDLQV D 
FKDUDFWHUFRPELQDWLRQDQGJRWRWKHIROORZLQJWDEOHZKHUHLW
PDWFKHV$%ZLWKWKH$%RIWKHWDEOHDQGILQGWKHLQGH[
7KHGDWDEDVHLVFUHDWHGE\XVLQJWKHN6HULHVVFKHPHVRLWLV
FDOOHGN6HULHV'DWDEDVH'LFWLRQDU\
$IWHUJHWWLQJWKHLQGH[HVXVLQJWKHDERYHGDWDEDVHGLFWLRQDU\
ZHUHSUHVHQWHDFKLQGH[ZLWKE\WHE\XVLQJWKH-DYD
OutputStreamWriter() IXQFWLRQ LQ -DYD SODWIRUP ZKLFK LV
XVHGWRFRQYHUWWKHZULWWHQFKDUDFWHUVWRWKHE\WHVZULWWHQWR
WKH XQGHUO\LQJ OutputStream +HUH ZHFRQYHUW WKH ZULWWHQ

Authorized licensed use limited to: Macquarie University. Downloaded on June 23,2020 at 06:21:00 UTC from IEEE Xplore. Restrictions apply.
LQGH[WR87)ZKLFK DFWXDOO\PHDQV8QLFRGH WH[WGHILQHV
E\WH
B. Compression and Decompression Algorithm
1) Compression Algorithm
Input:1RUPDOVWULQJ6
Output:$QHQFRGHGFRPSUHVVHGVWULQJ6F
Step1: %\DGGLQJVHWFKDQJHUHSUHVHQWLQJWKHVWULQJ6E\
6C
Step2: %\XVLQJWKHORRNXSWDEOH UHSUHVHQWLQJ WKH VWULQJ
6CE\LW¶VFRUUHVSRQGLQJELWVUHSUHVHQWDWLRQ
/HWWKHUHSUHVHQWDWLYHVWULQJ6¶FRQWDLQ.ELWV
Step3: 'LYLGH.LQWKLVEHORZZD\WRILQGWKHQXPEHURI
ELWFRPELQDWLRQVDYDLODEOHLQ.
:KLOH. WKHQMXPSLQWR6WHS
:KLOH. WKHQVXPXSPELWVLQWKHIRUPDWLRQRIWKH
ODVWELWVHWFRGHE\WDNLQJHYHU\ELWRIWKHVHWFRGHRQHE\
RQHDQGLQFUHDVLQJLQWKHVDPHZD\XQWLOO.P 
Step4: 6WRUHHYHU\ELWFRPELQDWLRQVLQ.¶
Step5: $GGWRHYHU\ELWFRPELQDWLRQRI.¶WRPDNH
WKH ELQDU\ FRPELQDWLRQ RQO\ OLPLWHG WR DERYH N
6HULHVGHVFULEHGLQ7DEOH,,
Step6: 'LYLGH .¶ E\  WR ILQG WKH FRUUHVSRQGLQJ $6&,,
FKDUDFWHUVDQGVWRUHWKHUHVXOWLQ5
Step7: 'LYLGH 5 E\ WKH GLIIHUHQW YDOXH RI WKH N6HULHV
ZKHUHN WR DQGILQGWKHH[DFWORFDWLRQRIWKH
SDUWLFXODUFRPELQDWLRQ RIWKHN6HULHVIURPWKH N
6HULHV'DWDEDVH'LFWLRQDU\E\XVLQJVHOHFWLRQZLWK
SURMHFWLRQ RSHUDWLRQ JLYLQJ WKH N6HULHV QXPEHU
DQG WKH SDUWLFXODU N6HULHV FRPELQDWLRQ DQG VWRUH
WKHFRUUHVSRQGLQJLQGH[HVLQ1
Step8: 5HSUHVHQWLQJ1E\LW¶VFRUUHVSRQGLQJ8QLFRGHWH[W
LQ6FDQGPDNLQJDPDSRIWKHSDUWLFXODULQGH[HVWR
LW¶V FRUUHVSRQGLQJ 8QLFRGH WH[W LQ D GDWDEDVH
QDPHG,QGH[WR8QLFRGH
Step9: 8QLFRGHWH[WLQ6FLVFRPSUHVVHGWH[W
Example 2:
Original Text (Input):
$&RPSUHVVLRQVFKHPH
Set Representation:
6HW$6HWVSDFH6HW&6HWRPSUHVVLRQVSDFHVFKHPH
Decimal Representation:«
5 bit representation     
«
After Dividing by 4:
«
Adding 0100 to every combination:
    
«
Corresponding ASCII representation:
1#..+1%2-&*&'+#&,/-&&)#1.'#$/#12
Corresponding N6HULHV after dividing by k (Where k=8):
1#..+1%2-&*&'+#&,/-&&)#1.'#$/#12
Index: 
Compressed String:
ᠵـ
2) Decompression Algorithm
Input:&RPSUHVVHG6WULQJ6F
Output:2ULJLQDO7H[W6
Step1: )URPWKH,QGH[WR8QLFRGHGDWDEDVHUHSUHVHQWLQJ
6FWRLW¶VFRUUHVSRQGLQJLQGH[YDOXH
Step2: )URPWKHORFDWLRQRIWKHLQGH[YDOXHLQWKH6FILQG
WKH H[DFW N6HULHV FRPELQDWLRQ IURP WKH N6HULHV
'DWDEDVH'LFWLRQDU\DQGVWRUHLWLQ.
Step3: )URP . ILQG LW¶V FRUUHVSRQGLQJ  ELW ELQDU\
FRPELQDWLRQIURPWKH$6&,,WDEOHDQGVWRUHWKH
UHVXOWDQWELQDU\ELWVLQ.¶
Step4: 5HPRYH  IURP HYHU\  ELW ELQDU\
FRPELQDWLRQV
Step5: )URPWKHUHPDLQLQJ.¶WKELWVVWUHDPWDNLQJELWV
DQG UHSUHVHQWLQJ LW E\ WKH FKDUDFWHU VHW RI WKH
ORRNXSWDEOH
Step6: $IWHUH[FOXGLQJWKHVHWQXPEHUIURPN¶WKHRULJLQDO
VWULQJ6FDQEHIRXQG
Example 3:
Compressed String:
ᠵـ
Index: 
Corresponding N6HULHV:
1#..+1%2-&*&'+#&,/-&&)#1.'#$/#12
Corresponding 8 bit Representation: 
    
«
After removing 0100 from every 8 bit combination:
«
After dividing it by 5
5 bit representation:     
«
Corresponding Decimal Representation: 
«
Corresponding Set Representation: 6HW $ 6HW VSDFH
6HW&6HWRPSUHVVLRQVSDFHVFKHPH
Original Text:$&RPSUHVVLRQVFKHPH
,9
$1$/< 7,&$/
$1$/<6,6
$WKHRUHWLFDODQDO\VLVLVGHYHORSHGWRFDOFXODWHWKHHIILFLHQF\
RI%(SUHFLVHO\7KHSDUDPHWHUVZKLFKDUHFRQVLGHUHGIRU
DQDO\WLFDODQDO\VLVDUHVKRZQLQ7$%/(,,,
/HWXVVXSSRVH
ܲ
I E\WH
ܲ
I  ELW
ܲ
I b+ s>6LQFHWKHVL]HRIVLVTXLHWVPDOOHU
ZHFDQLJQRUHLW@
  ELW
n ܲ
Ȁ
Ʌ
 
ூכ௕
Ʌ


Authorized licensed use limited to: Macquarie University. Downloaded on June 23,2020 at 06:21:00 UTC from IEEE Xplore. Restrictions apply.
  FRPELQDWLRQRIELW
7$%/(,,,
3DUDPHWHUVIRUDQDO\WLFDO
HYDOXDWLRQ
ܲ
 n
Ʌ
Ɏሻ 
כሺ
Ʌ
ା஠ሻ
Ʌ
 
ூכ௕כ
Ʌ
ା஠
Ʌ

   ELW
ܲ
ܲ
 
ூכ௕כ
Ʌ
ା஠
ͺכɅ
  E\WH
ܲ
Ȗ :KHUHN6HULHV 
ܲ

଼כఊ

ூכ௕כሺ
Ʌ
ା஠ሻ
଼כఊכ
Ʌ
 
ூכ௕כሺఏାగሻ
଼כஓכ
Ʌ

  §
ܲ
ȝ   E\WH  ELW ELW IRU UHSUHVHQWLQJ WKH
LQWHJHU
ܲ
ܲ
כܲ
 
ூכ௕כሺఏାగ ሻכஜ
଼כஓכఏ
 
ூכ௕כஜכሺఏାగ ሻ
଼כఊכఏ

  ELW
ܲ
ଵ଴
ܲ
 
ூכ௕כஜכሺఏାగሻ
଼כଵ଺כఊכఏ

  
߬ ൌ ܲ
ଵ଴
Ȁܲ

 

ଵ଺כூ
ூכ௕כఓכሺఏାగ ሻ
ூכ଼כଵ଺כఊכఏ
ൌ
௕כఓכሺఏାగሻ
଼כଵ଺כఊכఏ


ܾכߤכሺߠ൅ߨሻ
ͳʹͺߛߠ
= 
6RZHFDQVD\WKDWE\WKLVDOJRULWKPZHFDQDFKLHYHPD[LPXP
RIVDYLQJ$QGDOVRDQLQWHUHVWLQJDQGLPSRUWDQWILQGLQJ
LVWKDWRXUFRPSUHVVLRQUDWLRGRHVQ¶WGHSHQGRQWKHGLFWLRQDU\
VL]HRUILOHVL]H,WRQO\GHSHQGVRQWKHVL]HRIWKHN6HULHVLH
Ȗ8VLQJWKHDERYHHTXDWLRQZHHYDOXDWHGWKHWUHQGRI߬ZLWK
YDU\LQJYDOXHVRIȖ WR)LJVKRZVWKHDQDO\WLFDOUHVXOW
߬ ൌ  ܾ כ ߤ כ ሺߠ ൅ ߨሻ
ͳʹͺߛߠ
)LJ $QDO\WLFDOHYDOXDWLRQIRUFRPSUHVVLRQUDWLRIRU
GLIIHUHQWN6HULHVYDOXH
6R IURPWKLV JUDSK LW FDQ EH HDVLO\ XQGHUVWRRG WKDW ZLWK WKH
LQFUHDVLQJYDOXHRIȖWKH߬ ZLOOEHLPSURYLQJJUDGXDOO\$QG
DWWKHYDOXHRIȖ WKHHIILFLHQF\UHDFKHVQHDU
9
(
;3(5,0(17$/
5
(68/7
:H H[SHULPHQW DQG FRQWUDVW GLIIHUHQW N6HULHV RI %(
ZLWK WKH PRVW FRPPRQ +XIIPDQ >@ DQG /=: >@
DOJRULWKP :H KD YH GRQH  NLQG RI WHVW FR PSULVLQJ WKH
&RPSUHVVLRQRIILOHV&RPSUHVVLRQUDWLRDQG&RPSUHVVLRQ
WLPH
A. Compression of files:
)URP)LJ7KHUHVXOWVKRZVWKDWRXUDOJRULWKPZLWKN6HULHV
SURYLGHVWKHEHVWUHVXOWLQWHUPVRIFRPSUHVVLRQRIGLIIHUHQW
VL]HRIILOH7KHRWKHUVHTXHQFHDOVRSURYLGHVH[FHOOHQWUHVXOW
/=:VKRZVJRRGUHVXOWLQWKHLQLWLDOVWDJHEXWDWWKHLQFUHDVLQJ
VL]H RI WKH ILOH WKH SHUIRUPDQFH LV GHFUHDVLQJ JUDGXDOO\
%HFDXVH
WKH KHDUWRI /=: LV WUDQVODWH GXSOLFDWHGE\WHV LQWR
V\PEROWKHQZULWHWKHV\PEROVWRELWVWUHDP>@$QGWKH
SDFNHGELWVZLOOVDYHPDQ\VSDFH
/=:FRPSUHVVHVE\ILQGLQJ
UHSHDWHG VHTXHQFHV VR DQ LQSXW ZLWKRXW DQ\ UHSHDWLQJ
VHTXHQFHV ZLOO E\ QHFHVVLW\ JHW ODUJHU6R LI WKH ILOH VL]H LV
VPDOO LW ZLOO ILQG D ORW RI UHSHDWHG VHTXHQFH DQG SHUIRUPV
EHWWHU%XWLIWKHVL]HLVLQFUHDVLQJWKHUHSHDWHGVWUHDPLVTXLWH
GLIILFXOW WR ILQG VR WKH SHUIRUPDQFH GHFUHDVLQJ ,Q WHUPV RI
+XIIPDQ LW SURYLGHV WKH ZRUVW UHVXOW %HFDXVH LQ +XIIPDQ
FRGLQJWKHELQDU\VWULQJVRU FRGHV LQWKHHQFRGHGGDWDDUHDOO
GLIIHUHQWOHQJWKV>@6RLIDKXJHDPRXQWRIGLIIHUHQWV\PERO
RI GLIIHUHQW OHQJWK LV FUHDWHG LW UHDOO\ GHJUDGHV WKH
SHUIRUPDQFH6RLWSURYLGHVZRUVWUHVXOW
)LJ &RPSUHVVLRQRIILOHVIRUGLIIHUHQWGDWDVHW
B. Compression Ratio:
)LJ6KRZVWKHFRPSUHVVLRQUDWLRIRUGLIIHUHQWVL]HRIILOHV
7KHUHVXOW GHPRQVWUDWHVWKDWRXUDOJRULWKPSURYLGHVWKHEHVW
3DUDPHWH
U
 'HVFULSWLR
Q
ܲ
7RWDOLQSXW&KDUDFWH
U
ܲ
2ULJLQDOVL]HRIWKHLQSXW
b
s
6HW%LW
ܲ
6L]HRIWKHLQSXWDIWHUXVLQJEELW
UHSUHVHQWDWLRQDQGDGGLQJ6HWELW
ș %LWXVHGWRWDNHWKHNFRPELQDWLRQIURPWKH
FRPSUHVVHGELWVWUHDP
n 1XPEHURI&RPELQDWLRQIRUPDNLQJWKH N
6HULHV
ʌ
%LWDGGWRDOOWKHFRPELQDWLRQVR
I
݊
ܲ
1RRIELWVDIWHUDGGLQJʌELWWRHDFKFKDUDFWHU
RIܲ
ܲ
6L]HRIWKHGLFWLRQDU
\
ܲ
N6HULHVOHQJWKȖ
ܲ
1XPEHURILQGLFHV
ܲ
6L]HRIWKHWDNHQLQWHJH
U
ܲ
6L]HRILQGLFHV
ܲ
ଵ଴
8QLFRGH&KDUDFWHU6L]H
IJ &RPSUHVVHGUDWLR

Authorized licensed use limited to: Macquarie University. Downloaded on June 23,2020 at 06:21:00 UTC from IEEE Xplore. Restrictions apply.
UHVXOWLQWHUPVRIFRPSUHVVLRQUDWLR,W UHDFKHVPD[LPXPRI
 VDYLQJ DW WKH N6HULHV  2Q WKH RWKHU KDQG /=:
SURYLGHVJRRGFRPSUHVVLRQUDWLRDWWKHLQLWLDOOHYHOZKHQWKH
ILOH VL]H LV VPDOO EXW DV WKH VL]H RI GDWD LV LQFUHDVLQJ WKH
HIILFLHQF\ UDWH ZLOO EH SRRU 7KH UHDVRQ EHKLQG WKLV /=:
VKRZVZRUVWUDWLRRIFRPSUHVVLRQIRUGLFWLRQDU\VL]HZKLFKLV
OHVVDQGDVFKDUDFWHUOHQJWKLQFUHDVHVGLFWLRQDU\ZLOOQRWHDVLO\
EHPDGHRI6RSHUIRUPDQFHUHGXFHG%XW+XIIPDQSURYLGHV
WKHZRUVWUHVXOW7KHUHDVRQEHKLQG+XIIPDQWHFKQLTXHVKRZV
SRRUHIILFLHQF\LVWKDWWKHGDWDLVGHULYHGE\+XIIPDQIURPWKH
SUREDELOLW\RUIUHTXHQF\RIRFFXUUHQFHRIWKHSRVVLEOHYDOXHV
LQWKHVRXUFHV\PERO>@6RLIWKHVL]HRIGDWDLVTXLWHODUJH
WKHIUHTXHQF\JHQHUDWLRQZRXOGEHPRUHGLIILFXOWDQGDODUJH
QXPEHURILQGLYLGXDOV\PEROZLOOEHFUHDWHG
)LJ &RPSUHVVLRQUDWLRIRUGLIIHUHQWGDWDVHW
C. Compression Time:
)LJ &RPSUHVVLRQWLPHIRUGLIIHUHQWGDWDVHW
)LJSURYLGHVWKHFRPSUHVVLRQ WLPHRIGLIIHUHQWDOJRULWKPV
)URP )LJ  LW VKRZV WKDW 5BE SURYLGHV EHWWHU UHVXOW WKDQ
+XIIPDQ ,QLWLDOO\ /=: SURYLGHV EDG UHVXOW EXW DW WKH
LQFUHDVLQJVL]HRIWKHILOHLWZLOOVKRZWKHVLPLODUUDQJHDV%(
9,
&21&/86,21
,Q WKLV 3DSHU ZH HQGHDYRU WR SURSRVH DQ HIIHFWLYH
&RPSUHVVLRQV\VWHPQDPHG%(IRUWH[WFRPSUHVVLRQZKLFK
JLYHV D SURPLVLQJ SHUIRUPDQFH 7KLV V\VWHP LV D  ELW
FRPSUHVVLRQ WHFKQLTXH DQG LWV FKDUDFWHU VHW IRU H[DPSOH
GLFWLRQDU\LVGHOLYHUHGE\WKH=LSIDSSURSULDWLRQ'XHWRWKH
SHQDOW\RIVWRUDJHIRUWKHGLFWLRQDU\VKRZQLQILJZHSDFN
DQDWXUDOWH[WE\7KLVGLFWLRQDU\ZLOOEHXWLOL]HGE\WKH
FRPSUHVVLRQ DQG GHFRPSUHVVLRQ WHFKQLTXHV DQG WR EH
H[HFXWHG LQ WKH FOLHQWVLGH7KHUHIRUH LW FDQ EH FRQVWUXFWHG
RQO\ RQFH +HQFH WKH FRPSUHVVLRQ WHFKQLTXH SURYLGH
IDFLOLWLHVZLWKRXWLQWHUUXSWLRQ7KLVV\VWHPFDQEHHIIHFWLYHO\
XVHG FRPSDFWLQJ DQ LPPHQVH PHDVXUH RI QRUPDO GDWDVHWV
%RWK WKH PDSSLQJ IRUZDUG DQG LQ UHYHUVH PDNH WKLV
FRPSUHVVLRQPDSSLQJ FRPSOHWHDV LW LQGLFDWHV IRUH[DPSOH
ZKLOHGLVLQWHJUDWLQJZHFDQDFFRPSOLVKWKHRULJLQDOFRQWHQW
7KLV SURFHGXUH FDQEH H[HFXWHG LQ VWUXFWXUHD FUHDWLYH DQG
SURILFLHQW SURJUDPPLQJ ZKLFK FDQ EH FRQQHFWHG LQ
FRUUHVSRQGHQFH RI LQIRUPDWLRQ VWRFNSLOLQJ RI LQIRUPDWLRQ
DQG GDWDEDVH GHVLJQLQJ ,W WHQGV WR EH XVHG WR SDUDOOHO
SUHSDULQJ FRQGLWLRQ MXVW DV EXUGHQ DGMXVWLQJ SURFHGXUH WR
DFFRPSOLVKSURPLVLQJHQFRGLQJWLPHHQYLURQPHQWDVZHOODV
ORDG EDODQFLQJ WHFKQLTXH WR DFKLHYH SURPLVLQJ HQFRGLQJ
WLPH
5
()(5(1&(6
>@ :HQMXQ+XDQJ:HLPLQ:DQJ+XL;X³$/RVVOHVV'DWD&RPSUHVVLRQ
$OJRULWKP IRU 5HDOWLPH 'DWDEDVH´  WK :RUOG &RQJUHVV RQ
,QWHOOLJHQW&RQWURODQG$XWRPDWLRQSS
>@ ..DODMG]LF6+$OLDQG$3DWHO³5DSLGORVVOHVVFRPSUHVVLRQRI
VKRUW WH[W PHVVD JHV´ Computer Standards & InterfacesYRO   SS
±
>@ 1LHYHV5%ULVDERD $QWRQLR)DULxD *RQ]DOR1DYDUUR³/LJKWZHLJKW
QDWXUDOODQJXDJHWH[WFRPSUHVVLRQ´,QIRUPDWLRQUHWULHYDO

>@ 9X+ 1JX\HQ +LHQ71JX\HQ +LHX 1'XRQJDQG 9DFODY 6QDVHO ´
Q*UDP%DVHG 7H[W &RPSUHVVLRQ ´ &RPSXWDWLRQDO ,QWHOOLJHQFH DQG
1HXURVFLHQFH9ROXPH
>@ .0$]KDUXO+DVDQ&RPSUHVVLRQ6FKHPHVRI+LJK'LPHQVLRQDO'DWD
IRU02/$3,Q(YROYLQJ$SSOLFDWLRQ'RPDLQVRI'DWD:DUHKRXVLQJ
DQG0LQLQJ7UHQGVDQG6ROXWLRQV(GLWHGE\3HGUR)XUWDGR8QLYHUVLW\
RI&RLPEUD3RUWXJDO&KDSWHU,9SS
>@ 6HQWKLO 6KDQPXJD 6XQGDUDP DQG 5REHUW /RXUGXVDP\ ³$
&RPSDUDWLYH6WXG\ 2I 7H[W &RPSUHVVLRQ $OJRULWKPV´,QWHUQDWLRQDO
-RXUQDORI:LVGRP%DVHG&RPSXWLQJ9ROSS
>@ 0G$VKLT0DKPRRG7DULTXH/DWLI .0$]KDUXO+DVDQ DQGULDGXO
,VODP ³$ )HDVLEOH  %LW 7H[W 'DWDEDVH &RPSUHVVLRQ 6FKHPH ZLWK
&KDUDFWHU (QFRGLQJ %&´  VW ,QWHUQDWLRQDO &RQIHUHQFH RI
&RPSXWHUDQG,QIRUPDWLRQ7HFKQRORJ\,&&,7SS
>@ $&DUXV DQG $ 0HVXW³)DVW WH[W FRPSUHVVLRQ XVLQJPXOWLSOH VWDWLF
GLFWLRQDULHV´,QIRUPDWLRQ7HFKQRORJ\-RXUQDO
>@ 5HSODFHPHQW$QG%LW5HGXFWLRQQ´6,30)&67,7&$:6($&6,7
&6,7SS±
>@ $VKLT 0DKPRRG 7DULTXH /DWLI DQG . 0 $]KDUXO +DVDQ ³$Q
(IILFLHQWELW(QFRGLQJ6FKHPHIRU3ULQWDEOH&KDUDFWHUVE\WDEOHORRN
XS´ ,QWHUQDWLRQDO &RQIHUHQFH RQ (OHFWULFDO &RPSXWHU DQG
&RPPXQLFDWLRQ(QJLQHHULQJ(&&(SS
>@ :HQWLDQ /L ³5DQGRP 7H[WV ([KLELW =LSIV/DZ/LNH :RUG´ ,HHH
7UDQVDFWLRQV2Q,QIRUPDWLRQWKHRU\9RO1R1RYHPEHU
>@ )DJDQ 6WHSKHQ *HQoD\ 5DPD]DQ $Q LQWURGXFWLRQ WR WH[WXDO
HFRQRPHWULFV Handbook of Empirical Economics and Finance SS
±
>@ ' $ +XIIPDQ ³$ PHWKRG IRU WKH FRQVWUXFWLRQ RI PLQLPXP
UHGXQGDQF\FRGHV´3URFHHGLQJVRIWKH,5(YRO QRSS±

>@ 0 0 .RGDEDJL 0 9 -HUDEDQGL 1DJDUDM *DGDJLQ ³0XOWLOHYHO
VHFXULW\DQG FRPSUHVVLRQRIWH[WGDWD XVLQJELWVWXIILQJ DQGKXIIPDQ
FRGLQJ´  ,QWHUQDWLRQDO &RQIHUHQFHRQ $SSOLHG DQG 7KHRUHWLFDO
&RPSXWLQJDQG&RPPXQLFDWLRQ7HFKQRORJ\L&$7FF7SS

>@ -LQJ]KHQJ:X<RQJML:DQJ/LSLQJ'LQJ;LDRIHQJ/LDR³,PSURYLQJ
SHUIRUPDQFH RI QHWZRUN FRYHUW WLPLQJ FKDQQHO WKURXJK +XIIPDQ
FRGLQJ´Mathematical and Computer Modeling” YRO QRSS
±(OVHYLHU
>@ :HL :DQJ :HL =KDQJ ³+XIIPDQ &RGLQJ%DVHG $GDSWLYH 6SDWLDO
0RGXODWLRQ´ IEEE Transactions on Wireless Communications” YRO
QRSS
>@ - =LY DQG $ /HPSHO ³$ XQLYHUVDO DOJRULWKP IRU VHTXHQWLDO GDWD
FRPSUHVVLRQ´IEEE Transactions on Information TheoryYROQRSS
±
>@ -=LYDQG$/HPSHO³&RPSUHVVLRQRILQGLYLGXDOVHTXHQFHVYLDYDULDEOH
UDWHFRGLQJ´I EEE Transactions o n Information Theo ry YROQRSS
±
>@ +LGHR%DQQDL6KXQVXNH,QHQDJD0DVD\XNL7DNHGD³(IILFLHQW/=
)DFWRUL]DWLRQRI*UDPPDU&RPSUHVVHG7H[W´L Caldr´on-Benavides et
al. (Eds.): SPIRE 2012, LNCS 7608, SS ±± 6SULQJHU9HUODJ
%HUOLQ+HLGHOEHUJ
>@ 06DQJHHWKD3%HWW\*61DQGD.XPDU³$ELRPHWULHLULVLPDJH
FRPSUHVVLRQ XVLQJ /=: DQG K\EULG /=: FRGLQJ DOJRULWKP´ 
,QWHUQDWLRQDO&RQIHUHQFHRQ,QQRYDWLRQVLQ,QIRUPDWLRQ(PEHGGHGDQG
&RPPXQLFDWLRQ6\VWHPV,&,,(&6SS

Authorized licensed use limited to: Macquarie University. Downloaded on June 23,2020 at 06:21:00 UTC from IEEE Xplore. Restrictions apply.
Chapter
Full-text available
The exploration of the possibility of compressing data warehouses is inevitable because of their non-trivial storage and access costs. A typical large data warehouse needs hundreds of gigabytes to a terabyte of storage. Performance of computing aggregate queries is a bottleneck for many Online Analytical Processing (OLAP) applications. Hence, data warehousing implementations strongly depend on data compression techniques to make possible the management and storage of such large databases. The efficiency of data compression methods has a significant impact on the overall performance of these implementations. The purpose of this chapter is to discuss the importance of data compression to Multidimensional Online Analytical Processing (MOLAP), to survey data compression techniques relevant to MOLAP, and to discuss important quality issues of MOLAP compression and of existing techniques. Finally, we also discuss future research trends on this subject.
Conference Paper
Full-text available
Character encoding implies representing a repertoire of characters by some sort of encoding framework. Encoding a character in a compelling procedure is in every case estimable in light of the fact that it requires a couple of bits and least investment for information. It has an enormous region of utilization including data correspondence, data stockpiling, transmission of textual information and database innovation. In this paper, a new compression technique is proposed for text data which encodes a character by 6 bits to be specific 6-Bit Text database Compression (6BC). This strategy works with a system of encoding by 6 bit for characters which are printable by utilizing a lookup table. 8 bit characters are converted into 6 bit by this procedure and it partitions the characters into 4 sets. At that point, it utilizes the location of the characters uniquely to encode it by 6 bit. This strategy is likewise utilized in database innovation by compressing the text data in a connection of a database. With the assistance of a lookup table, 6BC can compress and in addition decompress the original data. Reverse procedure for decompression to get back the original data is additionally detailed. The result of 6BC is further applied to compress by the known algorithm to be specific Huffman and LZW. Promising efficiency is appeared by our experimental result. The procedure is further demonstrated by some examples and descriptions.
Article
Full-text available
Antenna switch enables multiple antennas to share a common RF chain. It also offers an additional spatial dimension, i.e., antenna index, that can be utilized for data transmission via both signal space and spatial dimension. In this paper, we propose a Huffman coding based adaptive spatial modulation that generalizes both conventional spatial modulation and transmit antenna selection. Through Huffman coding, i.e., designing variable length prefix codes, the transmit antennas can be activated with different probabilities. When the input signal is Gaussian distributed, the optimal antenna activation probability is derived through optimizing channel capacity. To make the optimization tractable, closed form upper bound and lower bound are derived as the effective approximations of channel capacity. When the input is discrete QAM signal, the optimal antenna activation probability is derived through minimizing symbol error rate. Numerical results show that the proposed adaptive transmission offers considerable performance improvement over conventional spatial modulation and transmit antenna selection.
Conference Paper
Full-text available
Encoding a character in an efficient way is always desirable because it takes less bits and less time for data. It has many application areas including data communication, data storage and database technology. In this paper we propose a new compression algorithm for text data that encodes any character by 6 bits called 6-bit encoding (6BE). The scheme deals with an encoding technique by 6 bits for printable characters using table look up. 6BE converts the 8 bit characters to 6 bits by dividing the characters into 5 sets and using them in a single table. The location of character is then used uniquely to encode by 6 bits. The 6BE can compress the text by 25% of the original text. The reverse algorithm for decomposition to get back the original text is also described. 6BE encoding produces another set of characters. The result of 6BE string is further compressed by Huffman and LZW algorithm. Our experimental result shows promising performance. Examples and descriptions are provided to explain the technique.
Article
Full-text available
We propose an efficient method for compressing Vietnamese text using n -gram dictionaries. It has a significant compression ratio in comparison with those of state-of-the-art methods on the same dataset. Given a text, first, the proposed method splits it into n -grams and then encodes them based on n -gram dictionaries. In the encoding phase, we use a sliding window with a size that ranges from bigram to five grams to obtain the best encoding stream. Each n -gram is encoded by two to four bytes accordingly based on its corresponding n -gram dictionary. We collected 2.5 GB text corpus from some Vietnamese news agencies to build n -gram dictionaries from unigram to five grams and achieve dictionaries with a size of 12 GB in total. In order to evaluate our method, we collected a testing set of 10 different text files with different sizes. The experimental results indicate that our method achieves compression ratio around 90% and outperforms state-of-the-art methods.
Article
Full-text available
Data Compression is the science and art of representing information in a compact form. For decades, Data compression has been one of the critical enabling technologies for the ongoing digital multimedia revolution. There are lot of data compression algorithms which are available to compress files of different formats. This paper provides a survey of different basic lossless data compression algorithms. Experimental results and comparisons of the lossless compression algorithms using Statistical compression techniques and Dictionary based compression techniques were performed on text data. Among the statistical coding techniques the algorithms such as Shannon-Fano Coding, Huffman coding, Adaptive Huffman coding, Run Length Encoding and Arithmetic coding are considered. Lempel Ziv scheme which is a dictionary based technique is divided into two families: those derived from LZ77 (LZ77, LZSS, LZH and LZB) and those derived from LZ78 (LZ78, LZW and LZFG). A set of interesting conclusions are derived on their basis.
Conference Paper
Data security and compression is the common requirement for most of the storage and transmission related applications. In this paper, a new method for multilevel security and compression of text data using bit stuffing and Huffman coding is presented. The proposed method comprises various processing modules for securing and compressing the text data through encryption process. The encryption process involves three phases to encrypt the data. In the first phase, the bits of every eighth character in the text will be embedded in the most significant bit positions of preceding seven characters. This process creates an encrypted text data which will be further processed for second level encryption. During second phase, every character of encrypted text is XORed with secret key provided by the sender that authenticates the sender. In the third phase, final encryption is done using Huffman encoding method. In decryption process, the ciphertext passes through three phases to reconstruct the original text. The encrypted data is decrypted using Huffman decoding method in the first phase. Then, the decrypted data and secret key that authenticates the receiver are used to reconstruct the original text in the last two phases of decryption process. Hence, the method provides multistage security for text data and also compresses the text data to larger extent. The results obtained are highly encouraging and the system is very effective in providing high level security and higher shrinkage or reduction of memory that result in reduction of bandwidth and transmission time. An average percentage of shrinkage or reduction of memory is 45.41%.