Agung Toto Wibowo Universitas Telkom Basic Techniques Agung Toto Wibowo Universitas Telkom
Outline Information, and Entropy Compresion Performance Run Length Encoding Move to Front Scalar Quantization Prefix Code
NILAI INFORMASI Pada suatu eksperimen probabilistik dengan suatu random variable diskrit S. S = { s1, s2, …., sN} Jumlah informasi yang diproduksi oleh kejadian sk adalah : Jika pk = 1 (kejadian pasti terjadi) maka I(sk) = 0 Apabila suatu kejadian sudah pasti akan terjadi, maka nilai informasinya = 0 Sifat nilai informasi : I(sk) 0 0 pk 1 I(sk) I(si) pk < pi Suatu kejadian yang mempunyai nilai kemungkin terjadi lebih kecil maka nilai informasinya akan lebih besar apabila kejadian tersebut terjadi.
NILAI INFORMASI Jika sk dan si independent, maka : I(sksi) =I(sk) + I(si) Bilangan dasar yang digunakan untuk menghitung nilai informasi pada persamaan di atas bisa bermacam-macam. Untuk sistem digital yang menggunakan bilangan biner, maka dipakai bilangan dasar 2.
H = - (p log2 (p) + (1 – p) log2 (1 – p) ENTROPI Entropi H adalah nilai informasi rata-rata per simbol untuk keluaran dari suatu sumber informasi tertentu. H = E [I(sk)] bit/ simbol Untuk sinyal biner (N=2)dengan probabilitas kemunculan p dan (1 – p), maka: H = - (p log2 (p) + (1 – p) log2 (1 – p)
ENTROPI Beberapa catatan untuk entropi : bit pada satuan informasi untuk biner (Bilangan dasar 2) tidak sama dengan binary digit Entropi berkonotasi pada ketidakpastian. Nilai entropi akan maksimum apabila kejadian yang keluar semakin tidak pasti. Contoh : untuk pelemparan uang logam dengan probabilitas yang sama (0,5), maka kejadian yang keluar akan susah ditebak (tidak pasti), sehingga nilai entropinya maksimum.
ENTROPI 2 KEJADIAN Untuk sinyal biner dengan probabilitas kemunculan p dan 1 – p, maka :
ENTROPI : CONTOH Contoh : Hitung nilai entropi (nilai informasi rata-rata) dalam bit / karakter untuk abjad latin (26) huruf apabila : a. probabilitas kemunculan tiap huruf sama b. probabilitas kemunculan terdistribusi sebagai berikut : p = 0,10 untuk huruf a, e, o, t p = 0,07 untuk huruf h, i, n, r, s p = 0,02 untuk huruf c, d, f,l, m, p, u, y p = 0,01 untuk huruf lainnya Jawab : H = - (4 x 0,1 log2 0,1 + 5 x 0,07 log2 0,07 + 9 x 0,02 log2 0,02 + 8 x 0,01 log2 0,01 ) = 4,17 bit / karakter bit/ karakater
Performance – Compression Ratio Compression Ratio also called as bpb (bit per bit) is equal to the number of bit in compressed streem needed, on average to compress one bit in the input stream. Vallue 0.6 means that data occupies 60% of its original size after compression. Values greater than 1 mean the output stream bigger than the input stream. Another term like bpp (bits per pixel), bpc (bis per character), general term is bit rate.
Performance – Compression Factor The invers of compression ratio is called the compression fator Value greater than 1 indicates compression, and Value less than 1 indicates expansion.
Run Length Encoding Idea : If a data item d occurs n tconsecutive times in the input stream, replace the n occurances with the single pair nd This approach is called Run Length Encoding (LRE)
RLE Text Compression “2._all_is_too_well” ecode into “2._a2_is_t2_we2” is not work. Decompressor should distinct first 2 is part of the text, while the others repetition factor of “o” and “l” “2._a2l_is_t2o_we2l” does not solve the problem. Still have the same length, and the problem still exist. “2._a@2l_is_t@2o_we@2l” a special escape generating longer text even it can solve the problem
Simple RLE Compression charCount = 0, repeatCount = 0 Read next character (CH) While not end of string do increment charCount if charCount = 1 than savedCharacter CH else if savedCharacter = CH than increment repeatCount else if repeatCount < 4 than write savedCharacter repeatCount times repeatCount 0, savedCharacter CH else write compressed format (3 char)
Simple RLE Decompression Repeat compressionFlag off While compressionFlag = off and not EOS do Read next Character (CH) if CH = ‘@’ then compressionFlag on else write CH on output stream if compressionFlag = on than read nRepetition, read dChar generate nRepetition of dChar Until end of string
RLE on Binary Image RLE compression in image based on observation : if we select a pixel in the image at random position, there is a good chance that its neighbours will have the same color. E.g if bitmap start with white pixels, folowed by 1 black one, followed 55 white pixel, than only the number 17, 1, 55,... needs be written on the output.
RLE on Gray Scale Image Encoded using pair (run length, pixel values) that up to 255 run. E.g 12, 12, 12, 12, 12, 12, 12, 12, 12, 35, 76, 112, 67, 87, 87, 87, 5, 5, 5, 5, 5, 5, 1, ... Compressed into 9, 12, 35, 76, 112, 67, 3, 87, 6, 5, 1, ... (problem to distinguish count and graysale value?) If image limited only 128 grayscale value, devote 1 bit to indicate grayscale value or count. If graysccale is 256, than can be reduced into 255 color, 1 byte reserved as flag (e.g 255) to indicate count. 255, 9, 12, 35, 76, 112, 67, 255, 3, 87, 255, 6, 5, 1, ... One bit can devoted to each byte wheather the byte is count or grayscale data. 10000010, 9, 12, 35, 76, 112, 67, 255, 3, 87, 255, 100....., 6, 5, 1, ...
RLE on True Color Image On Colored Images each pixel stored on tree bytes (representing RGB value), each color should be encoded separately. E.g (171, 85, 34), (172, 85, 35), (172, 85, 30), (173, 85, 33), should be separate into (171, 172, 172, 173, ...), (85, 85, 85, 85, ...) and (34, 35, 30, 33, ...) It is preferable to encode each row individually, thu a row ends with four pixel of intensity 87 and followed by 9 pixel of intensity 87 better to write with ..., 4, 87, 9, 87,.... Or even better to write with ..., 4, 87, eol, 9, 87,....
Exercises - RLE Compression Ratio??? 1 2 3 4 5 6 7 8 9 10 24 14 19 23 21 13 20
Relative Encoding Another technique called Differencing is used when the data to be compress consist of number that do not differ by much, (or similar) E.g Telemetry, Temperature sensing (70, 71, 72, 72.5, 73.1,...) can be express (70, 1, 1, 0.5, 0.6, ...) Note difference can be minus e.g (110, 115, 121, 119, 200, 202, ...) can be compressed into (110, 5, 6, -2, 200, 2, ...) To distinguish between data and difference, is done using the extra bits. E.g (110, 5, 6, -2, 200, 2, ...) is sent with following extra (100010002) bits.
Difference coding : Example
Difference coding : Example
Move to Front Coding [1] Idea : maintain he alphabet A of symbols as a list where frequently occuring symbols are located near the front. E.g list of alphabet A = (“a”, “b”, “c”, “d”, “m”, “n”, “o”, “p”) Input stream “abcddcbamnopponm” is encoded as C = (0, 1, 2, 3, 0, 1, 2, 3, 4, 5, 6, 7, 0, 1, 2, 3), and without move to front encoded as C’ = (0, 1, 2, 3, 3, 2, 1, 0, 4, 5, 6, 7, 7, 6, 5, 4).
Move to Front Coding [2] String “abcddcbamnopponm” and “abcdmnopabcdmnop”
Move to Front Coding [3] Move to front can be combined with the other method (hufman or arithmetics) by following steps : Assign Huffman Codes to Integer in range [0,n] such that smaller integers get shorter codes. E.g : 0-0, 1-10, 2-110, 3-1110, 4 – 11110, 5 – 111110, 6 – 1111110, 7 – 1111111. Assign codes to the integers such that the code of integer i ≥ 1 is binary codes proceded by [log 2 i], see picture Use Adaptive huffman coding For maximum compression, perform two passes over C, the first pass counting trequencies of codes, and the second pass performing the actual encoding.
Move To Front Coding [4] Move-ahead-k. The element of A match by the current symbol is move ahead k positions instead of all the way to the front of A. Parameter k is specified by user. If k = n than it is equal to move to front. Wait-c-and-move. An element of A is moved to the front only after it has been matched c times to symbols from the input stream. It may make sense to threat each word not character.
Exercises – Move To Front Perhatikan lirik berikut : Selamat Hari Raya selamat hari lebaran Selamat Idul Fitri maaf lahir dan batin Selamat Hari Raya selamat lebaran Selamat Idul Fitri maaf lahir dan batin Jika buffer yang dimiliki adalah 6kata, berapa compression rationya?
Scalar Quantization Reduce the number of data Lossy compression If the data in the form of large number, it is converted to smaller number. Not all data is used If the data to be compressed is analog, quantization is used to sample and digitized it into small number. The smaller the number, the better the compression, But also the greater the loss of information
Scalar Quantization - Example Data 8 bit delete the LSB bit data 7 bit Input data [0,255] just take quantized value 0, s, 2s, …., ks where (k+1)s < 255 s = 3 output data : 0,3,6,9, 12, … , 255 s = 4 output data : 0, 4, 8, 12, …, 252, 255 PCM (Puse Code Modulation) in voice voice 4 kHz (analog) is sampled 8000 sample/s and encode to 8 bit per sample = 64 kbps
Statistical Method Using variable size code shorter code assigned to symbol that appear more often (have a higher probability of occurrence) Example : Morse, Huffman code, etc
Fixed Length Code Each symbol is represented as fixed length code Example : ASCII code code length : 7 bit + 1 bit parity = 8 bit Total bit number = number of characters * 8 bit
Variable Size Code Assigning Code that can be decoded unambiguously Assigning code with the minimum average size Example : Entropy 1.57 Bit per symbol = 1.77 If the data have equal probability (0.25) entropy = bit/symbol = 2 bit Symbol Probability Code 1 Code 2 A1 0.49 1 A2 0.25 01 A3 010 000 A4 0.01 001
Prefix Code (= Prefix-free Code) A prefix code is a type of code system (typically a variable-length code) distinguished by its possession of the "prefix property"; which states that there is no valid code word in the system that is a prefix (start) of any other valid code word in the set. a receiver can identify each word without requiring a special marker between words.
Prefix Code : Example 1 A code with code words {9, 59, 55} has the prefix property; a code consisting of {9, 5, 59, 55} does not, because "5" is a prefix of both "59" and "55". A prefix code is an example of a uniquely decodable code:
Binary Prefix Code Dapat direpresentasikan dalam pohon biner Ciri khas: setiap simbol akan menjadi leaves nodes, tidak ada yang menjadi internal nodes
Prefix Code : Example 2 Prefix-free Code {01, 10, 11, 000, 001} Jika ni = banyaknya codeword yang memiliki panjang bit i, maka: n2 = 3 (pada level 2, ada 3 codeword) n3 = 2 (pada level 3 ada 2 codeword
Prefix Code : Example 3 Code {0, 01, 011, 0111} Code {0, 01, 11} Bukan prefix-free code, Code {0, 01, 11}
Exercise - Prefix Code Berapa Entropy dan bit rate apabila string : “aebbcchhffabbacdaaaaaffghbbfff” dikodekan menggunakan prefix code berikut. Jika 1 symbol 1 byte, berapa compression rationya?
ASSIGNMENT #1 Make a description paper (and complete it with algorithm) about Tunstall Code and Golomb Code in Bahasa Indonesia.
Daftar Pustaka [1] Solomon, David; “Data Compression 3rd Edition”, Springer, 2004