Author Topic: [Scala] huffman coding (Read 6029 times)

Deque · « **on:** August 07, 2012, 12:02:34 PM »

Hello Evilzone,

for comparison with the dictionary based compression technique Lempel-Ziv-Welch (link) I also implemented Huffman coding, which is based on probabilities.

Simple explanation: An ASCII character needs 8 bits to be saved. The length is a fixed one. Huffman coding uses variable lengths. It looks for the characters that occur most often and gives them the smallest codes. I.e. it might save the letter 'e' (which is a common one in english text) with the binary code '01' and the letter 'z' with the code '1000'.
So you get a compression by reducing the number of bits needed to save one character.

For details about the compression technique have a look here: http://www.cs.duke.e...poop/huff/info/
It explains very well how to create a tree for encoding and decoding.

Of course this is not as efficient as dictionary based compression. It is rather used in addition to other techniques. Here is a comparison of the compression ratios for my Scala implementations:

The average for Huffman is 34,08 %, for LZW 53,52 %.

Finally, the code:

Huffman.scala

Code: (Scala) [Select]

package model

import scala.collection.mutable.LinkedList
import scala.collection.immutable.{ Map, HashMap }
import scala.collection.breakOut
import scala.collection.immutable.TreeSet
import java.io.{
  File,
  FileOutputStream,
  FileInputStream,
  FileWriter,
  FileReader,
  BufferedReader
}
import scala.collection.mutable.ListBuffer

class Huffman {

  private val EOF = -1

  private val CHAR_FREQUENCY_FILE = "charlist.txt"

  /**
   * A map that contains the characters as key and their absolute frequency.
   * The data is initialized by reading the CHAR_FREQUENCY_FILE.
   */
  private val frequencyTable: Map[Char, Long] = {
    var map: Map[Char, Long] =
      (for (i <- 0 to 255) yield (i.toChar -> 1.toLong)) toMap

    for (line <- scala.io.Source.fromFile(CHAR_FREQUENCY_FILE).getLines) {
      if (line.contains(":")) {
        try {
          val number = line.split(':')(1).toLong
          map += (line.charAt(0) -> number)
        } catch {
          case e: NumberFormatException => //ignore
        }
      }
    }
    map
  }

  /**
   * The huffman tree that is used to associate a code for each character. 
   * Characters that occur most often, get the smallest code and vice versa.
   */
  private val huffTree: HuffTree = {
    //Create initial list of trees: One tree for every character, containing
    //only the root node as a leaf.
    var forest: LinkedList[HuffTree] =
      (for ((key, value) <- frequencyTable) yield Leaf(key, value))(breakOut)

    while (forest.size > 1) { //merge until one tree is left
      forest = forest.sortWith((t1, t2) => t1.value < t2.value)
      val merged = forest(0).merge(forest(1)) //merges the two minimal trees
      forest = merged +: forest.drop(2)
    }
    assert(forest.size == 1, { println("forest size = " + forest.size) })
    forest(0)
  }

  /**
   * @return a map (table) that contains the characters as key and their code as
   * value
   */
  private def encodingTable(): Map[Char, List[Int]] = {
    //traverses the huffman tree recursively to create a list of tuples
    def traverse(t: HuffTree, code: List[Int]): List[(Char, List[Int])] = t match {
      case Leaf(k, v) => List((k, code))
      case Node(l, r, v) => traverse(l, code ::: List(0)) ++ traverse(r, code ::: List(1))
    }
    traverse(huffTree, List()) toMap
  }

  /**
   * Compresses the given input file and writes the result to the output file.
   *
   * @param inFile
   * @param outFile
   */
  def compress(inFile: File, outFile: File): Unit = {
    val writer = new BitOutputStream(new FileOutputStream(outFile))
    val reader = new FileReader(inFile)
    try {
      var input = reader.read()
      var encoding = encodingTable()
      while (input != -1) {
        var char = input.toChar
        if (encoding.contains(char)) {
          encoding(char).foreach(writer.writeBit)
        }
        input = reader.read()
      }

    } finally {
      writer.close()
      reader.close()
    }
  }

  /**
   * Decompresses the given input file and writes the result to the output file.
   *
   * @param inFile
   * @param outFile
   */
  def decompress(inFile: File, outFile: File): Unit = {
    val writer = new FileWriter(outFile)
    val reader = new BitInputStream(new FileInputStream(inFile))
    try {
      var input = reader.readBit()
      var bits: ListBuffer[Int] = ListBuffer()

      while (input != EOF) {
        bits += input
        input = reader.readBit()
      }

      writer.write(decode(bits.toList))

    } finally {
      writer.close()
      reader.close()
    }
  }

  /**
   * Takes a list of bits and the huffman tree to decode the given bits.
   *
   * @param code a list of bits that represent the code
   * @return decoded string
   */
  private def decode(code: List[Int]): String = {
    var result = new StringBuilder()
    var tree = huffTree

    for (bit <- code) {
      tree = tree match {
        case Node(l, r, v) => if (bit == 0) l else r
        case Leaf(k, v) =>
          result.append(k)
          huffTree match { //TODO doesn't that work better somehow?
            case Node(l, r, v) => if (bit == 0) l else r
            case _ => huffTree
          }
      }
    }

    tree match { //decode the very last character too
      case Leaf(k, v) => result.append(k)
      case _ => //ignore
    }
    result.toString()
  }

  /**
   * Takes a list of bits and the huffman tree to decode the given bits.
   *
   * @deprecated use decode(code: List[Int]): String instead
   *              would have been so nice, but produces stack overflow with
   *              large files
   * @param code a list of bits that represent the code
   * @param tree the huffman tree used for decoding
   * @return decoded string
   */
  private def decode(code: List[Int], tree: HuffTree): String = code match {
    case x :: xs => tree match {
      case Node(l, r, v) => if (x == 0) decode(xs, l) else decode(xs, r)
      case Leaf(k, v) => k + decode(code, huffTree)
    }
    case Nil => tree match {
      case Leaf(k, v) => k.toString()
      case _ => "parsing error"
    }
  }

}

object Huffman {

  def main(args: Array[String]): Unit = {
    println("Huffman test")
    val huff = new Huffman()
    huff.compress(new File("input.txt"), new File("huffOut.txt"))
    huff.decompress(new File("huffOut.txt"), new File("huffResult.txt"))
    LempelZivWelch.printCompressionRatio(new File("input.txt"), new File("huffOut.txt"))
  }
}

HuffTree.scala

Code: (Scala) [Select]

package model

/**
 * Binary tree for Huffman coding.
 */
abstract class HuffTree {
  /**
   * absolute frequency of the characters saved in this tree
   */
  val value: Long
  /**
   * Merges the actual tree with the tree given as parameter. A new root node is
   * created with the actual tree as right subtree and the parameter tree (that)
   * as left tree. The root node value is the sum of the two subtrees.
   * 
   * @param that the tree to merge with
   * @return merged huffman tree
   */
  def merge(that: HuffTree): HuffTree = Node(that, this, that.value + this.value)
}

/**
 * @constructor
 * @param left the left subtree
 * @param right the right subtree
 * @param value absolute frequency of the characters in the leaves below
 */
case class Node(left: HuffTree, right: HuffTree, value: Long) extends HuffTree

/**
 * @constructor
 * @param char a character
 * @param value absolute frequency of the character char
 */
case class Leaf(char: Char, value: Long) extends HuffTree

It uses BitInputStream and BitOutputStream again, which I grabbed from http://www.developer...t_data_transfer

In case you want to compile the code, you can use as well the BitInputStream.scala and BitOutputStream.scala classes from the Lempel-Ziv-Welch implementation:
http://evilzone.org/evilzone-releases/%28scala%29-text-compression-cli-tool/

Deque