Yet Another Guide to Understand Iteratee for Haskell Programmers

The concept of iteratee is currently one of topics not much material cover it. Its rising resulted from its elegancy to overcome the disadvantage of lazy IO, to which most people blame for the lack of control on resource. Although it is generally categorized as an advanced topic, I don’t think it is too difficult to get the hang of it. It is an advanced topic because in most case lazy IO can just satisfy your needs, you only need it when you run into pretty tough/big problems.

I am not going to elaborate the design of iteratee, since there are already great introductions on the web. I will just recommend a reading priority to reduce the extra difficulties ahead. Here I list three of them from richly stressing in the context to direct implementation

Yesod Book
Oleg Kiselyov’s talk
John Milllikin’s tutorial

I believe following this order can ease the loading of your brain. The introduction provided by Yesod Book adress the problem from the benefit of the inverse of control, without jamming you too much about the problems of lazy IO. About lazy IO and the spirit of iteratee, Oleg’s slide give an excellent and clear explanation. It doesn’t require you a lot of background knowlege, very easy to read. And John’s tutorial can be directly correspond to concrete package on hackage, that is enumerator. That would facilitate doing experiment by yourself.

I will illustrate an example about the usage of iteratee, with accompanying testcase demonstrating the problem of lazy IO. This example is a modification from Kazu Yamamoto’s tutorial. I replace the enumerator package with the iteratee package. And instead of finding filenames matching given pattern, this example traverse the directories and print the first line of enumerated files (in ByteString.) The design of the example is intended to exhaust the limited resource, file descriptors, of naive implementation.

Print all the first lines of files in a directory Let’s look at the naive implementation first. Since the ByteString is used in the iteratee implementation, to be parallel, it imports Data.ByteString.Lazy.Char8. It also uses readFile intentionally to be an lazy IO program. Otherwise not much to address.

import Control.Monad
import Control.Monad.IO.Class
import Control.Applicative
import System.Environment
import System.Directory
import System.FilePath
import qualified Data.List as L
import qualified Data.ByteString.Lazy.Char8 as B


getValidContents :: FilePath -> IO [String]
getValidContents path =
    filter (`notElem` [".", "..", ".git", ".svn"])
    <$> getDirectoryContents path


isSearchableDir :: FilePath -> IO Bool
isSearchableDir dir =
    (&&) <$> doesDirectoryExist dir
         <*> (searchable <$> getPermissions dir)


getRecursiveContents :: FilePath -> IO [FilePath]
getRecursiveContents dir = do
    cnts <- map (dir </>) <$> getValidContents dir
    cnts' <- forM cnts $ \path -> do
        isDirectory <- isSearchableDir path
        if isDirectory
            then getRecursiveContents path
            else return [path]
    return . concat $ cnts'


firstLine :: FilePath -> IO B.ByteString
firstLine file = do
    b <- B.readFile file
    return $ head . B.lines $ b


allFirstLines :: FilePath -> IO ()
allFirstLines dir = do
    filepaths <- getRecursiveContents dir
    l <- mapM firstLine $ filepaths
    mapM_ B.putStrLn l


main = do
    dir:_ <- getArgs
    allFirstLines dir

Next we look at the iteratee implementation. I re-implement enumDir function from Kazy’s tutorial with iteratee package. It may be a personal taste, an implementation with enumerator would be beginner friendly than an implementation with iteratee in my opinion. I also implement an enumeratee to adapt an enumerator of filepaths to an enumerator of bytestrings: firstLineE. In firstLineE I adopt enumerator and enumeratee provided by the iteratee package to reduce the work.

import Control.Monad
import Control.Monad.IO.Class
import Control.Applicative
import System.Environment
import System.Directory
import System.FilePath
import qualified Data.List as L
import qualified Data.ByteString as B
import qualified Data.Iteratee as I
import Data.Iteratee.Iteratee
import qualified Data.Iteratee.Char as EC
import qualified Data.Iteratee.IO.Fd as EIO
import qualified Data.Iteratee.ListLike as EL


getValidContents :: FilePath -> IO [String]
getValidContents path =
    filter (`notElem` [".", "..", ".git", ".svn"])
    <$> getDirectoryContents path


isSearchableDir :: FilePath -> IO Bool
isSearchableDir dir =
    (&&) <$> doesDirectoryExist dir
         <*> (searchable <$> getPermissions dir)


getRecursiveContents :: FilePath -> IO [FilePath]
getRecursiveContents dir = do
    cnts <- map (dir </>) <$> getValidContents dir
    cnts' <- forM cnts $ \path -> do
        isDirectory <- isSearchableDir path
        if isDirectory
            then getRecursiveContents path
            else return [path]
    return . concat $ cnts'


printI :: Iteratee [B.ByteString] IO ()
printI = do
    mx <- EL.tryHead
    case mx of
         Nothing -> return ()
         Just l -> do
             liftIO . B.putStrLn $ l
             printI


firstLineE :: Enumeratee [FilePath] [B.ByteString] IO ()
firstLineE = mapChunksM $ \filenames -> do
    forM filenames $ \filename -> do
        i <- EIO.enumFile 1024 filename $ joinI $ ((mapChunks B.pack) ><> EC.enumLinesBS) EL.head
        result <- run i
        return result


enumDir :: FilePath -> Enumerator [FilePath] IO b
enumDir dir iter = runIter iter idoneM onCont
    where
        onCont k Nothing = do
            (files, dirs) <- liftIO getFilesDirs
            if null dirs
                then return $ k (Chunk files)
                else walk dirs $ k (Chunk files)
        walk dirs = foldr1 (>>>) $ map enumDir dirs
        getFilesDirs = do
            cnts <- map (dir </>) <$> getValidContents dir
            (,) <$> filterM doesFileExist cnts
                <*> filterM isSearchableDir cnts


allFirstLines :: FilePath -> IO ()
allFirstLines dir = do
    i' <- enumDir dir $ joinI $ firstLineE printI
    run i'


main = do
    dir:_ <- getArgs
    allFirstLines dir

To test the effect of different implementation, I prepare three test cases.

#!/usr/bin/bash

LOGDIR=log
mkdir -p $LOGDIR

ulimit -aS > $LOGDIR/ulimit_stat.log

if [[ ! -d test01 ]]; then
    mkdir -p test01
    cd test01
    for f in $(seq 1 500)
    do
        echo $RANDOM > $f
    done
    cd ..
fi

./allfirstlines_naive test01/ +RTS -sstderr -RTS 2> $LOGDIR/allfirstlines_naive_test01.log 1>/dev/null
./allfirstlines_iteratee test01/ +RTS -sstderr -RTS 2> $LOGDIR/allfirstlines_iteratee_test01.log 1>/dev/null

if [[ ! -d test02 ]]; then
    mkdir -p test02
    cd test02
    for f in $(seq 1 2048)
    do
        echo $RANDOM > $f
    done
    cd ..
fi

./allfirstlines_naive test02/ +RTS -sstderr -RTS 2> $LOGDIR/allfirstlines_naive_test02.log 1>/dev/null
./allfirstlines_iteratee test02/ +RTS -sstderr -RTS 2> $LOGDIR/allfirstlines_iteratee_test02.log 1>/dev/null


if [[ ! -d test03 ]]; then
    mkdir -p test03
    ./gen_nuclear_test test03
fi

./allfirstlines_naive test03/ +RTS -sstderr -RTS 2> $LOGDIR/allfirstlines_naive_test03.log 1>/dev/null
./allfirstlines_iteratee test03/ +RTS -sstderr -RTS 2> $LOGDIR/allfirstlines_iteratee_test03.log 1>/dev/null

The first test case is the most easy one, a directory containing 500 files, and the second one is a directory containing more than 2000 files, that would possibly exhaust file descriptors. The third one add another layer of complication, with directory hierarchy containing a total of more than 2000 files, The third one is generated with a haskell script as follows:

import Control.Monad
import System.Directory
import System.Environment

layer :: Int -> IO ()
layer depth = do
    if (depth <= 0)
        then do mapM_ (flip writeFile "b!") (map show [1..2])
                return ()
        else do createDirectory "0"
                setCurrentDirectory "0"
                layer $ depth-1
                setCurrentDirectory ".."

                createDirectory "1"
                setCurrentDirectory "1"
                layer $ depth-1
                setCurrentDirectory ".."


main = do
    (dir:_) <- getArgs
    setCurrentDirectory dir
    putStrLn "creating ..." >> layer 10 >> putStrLn "done!"
For convenience, I write a simple Makefile to make the test command be reduced to make test, and the clean command to make clean (Note that I use ghc 7.0.2, you have to compile with -rtsopts to enable most runtime flags.)

.PHONY: clean test

test: allfirstlines_iteratee allfirstlines_naive gen_nuclear_test
        bash run_test.sh

allfirstlines_naive: allfirstlines_naive.hs
        ghc --make -rtsopts -O2 $@

allfirstlines_iteratee: allfirstlines_iteratee.hs
        ghc --make -rtsopts -O2 $@

gen_nuclear_test: gen_nuclear_test.hs
        ghc --make -O2 $@

clean:
        rm -f *.o *.hi
        rm -rf test01/ test02/ test03/
And my ulimit -aS is as follows to see the allowed number of open files for a process.

core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 20
file size               (blocks, -f) unlimited
pending signals                 (-i) 401408
max locked memory       (kbytes, -l) 32
max memory size         (kbytes, -m) unlimited
open files                      (-n) 1024
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 10240
cpu time               (seconds, -t) unlimited
max user processes              (-u) unlimited
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

The naive result for test case 1. We can see it is inefficient for the garbage collection time taking 21.3% of time.

./allfirstlines_naive test01/ +RTS -sstderr 
      30,731,840 bytes allocated in the heap
       1,160,436 bytes copied during GC
       8,165,024 bytes maximum residency (4 sample(s))
       2,214,040 bytes maximum slop
              12 MB total memory in use (0 MB lost due to fragmentation)

  Generation 0:    51 collections,     0 parallel,  0.00s,  0.00s elapsed
  Generation 1:     4 collections,     0 parallel,  0.00s,  0.00s elapsed

  INIT  time    0.00s  (  0.00s elapsed)
  MUT   time    0.02s  (  0.02s elapsed)
  GC    time    0.01s  (  0.01s elapsed)
  EXIT  time    0.00s  (  0.00s elapsed)
  Total time    0.02s  (  0.02s elapsed)

  %GC time      21.3%  (21.9% elapsed)

  Alloc rate    1,656,435,077 bytes per MUT second

  Productivity  74.1% of total user, 76.2% of total elapsed

The iteratee benchmark for test case 1. The garbage collection time is reduced to 6.6%. It is still high because we use naive approach to enumerate files in a specific layer of directory.

./allfirstlines_iteratee test01/ +RTS -sstderr 
       3,222,812 bytes allocated in the heap
         268,504 bytes copied during GC
         266,296 bytes maximum residency (1 sample(s))
          16,348 bytes maximum slop
               1 MB total memory in use (0 MB lost due to fragmentation)

  Generation 0:     5 collections,     0 parallel,  0.00s,  0.00s elapsed
  Generation 1:     1 collections,     0 parallel,  0.00s,  0.00s elapsed

  INIT  time    0.00s  (  0.00s elapsed)
  MUT   time    0.01s  (  0.01s elapsed)
  GC    time    0.00s  (  0.00s elapsed)
  EXIT  time    0.00s  (  0.00s elapsed)
  Total time    0.02s  (  0.01s elapsed)

  %GC time       6.6%  (7.0% elapsed)

  Alloc rate    221,758,205 bytes per MUT second

  Productivity  86.3% of total user, 91.9% of total elapsed

test case 2 results for naive implementation. The resource is exhausted, as expected

./allfirstlines_naive test02/ +RTS -sstderr 
allfirstlines_naive: test02/1752: openFile: resource exhausted (Too many open files)
      25,549,600 bytes allocated in the heap
       3,387,656 bytes copied during GC
      16,887,092 bytes maximum residency (5 sample(s))
       4,258,052 bytes maximum slop
              22 MB total memory in use (0 MB lost due to fragmentation)

  Generation 0:    40 collections,     0 parallel,  0.00s,  0.00s elapsed
  Generation 1:     5 collections,     0 parallel,  0.01s,  0.01s elapsed

  INIT  time    0.00s  (  0.00s elapsed)
  MUT   time    0.05s  (  0.05s elapsed)
  GC    time    0.01s  (  0.01s elapsed)
  EXIT  time    0.00s  (  0.00s elapsed)
  Total time    0.06s  (  0.06s elapsed)

  %GC time      18.5%  (18.7% elapsed)

  Alloc rate    548,298,210 bytes per MUT second

  Productivity  79.9% of total user, 80.9% of total elapsed

test case 2 for iteratee implemetation. It completes without error, but with a high percentage of GC time. As previously stated, it is because we use a naive approach to enumerate files at a specific layer.

./allfirstlines_iteratee test02/ +RTS -sstderr 
      13,091,456 bytes allocated in the heap
       2,215,796 bytes copied during GC
       1,590,196 bytes maximum residency (2 sample(s))
          38,256 bytes maximum slop
               3 MB total memory in use (0 MB lost due to fragmentation)

  Generation 0:    21 collections,     0 parallel,  0.00s,  0.00s elapsed
  Generation 1:     2 collections,     0 parallel,  0.00s,  0.00s elapsed

  INIT  time    0.00s  (  0.00s elapsed)
  MUT   time    0.06s  (  0.06s elapsed)
  GC    time    0.01s  (  0.01s elapsed)
  EXIT  time    0.00s  (  0.00s elapsed)
  Total time    0.07s  (  0.07s elapsed)

  %GC time       9.2%  (9.3% elapsed)

  Alloc rate    216,670,627 bytes per MUT second

  Productivity  89.0% of total user, 90.3% of total elapsed

test case 3 for naive case. Again, file descriptors are exhausted.

./allfirstlines_naive test03/ +RTS -sstderr 
allfirstlines_naive: test03/0/1/1/1/1/1/1/1/1/0/1: openFile: resource exhausted (Too many open files)
      32,198,888 bytes allocated in the heap
       6,282,568 bytes copied during GC
      16,901,268 bytes maximum residency (5 sample(s))
       4,255,424 bytes maximum slop
              22 MB total memory in use (0 MB lost due to fragmentation)

  Generation 0:    50 collections,     0 parallel,  0.01s,  0.01s elapsed
  Generation 1:     5 collections,     0 parallel,  0.01s,  0.01s elapsed

  INIT  time    0.00s  (  0.00s elapsed)
  MUT   time    0.13s  (  0.13s elapsed)
  GC    time    0.02s  (  0.02s elapsed)
  EXIT  time    0.00s  (  0.00s elapsed)
  Total time    0.15s  (  0.15s elapsed)

  %GC time      11.5%  (11.6% elapsed)

  Alloc rate    239,243,962 bytes per MUT second

  Productivity  87.8% of total user, 88.2% of total elapsed
test case 3 for iteratee case. The GC time is greatly reduced since a layer is at most 2 files in it in this test case.

./allfirstlines_iteratee test03/ +RTS -sstderr 
      24,048,476 bytes allocated in the heap
          67,792 bytes copied during GC
          22,964 bytes maximum residency (1 sample(s))
          20,068 bytes maximum slop
               1 MB total memory in use (0 MB lost due to fragmentation)

  Generation 0:    40 collections,     0 parallel,  0.00s,  0.00s elapsed
  Generation 1:     1 collections,     0 parallel,  0.00s,  0.00s elapsed

  INIT  time    0.00s  (  0.00s elapsed)
  MUT   time    0.20s  (  0.20s elapsed)
  GC    time    0.00s  (  0.00s elapsed)
  EXIT  time    0.00s  (  0.00s elapsed)
  Total time    0.20s  (  0.20s elapsed)

  %GC time       0.4%  (0.4% elapsed)

  Alloc rate    122,192,573 bytes per MUT second

  Productivity  99.0% of total user, 99.3% of total elapsed

On one hand, this example serves as my excercise to help me understand iteratee, on the otherhand, it addresses the problem of lazy IO with such a simple concrete case, which is not seen on the web.

I list other useful links for further info.