The concept of iteratee is currently one of topics not much material cover it. Its rising resulted from its elegancy to overcome the disadvantage of lazy IO, to which most people blame for the lack of control on resource. Although it is generally categorized as an advanced topic, I don’t think it is too difficult to get the hang of it. It is an advanced topic because in most case lazy IO can just satisfy your needs, you only need it when you run into pretty tough/big problems.
I am not going to elaborate the design of iteratee, since there are already great introductions on the web. I will just recommend a reading priority to reduce the extra difficulties ahead. Here I list three of them from richly stressing in the context to direct implementation
- Yesod Book
- Oleg Kiselyov’s talk
- John Milllikin’s tutorial
I believe following this order can ease the loading of your brain. The introduction provided by Yesod Book adress the problem from the benefit of the inverse of control, without jamming you too much about the problems of lazy IO. About lazy IO and the spirit of iteratee, Oleg’s slide give an excellent and clear explanation. It doesn’t require you a lot of background knowlege, very easy to read. And John’s tutorial can be directly correspond to concrete package on hackage, that is enumerator. That would facilitate doing experiment by yourself.
I will illustrate an example about the usage of iteratee, with accompanying testcase demonstrating the problem of lazy IO. This example is a modification from Kazu Yamamoto’s tutorial. I replace the enumerator package with the iteratee package. And instead of finding filenames matching given pattern, this example traverse the directories and print the first line of enumerated files (in ByteString.) The design of the example is intended to exhaust the limited resource, file descriptors, of naive implementation.
Print all the first lines of files in a directory Let’s look at the naive implementation first. Since the ByteString is used in the iteratee implementation, to be parallel, it imports Data.ByteString.Lazy.Char8. It also uses readFile intentionally to be an lazy IO program. Otherwise not much to address.
import Control.Monad
import Control.Monad.IO.Class
import Control.Applicative
import System.Environment
import System.Directory
import System.FilePath
import qualified Data.List as L
import qualified Data.ByteString.Lazy.Char8 as B
getValidContents :: FilePath -> IO [String]
getValidContents path =
filter (`notElem` [".", "..", ".git", ".svn"])
<$> getDirectoryContents path
isSearchableDir :: FilePath -> IO Bool
isSearchableDir dir =
(&&) <$> doesDirectoryExist dir
<*> (searchable <$> getPermissions dir)
getRecursiveContents :: FilePath -> IO [FilePath]
getRecursiveContents dir = do
cnts <- map (dir </>) <$> getValidContents dir
cnts' <- forM cnts $ \path -> do
isDirectory <- isSearchableDir path
if isDirectory
then getRecursiveContents path
else return [path]
return . concat $ cnts'
firstLine :: FilePath -> IO B.ByteString
firstLine file = do
b <- B.readFile file
return $ head . B.lines $ b
allFirstLines :: FilePath -> IO ()
allFirstLines dir = do
filepaths <- getRecursiveContents dir
l <- mapM firstLine $ filepaths
mapM_ B.putStrLn l
main = do
dir:_ <- getArgs
allFirstLines dir
Next we look at the iteratee implementation. I re-implement enumDir function from Kazy’s tutorial with iteratee package. It may be a personal taste, an implementation with enumerator would be beginner friendly than an implementation with iteratee in my opinion. I also implement an enumeratee to adapt an enumerator of filepaths to an enumerator of bytestrings: firstLineE. In firstLineE I adopt enumerator and enumeratee provided by the iteratee package to reduce the work.
import Control.Monad
import Control.Monad.IO.Class
import Control.Applicative
import System.Environment
import System.Directory
import System.FilePath
import qualified Data.List as L
import qualified Data.ByteString as B
import qualified Data.Iteratee as I
import Data.Iteratee.Iteratee
import qualified Data.Iteratee.Char as EC
import qualified Data.Iteratee.IO.Fd as EIO
import qualified Data.Iteratee.ListLike as EL
getValidContents :: FilePath -> IO [String]
getValidContents path =
filter (`notElem` [".", "..", ".git", ".svn"])
<$> getDirectoryContents path
isSearchableDir :: FilePath -> IO Bool
isSearchableDir dir =
(&&) <$> doesDirectoryExist dir
<*> (searchable <$> getPermissions dir)
getRecursiveContents :: FilePath -> IO [FilePath]
getRecursiveContents dir = do
cnts <- map (dir </>) <$> getValidContents dir
cnts' <- forM cnts $ \path -> do
isDirectory <- isSearchableDir path
if isDirectory
then getRecursiveContents path
else return [path]
return . concat $ cnts'
printI :: Iteratee [B.ByteString] IO ()
printI = do
mx <- EL.tryHead
case mx of
Nothing -> return ()
Just l -> do
liftIO . B.putStrLn $ l
printI
firstLineE :: Enumeratee [FilePath] [B.ByteString] IO ()
firstLineE = mapChunksM $ \filenames -> do
forM filenames $ \filename -> do
i <- EIO.enumFile 1024 filename $ joinI $ ((mapChunks B.pack) ><> EC.enumLinesBS) EL.head
result <- run i
return result
enumDir :: FilePath -> Enumerator [FilePath] IO b
enumDir dir iter = runIter iter idoneM onCont
where
onCont k Nothing = do
(files, dirs) <- liftIO getFilesDirs
if null dirs
then return $ k (Chunk files)
else walk dirs $ k (Chunk files)
walk dirs = foldr1 (>>>) $ map enumDir dirs
getFilesDirs = do
cnts <- map (dir </>) <$> getValidContents dir
(,) <$> filterM doesFileExist cnts
<*> filterM isSearchableDir cnts
allFirstLines :: FilePath -> IO ()
allFirstLines dir = do
i' <- enumDir dir $ joinI $ firstLineE printI
run i'
main = do
dir:_ <- getArgs
allFirstLines dir
To test the effect of different implementation, I prepare three test cases.
#!/usr/bin/bash
LOGDIR=log
mkdir -p $LOGDIR
ulimit -aS > $LOGDIR/ulimit_stat.log
if [[ ! -d test01 ]]; then
mkdir -p test01
cd test01
for f in $(seq 1 500)
do
echo $RANDOM > $f
done
cd ..
fi
./allfirstlines_naive test01/ +RTS -sstderr -RTS 2> $LOGDIR/allfirstlines_naive_test01.log 1>/dev/null
./allfirstlines_iteratee test01/ +RTS -sstderr -RTS 2> $LOGDIR/allfirstlines_iteratee_test01.log 1>/dev/null
if [[ ! -d test02 ]]; then
mkdir -p test02
cd test02
for f in $(seq 1 2048)
do
echo $RANDOM > $f
done
cd ..
fi
./allfirstlines_naive test02/ +RTS -sstderr -RTS 2> $LOGDIR/allfirstlines_naive_test02.log 1>/dev/null
./allfirstlines_iteratee test02/ +RTS -sstderr -RTS 2> $LOGDIR/allfirstlines_iteratee_test02.log 1>/dev/null
if [[ ! -d test03 ]]; then
mkdir -p test03
./gen_nuclear_test test03
fi
./allfirstlines_naive test03/ +RTS -sstderr -RTS 2> $LOGDIR/allfirstlines_naive_test03.log 1>/dev/null
./allfirstlines_iteratee test03/ +RTS -sstderr -RTS 2> $LOGDIR/allfirstlines_iteratee_test03.log 1>/dev/null
The first test case is the most easy one, a directory containing 500 files, and the second one is a directory containing more than 2000 files, that would possibly exhaust file descriptors. The third one add another layer of complication, with directory hierarchy containing a total of more than 2000 files, The third one is generated with a haskell script as follows:
import Control.Monad
import System.Directory
import System.Environment
layer :: Int -> IO ()
layer depth = do
if (depth <= 0)
then do mapM_ (flip writeFile "b!") (map show [1..2])
return ()
else do createDirectory "0"
setCurrentDirectory "0"
layer $ depth-1
setCurrentDirectory ".."
createDirectory "1"
setCurrentDirectory "1"
layer $ depth-1
setCurrentDirectory ".."
main = do
(dir:_) <- getArgs
setCurrentDirectory dir
putStrLn "creating ..." >> layer 10 >> putStrLn "done!"
For convenience, I write a simple Makefile to make the test command be reduced to make test, and the clean command to make clean (Note that I use ghc 7.0.2, you have to compile with -rtsopts to enable most runtime flags.)
.PHONY: clean test
test: allfirstlines_iteratee allfirstlines_naive gen_nuclear_test
bash run_test.sh
allfirstlines_naive: allfirstlines_naive.hs
ghc --make -rtsopts -O2 $@
allfirstlines_iteratee: allfirstlines_iteratee.hs
ghc --make -rtsopts -O2 $@
gen_nuclear_test: gen_nuclear_test.hs
ghc --make -O2 $@
clean:
rm -f *.o *.hi
rm -rf test01/ test02/ test03/
And my ulimit -aS is as follows to see the allowed number of open files for a process.
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 20
file size (blocks, -f) unlimited
pending signals (-i) 401408
max locked memory (kbytes, -l) 32
max memory size (kbytes, -m) unlimited
open files (-n) 1024
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) 10240
cpu time (seconds, -t) unlimited
max user processes (-u) unlimited
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited
The naive result for test case 1. We can see it is inefficient for the garbage collection time taking 21.3% of time.
./allfirstlines_naive test01/ +RTS -sstderr
30,731,840 bytes allocated in the heap
1,160,436 bytes copied during GC
8,165,024 bytes maximum residency (4 sample(s))
2,214,040 bytes maximum slop
12 MB total memory in use (0 MB lost due to fragmentation)
Generation 0: 51 collections, 0 parallel, 0.00s, 0.00s elapsed
Generation 1: 4 collections, 0 parallel, 0.00s, 0.00s elapsed
INIT time 0.00s ( 0.00s elapsed)
MUT time 0.02s ( 0.02s elapsed)
GC time 0.01s ( 0.01s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 0.02s ( 0.02s elapsed)
%GC time 21.3% (21.9% elapsed)
Alloc rate 1,656,435,077 bytes per MUT second
Productivity 74.1% of total user, 76.2% of total elapsed
The iteratee benchmark for test case 1. The garbage collection time is reduced to 6.6%. It is still high because we use naive approach to enumerate files in a specific layer of directory.
./allfirstlines_iteratee test01/ +RTS -sstderr
3,222,812 bytes allocated in the heap
268,504 bytes copied during GC
266,296 bytes maximum residency (1 sample(s))
16,348 bytes maximum slop
1 MB total memory in use (0 MB lost due to fragmentation)
Generation 0: 5 collections, 0 parallel, 0.00s, 0.00s elapsed
Generation 1: 1 collections, 0 parallel, 0.00s, 0.00s elapsed
INIT time 0.00s ( 0.00s elapsed)
MUT time 0.01s ( 0.01s elapsed)
GC time 0.00s ( 0.00s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 0.02s ( 0.01s elapsed)
%GC time 6.6% (7.0% elapsed)
Alloc rate 221,758,205 bytes per MUT second
Productivity 86.3% of total user, 91.9% of total elapsed
test case 2 results for naive implementation. The resource is exhausted, as expected
./allfirstlines_naive test02/ +RTS -sstderr
allfirstlines_naive: test02/1752: openFile: resource exhausted (Too many open files)
25,549,600 bytes allocated in the heap
3,387,656 bytes copied during GC
16,887,092 bytes maximum residency (5 sample(s))
4,258,052 bytes maximum slop
22 MB total memory in use (0 MB lost due to fragmentation)
Generation 0: 40 collections, 0 parallel, 0.00s, 0.00s elapsed
Generation 1: 5 collections, 0 parallel, 0.01s, 0.01s elapsed
INIT time 0.00s ( 0.00s elapsed)
MUT time 0.05s ( 0.05s elapsed)
GC time 0.01s ( 0.01s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 0.06s ( 0.06s elapsed)
%GC time 18.5% (18.7% elapsed)
Alloc rate 548,298,210 bytes per MUT second
Productivity 79.9% of total user, 80.9% of total elapsed
test case 2 for iteratee implemetation. It completes without error, but with a high percentage of GC time. As previously stated, it is because we use a naive approach to enumerate files at a specific layer.
./allfirstlines_iteratee test02/ +RTS -sstderr
13,091,456 bytes allocated in the heap
2,215,796 bytes copied during GC
1,590,196 bytes maximum residency (2 sample(s))
38,256 bytes maximum slop
3 MB total memory in use (0 MB lost due to fragmentation)
Generation 0: 21 collections, 0 parallel, 0.00s, 0.00s elapsed
Generation 1: 2 collections, 0 parallel, 0.00s, 0.00s elapsed
INIT time 0.00s ( 0.00s elapsed)
MUT time 0.06s ( 0.06s elapsed)
GC time 0.01s ( 0.01s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 0.07s ( 0.07s elapsed)
%GC time 9.2% (9.3% elapsed)
Alloc rate 216,670,627 bytes per MUT second
Productivity 89.0% of total user, 90.3% of total elapsed
test case 3 for naive case. Again, file descriptors are exhausted.
./allfirstlines_naive test03/ +RTS -sstderr
allfirstlines_naive: test03/0/1/1/1/1/1/1/1/1/0/1: openFile: resource exhausted (Too many open files)
32,198,888 bytes allocated in the heap
6,282,568 bytes copied during GC
16,901,268 bytes maximum residency (5 sample(s))
4,255,424 bytes maximum slop
22 MB total memory in use (0 MB lost due to fragmentation)
Generation 0: 50 collections, 0 parallel, 0.01s, 0.01s elapsed
Generation 1: 5 collections, 0 parallel, 0.01s, 0.01s elapsed
INIT time 0.00s ( 0.00s elapsed)
MUT time 0.13s ( 0.13s elapsed)
GC time 0.02s ( 0.02s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 0.15s ( 0.15s elapsed)
%GC time 11.5% (11.6% elapsed)
Alloc rate 239,243,962 bytes per MUT second
Productivity 87.8% of total user, 88.2% of total elapsed
test case 3 for iteratee case. The GC time is greatly reduced since a layer is at most 2 files in it in this test case.
./allfirstlines_iteratee test03/ +RTS -sstderr
24,048,476 bytes allocated in the heap
67,792 bytes copied during GC
22,964 bytes maximum residency (1 sample(s))
20,068 bytes maximum slop
1 MB total memory in use (0 MB lost due to fragmentation)
Generation 0: 40 collections, 0 parallel, 0.00s, 0.00s elapsed
Generation 1: 1 collections, 0 parallel, 0.00s, 0.00s elapsed
INIT time 0.00s ( 0.00s elapsed)
MUT time 0.20s ( 0.20s elapsed)
GC time 0.00s ( 0.00s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 0.20s ( 0.20s elapsed)
%GC time 0.4% (0.4% elapsed)
Alloc rate 122,192,573 bytes per MUT second
Productivity 99.0% of total user, 99.3% of total elapsed
On one hand, this example serves as my excercise to help me understand iteratee, on the otherhand, it addresses the problem of lazy IO with such a simple concrete case, which is not seen on the web.
I list other useful links for further info.