Learn Haskell — 2 Link to heading

Today, let’s program a simple but useful command-line program in Haskell. The program takes one argument, N, which is the number of words. The program then reads in text data from stdin, parses each word and returns up to N words, separated by white space. The trick here is that the white space must be preserved. Let’s see an example.

Suppose we have an input main.rs

fn main() {
    // Statements here are executed when the compiled binary is called.

    // Print text to the console.
    println!("Hello World!");
}

Let’s say our program is tokenize. Below are what we should expect from our program

# print out only the first 7 words
$ ./tokenize 7 < main.rs
fn main() {
    // Statements here are
# print out only the first 3 words
$ ./tokenize 3 < main.rs
a b  c

Start with Simple Link to heading

If the program does not need to preserve the white space, then it will be very simple, as Haskell has built-in words and unwords functions. Let’s first implement the easier version which simply replaces every white space with a single space

import System.Environment (getArgs)

main :: IO ()
main = do
  args <- getArgs
  let n = read (head args) :: Int
  content <- getContents
  putStrLn $ unwords $ take n $ words content

Let’s test if it works as expected

$ ./tokenize 7 < main.rs
fn main() { // Statements here are
$ ./tokenize 3 < main.rs
a b c

This works just as expected.

Step by step Link to heading

The next step is to implement words_ws function that returns not just a list of words but along with continuous white spaces. For example, below describes the difference between built-in words function

words    " a dog  and a  puppy" == ["a", "dog", "and", "a", "puppy"]
words_ws " a dog  and a  puppy" == [" ", "a", " ", "dog", "  ", "and", " ", "a", "  ", "puppy"]

With this function, all we need to is to take 2*n tokens (one for white space and another for word) and concat.

import System.Environment (getArgs)
import Data.Char (isSpace)

main :: IO ()
main = do
  args <- getArgs
  let n = read (head args) :: Int
  content <- getContents
  putStrLn $ concat $ take (2 * n) $ words_ws content

words_ws :: String -> [String]
words_ws = -- TODO

Now, all we need to do is to implement words_ws function. For this, we are going to use built-in span function

span :: (a -> Bool) -> [a] -> ([a], [a])
span, applied to a predicate p and a list xs, returns a tuple where first element is the longest prefix (possibly empty) of xs of elements that satisfy p and second element is the remainder of the list:

With this, we can alternate between spanning white space characters and non-white-space chars, then concatenate with result from its recursive call.

words_ws :: String -> [String]
words_ws "" = []
words_ws xs =
  let (spaces, rest) = span isSpace xs
      (word, rest') = span (not . isSpace) rest
   in [spaces, word] ++ (words_ws rest')

Now, let’s assemble the entire code as tokenize.hs, compile, and test

-- tokenize.hs
import Data.Char (isSpace)
import System.Environment (getArgs)

main :: IO ()
main = do
  args <- getArgs
  let n = read (head args) :: Int
  content <- getContents
  putStrLn $ concat $ take (2 * n) $ words_ws content

words_ws :: String -> [String]
words_ws "" = []
words_ws xs =
  let (spaces, rest) = span isSpace xs
      (word, rest') = span (not . isSpace) rest
   in [spaces, word] ++ (words_ws rest')

$ ghc tokenize
$ ./tokenize 7 < main.rs
fn main() {
    // Statements here are
$ ./tokenize 3 < main.rs
a b  c

Everything works as expected! What surprised was memory efficiency. The memory footprint of the program stayed the same at about 4MB regardless of the input’s size.

How about in Rust? Link to heading

I tried to implement it in Rust too. In particular, I wanted to implement it as lazy as possible, i.e., using iterators rather than arrays. For that reason, I avoided reading the entire input as String, since that would require space complexity of O(m) where m is the size of the input. Below is my attempt to use recursive approach, similar to Haskell implementation.

use std::env::args;

use std::io::{stdin, stdout, BufReader, BufWriter, Read, Result, Write};
use std::iter::{empty, Peekable};

fn main() -> Result<()> {
    let n = args().nth(1).unwrap().parse::<usize>().unwrap();
    let reader = BufReader::new(stdin().lock());
    let mut writer = BufWriter::new(stdout().lock());

    for bytes in words_ws(reader.bytes().map(|r| r.unwrap()).peekable())
        .into_iter()
        .take(2 * n)
    {
        writer.write_all(&bytes)?;
    }

    Ok(())
}

fn is_whitespace(x: u8) -> bool {
    match x {
        b'\t' | b'\n' | b'\x0C' | b'\r' | b' ' => true,
        _ => false,
    }
}

fn span<I: Iterator<Item = u8>>(peekable: &mut Peekable<I>, pred: impl Fn(u8) -> bool) -> Vec<u8> {
    let mut bytes = Vec::new();
    while let Some(x) = peekable.peek() {
        if pred(*x) {
            bytes.push(*x);
            peekable.next();
        } else {
            break;
        }
    }
    bytes
}

fn words_ws<I: Iterator<Item = u8>>(
    mut peekable: Peekable<I>,
) -> Box<dyn Iterator<Item = Vec<u8>>> {
    if peekable.peek().is_none() {
        return Box::new(empty());
    }
    let spaces = span(&mut peekable, is_whitespace);
    let words = span(&mut peekable, |x| !is_whitespace(x));
    Box::new(
        [spaces, words]
            .into_iter()
            .chain(words_ws(peekable).into_iter()),
    )
}

Not only the Rust’s implementation is much more verbose and complex compared to Haskell, but most importantly, the program resulted in stack overflow for moderate-size input due to recursion of word_ws() function.

I finally gave in and implemented in a more Rustic way, without using a recursion. Though this Rust implementation ran faster and more memory-efficient compared to Haskell’s, I find it way too verbose, now that I have written it in just a few lines in Haskell.

use std::env::args;

use std::io::{stdin, stdout, BufReader, BufWriter, Read, Result, Write};
use std::iter::Peekable;

fn main() -> Result<()> {
    let n = args().nth(1).unwrap().parse::<usize>().unwrap();
    let reader = BufReader::new(stdin().lock());
    let mut writer = BufWriter::new(stdout().lock());

    let iter = WhitespaceTokenizer::new(reader.bytes().map(|r| r.unwrap()).peekable());

    for bytes in iter.take(2 * n) {
        writer.write_all(&bytes)?;
    }
    writer.write_all(&[b'\n'])?;

    Ok(())
}

fn is_whitespace(x: u8) -> bool {
    match x {
        b'\t' | b'\n' | b'\x0C' | b'\r' | b' ' => true,
        _ => false,
    }
}

fn span<I: Iterator<Item = u8>>(peekable: &mut Peekable<I>, pred: impl Fn(u8) -> bool) -> Vec<u8> {
    let mut bytes = Vec::new();
    while let Some(x) = peekable.peek() {
        if pred(*x) {
            bytes.push(*x);
            peekable.next();
        } else {
            break;
        }
    }
    bytes
}

enum State {
    WhiteSpace,
    Word,
}

struct WhitespaceTokenizer<I: Iterator<Item = u8>> {
    peekable: Peekable<I>,
    state: State,
}

impl<I: Iterator<Item = u8>> WhitespaceTokenizer<I> {
    fn new(iter: I) -> Self {
        Self {
            peekable: iter.peekable(),
            state: State::WhiteSpace,
        }
    }
}

impl<I: Iterator<Item = u8>> Iterator for WhitespaceTokenizer<I> {
    type Item = Vec<u8>;

    fn next(&mut self) -> Option<Self::Item> {
        if self.peekable.peek().is_none() {
            None
        } else {
            let xs = match self.state {
                State::WhiteSpace => {
                    self.state = State::Word;
                    span(&mut self.peekable, is_whitespace)
                }
                State::Word => {
                    self.state = State::WhiteSpace;
                    span(&mut self.peekable, |x| !is_whitespace(x))
                }
            };
            Some(xs)
        }
    }
}

My conclusion is that Haskell makes it so much easier to write lazy version of operations, compared to other typical languages. There is a steep learning curve, and I’m at still at the very bottom of it, but the its recursive implementation is typically more elegant. Yes, it is still slower, but again, not all applications require performance on-par as Rust. I do think Haskell is a good language and can see why there are so many Haskell zealots.