Io has a normally distributed random number generator:

bash $ io
Io 20090105
Io> Random gaussian
==> 1.0989883665373978
Io> Random gaussian(100,10)
==> 88.2743995098800269

The underlying source code uses the Box-Muller transform:

double RandomGen_gaussian(RandomGen *self, double m, double s)
{
		// http://www.taygeta.com/random/gaussian.html
	double x1, x2, w, y1, y2;

	do {
		x1 = 2.0 * RandomGen_randomDouble(self) - 1.0;
		x2 = 2.0 * RandomGen_randomDouble(self) - 1.0;
		w = x1 * x1 + x2 * x2;
	} while ( w >= 1.0 );

	w = sqrt( (-2.0 * log( w ) ) / w );
	y1 = x1 * w;
	y2 = x2 * w;

		// The following code resulted in a lot of nans being returned. The
		// following code *should* also be slower.
		/*
	double x1 = 2.0 * RandomGen_randomDouble(self) - 1.0;
	double x2 = 2.0 * RandomGen_randomDouble(self) - 1.0;
	double y1 = sqrt( - 2.0 * log(x1) ) * cos( M_PI_2 * x2 );
		*/

	return ( m + y1 * s );
}

Factor also has a normally distributed generator:

bash $ factor
( scratchpad ) USE: random
( scratchpad ) 0 1 normal-random-float

--- Data stack:
-1.651647700885478

The interesting story here is the Factor source is algorithmically identical to what is commented out in the Io source:

: normal-random-float ( mean sigma -- n )
    0.0 1.0 uniform-random-float
    0.0 1.0 uniform-random-float
    [ 2 pi * * cos ]
    [ 1.0 swap - log -2.0 * sqrt ]
    bi* * * + ;

So a normally distributed random variable is easy to simulate, but what about the other distributions? It occurred to me that the inverse of the cumulative distribution function (qv quantile function) of an RV will convert a standard uniform random number into a sample from the RV.

Since most languages don’t have cumulative distribution functions, much less quantile functions, this converts the problem into an exercise in numerical analysis. I wondered if this was the technique used in the R source code. It wasn’t so for either rgamma or rnorm. rnorm underlyingly calls this routine, which consists of almost 300 lines of C:

/*
 *  REFERENCE
 *
 *    Ahrens, J.H. and Dieter, U.
 *    Extensions of Forsythe's method for random sampling from
 *    the normal distribution.
 *    Math. Comput. 27, 927-937.
 *
 *    The definitions of the constants a[k], d[k], t[k] and
 *    h[k] are according to the abovementioned article
 */
double norm_rand(void)
{

The paper mentioned in the comment doesn’t appear to be available online, but it is summarized here. It goes back to an idea von Neumann had in the 1950s. The technique still involves multiple uniform distribution samples, which are apparently cheaper than evaluating a quantile function. It uses far fewer samples on average than the 30 I used in rnorm.

Here is a relevant Knuth quote:

It is conceivable that someday somebody will invent a random number generator that produces one of these other random quantities directly, instead of getting it indirectly via the uniform distribution. But no direct methods have as yet proved to be practical, except for the “random bit” generator described in Section 3.2.2.

The Art of Computer Programming: Seminumerical Algorithms, p. 119

According to the central limit theorem, a large enough sample from any distribution will converge to a normal distribution. So we might as well use the built in uniform distribution, which has variance 1/12, and the fact that the sum of independent random variables has variance equal to the sum of the variances.

def rnorm(mean=0, sd=1)
  n = 30
  sum = Array.new(n) { rand }.inject { |m,o| m+o }
  zscore = (sum - (n/2.0))/(n/12.0)**0.5
  zscore * sd + mean
end

How to make a quick ASCII histogram of the results:

h = Hash.new(0)
1000.times { h[(10*rnorm).to_i.to_f/10] += 1 }
(-35..35).each { |k| puts "#{k.abs/10.0} #{'#' * h[k/10.0]}" }

A javascript version:

function rnorm(mean,sd) {
  n=30;
  if (null == mean) { mean = 0; }
  if (null == sd) { sd = 1;}
  for(sum=0, i=0;i < n; i++) {
    sum += Math.random();
  }
  zscore = (sum - (n/2))/Math.pow(n/12,0.5);
  return zscore * sd + mean;
}

6/16 update: code fix: tripped up by the fact that 3/2 == 1 in ruby. In javascript 3/2 == 1.5

A nice R feature that I haven’t seen in any other language is random number generators for distributions other than the uniform distribution. The most useful is the normal distribution: here’s a crude simulation of rnorm in ruby

def rnorm(mean=0, sd=1)
  coin_flips=37
  binomial = Array.new(coin_flips) { rand(2) }.inject { |m,o| m+o }
  zscore = 2 * (binomial - (coin_flips/2))/coin_flips**0.5
  zscore * sd + mean
end

and in javascript:

function rnorm(mean,sd) {
  coin_flips=37;
  if (null == mean) { mean = 0; }
  if (null == sd) { sd = 1;}
  binomial=0;
  for(i=0;i < coin_flips; i++) {
    if ( 0.5 < Math.random() ) {
      binomial += 1;
    }
  }
  zscore = 2 * (binomial - (coin_flips/2))/Math.pow(coin_flips,0.5);
  return zscore  * sd + mean;
}

Unlike the R version, there are only 38 possible values that above functions can assume, which might not be acceptable if repeated trials are being made. To fix that, here is a ruby version which mixes in another random number.

def fuzz(width)
  [(rand-0.5)*width, ((width**2)/12.0)**0.5]
end

def rnorm(mean=0, sd=1)
  coin_flips=37
  fuzz, fuzz_sd = fuzz(1/coin_flips**0.5)
  binomial = Array.new(coin_flips) { rand(2) }.inject { |m,o| m+o }
  zscore = 2 * (binomial - (coin_flips/2))/coin_flips**0.5
  zscore * (sd**2-fuzz_sd**2)**0.5 + mean + fuzz
end

The previous solution is a lot slower than the system provided grep:

Macintosh-2:Haskell clark$ ghc --make Main.hs
[1 of 1] Compiling Main             ( Main.hs, Main.o )
Linking Main ...
Macintosh-2:Haskell clark$ time ./Main -r teen regex-tdfa-1.1.2

real	0m0.643s
user	0m0.546s
sys	0m0.089s
Macintosh-2:Haskell clark$ time grep -r teen regex-tdfa-1.1.2

real	0m0.018s
user	0m0.008s
sys	0m0.010s

ghc has a profiler which we can use to find out where the code is spending its time and how much memory is getting allocated.

The profiler requires that options be passed to it on the command line. This is inconvenient when the program to be profiled itself takes command line options. Thus, to profile the program, the first step was to hardcode some command line arguments.

-- a <- getArgs
a <- return [ "-r", "teen", "regex-tdfa-1.1.2" ]

Next compile a profiled version of the code:

ghc --make -prof -auto-all Main.hs

Then run the executable:

./Main +RTS -p

The profiling output appears in Main.prof:

        Mon May 25 09:02 2009 Time and Allocation Profiling Report  (Final)

           Main +RTS -p -RTS

        total time  =        0.25 secs   (5 ticks @ 50 ms)
        total alloc = 179,197,136 bytes  (excludes profiling overheads)

COST CENTRE                    MODULE               %time %alloc

doesLineMatch                  Main                  60.0   32.1
grepHandlePerLine              Main                  40.0   67.4

                                                                                               individual    inherited
COST CENTRE              MODULE                                               no.    entries  %time %alloc   %time %alloc

MAIN                     MAIN                                                   1           0   0.0    0.0   100.0  100.0
  main                   Main                                                 262           0   0.0    0.0     0.0    0.0
   grepFiles             Main                                                 263           0   0.0    0.0     0.0    0.0
    grepFile             Main                                                 264           0   0.0    0.0     0.0    0.0
 CAF                     Main                                                 236           9   0.0    0.0   100.0  100.0
  grepHandle             Main                                                 252           0   0.0    0.0     0.0    0.0
  grepHandlePerLine      Main                                                 251           0   0.0    0.0     0.0    0.0
  main                   Main                                                 242           1   0.0    0.0   100.0  100.0
   getGrepPattern        Main                                                 259           1   0.0    0.0     0.0    0.0
    matchCaseInsensitive Main                                                 260           1   0.0    0.0     0.0    0.0
   grepFiles             Main                                                 245          47   0.0    0.0   100.0  100.0
    grepFile             Main                                                 246          37   0.0    0.1   100.0  100.0
     grepHandle          Main                                                 253       10705   0.0    0.0   100.0   99.9
      grepHandlePerLine  Main                                                 255       10705  40.0   67.4   100.0   99.8
       outputNonMatch    Main                                                 261       10677   0.0    0.3     0.0    0.3
       doesLineMatch     Main                                                 256       10677  60.0   32.1    60.0   32.1
        matchCaseInsensitive Main                                                 258       10677   0.0    0.0     0.0    0.0
        matchRegularExpression Main                                                 257       10677   0.0    0.0     0.0    0.0
      outputPerFile      Main                                                 254       10705   0.0    0.1     0.0    0.1
     recursiveDir        Main                                                 250           9   0.0    0.0     0.0    0.0
   getOptions            Main                                                 244           4   0.0    0.0     0.0    0.0
   getNonOptionArguments Main                                                 243           4   0.0    0.0     0.0    0.0
 CAF                     GHC.Int                                              201           1   0.0    0.0     0.0    0.0
 CAF                     GHC.Handle                                           189           6   0.0    0.0     0.0    0.0
 CAF                     System.Posix.Internals                               172           7   0.0    0.0     0.0    0.0
 CAF                     System.Directory                                     125           2   0.0    0.0     0.0    0.0
  main                   Main                                                 247           0   0.0    0.0     0.0    0.0
   grepFiles             Main                                                 248           0   0.0    0.0     0.0    0.0
    grepFile             Main                                                 249           0   0.0    0.0     0.0    0.0

The profiling did not immediately suggest ways to improve performance. It appears to be an IO issue, and this article demonstrates faster ways to do IO in Haskell.

The challenge is to parse a CSV file and insert it into a database table.

The first row of the CSV file will contain the column names of the target table. The CSV parser should support double quoted and unquoted fields, as well as the doubling mechanism for escaping double quotes within a double quoted fields. The parser should correctly handle commas and newlines within double quotes.

See the problem statement in a previous post.

A Complete Solution

This solution adds support for recursive searches, the -l and -L flags, and regular expressions.

Regular expressions were difficult. This blog post explains how to use them, but I had difficulty compiling the code. The command that was good enough previously, namely

ghc grep.hs

resulted in linker errors when I included Test.Regex.Posix. The fix was to rename my file Main.hs and compile it with

ghc --make Main.hs

module Main where
import System
import Data.List
import IO
import Char
import Directory
import Text.Regex.Posix

getOptions [] = []
getOptions (x:xs) | x == "--" = []
                  | x /= "--" = if length x > 1 && x !! 0 == '-'
                                  then (tail x) ++ getOptions xs
                                  else getOptions xs 

recursiveDir opts = elem 'r' opts
outputNonMatch opts = ( elem 'v' opts && not (elem 'l' opts) ) || elem 'L' opts
outputPerFile opts = elem 'l' opts || elem 'L' opts
matchCaseInsensitive opts = elem 'i' opts
matchRegularExpression opts = elem 'E' opts
matchExact opts = not (matchCaseInsensitive opts) && not (matchRegularExpression opts)
showLine opts = not (outputPerFile opts)
showOffset opts = not (outputPerFile opts) && elem 'b' opts
showLineNumber opts = not (outputPerFile opts) && elem 'n' opts
showFile opts = elem 'H' opts || elem 'r' opts

getNonOptionArguments [] = []
getNonOptionArguments (x:xs) | x == "--" = xs
	                     | x/= "--" = if x == "-" || x !! 0 /= '-'
                                            then x : getNonOptionArguments xs
                                            else getNonOptionArguments xs

getGrepPattern opts firstArg = if matchCaseInsensitive opts
                                 then map toLower firstArg
                                 else firstArg

data Location a b c = Location { file :: String, lineNumber :: Int, offset :: Int }
incLocation loc l = Location (file loc ) (lineNumber loc + 1) (offset loc + length l + 1)

showResult loc opts pattern l = do
  if showFile opts
    then do putStr (file loc)
            putChar ':'
    else return ()
  if showLineNumber opts
    then do putStr (show (lineNumber loc + 1))
            putChar ':'
    else return ()
  if showOffset opts
    then do putStr (show (offset loc))
            putChar ':'
    else return ()
  putStrLn l

doesLineMatch opts pattern l | matchRegularExpression opts = l =~ pattern :: Bool
                             | matchCaseInsensitive opts   = isInfixOf pattern (map toLower l)
                             | True                        = isInfixOf pattern l

grepHandlePerFile loc opts pattern h = do
  eof <- hIsEOF h
  if eof
    then
      if outputNonMatch opts
        then putStrLn (file loc)
        else return ()
    else do
      l <- hGetLine h
      if doesLineMatch opts pattern l
        then
          if outputNonMatch opts
            then return ()
            else putStrLn (file loc)
        else
          grepHandlePerFile (incLocation loc l) opts pattern h

grepHandlePerLine loc opts pattern h = do
  eof <- hIsEOF h
  if eof
    then return ()
    else do
      l <- hGetLine h
      if ( doesLineMatch opts pattern l ) == ( outputNonMatch opts )
        then return ()
        else showResult loc opts pattern l
      grepHandle (incLocation loc l) opts pattern h

grepHandle loc opts pattern h = do
  if outputPerFile opts
    then grepHandlePerFile loc opts pattern h
    else grepHandlePerLine loc opts pattern h

grepFile opts pattern file = do
  if file == "-"
    then grepHandle (Location file 0 0) opts pattern stdin
    else do
      d <- doesDirectoryExist file
      if d
        then
          if recursiveDir opts
            then do
              dc <- getDirectoryContents file
              grepFiles opts pattern (map ( (file ++ "/") ++ ) (filter (\x -> x !! 0 /= '.' ) dc))
            else return ()
        else do
          h  <- openFile file ReadMode `catch` (\e -> do hPutStrLn stderr (show e); return (stderr))
          if h == stderr
            then return ()
            else grepHandle (Location file 0 0) opts pattern h

grepFiles opts pattern files = do
  if length files == 0
    then return ()
    else do
      grepFile opts pattern (head files)
      grepFiles opts pattern (tail files)

main = do
  usage <- return "Usage: grep [OPTION]... PATTERN [FILE]..."
  a <- getArgs
  userOpts <- return ( getOptions a )
  args <- return ( getNonOptionArguments a )
  badOpts <- return ( userOpts \\ "bEhHilLnrv" )
  opts <- if length args > 2 && not ( elem 'h' userOpts )
            then return ('H':userOpts)
            else return (userOpts)
  if length badOpts > 0
    then do putStr "grep: invalid option -- "
            putChar (badOpts !! 0)
	    putStrLn ""
            putStrLn usage
    else if length args == 0
           then putStrLn usage
           else do pattern <- return ( getGrepPattern opts (head args) )
                   if length args == 1
                     then grepHandle (Location "(standard input)" 0 0) opts pattern stdin
                     else grepFiles opts pattern (tail args)

The problem statement and partial solution #1

Partial Solution #2: source code

The following flags are implemented: -b, -h, -H, -i, -n, -v

The -b flag displays the byte offset of the start of the line containing the occurrence. I originally implemented finding the byte offset of the start of the occurrence itself, which is harder, but that’s not how grep works so I was wasting my time.

Haskell has a library System.Console.GetOpt for processing command line options, but we are doing it by hand for code-writing practice.

Rather than defining a data type to contain all the option information, this solution passes around the options as a string of characters. For each behavior variant, there is a function which examines the options and returns a boolean value.

The challenge is to implement grep in Haskell.

grep has a lot of options. To simplify the problem somewhat, we only implement a subset of the single letter options such as the following, which don’t take arguments:

  • -b: output offset of occurrence
  • -E: pattern is regular expression
  • -h: suppress filename in output: default for one file
  • -H: show filename in output: default for two or more files
  • -i: case insensitve match
  • -l: file name of matching files only
  • -L: file name of non-matching files only
  • -n: output line # of occurrence
  • -r: if an argument is a directory, recursively grep all files in the directory
  • -v: output lines that don’t match instead of lines that match

grep displays its usage if there are no non-option arguments. If there are non-option arguments, the first is the pattern, and the remaining, if any, are the files to be grepped. If no files are specified, then standard input is grepped.

Options and non-option arguments can appear in any order. Usually, options are recognized by the hyphen, but any arguments that appear after two hyphens: -- are treated as non-options. Thus, grep torque -- -o will grep the file -o for the pattern torque.

Multiple options can be grouped into a single argument: grep -iv is the same as grep -i -v.

Also, a single hyphen: - as an argument is always a non-option, and if it is not the pattern, it refers to the standard input. There is no provision that I am aware of for grepping a file named -.

Partial Solution #1

This solution parses the arguments correctly, storing the options as characters in a list. However, it ignores the options and behaves as if no options have been specified:

module Main where
import System
import Data.List
import IO

getOptions [] = []
getOptions (x:xs) | x == "--" = []
                  | x /= "--" = if length x > 1 && x !! 0 == '-'
                                          then (tail x) ++ getOptions xs
                                          else getOptions xs 

getArguments [] = []
getArguments (x:xs) | x == "--" = xs
	            | x/= "--" = if x == "-" || x !! 0 /= '-'
                                           then x : getArguments xs
                                           else getArguments xs

grepHandle pattern h = do
  eof <- hIsEOF h
  if eof then return () else
    do l <- hGetLine h
       if isInfixOf pattern l
         then putStrLn l
         else return ()
       grepHandle pattern h

grepFile pattern file = do
  if file == "-"
    then grepHandle pattern stdin
    else do h <- openFile file ReadMode
            grepHandle pattern h

grepFiles pattern files = do
  if length files == 0
    then return ()
    else do grepFile pattern (head files)
            grepFiles pattern (tail files)

main = do
  a <- getArgs
  opts <- return ( getOptions a )
  args <- return ( getArguments a )
  if length args == 0
    then print "Usage: grep [OPTION]... PATTERN [FILE]..."
    else if length args == 1
      then grepHandle ( args !! 0 ) stdin
      else grepFiles (head args) (tail args)

Starting with Haskell

May 17, 2009

A beginner might find these links useful:

Haskell isn’t a good first language, but someone who knows C and Lisp has a good chance. Lisp is helpful because there are no loops in Haskell. Where normally one would write a loop, a recursive function must be written instead.

C is helpful because the Glasgow Haskell Compiler works like the GNU compiler. In particular, if the code is contained in grep.hs, then

ghc grep.hs

will create an executable called a.out.

Learning Haskell reminds me of learning C because of the time I spend puzzling over compiler error messages. The key to understanding Haskell error messages is the Hindley-Milner type inference algorithm.

The REPLs lack global variables or definitions. At least I haven’t figured out how to define a function with hugs and get the definition to stick around. Fortunately hugs has readline.

Macintosh-2:Haskell clark$ hugs
__   __ __  __  ____   ___      _________________________________________
||   || ||  || ||  || ||__      Hugs 98: Based on the Haskell 98 standard
||___|| ||__|| ||__||  __||     Copyright (c) 1994-2005
||---||         ___||           World Wide Web: http://haskell.org/hugs
||   ||                         Bugs: http://hackage.haskell.org/trac/hugs
||   || Version: September 2006 _________________________________________

Haskell 98 mode: Restart with command line option -98 to enable extensions

Type :? for help
Hugs> inc x = x + 1
ERROR - Syntax error in input (unexpected `=')
Hugs> let inc x = x + 1 in inc 1
2
Hugs> inc 2
ERROR - Undefined variable "inc"