Nonuniform Random Number Generators IV
June 12, 2009
Io has a normally distributed random number generator:
bash $ io Io 20090105 Io> Random gaussian ==> 1.0989883665373978 Io> Random gaussian(100,10) ==> 88.2743995098800269
The underlying source code uses the Box-Muller transform:
double RandomGen_gaussian(RandomGen *self, double m, double s)
{
// http://www.taygeta.com/random/gaussian.html
double x1, x2, w, y1, y2;
do {
x1 = 2.0 * RandomGen_randomDouble(self) - 1.0;
x2 = 2.0 * RandomGen_randomDouble(self) - 1.0;
w = x1 * x1 + x2 * x2;
} while ( w >= 1.0 );
w = sqrt( (-2.0 * log( w ) ) / w );
y1 = x1 * w;
y2 = x2 * w;
// The following code resulted in a lot of nans being returned. The
// following code *should* also be slower.
/*
double x1 = 2.0 * RandomGen_randomDouble(self) - 1.0;
double x2 = 2.0 * RandomGen_randomDouble(self) - 1.0;
double y1 = sqrt( - 2.0 * log(x1) ) * cos( M_PI_2 * x2 );
*/
return ( m + y1 * s );
}
Factor also has a normally distributed generator:
bash $ factor ( scratchpad ) USE: random ( scratchpad ) 0 1 normal-random-float --- Data stack: -1.651647700885478
The interesting story here is the Factor source is algorithmically identical to what is commented out in the Io source:
: normal-random-float ( mean sigma -- n )
0.0 1.0 uniform-random-float
0.0 1.0 uniform-random-float
[ 2 pi * * cos ]
[ 1.0 swap - log -2.0 * sqrt ]
bi* * * + ;
Nonuniform Random Number Generators III
June 11, 2009
So a normally distributed random variable is easy to simulate, but what about the other distributions? It occurred to me that the inverse of the cumulative distribution function (qv quantile function) of an RV will convert a standard uniform random number into a sample from the RV.
Since most languages don’t have cumulative distribution functions, much less quantile functions, this converts the problem into an exercise in numerical analysis. I wondered if this was the technique used in the R source code. It wasn’t so for either rgamma or rnorm. rnorm underlyingly calls this routine, which consists of almost 300 lines of C:
/*
* REFERENCE
*
* Ahrens, J.H. and Dieter, U.
* Extensions of Forsythe's method for random sampling from
* the normal distribution.
* Math. Comput. 27, 927-937.
*
* The definitions of the constants a[k], d[k], t[k] and
* h[k] are according to the abovementioned article
*/
double norm_rand(void)
{
The paper mentioned in the comment doesn’t appear to be available online, but it is summarized here. It goes back to an idea von Neumann had in the 1950s. The technique still involves multiple uniform distribution samples, which are apparently cheaper than evaluating a quantile function. It uses far fewer samples on average than the 30 I used in rnorm.
Here is a relevant Knuth quote:
It is conceivable that someday somebody will invent a random number generator that produces one of these other random quantities directly, instead of getting it indirectly via the uniform distribution. But no direct methods have as yet proved to be practical, except for the “random bit” generator described in Section 3.2.2.
The Art of Computer Programming: Seminumerical Algorithms, p. 119
Nonuniform Random Number Generators II
June 11, 2009
According to the central limit theorem, a large enough sample from any distribution will converge to a normal distribution. So we might as well use the built in uniform distribution, which has variance 1/12, and the fact that the sum of independent random variables has variance equal to the sum of the variances.
def rnorm(mean=0, sd=1)
n = 30
sum = Array.new(n) { rand }.inject { |m,o| m+o }
zscore = (sum - (n/2.0))/(n/12.0)**0.5
zscore * sd + mean
end
How to make a quick ASCII histogram of the results:
h = Hash.new(0)
1000.times { h[(10*rnorm).to_i.to_f/10] += 1 }
(-35..35).each { |k| puts "#{k.abs/10.0} #{'#' * h[k/10.0]}" }
A javascript version:
function rnorm(mean,sd) {
n=30;
if (null == mean) { mean = 0; }
if (null == sd) { sd = 1;}
for(sum=0, i=0;i < n; i++) {
sum += Math.random();
}
zscore = (sum - (n/2))/Math.pow(n/12,0.5);
return zscore * sd + mean;
}
6/16 update: code fix: tripped up by the fact that 3/2 == 1 in ruby. In javascript 3/2 == 1.5
Nonuniform Random Number Generators
June 10, 2009
A nice R feature that I haven’t seen in any other language is random number generators for distributions other than the uniform distribution. The most useful is the normal distribution: here’s a crude simulation of rnorm in ruby
def rnorm(mean=0, sd=1)
coin_flips=37
binomial = Array.new(coin_flips) { rand(2) }.inject { |m,o| m+o }
zscore = 2 * (binomial - (coin_flips/2))/coin_flips**0.5
zscore * sd + mean
end
and in javascript:
function rnorm(mean,sd) {
coin_flips=37;
if (null == mean) { mean = 0; }
if (null == sd) { sd = 1;}
binomial=0;
for(i=0;i < coin_flips; i++) {
if ( 0.5 < Math.random() ) {
binomial += 1;
}
}
zscore = 2 * (binomial - (coin_flips/2))/Math.pow(coin_flips,0.5);
return zscore * sd + mean;
}
Unlike the R version, there are only 38 possible values that above functions can assume, which might not be acceptable if repeated trials are being made. To fix that, here is a ruby version which mixes in another random number.
def fuzz(width)
[(rand-0.5)*width, ((width**2)/12.0)**0.5]
end
def rnorm(mean=0, sd=1)
coin_flips=37
fuzz, fuzz_sd = fuzz(1/coin_flips**0.5)
binomial = Array.new(coin_flips) { rand(2) }.inject { |m,o| m+o }
zscore = 2 * (binomial - (coin_flips/2))/coin_flips**0.5
zscore * (sd**2-fuzz_sd**2)**0.5 + mean + fuzz
end
Haskell Problem 1: Unix grep (part 4)
May 25, 2009
The previous solution is a lot slower than the system provided grep:
Macintosh-2:Haskell clark$ ghc --make Main.hs [1 of 1] Compiling Main ( Main.hs, Main.o ) Linking Main ... Macintosh-2:Haskell clark$ time ./Main -r teen regex-tdfa-1.1.2 real 0m0.643s user 0m0.546s sys 0m0.089s Macintosh-2:Haskell clark$ time grep -r teen regex-tdfa-1.1.2 real 0m0.018s user 0m0.008s sys 0m0.010s
ghc has a profiler which we can use to find out where the code is spending its time and how much memory is getting allocated.
The profiler requires that options be passed to it on the command line. This is inconvenient when the program to be profiled itself takes command line options. Thus, to profile the program, the first step was to hardcode some command line arguments.
-- a <- getArgs
a <- return [ "-r", "teen", "regex-tdfa-1.1.2" ]
Next compile a profiled version of the code:
ghc --make -prof -auto-all Main.hs
Then run the executable:
./Main +RTS -p
The profiling output appears in Main.prof:
Mon May 25 09:02 2009 Time and Allocation Profiling Report (Final)
Main +RTS -p -RTS
total time = 0.25 secs (5 ticks @ 50 ms)
total alloc = 179,197,136 bytes (excludes profiling overheads)
COST CENTRE MODULE %time %alloc
doesLineMatch Main 60.0 32.1
grepHandlePerLine Main 40.0 67.4
individual inherited
COST CENTRE MODULE no. entries %time %alloc %time %alloc
MAIN MAIN 1 0 0.0 0.0 100.0 100.0
main Main 262 0 0.0 0.0 0.0 0.0
grepFiles Main 263 0 0.0 0.0 0.0 0.0
grepFile Main 264 0 0.0 0.0 0.0 0.0
CAF Main 236 9 0.0 0.0 100.0 100.0
grepHandle Main 252 0 0.0 0.0 0.0 0.0
grepHandlePerLine Main 251 0 0.0 0.0 0.0 0.0
main Main 242 1 0.0 0.0 100.0 100.0
getGrepPattern Main 259 1 0.0 0.0 0.0 0.0
matchCaseInsensitive Main 260 1 0.0 0.0 0.0 0.0
grepFiles Main 245 47 0.0 0.0 100.0 100.0
grepFile Main 246 37 0.0 0.1 100.0 100.0
grepHandle Main 253 10705 0.0 0.0 100.0 99.9
grepHandlePerLine Main 255 10705 40.0 67.4 100.0 99.8
outputNonMatch Main 261 10677 0.0 0.3 0.0 0.3
doesLineMatch Main 256 10677 60.0 32.1 60.0 32.1
matchCaseInsensitive Main 258 10677 0.0 0.0 0.0 0.0
matchRegularExpression Main 257 10677 0.0 0.0 0.0 0.0
outputPerFile Main 254 10705 0.0 0.1 0.0 0.1
recursiveDir Main 250 9 0.0 0.0 0.0 0.0
getOptions Main 244 4 0.0 0.0 0.0 0.0
getNonOptionArguments Main 243 4 0.0 0.0 0.0 0.0
CAF GHC.Int 201 1 0.0 0.0 0.0 0.0
CAF GHC.Handle 189 6 0.0 0.0 0.0 0.0
CAF System.Posix.Internals 172 7 0.0 0.0 0.0 0.0
CAF System.Directory 125 2 0.0 0.0 0.0 0.0
main Main 247 0 0.0 0.0 0.0 0.0
grepFiles Main 248 0 0.0 0.0 0.0 0.0
grepFile Main 249 0 0.0 0.0 0.0 0.0
The profiling did not immediately suggest ways to improve performance. It appears to be an IO issue, and this article demonstrates faster ways to do IO in Haskell.
Haskell Problem 2: CSV and Database Access
May 24, 2009
The challenge is to parse a CSV file and insert it into a database table.
The first row of the CSV file will contain the column names of the target table. The CSV parser should support double quoted and unquoted fields, as well as the doubling mechanism for escaping double quotes within a double quoted fields. The parser should correctly handle commas and newlines within double quotes.
Haskell Problem 1: Unix grep (part 3)
May 24, 2009
See the problem statement in a previous post.
A Complete Solution
This solution adds support for recursive searches, the -l and -L flags, and regular expressions.
Regular expressions were difficult. This blog post explains how to use them, but I had difficulty compiling the code. The command that was good enough previously, namely
ghc grep.hs
resulted in linker errors when I included Test.Regex.Posix. The fix was to rename my file Main.hs and compile it with
ghc --make Main.hs
module Main where
import System
import Data.List
import IO
import Char
import Directory
import Text.Regex.Posix
getOptions [] = []
getOptions (x:xs) | x == "--" = []
| x /= "--" = if length x > 1 && x !! 0 == '-'
then (tail x) ++ getOptions xs
else getOptions xs
recursiveDir opts = elem 'r' opts
outputNonMatch opts = ( elem 'v' opts && not (elem 'l' opts) ) || elem 'L' opts
outputPerFile opts = elem 'l' opts || elem 'L' opts
matchCaseInsensitive opts = elem 'i' opts
matchRegularExpression opts = elem 'E' opts
matchExact opts = not (matchCaseInsensitive opts) && not (matchRegularExpression opts)
showLine opts = not (outputPerFile opts)
showOffset opts = not (outputPerFile opts) && elem 'b' opts
showLineNumber opts = not (outputPerFile opts) && elem 'n' opts
showFile opts = elem 'H' opts || elem 'r' opts
getNonOptionArguments [] = []
getNonOptionArguments (x:xs) | x == "--" = xs
| x/= "--" = if x == "-" || x !! 0 /= '-'
then x : getNonOptionArguments xs
else getNonOptionArguments xs
getGrepPattern opts firstArg = if matchCaseInsensitive opts
then map toLower firstArg
else firstArg
data Location a b c = Location { file :: String, lineNumber :: Int, offset :: Int }
incLocation loc l = Location (file loc ) (lineNumber loc + 1) (offset loc + length l + 1)
showResult loc opts pattern l = do
if showFile opts
then do putStr (file loc)
putChar ':'
else return ()
if showLineNumber opts
then do putStr (show (lineNumber loc + 1))
putChar ':'
else return ()
if showOffset opts
then do putStr (show (offset loc))
putChar ':'
else return ()
putStrLn l
doesLineMatch opts pattern l | matchRegularExpression opts = l =~ pattern :: Bool
| matchCaseInsensitive opts = isInfixOf pattern (map toLower l)
| True = isInfixOf pattern l
grepHandlePerFile loc opts pattern h = do
eof <- hIsEOF h
if eof
then
if outputNonMatch opts
then putStrLn (file loc)
else return ()
else do
l <- hGetLine h
if doesLineMatch opts pattern l
then
if outputNonMatch opts
then return ()
else putStrLn (file loc)
else
grepHandlePerFile (incLocation loc l) opts pattern h
grepHandlePerLine loc opts pattern h = do
eof <- hIsEOF h
if eof
then return ()
else do
l <- hGetLine h
if ( doesLineMatch opts pattern l ) == ( outputNonMatch opts )
then return ()
else showResult loc opts pattern l
grepHandle (incLocation loc l) opts pattern h
grepHandle loc opts pattern h = do
if outputPerFile opts
then grepHandlePerFile loc opts pattern h
else grepHandlePerLine loc opts pattern h
grepFile opts pattern file = do
if file == "-"
then grepHandle (Location file 0 0) opts pattern stdin
else do
d <- doesDirectoryExist file
if d
then
if recursiveDir opts
then do
dc <- getDirectoryContents file
grepFiles opts pattern (map ( (file ++ "/") ++ ) (filter (\x -> x !! 0 /= '.' ) dc))
else return ()
else do
h <- openFile file ReadMode `catch` (\e -> do hPutStrLn stderr (show e); return (stderr))
if h == stderr
then return ()
else grepHandle (Location file 0 0) opts pattern h
grepFiles opts pattern files = do
if length files == 0
then return ()
else do
grepFile opts pattern (head files)
grepFiles opts pattern (tail files)
main = do
usage <- return "Usage: grep [OPTION]... PATTERN [FILE]..."
a <- getArgs
userOpts <- return ( getOptions a )
args <- return ( getNonOptionArguments a )
badOpts <- return ( userOpts \\ "bEhHilLnrv" )
opts <- if length args > 2 && not ( elem 'h' userOpts )
then return ('H':userOpts)
else return (userOpts)
if length badOpts > 0
then do putStr "grep: invalid option -- "
putChar (badOpts !! 0)
putStrLn ""
putStrLn usage
else if length args == 0
then putStrLn usage
else do pattern <- return ( getGrepPattern opts (head args) )
if length args == 1
then grepHandle (Location "(standard input)" 0 0) opts pattern stdin
else grepFiles opts pattern (tail args)
Haskell Problem 1: Unix grep (part 2)
May 20, 2009
The problem statement and partial solution #1
Partial Solution #2: source code
The following flags are implemented: -b, -h, -H, -i, -n, -v
The -b flag displays the byte offset of the start of the line containing the occurrence. I originally implemented finding the byte offset of the start of the occurrence itself, which is harder, but that’s not how grep works so I was wasting my time.
Haskell has a library System.Console.GetOpt for processing command line options, but we are doing it by hand for code-writing practice.
Rather than defining a data type to contain all the option information, this solution passes around the options as a string of characters. For each behavior variant, there is a function which examines the options and returns a boolean value.
Haskell Problem 1: Unix grep (part 1)
May 17, 2009
The challenge is to implement grep in Haskell.
grep has a lot of options. To simplify the problem somewhat, we only implement a subset of the single letter options such as the following, which don’t take arguments:
- -b: output offset of occurrence
- -E: pattern is regular expression
- -h: suppress filename in output: default for one file
- -H: show filename in output: default for two or more files
- -i: case insensitve match
- -l: file name of matching files only
- -L: file name of non-matching files only
- -n: output line # of occurrence
- -r: if an argument is a directory, recursively grep all files in the directory
- -v: output lines that don’t match instead of lines that match
grep displays its usage if there are no non-option arguments. If there are non-option arguments, the first is the pattern, and the remaining, if any, are the files to be grepped. If no files are specified, then standard input is grepped.
Options and non-option arguments can appear in any order. Usually, options are recognized by the hyphen, but any arguments that appear after two hyphens: -- are treated as non-options. Thus, grep torque -- -o will grep the file -o for the pattern torque.
Multiple options can be grouped into a single argument: grep -iv is the same as grep -i -v.
Also, a single hyphen: - as an argument is always a non-option, and if it is not the pattern, it refers to the standard input. There is no provision that I am aware of for grepping a file named -.
Partial Solution #1
This solution parses the arguments correctly, storing the options as characters in a list. However, it ignores the options and behaves as if no options have been specified:
module Main where
import System
import Data.List
import IO
getOptions [] = []
getOptions (x:xs) | x == "--" = []
| x /= "--" = if length x > 1 && x !! 0 == '-'
then (tail x) ++ getOptions xs
else getOptions xs
getArguments [] = []
getArguments (x:xs) | x == "--" = xs
| x/= "--" = if x == "-" || x !! 0 /= '-'
then x : getArguments xs
else getArguments xs
grepHandle pattern h = do
eof <- hIsEOF h
if eof then return () else
do l <- hGetLine h
if isInfixOf pattern l
then putStrLn l
else return ()
grepHandle pattern h
grepFile pattern file = do
if file == "-"
then grepHandle pattern stdin
else do h <- openFile file ReadMode
grepHandle pattern h
grepFiles pattern files = do
if length files == 0
then return ()
else do grepFile pattern (head files)
grepFiles pattern (tail files)
main = do
a <- getArgs
opts <- return ( getOptions a )
args <- return ( getArguments a )
if length args == 0
then print "Usage: grep [OPTION]... PATTERN [FILE]..."
else if length args == 1
then grepHandle ( args !! 0 ) stdin
else grepFiles (head args) (tail args)
Starting with Haskell
May 17, 2009
A beginner might find these links useful:
- A Gentle Intro to Haskell, a tutorial
- Standard Prelude, definitions of functions that are always available
- Haskell Quiz, some practice problems with solutions
- Haskell 98 Report, the language specification
Haskell isn’t a good first language, but someone who knows C and Lisp has a good chance. Lisp is helpful because there are no loops in Haskell. Where normally one would write a loop, a recursive function must be written instead.
C is helpful because the Glasgow Haskell Compiler works like the GNU compiler. In particular, if the code is contained in grep.hs, then
ghc grep.hs
will create an executable called a.out.
Learning Haskell reminds me of learning C because of the time I spend puzzling over compiler error messages. The key to understanding Haskell error messages is the Hindley-Milner type inference algorithm.
The REPLs lack global variables or definitions. At least I haven’t figured out how to define a function with hugs and get the definition to stick around. Fortunately hugs has readline.
Macintosh-2:Haskell clark$ hugs __ __ __ __ ____ ___ _________________________________________ || || || || || || ||__ Hugs 98: Based on the Haskell 98 standard ||___|| ||__|| ||__|| __|| Copyright (c) 1994-2005 ||---|| ___|| World Wide Web: http://haskell.org/hugs || || Bugs: http://hackage.haskell.org/trac/hugs || || Version: September 2006 _________________________________________ Haskell 98 mode: Restart with command line option -98 to enable extensions Type :? for help Hugs> inc x = x + 1 ERROR - Syntax error in input (unexpected `=') Hugs> let inc x = x + 1 in inc 1 2 Hugs> inc 2 ERROR - Undefined variable "inc"