Cuneiform Ruby

July 3, 2009

I’m guessing that you don’t have a cuneiform font installed. Install one now.

Cuneiform characters were officially assigned Unicode points in 2006. None of these points are in the Basic Multilingual Plane. A lot of software which claims to handle Unicode cannot handle points outside of the BMP. For example, before version 1.5 java assumed all characters could fit into 16 bits. So cuneiform is a good test of the system.

Ruby 1.9 extends the \u notation for entering Unicode points in strings:


bash$ irb
irb(main):001:0> puts "\u{1204C}"
𒁌

Better yet, use cuneiform as a variable


bash$ cat cuneiform.rb
# encoding: utf-8
𒀻     = 13
puts 𒀻

bash$ ruby cuneiform.rb
13

I had better luck with vi than emacs editing the file. There were display issues. Beacuse 𒀻 is quite wide for a single character and Terminal is fixed width, the cuneiform character was overlapping the characters that follow it.


bash$ cat cuneiform.rb
# encoding: utf-8

class 𒀽
  def puts
    puts "I am cuneiform: \u{1203D}"
  end
end

c = 𒀽.new
c.puts
bash$ ruby cuneiform.rb
cuneiform.rb:3: class/module name must be CONSTANT

So this tells us that Ruby considers the character to be lower case. Ruby doesn’t appear to recognize any non-ASCII capital letters in source code:


bash$ cat greek.rb
#encoding: utf-8
class Λ
  def initialize(λ)
    @λ = λ
  end
  def apply(*a)
    @λ.call(*a)
  end
end

λ = Λ.new(lambda { |x,y| x+y })
puts λ.apply(3,7)

bash$ ruby greek.rb
greek.rb:2: class/module name must be CONSTANT

Ruby is aware of capitals in strings:


irb(main):010:0> s = "\u039B"; puts s; /\p{Upper}/.match(s)
Λ
=> #<MatchData "Λ">
irb(main):011:0> s = "\u03BB"; puts s; /\p{Upper}/.match(s)
λ
=> nil
irb(main):012:0> s = "\u{12040}"; puts s; /\p{Upper}/.match(s)
𒁀
=> nil

We can use Unicode in symbols:


# encoding: utf-8
alias :λ :lambda
add = λ { |x,y| x + y }
puts add.call(3,11)

Kindle 2 Thoughts

June 28, 2009

I like this device. I’d like to get rid of my four shelves of books and rely on this thing. Unfortunately, we’re not quite there yet, since many of the books aren’t available as Kindle editions. The replacement cost would be high, though if I paid for my Kindle books as I needed them, it might not be too bad. Many of my books I probably won’t use again.

Some areas for improvement in order of declining importance:

availability and quality of Kindle books:
Lots of books still not available. Some Kindle books have quality issues. Some lack a table of contents with links, and left and right on the 5-way controller, which is supposed to move by chapter or section, doesn’t always work. I see books with bits like “procrast- ination”, which suggest some sort of problem in the conversion to Kindle process. Downloading a sample of the book first or reading reviews can help avoid these bad books, but it might not be a big deal since the bad books are usually the dollar books.

navigation:
Should be able to execute commands in the menu with one or two keystrokes. Droppping the menu and navigating to the desired action with the 5 way can easily be 5 or more keystrokes (counting the 5 way as a key). Also, sometimes a sentence or paragraph is broken by the page break, and it is desirable to quickly see the whole. I would like to be able to reposition the page by one location with a single stroke. Best I can do now is use the goto location in the menu. There is some weirdness I don’t understand about the goto location, because sometimes the location provided doesn’t become the beginning of the page. Are not all locations positionable at the top of the page? Also, if you page backward, sometimes it takes more page forwards to return to the original position, which is disorienting.

screen refresh:
Too slow, affects navigation, both when paging and moving the cursor. Is this a limitation of the screen technology or the speed of the processor?

symbols
Using the symbol palette that pops up when you hit the SYM key stinks. First of all, there aren’t enough symbols. When making a note on a mathematical text, I would like to have Greek letters. Also, when using the 5 way, I find it too easy to push in instead of push left, which results in the wrong symbol selected. Not sure what the solution is. Have every symbol in the table numbered, so it can be selected with one or two keystrokes? Or add symbols to the number keys like a regular keyboard (plenty of space above the keyboard)?

internationalization:
There are some foreign language texts, and accented Latin displays correctly, but no foreign language dictionaries. Ideally, I would like to set two dictionaries simultaneously when reading a foreign text, to have for example a Spanish dictionary and a Spanish-to-English dictionary.

pdf:
Sent a PDF to amazon to have it converted and downloaded. It was a challenging document, 3M of images scans of a 1937 master’s thesis. Amazon chopped it up into pages. One page would be the best guess of the text in the image, and the next would be the image. Document size on the kindle didn’t increase much, and the process took about 5-10 minutes and cost $0.75. Although the text was ugly and had lots of errors and was not completely necessary because the image quality was readable, I see the value of including it for search. Unfortunately, I didn’t change the file name when I attached it in the email, and ended up with 344541425.pdf as the document name on my home screen. Could not find out how to change it. Not even hooking up my laptop via USB and editing the file name.

web, mp3:
The browser could be improved. The external speakers make crackle noise. I haven’t tried plugging in headphones.

other:
One can image other improvements: color screen, backlit screen. These may add more cost than value. I hanker for a stylus like the old palm pilot, for taking notes, or even doing the exercises in a technical work, so I don’t have to have an extra notebook and pen. Seems like there is room for increasing the screen size without increasing the overall size of the device: about an inch is wasted horizontally.

gcc and the mac

June 20, 2009

Re-acquainting myself with gcc. Since the man page is extensive, I summarize some useful options.

-c : compile and assemble, don’t link
-S : compile, don’t assemble
-E : preprocess, don’t compile
-o FILE : create target named FILE
-v : print version number. Output commands used during compilation
-Wall : highest warning level
-w : turn off warning
-g : add debugging info to object (for gdb)
-p : add profiling info to object (for prof)
-pg : add profiling info to object (for gprof)
-Q : output compiler stats for each function
-O,-O1 : optimize
-O2 : optimize more
-O3 : highest optimization level
-fast : optimize for speed
-Os : optimize for size
-D NAME : for preprocessor, set macro NAME to 1
-D NAME=VALUE: set macro NAME to VALUE
-U NAME : undefine macro NAME
-I DIR : add DIR to include search path
-L IDR : add DIR to library search path
-M : output make dependencies
-l LIB : search libLIB for symbol definitions (linker)

Some oddities peculiar to the mac that I’m not used to. Here’s how to create a shared object:

gcc -c stack.c
gcc -dynamiclib -o libstack.dylib -dylib stack.o

Also, the mac version of gcc appends an underscore to symbols

bash $ cat main.c
#include <stdio.h>

void
hello_world() {
  printf("hello world\n");
}

int
main(int argc, char **argv) {
  hello_world();
}

bash $ gcc main.c
bash $ ./a.out
hello world
bash $ nm a.out
0000200c D _NXArgc
00002008 D _NXArgv
00002000 D ___progname
00001fb4 t __dyld_func_lookup
00001000 A __mh_execute_header
00002004 D _environ
         U _exit
00001fc2 T _hello_world
00001fe3 T _main
         U _puts
00002010 d dyld__mach_header
00001fa0 t dyld_stub_binding_helper
00001f60 T start

A common example of this is Covey’s set:

  • be proactive
  • begin with the end in mind
  • first things first
  • think win-win
  • first understand, then be understood
  • synergize
  • sharpen the saw

Lean software development is also septofascicular:

  • deliver fast
  • decide late
  • eliminate waste
  • empower the employee
  • amplify learning
  • build in integrity
  • see the whole

The Rock Warrior’s Way organizes itself into seven principles, but Ilgner doesn’t distill them out as mnemonic imperatives. So I attempt my own formulations:

  • become self-aware
  • see subtlety
  • accept responsibility
  • ask what you can give
  • commit with unbending intent
  • don’t expect, listen
  • value the journey, not the destination

Shortcut Keys

June 13, 2009

Note to self: these shortcut keys are useful. Practice using them.

mac

I set the dock to hide itself. I set the number of spaces to two, so that I am never searching for a window. If I don’t see it, it’s in the other space. Also, both arrow keys take me to the other space, so I don’t need to remember where I am. Also just learned about pbcopy and pbpaste.


cmd-del        move selection to trash
cmd-rarrow   move between spaces (click and hold on window to carry it with you)
cmd-larrow    move between spaces
cmd-E            eject
cmd-W           close window
cmd-Q           quit app
cmd-+           adjust magnification
cmd--
F11                 expose (at least that's where I want it)
cmd-M            minimize window
cmd-TAB         change app

emacs

Don’t exit emacs to prowl the the filesystem or run a command. Wasteful.


M-!              execute external cmd
M-x shell    shell buffer
C-c C-c       send C-c (shell mode)
C-c C-d       send C-d (shell mode)
C-c C-z       send C-z (shell mode)
C-x C-b      display buffer list
M-x sort-lines
M-x reverse-region
^                  visit parent directory (dired)
d                  mark for delete (dired)
i                   display contents of subdir at bottom of same buffer (dired)
o                  open file in different window (C-o doesn't change focus)
m                 mark (dired)
u                  unmark (dired)
x                  expunge files marked for delete (dired)
#                  mark auto-save files for deletion (dired)
~                  mark emacs backup files for deletion (dired)
D                 delete marked files (dired)
R                  rename marked files (multiple files are moved into dir)
C                  copy marked files
+                 make a subdir (dired)
$                  hide subdir (dired)
M-$              hide all subdirs (dired)
M-x find-name-dired
M-x find-grep-dired
M-x find-grep
M-x wdired-change-to-wdired-mode   rename and move files by editing the dired buffer

ssh, screen

Remember to ssh user@host instead of ssh host followed by su user. Use ssh -X to launch X windows, and then emacs can be in its own window. Also, you pay for that up front laziness: how to set up ssh keys.

Basic screen procedure: run screen, run the comand, then detach with C-a d. Reattach with screen -r. If multiple screen sesssions, specify pid and terminal. Screen is unkind to emacs users because of the C-a binding but that is all you need. Stop using nohup.

firefox

installed “It’s All Text!” and reinstalled Firebug and YSlow. For the record the other addons I find useful: Live HTTP Headers, Web Developer, google and del.icio.us toobars.

websites

got a github account. Pastie at gist.github.com

also look into

  • running svn commands from within emacs
  • keep some standard dot files .bashrc .irbrc .emacs at github
  • install these emacs files
  • color highlighting in emacs not working on linux

Io has a normally distributed random number generator:


bash $ io
Io 20090105
Io> Random gaussian
==> 1.0989883665373978
Io> Random gaussian(100,10)
==> 88.2743995098800269

The underlying source code uses the Box-Muller transform:


double RandomGen_gaussian(RandomGen *self, double m, double s)
{
		// http://www.taygeta.com/random/gaussian.html
	double x1, x2, w, y1, y2;

	do {
		x1 = 2.0 * RandomGen_randomDouble(self) - 1.0;
		x2 = 2.0 * RandomGen_randomDouble(self) - 1.0;
		w = x1 * x1 + x2 * x2;
	} while ( w >= 1.0 );

	w = sqrt( (-2.0 * log( w ) ) / w );
	y1 = x1 * w;
	y2 = x2 * w;

		// The following code resulted in a lot of nans being returned. The
		// following code *should* also be slower.
		/*
	double x1 = 2.0 * RandomGen_randomDouble(self) - 1.0;
	double x2 = 2.0 * RandomGen_randomDouble(self) - 1.0;
	double y1 = sqrt( - 2.0 * log(x1) ) * cos( M_PI_2 * x2 );
		*/

	return ( m + y1 * s );
}

Factor also has a normally distributed generator:


bash $ factor
( scratchpad ) USE: random
( scratchpad ) 0 1 normal-random-float

--- Data stack:
-1.651647700885478

The interesting story here is the Factor source is algorithmically identical to what is commented out in the Io source:


: normal-random-float ( mean sigma -- n )
    0.0 1.0 uniform-random-float
    0.0 1.0 uniform-random-float
    [ 2 pi * * cos ]
    [ 1.0 swap - log -2.0 * sqrt ]
    bi* * * + ;

So a normally distributed random variable is easy to simulate, but what about the other distributions? It occurred to me that the inverse of the cumulative distribution function (qv quantile function) of an RV will convert a standard uniform random number into a sample from the RV.

Since most languages don’t have cumulative distribution functions, much less quantile functions, this converts the problem into an exercise in numerical analysis. I wondered if this was the technique used in the R source code. It wasn’t so for either rgamma or rnorm. rnorm underlyingly calls this routine, which consists of almost 300 lines of C:


/*
 *  REFERENCE
 *
 *    Ahrens, J.H. and Dieter, U.
 *    Extensions of Forsythe's method for random sampling from
 *    the normal distribution.
 *    Math. Comput. 27, 927-937.
 *
 *    The definitions of the constants a[k], d[k], t[k] and
 *    h[k] are according to the abovementioned article
 */
double norm_rand(void)
{

The paper mentioned in the comment doesn’t appear to be available online, but it is summarized here. It goes back to an idea von Neumann had in the 1950s. The technique still involves multiple uniform distribution samples, which are apparently cheaper than evaluating a quantile function. It uses far fewer samples on average than the 30 I used in rnorm.

Here is a relavant Knuth quote:

It is conceivable that someday somebody will invent a random number generator that produces one of these other random quantities directly, instead of getting it indirectly via the uniform distribution. But no direct methods have as yet proved to be practical, except for the “random bit” generator described in Section 3.2.2.

The Art of Computer Programming: Seminumerical Algorithms, p. 119

According to the central limit theorem, a large enough sample from any distribution will converge to a normal distribution. So we might as well use the built in uniform distribution, which has variance 1/12, and the fact that the sum of independent random variables has variance equal to the sum of the variances.


def rnorm(mean=0, sd=1)
  n = 30
  sum = Array.new(n) { rand }.inject { |m,o| m+o }
  zscore = (sum - (n/2.0))/(n/12.0)**0.5
  zscore * sd + mean
end

How to make a quick ASCII histogram of the results:


h = Hash.new(0)
1000.times { h[(10*rnorm).to_i.to_f/10] += 1 }
(-35..35).each { |k| puts "#{k.abs/10.0} #{'#' * h[k/10.0]}" }

A javascript version:


function rnorm(mean,sd) {
  n=30;
  if (null == mean) { mean = 0; }
  if (null == sd) { sd = 1;}
  for(sum=0, i=0;i < n; i++) {
    sum += Math.random();
  }
  zscore = (sum - (n/2))/Math.pow(n/12,0.5);
  return zscore * sd + mean;
}

6/16 update: code fix: tripped up by the fact that 3/2 == 1 in ruby. In javascript 3/2 == 1.5

A nice R feature that I haven’t seen in any other language is random number generators for distributions other than the uniform distribution. The most useful is the normal distribution: here’s a crude simulation of rnorm in ruby


def rnorm(mean=0, sd=1)
  coin_flips=37
  binomial = Array.new(coin_flips) { rand(2) }.inject { |m,o| m+o }
  zscore = 2 * (binomial - (coin_flips/2))/coin_flips**0.5
  zscore * sd + mean
end

and in javascript:


function rnorm(mean,sd) {
  coin_flips=37;
  if (null == mean) { mean = 0; }
  if (null == sd) { sd = 1;}
  binomial=0;
  for(i=0;i < coin_flips; i++) {
    if ( 0.5 < Math.random() ) {
      binomial += 1;
    }
  }
  zscore = 2 * (binomial - (coin_flips/2))/Math.pow(coin_flips,0.5);
  return zscore  * sd + mean;
}

Unlike the R version, there are only 38 possible values that above functions can assume, which might not be acceptable if repeated trials are being made. To fix that, here is a ruby version which mixes in another random number.


def fuzz(width)
  [(rand-0.5)*width, ((width**2)/12.0)**0.5]
end

def rnorm(mean=0, sd=1)
  coin_flips=37
  fuzz, fuzz_sd = fuzz(1/coin_flips**0.5)
  binomial = Array.new(coin_flips) { rand(2) }.inject { |m,o| m+o }
  zscore = 2 * (binomial - (coin_flips/2))/coin_flips**0.5
  zscore * (sd**2-fuzz_sd**2)**0.5 + mean + fuzz
end

The previous solution is a lot slower than the system provided grep:


Macintosh-2:Haskell clark$ ghc --make Main.hs
[1 of 1] Compiling Main             ( Main.hs, Main.o )
Linking Main ...
Macintosh-2:Haskell clark$ time ./Main -r teen regex-tdfa-1.1.2

real	0m0.643s
user	0m0.546s
sys	0m0.089s
Macintosh-2:Haskell clark$ time grep -r teen regex-tdfa-1.1.2

real	0m0.018s
user	0m0.008s
sys	0m0.010s

ghc has a profiler which we can use to find out where the code is spending its time and how much memory is getting allocated.

The profiler requires that options be passed to it on the command line. This is inconvenient when the program to be profiled itself takes command line options. Thus, to profile the program, the first step was to hardcode some command line arguments.

-- a <- getArgs
a <- return [ "-r", "teen", "regex-tdfa-1.1.2" ]

Next compile a profiled version of the code:

ghc --make -prof -auto-all Main.hs

Then run the executable:

./Main +RTS -p

The profiling output appears in Main.prof:


        Mon May 25 09:02 2009 Time and Allocation Profiling Report  (Final)

           Main +RTS -p -RTS

        total time  =        0.25 secs   (5 ticks @ 50 ms)
        total alloc = 179,197,136 bytes  (excludes profiling overheads)

COST CENTRE                    MODULE               %time %alloc

doesLineMatch                  Main                  60.0   32.1
grepHandlePerLine              Main                  40.0   67.4

                                                                                               individual    inherited
COST CENTRE              MODULE                                               no.    entries  %time %alloc   %time %alloc

MAIN                     MAIN                                                   1           0   0.0    0.0   100.0  100.0
  main                   Main                                                 262           0   0.0    0.0     0.0    0.0
   grepFiles             Main                                                 263           0   0.0    0.0     0.0    0.0
    grepFile             Main                                                 264           0   0.0    0.0     0.0    0.0
 CAF                     Main                                                 236           9   0.0    0.0   100.0  100.0
  grepHandle             Main                                                 252           0   0.0    0.0     0.0    0.0
  grepHandlePerLine      Main                                                 251           0   0.0    0.0     0.0    0.0
  main                   Main                                                 242           1   0.0    0.0   100.0  100.0
   getGrepPattern        Main                                                 259           1   0.0    0.0     0.0    0.0
    matchCaseInsensitive Main                                                 260           1   0.0    0.0     0.0    0.0
   grepFiles             Main                                                 245          47   0.0    0.0   100.0  100.0
    grepFile             Main                                                 246          37   0.0    0.1   100.0  100.0
     grepHandle          Main                                                 253       10705   0.0    0.0   100.0   99.9
      grepHandlePerLine  Main                                                 255       10705  40.0   67.4   100.0   99.8
       outputNonMatch    Main                                                 261       10677   0.0    0.3     0.0    0.3
       doesLineMatch     Main                                                 256       10677  60.0   32.1    60.0   32.1
        matchCaseInsensitive Main                                                 258       10677   0.0    0.0     0.0    0.0
        matchRegularExpression Main                                                 257       10677   0.0    0.0     0.0    0.0
      outputPerFile      Main                                                 254       10705   0.0    0.1     0.0    0.1
     recursiveDir        Main                                                 250           9   0.0    0.0     0.0    0.0
   getOptions            Main                                                 244           4   0.0    0.0     0.0    0.0
   getNonOptionArguments Main                                                 243           4   0.0    0.0     0.0    0.0
 CAF                     GHC.Int                                              201           1   0.0    0.0     0.0    0.0
 CAF                     GHC.Handle                                           189           6   0.0    0.0     0.0    0.0
 CAF                     System.Posix.Internals                               172           7   0.0    0.0     0.0    0.0
 CAF                     System.Directory                                     125           2   0.0    0.0     0.0    0.0
  main                   Main                                                 247           0   0.0    0.0     0.0    0.0
   grepFiles             Main                                                 248           0   0.0    0.0     0.0    0.0
    grepFile             Main                                                 249           0   0.0    0.0     0.0    0.0

The profiling did not immediately suggest ways to improve performance. It appears to be an IO issue, and this article demonstrates faster ways to do IO in Haskell.