2011-05-17

Parallel Collections in Scala 2.9.0

Just played around with Scala 2.9.0's parallel collections and so far i am really happy with it. I did not spend more than an hour just scratching on the surface of it but here is a little example usage and comparison between parallel and non parallel collections.


First I needed to have a timer function that can take the time of each run and also do som JIT warmup.


  def time(name: String, func:  => Unit) : Long = { 
    func // JIT warmup
    val then = System.currentTimeMillis
    func
    val timeDiff = System.currentTimeMillis - then
    printf("time for %-8s: %5dms\n", name, timeDiff)
    return timeDiff
  }



This higher order function takes a name of the run and a function to execute. The function that it is going to execute is without any parameters or return values. The first line is just a JIT warmup to get a better comparison between the runs.


I also needed some test cases to try it out so i thought that regexp would be a nice fit. There is a file called words on many unix/linux systems which contains a lot of dictionary words and that file was perfect for this test. I read all the lines from this file using the Source class.

  var lines = Source.fromFile("/usr/share/dict/words").getLines.toList

Now to the good stuff. To get a parallel version of this list of lines it is enough to just type:

   val par = lines.par

It is actually not a List but a ParSeq but that is not important for this specific test.  Then it comes down to running a test with the timer function and the regular expression.

  val scaleRe = """.*scala.*""".r
  val normalTime = time("normal", { lines.foreach(scaleRe.findFirstIn(_)) })

  val parTime    = time("parallel", { par.foreach(scaleRe.findFirstIn(_)) })

The results from test on my computer (Intel(R) Xeon(R) CPU W3530  @ 2.80GHz) looks like this:

98569 lines
time for normal  :    93ms
time for parallel:    24ms
parallel is 3.88 times faster


Not surprisingly the parallel version is much faster. Below is the full source of the test.

package test 

import scala.io.Source

object ParallelTest {

  def time(name: String, func:  => Unit) : Long = { 
    func // JIT warmup
    val then = System.currentTimeMillis
    func
    val timeDiff = System.currentTimeMillis - then
    printf("time for %-8s: %5dms\n", name, timeDiff)
    return timeDiff
  }
 
  def main(args : Array[String]) {
   
    var lines = Source.fromFile("/usr/share/dict/words").getLines.toList
    val par   = lines.par
   
    val scaleRe = """.*scala.*""".r
   
    printf("%d lines\n", lines.size)

    val normalTime = time("normal", { lines.foreach(scaleRe.findFirstIn(_)) })
    val parTime    = time("parallel", { par.foreach(scaleRe.findFirstIn(_)) })
   
    printf("parallel is %.2f times faster\n", normalTime.toDouble / parTime )
  }
  
}

2 comments:

  1. The warmup should be longer (doesn't matter much in the example, but for other stuff, it will) - code is only compiled after it runs 50-200 times I think (it's configurable). Also, the test lasts for a very short time, so a single GC run would throw your whole result down the drain.

    ReplyDelete
  2. Before i posted the blog i tried with adding the dataset to itself a couple of times to get more lines. When i had ~3 million lines I still got about 4 times the speedup. But I agree with you that it is better to run longer tests and have longer warmups to increase reliability of the results. But for this example I mostly wanted to show a quick demo of parallel collections in scala.

    ReplyDelete