heavytailed

September 29, 2015

Iterators in R

Filed under: Uncategorized — heavytailed @ 4:25 am

I’ve got an absurd number of post stubs at the moment; which should start coming out over the next month. This is a short code-related one: how to take many files sorted (by, say, p-value) and stream records into R as though they had all been merged into one file. All this needs is an implementation of a peekable iterator. This is one of those things which is boilerplate for most languages, but is a bit of a struggle for R (especially due to needing to use local/global assignment: <<-).

The first little bit is to use the iterators and itertools packages to create a simple file iterator. I know there’s an iread.table, but it’s not exactly what I need.


require('iterators')
require('itertools')

fstream <- function(filename, format, header=T) {
  handle = file(filename)
  open(handle)
  hdr <- NULL
  if ( header ) {
    hdr <- readLines(handle, n=1)
  }
  nxt <- function() {
    x <- scan(handle, format, nlines=1, quiet=T)
    if ( length(x[[1]]) == 0 ) {
      stop('StopIteration')
    }
    x
  }
  obj <- list(nextElem = nxt)
  class(obj) <- c('fstream', 'abstractiter', 'iter')
  obj
}

The file() function takes a path and turns it into a file handle; and open() lets you read from it (and maintain the offset). Note that function chaining (“handle = open(file(filename))”) won’t work here, nor will open(filename). Kind of ugly stuff.

The next part is to write the merging algorithm, that assumes the files are sorted by a (specified) comparison function. I’ve used a very simple one (sorted, ascending, by second column):


simple_compare <- function(rows) {
  row_vals = sapply(rows, function(t){t[[2]]})  # unlist
  which(row_vals == min(row_vals))[1]
}

iterfiles <- function(files, format, cmp) {
  streams <- lapply(files, function(f) { ihasNext(fstream(f, format)) })
  heads <- lapply(streams, function(t) { nextElem(t)})
  nxt <- function() {
    best_idx <- cmp(heads)
    to_ret <- heads[[best_idx]]
    if ( hasNext(streams[[best_idx]]) ) {
      heads[[best_idx]] <<- nextElem(streams[[best_idx]])
    } else {
      if ( length(streams) == 1 ) {
        streams <<- list()
        heads <<- list()
      } else if ( best_idx == 1 ) {
        streams <<- streams[2:length(streams)]
        heads <<- heads[2:length(heads)]
      } else if ( best_idx == length(heads)) {
        streams <<- streams[1:(best_idx-1)]
        heads <<- heads[1:(best_idx - 1)]
      } else {
        streams <<- c(streams[1:(best_idx-1)], streams[(best_idx+1):length(streams)])
        heads <<- c(heads[1:(best_idx-1)], heads[(best_idx+1):length(heads)])
      }
    }
    if (length(heads) == 0 ) {
      to_ret <- NULL
    }
    to_ret
  }
  nxt
}

Note the extensive use of “<<-” to ensure the reassignment of the streams and head variables which are one level above the scope of the “nxt” function. Putting this altogether:


write.table(file='.foo1', x=sort(runif(1000)))
write.table(file='.foo2', x=sort(runif(1000)))
write.table(file='.foo3', x=sort(runif(1000)))

nxtline <- iterfiles(c('.foo1', '.foo2', '.foo3'), format=list(character(), double()), cmp=simple_compare)

line <- nxtline()
merged_pvals <- c(NA, line[[2]])
line <- nxtline()
while ( ! is.null(line) ) {
  merged_pvals <- c(merged_pvals, line[[2]])
  line <- nxtline()
}

Produces the sorted, merged list of p-values, as expected. The idea, of course, would be that there’s a total of ~300,000,000 such values, and it’s only worthwhile plotting certain ones. This can be done with something like:


ranks <- c(NA, 1)

pvals <- c(NA, fields[[5]])

KEEP_ALL_BEFORE <- 100
SAMPLING_POWER <- 1.1 skip = 4 while (  length(fields[[5]]) > 0 ) {
  if ( rank <= KEEP_ALL_BEFORE ) {
    ranks <- c(ranks, rank)
    pvals <- c(pvals, fields[[5]])
  } else {
    if ( rank %% skip == 1) {
      print(sprintf('%d  %e', rank, fields[[5]]))
      ranks <- c(ranks, rank)
      pvals <- c(pvals, fields[[5]])
      skip <- 1 + round(skip * SAMPLING_POWER)
    }
  }
  fields <- nextline()
  rank = rank + 1
}

and of course the expected pvalue will just be “ranks/rank”, since rank will, in the very end, be the total number of data points.

Advertisements

Leave a Comment »

No comments yet.

RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Create a free website or blog at WordPress.com.

%d bloggers like this: