FAQ
Author: Rares Vernica <rares (at) ics.uci.edu>
Table of Contents
1 What should I do if I get java.lang.OutOfMemoryError: Java heap space
in the Map phase of Stage 2, Kernel (ridpairsimproved
or ridpairsppjoin
)?
Stage 1, Token Ordering (tokesbasic
or tokensimproved
) produces a
list of unique tokens that are loaded into memory by Stage 2. The list
is output in the tokens.n
directory in HDFS. The reason for the
OutOfMemoryError
might be the fact that the list of tokens does not
fit into memory.
The first thing you should check is whether you are using the right
tokenizer for your data. For example, if each join field value is a
list of words, then the word tokenizer would be
appropriate. Otherwise, if each join field value is a contiguous string
of characters, then a n-gram tokenizer might be appropriate. The
tokenizer can be specified in the command line with the
-Dfuzzyjoin.tokenizer=
option or in the XML file specified with the
-conf
option. For more details please see:
hadoop/fuzzyjoin/resources/conf/fuzzyjoin/default.xml
2 Where can I get more help?
Please email Rares Vernica <rares (at) ics.uci.edu> with any questions you might have.
Date: 2010-04-23 17:00:41 PDT
HTML generated by org-mode 6.31a in emacs 23