UP | HOME
FAQ

FAQ

Author: Rares Vernica <rares (at) ics.uci.edu>

Table of Contents

1 What should I do if I get java.lang.OutOfMemoryError: Java heap space in the Map phase of Stage 2, Kernel (ridpairsimproved or ridpairsppjoin)?

Stage 1, Token Ordering (tokesbasic or tokensimproved) produces a list of unique tokens that are loaded into memory by Stage 2. The list is output in the tokens.n directory in HDFS. The reason for the OutOfMemoryError might be the fact that the list of tokens does not fit into memory.

The first thing you should check is whether you are using the right tokenizer for your data. For example, if each join field value is a list of words, then the word tokenizer would be appropriate. Otherwise, if each join field value is a contiguous string of characters, then a n-gram tokenizer might be appropriate. The tokenizer can be specified in the command line with the -Dfuzzyjoin.tokenizer= option or in the XML file specified with the -conf option. For more details please see:

hadoop/fuzzyjoin/resources/conf/fuzzyjoin/default.xml

2 Where can I get more help?

Please email Rares Vernica <rares (at) ics.uci.edu> with any questions you might have.

Date: 2010-04-23 17:00:41 PDT

HTML generated by org-mode 6.31a in emacs 23