UP | HOME

FAQ

Author: Rares Vernica <rares (at) ics.uci.edu>

Table of Contents

1 Copyright

Copyright 2010-2011 The Regents of the University of California

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS"; BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

2 What should I do if I get java.lang.OutOfMemoryError: Java heap space in the Map phase of Stage 2, Kernel (ridpairsimproved or ridpairsppjoin)?

Stage 1, Token Ordering (tokesbasic or tokensimproved) produces a list of unique tokens that are loaded into memory by Stage 2. The list is output in the tokens.n directory in HDFS. The reason for the OutOfMemoryError might be the fact that the list of tokens does not fit into memory.

The first thing you should check is whether you are using the right tokenizer for your data. For example, if each join field value is a list of words, then the word tokenizer would be appropriate. Otherwise, if each join field value is a contiguous string of characters, then a n-gram tokenizer might be appropriate. The tokenizer can be specified in the command line with the -Dfuzzyjoin.tokenizer= option or in the XML file specified with the -conf option. For more details please see:

hadoop/fuzzyjoin/resources/conf/fuzzyjoin/default.xml

3 Where can I get more help?

Please email Rares Vernica <rares (at) ics.uci.edu> with any questions you might have.

Date: 2011-04-12 09:58:19 PDT

HTML generated by org-mode 7.4 in emacs 24