FAQ
Author: Rares Vernica <rares (at) ics.uci.edu>
Table of Contents
1 Copyright
Copyright 2010-2011 The Regents of the University of California
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS"; BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.
2 What should I do if I get java.lang.OutOfMemoryError: Java heap space
in the Map phase of Stage 2, Kernel (ridpairsimproved
or ridpairsppjoin
)?
Stage 1, Token Ordering (tokesbasic
or tokensimproved
) produces a
list of unique tokens that are loaded into memory by Stage 2. The list
is output in the tokens.n
directory in HDFS. The reason for the
OutOfMemoryError
might be the fact that the list of tokens does not
fit into memory.
The first thing you should check is whether you are using the right
tokenizer for your data. For example, if each join field value is a
list of words, then the word tokenizer would be
appropriate. Otherwise, if each join field value is a contiguous string
of characters, then a n-gram tokenizer might be appropriate. The
tokenizer can be specified in the command line with the
-Dfuzzyjoin.tokenizer=
option or in the XML file specified with the
-conf
option. For more details please see:
hadoop/fuzzyjoin/resources/conf/fuzzyjoin/default.xml
3 Where can I get more help?
Please email Rares Vernica <rares (at) ics.uci.edu> with any questions you might have.
Date: 2011-04-12 09:58:19 PDT
HTML generated by org-mode 7.4 in emacs 24