README
Author: Rares Vernica <rares (at) ics.uci.edu>
Table of Contents
1 Copyright
Copyright 2010-2011 The Regents of the University of California
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS"; BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.
2 Overview
This guide describes how to use the source code developed for the study in:
Efficient Parallel Set-Similarity Joins Using MapReduce. Rares Vernica, Michael J. Carey, Chen Li SIGMOD 2010
3 Quick Start
The only requirement for running the code is a Hadoop cluster. It does not have to be a full-fledged cluster, a single-node pseudo-distributed installation of Hadoop is enough. For more details about starting a Hadoop cluster please see http://hadoop.apache.org/common/docs/current/quickstart.html The code works with Hadoop version 0.17 or higher.
3.1 Build
$ cd fuzzyjoin-hadoop fuzzyjoin-hadoop$ ant
3.2 Self-join
Here are the steps to perform a self-join on a small sample of the DBLP dataset. We use 100 DBLP entries, title and authors as the join attributes, Jaccard similarity and a 0.5 similarity threshold.
3.2.1 Upload raw data
fuzzyjoin-hadoop$ hadoop fs -put \ ../data/dblp-small/raw-000 dblp-small/raw-000
The file dblp-small.raw.txt contains one record per line. On each
line the fields are separated by ":" and contain DBLP id,
publication title, authors (concatenated with " ") and other
information available about the publication (concatenated with " ").
3.2.2 Generate records
fuzzyjoin-hadoop$ hadoop jar target/fuzzyjoin-hadoop-0.0.2.jar \ recordbuild -conf src/main/resources/fuzzyjoin/dblp.quickstart.xml
This job assigns unique record-IDs to each record. The RIDs are integers and are appended in front of each record. After this job, each record contains five fields: RID, DBLP id, title, authors, other information.
3.2.3 Balance records across nodes
fuzzyjoin-hadoop$ hadoop jar target/fuzzyjoin-hadoop-0.0.2.jar \ recordbalance -conf src/main/resources/fuzzyjoin/dblp.quickstart.xml
To skip this step, run:
fuzzyjoin-hadoop$ hadoop fs -mv \ dblp-small/recordsbulk-000 dblp-small/records-000
3.2.4 Run set-similarity self-join
fuzzyjoin-hadoop$ hadoop jar target/fuzzyjoin-hadoop-0.0.2.jar \ fuzzyjoin -conf src/main/resources/fuzzyjoin/dblp.quickstart.xml
This will run the three stages required to do fuzzy joins: token ordering (Tokens), kernel (RIDPairs), and record join (RecordPairs). It will use the basic alternative for each stage. In total it will run five Hadoop jobs (TokensBasic.phase1, TokenBasic.phase2, RIDPairsImproved, RecordPairsBasic.phase1, RecordPairsBasic.phase2).
Each stage can be run separately using different alternatives by
replacing fuzzyjoin in the above command with the name of the stage
and the alternative. For example, to run the one-phase token ordering
(TokensImproved), type:
fuzzyjoin-hadoop$ hadoop jar target/fuzzyjoin-hadoop-0.0.2.jar \ tokensimproved -conf src/main/resources/fuzzyjoin/dblp.quickstart.xml
To get the list with all the available stages and alternatives, type:
fuzzyjoin-hadoop$ hadoop jar target/fuzzyjoin-hadoop-0.0.2.jar
To see the results, type:
fuzzyjoin-hadoop$ hadoop fs -cat "dblp-small/recordpairs-000/part-*"
Each line contains a pair of records that fuzzy join and their
similarity. The format of the line is record 1;threshold;record2,
where record1 and record2 have the same format as described in
step 3.
3.3 R-S join
Here are the steps to perform a join between a small sample of the DBLP dataset and a small sample of the CITESEERX dataset. We use 100 DBLP entries and 100 CITESEERX entries, title and authors as the join attributes, Jaccard similarity and a 0.5 similarity threshold.
3.3.1 Upload raw data
fuzzyjoin-hadoop$ hadoop fs -put \ ../data/pub-small/raw.dblp-000 pub-small/raw.dblp-000 fuzzyjoin-hadoop$ hadoop fs -put \ ../data/pub-small/raw.csx-000 pub-small/raw.csx-000
The raw directory contains two files, one for each dataset.
3.3.2 Generate records
fuzzyjoin-hadoop$ hadoop jar target/fuzzyjoin-hadoop-0.0.2.jar \ recordbuild -conf src/main/resources/fuzzyjoin/pub.quickstart.xml \ -Dfuzzyjoin.data.suffix.input=dblp fuzzyjoin-hadoop$ hadoop jar target/fuzzyjoin-hadoop-0.0.2.jar \ recordbuild -conf src/main/resources/fuzzyjoin/pub.quickstart.xml \ -Dfuzzyjoin.data.suffix.input=csx
Each job generates records for one of the datasets.
3.3.3 Balance records across nodes
fuzzyjoin-hadoop$ hadoop jar target/fuzzyjoin-hadoop-0.0.2.jar \ recordbalance -conf src/main/resources/fuzzyjoin/pub.quickstart.xml \ -Dfuzzyjoin.data.suffix.input=dblp fuzzyjoin-hadoop$ hadoop jar target/fuzzyjoin-hadoop-0.0.2.jar \ recordbalance -conf src/main/resources/fuzzyjoin/pub.quickstart.xml \ -Dfuzzyjoin.data.suffix.input=csx
To skip this step, run:
fuzzyjoin-hadoop$ hadoop fs -mv \ pub-small/recordsbulk.dblp-000 pub-small/records.dblp-000 fuzzyjoin-hadoop$ hadoop fs -mv \ pub-small/recordsbulk.csx-000 pub-small/records.csx-000
3.3.4 Run set-similarity join
fuzzyjoin-hadoop$ hadoop jar target/fuzzyjoin-hadoop-0.0.2.jar \ fuzzyjoin -conf src/main/resources/fuzzyjoin/pub.quickstart.xml
To see the results, type:
fuzzyjoin-hadoop$ hadoop fs -cat "pub-small/recordpairs-000/part-*"
Each line contains a pair of records that fuzzy join and their
similarity. The format of the line is
record-DBLP;threshold;record-CITESEERX, where record-DBLP and
record-CITESEERX have the same format as described in the self-join
case.
4 Configuration
The XML files provided with the -conf argument above contain various
configuration parameters. Using the configuration parameters, a user
can specify the location of the data, the similarity function and
threshold, the join attributes and other settings. Moreover the user
can specify additional parameters in the command line using the -D
option.
The default parameters and more details about each parameter are in:
fuzzyjoin-hadoop/src/main/resources/fuzzyjoin/default.xml
All these parameters and other constants are defined in:
fuzzyjoin-core/src/main/java/edu/uci/ics/fuzzyjoin/FuzzyJoinConfig.java fuzzyjoin-hadoop/src/main/java/edu/uci/ics/fuzzyjoin/hadoop/FuzzyJoinDriver.java
5 Directory Structure and Tasks
The following directory structure is used for self-joins:
| |- raw-000 |- recordsbulk-000 |- recordsbulk-001 |- ... |- records-000 |- records-001 |- ... |- tokens-000 |- ... |- tokens.phase1-000 |- ... |- ridpairs-000 |- ... |- recordpairs-000 |- ... |- recordpairs.phase1-000 |- ...
The raw-000 directory contains the original files, one record per
line. The recordsbulk directory contains the original data where
each record starts with an integer RID. The number after the directory
name represents the copy number (000 is the original data, 001 is
the first copy, etc.). The records directory contains the same data
as the recordsbulk directory except that multiple copies are
aggregated and the data is balanced across nodes. The number after the
directory name represents how many copies are aggregated (000 is of
only one copy: recordsbulk-000, 001 is for two copies:
recordsbulk-000 and recordsbulk-001, etc.). So records-n
represents an increased dataset, where n denotes how many times the
dataset was increased. For the rest of the directories the number
after the directory name has the same meaning. The tokens directory
contains the list of tokens. The ridpairs directory contains the RID
pairs that fuzzy-join. The recordpairs directory contains the record
pairs that fuzzy-join. The phase1 prefix that appears for some
directories represents the output of the first MapReduce job for the
tasks with two MapReduce jobs (i.e., tokensbasic and
recordpairsbasic).
Bellow is a table with each task input and output directories:
| Task | Input | Output |
|---|---|---|
| recordbuild | raw | recordsbulk |
| recordbalance | recordsbulk | records |
| tokens basic/improved | records | tokens |
| ridpairs improved/ppjoin | records, tokens | ridpairs |
| recordpairs basic/improved | records, ridpairs | recordpairs |
| recordgenerate | recordsbulk-000, tokens-000 | recordsbulk |
For R-S joins, the first few directories also carry the name of the dataset (name of the R dataset or of the S dataset) in order to differentiate between them:
| |- raw.DATASET_R-000 |- raw.DATASET_S-000 |- recordsbulk.DATASET_R-000 |- recordsbulk.DATASET_R-001 |- ... |- recordsbulk.DATASET_S-000 |- recordsbulk.DATASET_S-001 |- ... |- records.DATASET_R-000 |- records.DATASET_R-001 |- ... |- records.DATASET_S-000 |- records.DATASET_S-001 |- ...
where DATASET_R and DATASET_S are the names of the two
datasets. In our R-S join example we used dblp for DATASET_R and
csx for DATASET_S.
6 Dataset
By default the dataset is assumed to have one record per line. The
fields of each record are delimited by ":". The first filed of each
record is an integer RID. This settings can be changed in:
fuzzyjoin-core/src/main/java/edu/uci/ics/fuzzyjoin/FuzzyJoinConfig.java
The dataset can be increased using the recordgenerate task:
fuzzyjoin-hadoop$ hadoop jar target/fuzzyjoin-hadoop-0.0.2.jar \ recordgenerate -conf src/main/resources/fuzzyjoin/dblp.quickstart.xml \ -Dfuzzyjoin.data.copy=10 \ -Dfuzzyjoin.data.norecords=100
This stats 9 MapReduce jobs, each of them generating a new copy of
the dataset. The fuzzyjoin.data.copy parameter specifies the number
of times the dataset should be increased, while the
fuzzyjoin.data.norecords parameter specifies the number of records
in the original dataset (it is used to generate unique and
increasing RIDs). All the following tasks also need to have the same
value for the fuzzyjoin.data.copy parameter in order to use the
increased dataset. This task can only be ran after running
recordbuild and tokensbasic or tokensimproved on the original
dataset. After this task, the recordbuild task needs to be ran (it
cannot be skipped on the increased dataset):
fuzzyjoin-hadoop$ hadoop jar target/fuzzyjoin-hadoop-0.0.2.jar \ recordbalance -conf src/main/resources/fuzzyjoin/dblp.quickstart.xml \ -Dfuzzyjoin.data.copy=10 fuzzyjoin-hadoop$ hadoop jar target/fuzzyjoin-hadoop-0.0.2.jar \ fuzzyjoin -conf src/main/resources/fuzzyjoin/dblp.quickstart.xml \ -Dfuzzyjoin.data.copy=10
7 Source Code Overview
The source code is divided into two modules:
-
fuzzyjoin-core: general fuzzy-join code infuzzyjoin-core/src/main/java-
edu.uci.ics.fuzzyjoin: main memory fuzzy-join -
edu.uci.ics.fuzzyjoin.similarity: similarity functions and filters -
edu.uci.ics.fuzzyjoin.invertedlist: inverted lists index -
edu.uci.ics.fuzzyjoin.recordgroup: alternatives for grouping records -
edu.uci.ics.fuzzyjoin.tokenizer: tokenizes -
edu.uci.ics.fuzzyjoin.tokenorder: alternatives for ordering tokens
-
-
fuzzyjoin-hadoop: Hadoop specific fuzzy-join code infuzzyjoin-hadoop/src/main/java-
edu.uci.ics.fuzzyjoin.hadoop: main program -
edu.uci.ics.fuzzyjoin.hadoop.datagen: classes for building records and increasing dataset size -
edu.uci.ics.fuzzyjoin.hadoop.recordpairs: Stage 3 -
edu.uci.ics.fuzzyjoin.hadoop.ridpairs: Stage 2 -
edu.uci.ics.fuzzyjoin.hadoop.ridrecordpairs: alternative to Stage 2 and 3 where records are not projected -
edu.uci.ics.fuzzyjoin.hadoop.tokens: Stage 1
-
Date: 2011-04-12 09:58:14 PDT
HTML generated by org-mode 7.4 in emacs 24