README
Table of Contents
1 Overview
This guide describes how to use the source code developed for the study in:
Efficient Parallel Set-Similarity Joins Using MapReduce. Rares Vernica, Michael J. Carey, Chen Li SIGMOD 2010
2 Quick start
The only requirement for running the code is a Hadoop cluster. It does not have to be a full-fledged cluster, a single-node pseudo-distributed installation of Hadoop is enough. For more details about starting a Hadoop cluster please see http://hadoop.apache.org/common/docs/current/quickstart.html The code has been compiled with Hadoop-0.17.2.1.
2.1 Build
$ cd hadoop/fuzzyjoin hadoop/fuzzyjoin$ ant
2.2 Self-join
Here are the steps to perform a self-join on a small sample of the DBLP dataset. We use 80 DBLP entries, title and authors as the join attributes, Jaccard similarity and a 0.5 similarity threshold.
2.2.1 Upload raw data
hadoop/fuzzyjoin$ hadoop fs -put resources/data/dblp.small/raw dblp.small/raw
Each line in the resources/data/dblp.small/raw/dblp.small.raw.txt will
become one record. The lines are separated by ":
" and contain DBLP id,
publication title, authors (concatenated with " ") and other
information available about the publication (concatenated with " ").
2.2.2 Generate records
hadoop/fuzzyjoin$ hadoop jar build/jar/fuzzyjoin-hadoop-1.0.0.jar \ recordbuild -conf resources/conf/fuzzyjoin/dblp.quickstart.xml
This job appends a unique integer RID in front of each record. After this job, each record contains five fields: RID, DBLP id, title, authors, other information.
2.2.3 Balance records across nodes
hadoop/fuzzyjoin$ hadoop jar build/jar/fuzzyjoin-hadoop-1.0.0.jar \ recordbalance -conf resources/conf/fuzzyjoin/dblp.quickstart.xml
To skip this step run:
hadoop/fuzzyjoin$ hadoop fs -mv \ dblp.small/recordsbulk.00 dblp.small/records.1
2.2.4 Run set-similarity self-join
hadoop/fuzzyjoin$ hadoop jar build/jar/fuzzyjoin-hadoop-1.0.0.jar \ fuzzyjoin -conf resources/conf/fuzzyjoin/dblp.quickstart.xml
This will run the three stages required to do fuzzy joins: token ordering (Tokens), kernel (RIDPairs), and record join (RecordPairs). It will use the basic alternative for each stage. In total it will run five Hadoop jobs (TokensBasic.phase1, TokenBasic.phase2, RIDPairsImproved, RecordPairsBasic.phase1, RecordPairsBasic.phase2). (Node that despite the name RIDPairsImproved is in fact the basic approach.)
Each stage can be run separately using different alternatives by
replacing fuzzyjoin
in the above command with the name of the stage
and the alternative. For example, to run the one-phase token ordering
(TokensImproved) type:
hadoop/fuzzyjoin$ hadoop jar build/jar/fuzzyjoin-hadoop-1.0.0.jar \ tokensimproved -conf resources/conf/fuzzyjoin/dblp.quickstart.xml
To get the list with all the available stages and alternatives type:
hadoop/fuzzyjoin$ hadoop jar build/jar/fuzzyjoin-hadoop-1.0.0.jar
To see the results type:
hadoop/fuzzyjoin$ hadoop fs -cat "dblp.small/recordpairs.1/part-*"
Each line contains a pair of records that fuzzy join and their
similarity. The format of the line is record 1;threshold;record2
,
where record1
and record2
have the same format as described in
step 3.
2.3 R-S join
Here are the steps to perform a join between a small sample of the DBLP dataset and a small sample of the CITESEERX dataset. We use 80 DBLP entries and 80 CITESEERX entries, title and authors as the join attributes, Jaccard similarity and a 0.5 similarity threshold.
2.3.1 Upload raw data
hadoop/fuzzyjoin$ hadoop fs -put \ resources/data/pub.small/raw pub.small/raw
The raw
directory contains two files, one for each dataset.
2.3.2 Generate records
hadoop/fuzzyjoin$ hadoop jar build/jar/fuzzyjoin-hadoop-1.0.0.jar \ recordbuild -conf resources/conf/fuzzyjoin/pub.quickstart.xml \ -Dfuzzyjoin.data.suffix.input=dblp hadoop/fuzzyjoin$ hadoop jar build/jar/fuzzyjoin-hadoop-1.0.0.jar \ recordbuild -conf resources/conf/fuzzyjoin/pub.quickstart.xml \ -Dfuzzyjoin.data.suffix.input=csx
Each job generates records for one of the datasets.
2.3.3 Balance records across nodes
hadoop/fuzzyjoin$ hadoop jar build/jar/fuzzyjoin-hadoop-1.0.0.jar \ recordbalance -conf resources/conf/fuzzyjoin/pub.quickstart.xml \ -Dfuzzyjoin.data.suffix.input=dblp hadoop/fuzzyjoin$ hadoop jar build/jar/fuzzyjoin-hadoop-1.0.0.jar \ recordbalance -conf resources/conf/fuzzyjoin/pub.quickstart.xml \ -Dfuzzyjoin.data.suffix.input=csx
To skip this step run:
hadoop/fuzzyjoin$ hadoop fs -mv \ pub.small/recordsbulk.dblp.00 pub.small/records.dblp.1 hadoop/fuzzyjoin$ hadoop fs -mv \ pub.small/recordsbulk.csx.00 pub.small/records.csx.1
2.3.4 Run set-similarity join
hadoop/fuzzyjoin$ hadoop jar build/jar/fuzzyjoin-hadoop-1.0.0.jar \ fuzzyjoin -conf resources/conf/fuzzyjoin/pub.quickstart.xml
To see the results type:
hadoop/fuzzyjoin$ hadoop fs -cat "pub.small/recordpairs.1/part-*"
Each line contains a pair of records that fuzzy join and their
similarity. The format of the line is
record-DBLP;threshold;record-CITESEERX
, where record-DBLP
and
record-CITESEERX
have the same format as described in the self-join
case.
3 Configuration
The XML files provided with the -conf
argument above contain various
configuration parameters. Using the configuration parameters, a user
can specify the location of the data, the similarity function and
threshold, the join attributes and other settings. Moreover the user
can specify additional parameters in the command line using the -D
option.
The default parameters and more details about each parameter are in:
hadoop/fuzzyjoin/resources/conf/fuzzyjoin/default.xml
All these parameters and other constants are defined in:
fuzzyjoin/src/edu/uci/ics/fuzzyjoin/Config.java hadoop/fuzzyjoin/src/edu/uci/ics/hadoop/fuzzyjoin/FuzzyJoinDriver.java
4 Directory structure and tasks
The following directory structure is used:
| |- raw |- recordsbulk.00 |- recordsbulk.01 |- ... |- records.1 |- records.2 |- ... |- tokens.1 |- tokens.1.phase1 |- ... |- ridpairs.1 |- ... |- recordpairs.1 |- recordpairs.1.phase1 |- ...
The raw
directory contains the original files, one record per
line. The recordsbulk
directory contains the original data where
each record starts with an integer RID. The number after the directory
name represents the copy number (00 is the original data, 01 is the
first copy, etc.). The records
directory contains the same data as
the recordsbulk
except that multiple copies are aggregated and data
is balanced across nodes. The number after the directory name
represents how many copies are aggregated (1 is of only one copy:
recordsbulk.00
, 2 is for two copies: recordsbulk.00
and
recordsbulk.01
, etc.). So records.n
represents an increased
dataset, where n
denotes how many times the dataset was
increased. For the rest of the directories the number after the
directory name has the same meaning. The tokens
directory contains
the list of tokens. The ridpairs
directory contains the RID pairs
that fuzzy-join. The recordpairs
directory contains the record pairs
that fuzzy-join. The phase1
prefix that appears for some directories
represent the output of the first MapReduce job for the tasks with two
MapReduce jobs (i.e., tokensbasic
and recordpairsbasic
).
Bellow is a table with each task input and output directories:
Task | Input | Output |
---|---|---|
recordbuild | raw | recordsbulk |
recordbalance | recordsbulk | records |
tokensbasic/improved | records | tokens |
ridpairsimproved/ppjoin | records, tokens | ridpairs |
recordpairsbasic/improved | records, ridpairs | recordpairs |
recordgenerate | recordsbulk.00, tokens.1 | recordsbulk |
5 Dataset
By default the dataset is assumed to have one record per line. The
fields of each record are delimited by ":
". The first filed of each
record is an integer RID. This settings can be changed in:
fuzzyjoin/src/edu/uci/ics/fuzzyjoin/Config.java
The dataset can be increased using the recordgenerate
task:
hadoop/fuzzyjoin$ hadoop jar build/jar/fuzzyjoin-hadoop-1.0.0.jar \ recordgenerate -conf resources/conf/fuzzyjoin/dblp.quickstart.xml \ -Dfuzzyjoin.data.copy=10 \ -Dfuzzyjoin.data.norecords=80
The fuzzyjoin.data.copy
parameter specifies the number of times the
dataset should be increased, while the fuzzyjoin.data.norecords
parameter specifies the number of records in the original dataset (it
is used to generate unique and increasing RIDs). All the following
task also need to have the same value for the fuzzyjoin.data.copy
parameter in order to use the increased dataset. This task is ran
after running recordbuild
and tokensbasic
or tokensimproved
on
the original data. After this task, the recordbuild
task needs to be
ran (it cannot be skipped on the increased dataset):
hadoop/fuzzyjoin$ hadoop jar build/jar/fuzzyjoin-hadoop-1.0.0.jar \ recordbalance -conf resources/conf/fuzzyjoin/dblp.quickstart.xml \ -Dfuzzyjoin.data.copy=10 hadoop/fuzzyjoin$ hadoop jar build/jar/fuzzyjoin-hadoop-1.0.0.jar \ fuzzyjoin -conf resources/conf/fuzzyjoin/dblp.quickstart.xml \ -Dfuzzyjoin.data.copy=10
6 Source Code Overview
The source code is divided into two modules:
-
fuzzyjoin
: general fuzzy-join code-
edu.uci.ics.fuzzyjoin
: main memory fuzzy-join -
edu.uci.ics.fuzzyjoin.similarity
: similarity functions and filters -
edu.uci.ics.fuzzyjoin.invertedlist
: inverted lists index -
edu.uci.ics.fuzzyjoin.recordgroup
: alternatives for grouping records -
edu.uci.ics.fuzzyjoin.tokenizer
: tokenizes -
edu.uci.ics.fuzzyjoin.tokenorder
: alternatives for ordering tokens
-
-
hadoop/fuzzyjoin
: Hadoop specific fuzzy-join code-
edu.uci.ics.hadoop.fuzzyjoin
: main program -
edu.uci.ics.hadoop.fuzzyjoin.datagen
: classes for building records and increasing dataset size -
edu.uci.ics.hadoop.fuzzyjoin.recordpairs
: Stage 3 -
edu.uci.ics.hadoop.fuzzyjoin.ridpairs
: Stage 2 -
edu.uci.ics.hadoop.fuzzyjoin.ridrecordpairs
: alternative to Stage 2 and 3 where records are not projected -
edu.uci.ics.hadoop.fuzzyjoin.tokens
: Stage 1
-
Date: 2010-03-24 15:07:06 PDT
HTML generated by org-mode 6.31a in emacs 23