MIR-Final-Proj-V1.1.ppt

Notes on Final Project of
MIR Course
Part I: Crawling Phase
Modern Information Retrival Course,
Semantic web Research labratory
1
Crawling Phase

Crawling the Dmoz directory


It has as taxonomic structure (Tree-like)
Each subdirectory by a group
Modern Information Retrival Course,
Semantic web Research labratory
2
Crawling Phase

This tree-like structure has two important components:


Internal Nodes (also known as “topics”)
Leaves (also known as “pages”)
Topics
Pages
Modern Information Retrival Course,
Semantic web Research labratory
3
Crawling Phase

Then each topic has a:





list of children (subtopics)
unique path to root node (supertopics)
description
list of related pages
And each page has:

A topic
Modern Information Retrival Course,
Semantic web Research labratory
4
Crawling Phase

Each topic has some characteristics
Description of
Current Topic
The Current
Topic (Node)
List of
super
topics
List of subtopics
List of Related
Pages (Leaves)
Modern Information Retrival Course,
Semantic web Research labratory
5
Crawling Phase

Deliveries for first phase:

TopicNames.txt


TopicDescs.txt


Each line contains a topic number and the full name of that topic,
separated by a tab character (i.e. 46 Top/Science/Agriculture )
Each line contains a topic number and the description of that
topic, separated by a tab character. For some topics, the
description is a zero-length string.
TopicHierarchy.txt

Each line contains a pair of topic numbers (separated by a tab
character). The first of these two topics is the parent of the second
topic. Each topic has exactly one parent, except for the root (topic
0), which has no parent.
Modern Information Retrival Course,
Semantic web Research labratory
6
Crawling Phase

Deliveries for first phase:

DocUrls.txt


DocTitles.txt


Each line contains a document number and its URL, separated by
a tab character
Each line contains a document number and its title, separated by
a tab character
DocTopics.txt

Each line contains a document number and a topic number,
separated by a tab character. This indicates that the document
belongs to the given topic.
Modern Information Retrival Course,
Semantic web Research labratory
7
Crawling Phase

Deliveries for first phase:

Documents.zip


The contents of the documents seperately
A list of samples for each output file have been added
to the Assignments page (for “Science” Subdirectory)
Modern Information Retrival Course,
Semantic web Research labratory
8
Crawling Phase

Naming contraction:

Names in each subdirectory start with a special character:
Subdirectory
Arts
Business
Computers
Games
Health
Home
Kids and Teens
News
Char
A
B
C
D
E
F
G
H
Subdirectory
Recreation
Reference
Regional
Science
Shopping
Society
Sports
Modern Information Retrival Course,
Semantic web Research labratory
Char
I
J
K
L
M
N
O
9
Crawling Phase


Then for each sub tree , generate numeric names for
children in BFS search order.
i.e. in Science Subdirectory:
Sample
Topic
Sample
Page
1
L1
2
L3
L2
3
L5
5
L4
4
L6
Modern Information Retrival Course,
Semantic web Research labratory
L7
L8
10
Crawling Phase

Assignments of subdirectories to groups:
Subdir.
Group
Subdir.
Group
Arts
Abbasi / Kord-Zadeh
Recreation
Mirjalaali / Sayyedi
Business
Ahangaraan / Samad-Zadegan
Reference
Nokhbe-Zaeim /
Tabaatabaaei
Computers
Ashraf / Rahimi (M.)
Regional
Omid / Arab
Games
Darvishi / Rahimi (A.)
Science
Qaderian / KhorramZadeh
Health
Falaki / Vaezi
Shopping
Qazvinian / Rsoulian
Home
Fathi / Sadjadi
Society
Saremi / Mashayekhi
Kids and
Teens
Iranmanesh / Takhtaai
Sports
Shafi'i-Nowroozi
News
Kazemi-Tabar / Parsa
Computers
Vafaai / Jalili
Modern Information Retrival Course,
Semantic web Research labratory
11