Husky Internals
CSCI5570 Large Scale Data Processing Systems
Lab 2
Building Husky
• Build Master & Applications(see the examples in example
directory: https://github.com/huskyteam/husky/tree/master/examples)
log in student_number@proj10
$ cd husky && mkdir build && cd build
$ cmake ..
$ make Master
$ make PI # PI caculation
Running Husky in single machine
• Assuming that Master and PI executables are built
• Input the following commands:
$ ./Master --master_host=proj10 --master_port=10086 -comm_port=12306 --worker.info=proj10:3 --serve=0 -hdfs_namenode=proj10 --hdfs_namenode_port=9000
$ ./PI --master_host=proj10 --master_port=10086 -comm_port=12306 --worker.info=proj10:3 --hdfs_namenode=proj10
--hdfs_namenode_port=9000
Output: [INFO] 3.093333
Running Husky in single machine
• Alternatively, you may also use configuration file
instead
$ ./Master --conf /path/to/your/conf
$ ./PI --conf /path/to/your/conf
Configuration Example(INI format)
# Required
master_host=proj10
master_port=10086
comm_port=12306
serve=0
hdfs_namenode=proj10
hdfs_namenode_port=9000
# Worker information
[worker]
info=proj10:3
Running in Distributed Environment
• Start the Master in master machine
$ ./Master --conf /path/to/your/conf
• Use exec.sh script to execute the applications in all machines
$ ./exec.sh PI --conf /path/to/your/conf
Remark: replace ./exec.sh by /data/opt/tmp/exec.sh
exec.sh uses pssh, configure the master machine can ssh into others
machines without password. See ssh-copy-id in lab note 1.
exec.sh needs a file "machine.cfg"
Check the file in /data/opt/tmp/machine.cfg
Or create a file and type the following:
proj5
proj6
proj7
proj8
proj9
Change Configuration file
master_host=proj10
master_port=10086
comm_port=12306
serve=0
hdfs_namenode=proj10
hdfs_namenode_port=9000
# Worker information in distributed mode
[worker]
info=proj5:10
info=proj6:10
info=proj7:10
info=proj8:10
info=proj9:10
Running in Distributed Environment
• Start the Master in master machine
$ ./Master --conf /path/to/your/conf
• Use exec.sh script to execute the applications in all machines
$ ./exec.sh PI --conf /path/to/your/conf
Remark: replace ./exec.sh by /data/opt/tmp/exec.sh
exec.sh uses pssh, configure the master machine can ssh into others
machines without password. See ssh-copy-id in lab note 1.
Husky Components
• Object List – data abstraction
• Channel – object communication
• API– how to use
Object List
• A list stored all the Objects among the distributed
cluster
• Can be defined it as you need.
For example:
vertex object list -> graph
word object list -> sentence, article,corpus, …
data features list-> do machine learning, …
Object List
class Obj{
public:
using KeyT= int;
KeyT key;
const KeyT& id() const{ return key; }
explicit Obj(const KeyT& k) : key(k) {}
};
Object List
auto& objlist=
ObjListFactory::create_objlist<Obj>("my_objlist");
Obj obj(3);
objlist.add_object(obj);
Channel
• How object list communicates with others
Push Channel– send message from source to destination
Push Combined Channel– send combined message
Channel
auto& ch =
ChannelFactory::create_push_channel<int>(src_list,
dst_list);
ch.push(msg, key);
auto& msgs= ch.get(obj);
API
globalize(objlist);
list_execute(objlist, {in_ch},{out_ch}, [](Obj& obj)
{
auto msg= in_ch.get(obj);
...
out_ch.push(msg);
});
Data Source
Need to read data from many sources
For instance:
using HDFSInputFormat to read data from HDFS
using MongoDBInputFormat to read data from Mongo
using NFSInputFormat to read data from local files
Data Source
The source of Push Channel can be inputformat:
auto& word_list= ObjListFactory::create_objlist<Word>();
HDFSLineInputFormat infmt;
infmt.set_input(Context::get_param(“file_path"));
auto& ch=
ChannelFactory::create_push_combined_channel<int,
SumCombiner<int>>(infmt, word_list);
load(infmt, {&ch}, parse_wc);
list_execute(word_list, [&ch](Word & word)
{ ch.get(word); });
The Development of Husky
Applications
CSCI5570 Large Scale Data Processing Systems
Lab 2
Development: three steps
• Define classes
• Create objects
• Implement dataflow
Example: PI calulation
• Generate points with both the x and y coordinate smaller or equal to
1
•
𝐴𝑟𝑒𝑎 𝑜𝑓 𝑐𝑖𝑟𝑐𝑙𝑒
𝐴𝑟𝑒𝑎 𝑜𝑓 𝑠𝑞𝑢𝑎𝑟𝑒
=
𝜋 ∗1 ∗1
4
=
# 𝑜𝑓 𝑝𝑜𝑖𝑛𝑡𝑠 𝑖𝑛 𝑡ℎ𝑒 𝑐𝑖𝑟𝑐𝑙𝑒
𝑡𝑜𝑡𝑎𝑙 # 𝑜𝑓 𝑝𝑜𝑖𝑛𝑡𝑠
Implementation: PI calculation
• Define class: PIObject
• Aggregate # of points in the circle
Implementation: PI calculation
• Create objects
• Each thread generates some points
Implementation: PI calculation
• Create objects
• Each thread sends # of points in the circle to PIObject with id 0
• Messages dynamically create PIObject 0
Implementation: PI calculation
• Implement dataflow
• PIObject 0 aggregates messages and outputs the final result
Example: PageRank
• Webpage ranking algorithm of Google
• Each page can vote to other pages with its PageRank value
• Pages with high PageRank values are more important
• Notation
• 𝑝𝑟𝑖 𝑣 :PageRank value of a vertex 𝑣 at the 𝑖-th iteration
• d: a factor which is usually set to 0.85
• Γ𝑖𝑛 (v) and Γ𝑜𝑢𝑡 (v): in- and out- neighbors of vertex v
Example: PageRank – a vertex view
• Aggregate votes from in-neighbors
• Calculate the PageRank value
• Vote to out-neighbors
Implementation: PageRank
• Define class: Vertex
• Id, out-neighbors and PageRank value
• Serialization and deserialization
Implementation: PageRank
• Create vertices objects from adjacent list
• Parse each line in each input splitter to a vertex
• Globalize vertices (enable sending messages to a vertex)
Implementation: PageRank
• Implement dataflow with list_execute and channel
• Aggregate votes from in-neighbors
• Calculate the PageRank value
• Vote to out-neighbors
© Copyright 2026 Paperzz