Assignment-3_Paper

Cache-conscious Frequent Pattern Mining Processor
The majority of reviewers commented on the section where the authors talk about
exploiting SMT. I agree that this section could give more details about the idea. I will
comment on the weak points since they are the only points that worth of discussing.
Adrian:
- I agree that the data sets and the chosen supports are not well explained and
in general all graphs have the same trends. As such the experimental section
is not very solid.
- True, but the paper emphasizes in a specific domain of data mining to show
that the data-mining algorithms should consider cache locality.
Cansu:
- Valid point but this might not be straightforward.
- True. In general the section where the authors exploit multi-threading is not
well written.
Danica:
- The motivation of this work is to reduce the memory stalls that the processor
faces during the execution of a cache-unconscious data-mining algorithm.
The difference between future modern processors and the processor that the
authors use, is the size of the last level on-chip cache. The authors exploit
spatial locality by improving cache line utilization and temporal locality by
splitting the tree to multiple tiles that each one fits in the cache memory. So,
in the case of future modern processors, the authors should use a different
size of tiling the tree.
- In fact in a chip multiprocessor each thread will have its own L1 cache. As
such, threads will not compete for a shared L1-cache. However, they will
compete for the shared LLC. In order to be able to see whether the current
implementation can be used in a chip multiprocessor, the authors should
show how many memory stalls the SMT reduces in each level of cache. As
such, the algorithm might need small modifications that take into
consideration the fact that cores benefit by sharing the LLC. There are many
ways to exploit the multiple cores. The major contribution of the paper is
that people should take into consideration the CPU architecture.
Djordje:
- The authors indeed do not talk about the overhead of constructing the cacheconscious prefix tree. However, this cost is included in the total execution
time. The results show that the benefits of using this tree overwhelm the
overhead. Of course, a quantitative analysis should be done.
- I think that the methodology is clear. They use hardware counters to evaluate
the CPI. Even if the numbers are rounded, the order of the CPI is so huge that
does not change the conclusion that CPI is very far from the basic CPI of this
-
processor. Still, I totally agree that they should show non-rounded numbers
in case they didn’t.
For a cpu-bound algorithm you expect that this would be linear. The graph
motivates the readers that the performance of a memory-bound algorithm is
very far from the optimal.
Regarding the CPU utilization, I totally agree that it’s misleading. The authors
wanted to say useful computation.
Farhan:
- In my opinion the simulation should be preferred in case that the authors
wanted to exploit various hardware architectures.
- This might be the case in a multicore processor where we can have false
sharing. Not only the authors use a uniprocessor, but the operations on the
tree are read-only (after being created) and hence I don’t see why there is
any negative effect on placing more nodes in one cache line.
- Assuming a tree that does not fit in the main memory is orthogonal to the
cache optimizations that the authors implement.
Ioannis:
- I agree that the authors should provide how much time the transformation to
the cache-conscious tree takes. However, this is included in the total
execution time and hence we can see that it’s well overwhelmed.
- True. It would be nice to use real datasets.
- True. Personally I needed to go through references to understand how the
algorithm works.
Manos:
- See above comments
Mutaz:
- See comments on Djordje’s reviews.
- I agree that the algorithms were not well explained and that the readers who
are not familiar need to go through seminar work.
- Indeed the authors cannot use larger data sets saying that the FIMI
implementations cannot handle them. Although, this can be considered as
not a good excuse, I don’t think that larger datasets would change something.
Onur:
- Actually the FP-tree is memory-resident which is quite smaller than the
dataset. Thus, the dataset does not fit in the memory. The purpose of this
paper is the cache performance of data-mining algorithms. As such, I
consider that examining a non-memory resident FP-tree is beyond the scope
of the paper.
- I think that the main contribution of this paper is that they are the first in the
data-mining community to take into consideration the underlined hardware
-
architecture. As such, I am sure that other data-mining algorithms can benefit
from this contribution.
Indeed specialized hardware for data mining can be very effective. However,
this would imply using architectural simulators.
Pinar:
- The overhead of rebuilding the prefix tree is overwhelmed by the benefits of
exploiting the data locality
- I don’t consider complexity as a weak point. Indeed it will take more time to
build a cache-conscious algorithm or exploit the multithreading. However,
we cannot expect getting more performance without exploiting the cache
even if this requires more programming effort.
Renata:
- This is true but if you want to optimize the cpu or cache performance you
cannot be I/O bound. Having a non-memory resident tree is orthogonal.
- Frequent pattern mining is an important data-mining algorithm. The
contribution of this paper is not to give an implementation of a frequent
pattern mining that exploits cache locality but to show that cache
consciousness is the correct way to increase performance of data mining
algorithms.
Sotiria:
- True. The FPGrowth algorithm was not well explained.
- True. Prefetching does not improve significantly the performance but a
simple sequential or stride prefetcher exists in modern processors and hence
does not add any additional cost.