Authorship Recognition: Who wrote this document? Stylometry is a

PROBLEM STATEMENT Authorship Recognition: Who wrote
this document?
Stylometry is a field that relies on
linguistic information found in a
document to perform authorship
recognition.
①  Does machine translation anonymize text?
②  Can we detect the machine translation tools?
③  What kind of a feature set do we need?
MOTIVATION Stylometry is currently used in
intelligence analysis and forensics.
BACKGROUND AND RELATED WORK RESULTS REAL WORLD APPLICATION Identifying translators in one-way translations:
93.60% correct classification
Authorship attribution in one-way translations:
84.38% correct classification
Underground forum users worldwide:
72.53% correct classification
Note: Underground forum data is not similar to
common writing.
•  State-of-the-art stylometry methods can
identify individuals in sets of 50 authors
with over 90% accuracy as shown in Abbasi
and Chen’s work.
•  Rao and Rohatgi introduced the idea of
translating text to a different language and
then back to its original language using a
machine translation tool to obfuscate a
text’s author.
•  Hedegaard and Simonsen researched
authorship attribution in translated text,
which we outperform in this work.
APPROACH AND UNIQUENESS •  Supervised Stylometry: Given a set of
documents of known authorship, use JStylo to
classify a document of unknown authorship.
CONTRIBUTIONS TRANSLATION FEATURE SET CONCLUSIONS AND FUTURE WORK •  Translations do not obfuscate authors.
•  We can detect which machine translation tool
was used to translate text.
Are we anonymous?
Average characters per word Character count Func.on words LeHers JStylo
A lot of non-English text of interest...
The 2009 Technology Assessment for the State of
the Art Biometrics Excellence Roadmap (SABER)
commissioned by the FBI stated that, “As nonhandwritten communications become more
prevalent, such as blogging, text messaging and
emails, there is a growing need to identify writers
not by their written script, but by analysis of the
typed content.”
RESEARCH POSTER PRESENTATION DESIGN © 2012
www.PosterPresentations.com
•  Brennan-Greenstadt Adversarial Stylometry
Corpus for two-way translations:
l 
l 
l 
l 
l 
l 
l 
Diverse topics in 21st century wri.ng 13 na.ve English speaking authors Minimum of 5000 words per author Use Google’s and Bing’s translators Translate to German and back to English Translate to Japanese and back to English Translate to German, then to Japanese and back to English •  Uniqueness: Find language independent
features to form the ‘Translation Feature Set’. Punctua.on Anonymouth
Special Characters •  Anonymouth is a novel framework for
anonymizing writing style.
•  Suggest two-way translations on author’s writing
in Anonymouth and replace text with translations
with the lowest anonymity index.
•  Check anonymity of the edited text with the
‘Translation Feature Set’.
Top leHer bigrams Top leHer trigrams Words Word lengths ACKNOWLEDGEMENTS CONTACT: We thank DARPA (grant N10AP20014)
[email protected] for supporting this work.