SHERPA/RoMEO Forthcoming Developments and the API

RoMEO and CRIS
Technical Issues & Efficiency Tips
Peter Millington
Centre for Research Communications
University of Nottingham
RoMEO and CRIS in Practice
Birmingham, 1st April 2011
Outline
• Patterns of usage
– Do we have a crisis?
• Approaches to using ROMEO in CRIS
– Real time queries
– Caching and reusing RoMEO query results
• Rates of change – Reality Check
– And their implications
• Other efficiency tips
Usage of Interactive RoMEO
One Month
Number of Page Views
6000
5000
4000
3000
2000
1000
0
27/02/2011
06/03/2011
13/03/2011
Date
20/03/2011
27/03/2011
100000
80000
60000
40000
20000
Number of Page Views
Usage of Interactive RoMEO
Long Term
120000
0
Mar-2011
Feb-2011
Jan-2011
Dec-2010
Nov-2010
Oct-2010
Sep-2010
Aug-2010
Jul-2010
Jun-2010
May-2010
Apr-2010
Mar-2010
Feb-2010
Jan-2010
Dec-2009
Month
Usage of Interactive RoMEO
•
•
•
•
•
•
Similar curve shapes for other measures
Distinct weekly pattern
~4,500 Page views per day
~1,000 Visits per day
~ 700 Unique visitors per day
Seems to be a stable seasonal pattern
Usage of the RoMEO API – All Users
One Month
100
Number of IP Addresses
90
80
70
60
50
40
30
20
10
0
01/03/11
08/03/11
15/03/11
Date
22/03/11
29/03/11
Usage of the RoMEO API – All Users
Long Term
100
No. of IP Addresses
90
80
70
60
50
40
30
20
10
0
01/11/200 31/12/200 01/03/201 30/04/201 29/06/201 28/08/201 27/10/201 26/12/201 24/02/201
9
9
0
0
0 Date
0
0
0
1
Usage of the RoMEO API – Requests
One Month
Number of Requests
600000
500000
400000
300000
200000
100000
0
01/03/2011
08/03/2011
15/03/2011
Date
22/03/2011
29/03/2011
Usage of the RoMEO API – Requests
Long Term
Number of Requests
600000
500000
400000
300000
200000
100000
0
01/11/2009 31/12/2009 01/03/2010 30/04/2010 29/06/2010 28/08/2010 27/10/2010 26/12/2010 24/02/2011
Date
Usage of the RoMEO API
• Much more variable pattern
– Weekly cycle of visits less distinct
– Number of requests very highly variable
• More usage by fewer users
– ~60 Unique visitors per day
– Over 250,000 hits per day (>50 times interactive)
• Significant growth
– Steady growth in number of API users
– Rapid growth in number of requests
Do we have a Crisis?
• Do you ever think RoMEO is slow?
– Most API usage is by CRIS-like applications
• How can we improve things?
– Higher capacity server?
• Funding? Unnecessary?
– Improve efficiency?
• Optimise the API? More efficient usage?
– Put a cap on number of requests per day?
• What level? 1000? 2000?
– Block commercial software users
• N.b. Creative Commons License
API approaches in CRIS applications
• Real time requests when displaying data
– Acceptable for individual article displays
– Latency too slow for lists of articles
• Caching RoMEO data for rapid local re-use
– Initial (bulk) checks against RoMEO
– Store the results locally
– Periodically recheck for updated policies
• Whole bibliography
• Additions and updates only
Real Time Usage Pattern
30000
Number of Request
25000
20000
15000
10000
5000
0
01/11/2009
09/02/2010
20/05/2010
28/08/2010
Date
06/12/2010
16/03/2011
Real Time Usage Pattern
30000
One Month
20000
15000
10000
5000
0
01/11/2009
2000
1800
1600
1400
1200
1000
800
600
400
200
09/02/2010 0
20/05/2010
28/08/2010
15/03/2011
08/03/2011
01/03/2011
Date
Number of Requests
Number of Request
25000
Date
06/12/2010
22/03/2011
16/03/2011
29/03/2011
Real Time Usage Pattern
• Levels vary day by day
– Arguably high usage for one installation
• Occasional peaks
– Special system jobs
– Special end user projects
Caching with Monthly Updates
100000
Number of Requests
90000
80000
70000
60000
50000
40000
30000
20000
10000
0
01/11/2010
01/12/2010
01/01/2011
Date
01/02/2011
01/03/2011
01/04/2011
Caching with Monthly Updates
• Rechecking the whole database each cycle
– Seems to take three days. Low priority setting?
• Scheduled job – starts 1st of the month
– Could it be a weekend instead?
• Faster. Less intrusive.
• What is being checked?
– Each reference?
– Groups of records for each journal title?
• What about additions between cycles?
Caching with Daily Updates (1)
2500
Number of Requests
2000
1500
1000
500
0
01/06/2010
21/07/2010
09/09/2010
29/10/2010
Date
18/12/2010
06/02/2011
28/03/2011
Caching with Daily Updates (1)
2500
One Month
2500
1500
1000
500
0
01/06/2010
Number of Requests
Number of Requests
2000
2000
1500
1000
21/07/2010
500
09/09/2010
29/10/2010
18/12/2010
06/02/2011
28/03/2011
0
Date
01/03/2011 06/03/2011 11/03/2011 16/03/2011 21/03/2011 26/03/2011 31/03/2011
Date
Caching with Daily Updates (1)
• Whole database checked every day
– Institutions can easily have lists of 50,000 items!
– Lists constantly growing, slowing things down
• What is being checked?
– Each reference? Probably
• Additions and updates between checks?
– No accuracy problems
• Sledgehammer to crack a nut
Is the nut cracking the sledgehammer?
90000
Number of Requests
80000
70000
60000
50000
40000
30000
20000
10000
0
01/05/2010
20/06/2010
09/08/2010
28/09/2010
Date
17/11/2010
06/01/2011
25/02/2011
Caching with Daily Updates (2)
Number of Requests (Log scale)
100000
10000
1000
100
10
1
01/03/2010
30/04/2010
29/06/2010
28/08/2010
Date
27/10/2010
26/12/2010
24/02/2011
Caching with Daily Updates (2)
•
•
•
•
Note the logarithmic scale
Large initial check of the whole database
Daily check of added & changed items only
Welcome low loading on the API
Rates of Change – Reality Check
• Institutional Bibliographies
– Up to 2,000 additions per year (<40 per week)
– Few bibliographic changes after initial QA
• RoMEO Publishers’ Policies
– c.25 additions or substantive changes per week
• Journal - Publisher Correlations
– Change of publisher - infrequent - mostly January
– Bulk changes - Business take-over or name change
• Expiry of archiving embargos
RoMEO Implications of Change Rates
• Institutional Bibliographies
– Only need to check additions & changes
– Weekly check probably sufficient, or on first use
• RoMEO Publishers’ Policies
– Recheck when the RoMEO record changes
– Store RoMEO ID with article/journal for bulk updates
• Journal - Publisher Correlations
– Full recheck annually on rolling cycle
– Specific rechecks for known business/name changes
• Expiry of archiving embargos
– Scope for improvement in RoMEO
Caching of RoMEO Publisher Data
• Download the whole database with “?all=yes”
– Relatively fast
– Download as often as you wish
• Suggest weekly
• And/Or…
– Store key RoMEO data with bibliographic records
– Provide links to interactive RoMEO
• Full publisher records using RoMEO ID, or
• Journal level data using ISSN
Caching Journal-level Data
• Schema/Organisation
– Per journal (efficient)
– Per article (probably inefficient)
• Fields
–
–
–
–
Journal title
ISSN and ESSN
RoMEO Persistent Publisher ID
RoMEO Colour and/or Version-specific permissions
• Normal – i.e. At the time of publication
• Adjusted after the completion of any embargo period
Most Efficient RoMEO Queries
• Journals
– ISSN/ESSN or Exact Title
• Unique or far fewer results, so faster
• May avoid the overhead of needing to search Zetoc
• Publishers
– RoMEO ID
• Unique result. It gets no faster.
– Exact publisher name
• May sometimes find multiple results.
What to do with failed requests?
• Don’t just keep rechecking!
• Not a journal article?
– Outside RoMEO’s scope. Prevent rechecking
• Data error (e.g. typo, bad abbreviation)?
– Correct the source data, then recheck
• No publisher or no policy in RoMEO?
– Feedback to RoMEO – if important
– Recheck infrequently – say annually or quarterly
Any Questions?
RoMEO:
API:
http://www.sherpa.ac.uk/romeo
http://www.sherpa.ac.uk/romeo/api
Blog:
E-mail:
Twitter:
http://romeoblog.jiscinvolve.org
[email protected]
@SHERPAServices
Peter Millington:
[email protected]
0115 84 68481
http://www.sherpa.ac.uk/romeo/