Accession Collecting

Documenting PGR
A short summary about
Theories
and
Practical work
June 2008
SADC
How do you store your data?
•
•
•
•
•
In MS Excel spreadsheet?
Handwritten cards?
In a database (like MS Access)?
In you head?
In another database format?
June 2008
SADC
What is a database?
• Databases are designed to offer an organized
mechanism for storing, managing and retrieving
information. They do so through the use of
tables. If you’re familiar with spreadsheets like
Microsoft Excel, you’re probably already
accustomed to storing data in tabular form. It’s
not much of a stretch to make the leap from
spreadsheets to databases.
• The ability to link tables is a very powerful
resource
June 2008
SADC
So why not use a spreadsheet?
Databases are actually much more powerful than spreadsheets in the way you’re able to
manipulate data . Just a few actions you can perform on a database:
•
•
•
•
Retrieve all records that match certain criteria
Update records in bulk
Cross-reference records in different tables
Perform complex aggregate calculations
June 2008
SADC
What is a database?
I have
data!
In other words:
Me
too!
Me
too!
Me
too!
A well-organized
set of interrelated
data held in one or
more files which
are capable of
being managed by
a software.
June 2008
Me
also!
Me
too!
Me
too!
SADC
Step by step
• Analysis
• of gene bank activities to determine information
and documentation needs
• Design
• of the manual and/or computerized system
based on documentation and information needs
• Implementation
• of the system that has been developed
June 2008
SADC
Data
• Quantitative
• Qualitative
Qualitative, description of the
object being examined. Blue,
horizontal ..
Quantitative, dealing with
numbers 94 g, 34 mm,
67%
June 2008
You can also store
pictures, links urladdresses, maps,
scanned material..
SADC
A relation can be seen as a table that stores information.
Attributes or columns
Person
Rows
Name
Filip
Address
Lund
Telephone
12345
Records
Peter
Malmö
23456
Victoria
Malmö
34567
June 2008
SADC
But every table is not an relation.
• There can not exist two rows in a table
that are exactly the same
• Every column in the table has a unique
name
• There is only one value in every cell.
• All values in one column belongs to the
same domain or are undefined (null)
June 2008
SADC
Redundant data
Data in just one table repeat information…
Students taking courses in a school
Name
Address
Tel
Course
Filip
Lund
12345
Math
Peter
Malmö
23456
Math
Victoria
Malmö
34567
English
Peter
Malmö
23456
History
What if Peter wants to take a course in English?
June 2008
SADC
Example: Students taking courses in a University
Person_id
Name
Adress
Tel
1
Filip
Lund
12345
12
Peter
Malmö
23456
123
Victoria
Malmö
34567
Course_id
Study
C2
Math
C3
Engl
C4
History
Person_id
Course_id
1
C2
12
C2
12
C4
C3
123
How do I store information if Peter wants to take a course in English?
June 2008
SADC
Primary key
•
•
•
•
A primary key can identify the data row
It must be unique
Cant be null
Can be a ID
The easiest way is to let the
program hold track of the
rows and automatically
increase the key.
Remember the primary key
must be unique!
June 2008
SADC
Table of accessions, data of seedsample
accessi
on_num
ber
NGB1
NGB2
NGB3
NGB4
NGB5
Scientific name
Triticum aestivum ssp.
aestivum
Triticum aestivum ssp.
aestivum
Triticum aestivum ssp.
aestivum
Triticum aestivum ssp.
aestivum
Triticum aestivum ssp.
aestivum
genebank
_institute
_name
origin_c
ountry
culton
_type
acqusition
_date
IDUNA
Nordic
Gene
Bank
Sweden
CV
20080625
STANDARD
Nordic
Gene
Bank
Sweden
CV
20080625
JARL
Nordic
Gene
Bank
Sweden
ANKAR
Nordic
Gene
Bank
Denmar
k
SAXO
Nordic
Gene
Bank
accession_name
20080625
CV
20080625
CV
20080625
Sweden
CV
The id’s are the primary key, because they
identify the rest of the rows
June 2008
SADC
But how store data in a good
and reliable way?
• Just to put everything in a file, in one big table is not a
good idea.
• Use a relational database (think of the db as several
tables that are connected to each other)
• Normalize it!
– Split up information into different tables and give the
record an identifier!
– Save related information in one table. (For instance
all personal data in a table called “person”)
– Link different tables together so you can retrieve all
information you need
June 2008
SADC
Normalization
Normalization is the process of efficiently
organizing data in a database. There are
two goals of the normalization process:
• eliminating redundant data (for example,
storing the same data in more than one table)
and
• ensuring data dependencies make sense (only
storing related data in a table). Both of these are
worthy goals as they reduce the amount of space
a database consumes and ensure that data is
logically stored.
June 2008
SADC
This database is poorly normalized.
In fact, it is not normalized at all. Everything is stored in one big table.
Imagine how many things you have to repeat if you test the same
June 2008
SADC
accession three times!
Person
id
Name
Address
Zipcode/city
countr
Title
Institute
Ins_street
Inst_City
Telephone
Cou
123
Magdalena
Svärdh
Bondev 21
227 38 Lund
Swe
IT staff
Nordic Genetic
Resource center
Smedjev 3
Alnarp
+46 40 536640
Swe
124
Louise Bondo
Danske vej
Farum
Den
Section
leader
Nordic Genetic
Resource center
Smedjev 3
Alnarp
+46 40 536640
Swe
125
Pia Ohlsson
Blentarp
123 45
Swe
Lab
Nordic Genetic
Resource center
Smedjev 3
Alnarp
+46 40 536640
Swe
126
Johan
Bäckman
Dag H. vg
222 54 Lund
Swe
IT boss
Nordic Genetic
Resource center
Smedjev 3
Alnarp
+46 40 536640
Swe
127
Jonas Nordling
S Hans gr
124 53 Lund
Swe
IT staff
Nordic Genetic
Resource center
Smedjev 3
Alnarp
+46 40 536640
Swe
128
Slatan Ibrahim
Long Street
02
Copenhagen
Den
Apple expert
Pometet LIFE
Institute for
Agricultural
Science
Short
Street
2630
Taastrup
+45 35 28 4500
DNK
129
Per Person
A street
123 45Malmö
swe
Seed store
Nordic Genetic
Resource center
Smedjev 3
Alnarp
+46 40 536640
Swe
130
Sven Svensson
B street
123 45Malmö
swe
Seed store
Nordic Genetic
Resource center
Smedjev 3
Alnarp
+46 40 536640
Swe
131
Karl Krlssn
C street
123 45Malmö
swe
Seed store
Nordic Genetic
Resource center
Smedjev 3
Alnarp
+46 40 536640
Swe
Information about the institute where people
work is repeated several times!
Imagine how boring it will be to add the same
information for all staff members in NordGen!
June 2008
SADC
Person
id
Name
Address
Zipcode/city
countr
Title
Inst_id
123
Magdalena
Svärdh
Bondev 21
227 38 Lund
Swe
IT staff
45
124
Louise Bondo
Danske vej
Farum
Den
Section
leader
45
125
Pia Ohlsson
Blentarp
123 45
Swe
Lab
45
126
Johan
Bäckman
Dag H. vg
222 54 Lund
Swe
IT boss
45
127
Jonas Nordling
S Hans gr
124 53 Lund
Swe
IT staff
45
128
Slatan Ibrahim
Long Street
02
Copenhagen
Den
Apple expert
46
129
Per Person
A street
123 45Malmö
swe
Seed store
45
130
Sven Svensson
B street
123 45Malmö
swe
Seed store
45
131
Karl Krlssn
C street
123 45Malmö
swe
Seed store
45
Institutes
Inst_id
Institute
Ins_street
Inst_City
Telephone
Cou
45
Nordic Genetic
Resource center
Smedjev 3
Alnarp
+46 40 536640
Swe
46
Pometet LIFE Institute
for Agricultural Science
Short
Street
2630
Taastrup
+45 35 28 4500
DNK
June 2008
SADC
Much better!
If the Institute
changes telephone
number you just
have to change it
once!
Relational Database Managers
By linking files or information together, a
relationship is produced between the tables.
The shared field is only stored once, it is not
duplicated in each file
June 2008
SADC
You often visualize a table like this:
person
id
name
address
zipcode/city
countr
title
inst_id
123
Magdalena
Svärdh
Bondev 21
227 38 Lund
Swe
IT staff
45
124
Louise Bondo
Danske vej
Farum
Den
Section
leader
45
125
Pia Ohlsson
Blentarp
123 45
Swe
Lab
45
126
Johan Bäckman
Dag H. vg
222 54 Lund
Swe
IT boss
45
127
Jonas Nordling
S Hans gr
124 53 Lund
Swe
IT staff
45
128
Slatan Ibrahim
Long Street
02
Copenhagen
Den
Apple expert
46
129
Per Person
A street
123 45Malmö
swe
Seed store
45
130
Sven Svensson
B street
123 45Malmö
swe
Seed store
45
131
Karl Krlssn
C street
123 45Malmö
swe
Seed store
45
June 2008
SADC
person
id
name
adress
zipcode/city
country
title
inst_id
another example…
Fields which have a logical relation to Seed info
each other and to the identifying
Accession_id
field, should be grouped together in
Accession name
a relation.
Species
Date of acquisition
This is not a good example!
DonorInstituteName
Street Address
City Address
Other fields…
If the gene bank receives another accession at a later date , the entire address would need to be
entered again in a different record. It would be better to store information about the donors in a
separate table.
June 2008
SADC
… this would be better
Donor
Seed info
Accession_id
AccDon_id
Accession name
donPer_id
Species
donInst_id
Date of acquisition
donAccname
Amount of seeds
Don_info
AccDon_id
Other fields…
June 2008
Information about the donated material,
from which institute, person who donated
it, what the accession has been called in
the donating institute and other
information about the donated material
SADC
…yet another example
Storage
Registration
Passport
Accession name
Accession name
Accession name
Scientific name
Collecting institute
Freezer_number
Original country
Collector
Box_number
Other fields…
Date of collection
Storage_date
Other fields…
Other fields…
What happens if you have to change the name? You have to change it in all
three otherwise data will be lost.
The best thing is to store a link to the accession and store all the
information about the accession elsewhere.
June 2008
SADC
Accession
Passport
Storage
Accession_id
Accession_id
Accession_id
Accession name
Collecting institute
Freezer_number
Scientific name
Collector
Box_number
Original country
Date of collection
Storage_date
Other fields…
Other fields…
Other fields…
If you misspelled the accession name, you just have to change it once
June 2008
SADC
Visualized in a ER-model (Entity-Relationship model)
it looks like this:
Gene Bank
1:N
Seed sending info
Our institute (Gene Bank) can receive many seeds (with
seed sending information)
June 2008
SADC
Visualized in a ER-model (Entity-Relationship model)
it looks like this:
species
1
1
*
accessions
0..5
teacher
students
One species can have many accessions.
One accession belongs to just one species
One teacher can have up to five students to teach
June 2008
SADC
Different categories for data modeling
Object oriented model
Relational model
June 2008
SADC
Data modeling- how to get started
• To start – find the object
• Let the object become a relation
• Each object or relation should have its own information and
descriptive attributes
• The object becomes the table, the attributes columns and the
identifying attribute becomes the primary key
• An object could be
– Persons
– Places
– Objects (like Accessions, institutes…)
– Events (like seed drying, germination tests..)
Things you want to store information about
June 2008
SADC
Data modeling
• A data model makes it easier to understand the meaning of
the data
• A common and popular data model used in database
design is based on the concepts of the Entity –
Relationship (ER) model
• Conceptual
–
logical structure on databases. Designed by experts, information users, those who know and are
familiar with the information. Identification of important entity and relationship types
• Physical structure
–
designed by experts, database developers. How the logical structure is to be physically
implemented (as tables) on the target DBMS
June 2008
SADC
Data modeling:
• Should represent reality and the information that
need to be stored
• Help us to visualize our task and clarify rules
and restrictions and relations between objects
• Gives the staff a chance to participate early in
the development
• Is a established way of developing information
systems
• Is a way to document and explain the IT-system
June 2008
SADC
Accession and collecting information
Start to gather related information into tables and link them together
•
•
•
•
•
•
Accession number
Scientific name
Pedigree name
Donor name
Acquisition name
And more…
June 2008
• Collecting
organization
• Collecting date
• Country of collection
• And more…
SADC
A model of accession descriptors
and collecting descriptors:
Accession
Collecting
Accession_id
Accession_id
Accession name
Collection organization
Scientific name
Collecting date
Cultivar name
Country of collecting
Donor
Province/state
Acquisition date
Location of collecting site
More…
Type of sample…
The accession relation gather all the information about the accession.
The collecting relation describe all the information about the specific collecting
event
June 2008
SADC
Example of relations
Collect
1
Accession
Taxon
taxon_id
sci_name
eng_name
mandate
grin_no
thgw
1
N
1
1
accession_id
taxon_id
accession_name
country_id
acc_mandate
N
collect_id
accession_id
person_id
latitude
longitude
site_name
country_id
Seedstore
batch_id
accession_id
batch_no
collection
box_no
harvest_year
1
germin_test
N
June 2008
SADC
test_id
accession_id
batch_id
grm_pct
test_date
person_id
N
Wow
Instead of
repeating a lot of
information, You
store the
accession once
and link to other
tables with other
information!
By linking files or
information together, a
relationship is produced
between the tables.
Smart!
And effective!
June 2008
SADC
Part 2
Maybe a
short
brake?
June 2008
SADC
DBMS
A software system that enables users to define,
create, maintain and control access to the
database.
• The DBMS is the software that interacts with the users’ application
programs and the database.
• Typically a DBMS provides the following facilities…
June 2008
SADC
• It allows users to define the database
with types and structures and the
constraints on the data to be stored in
the database. (DDL)
• It allows users to insert, update, delete
and retrieve data from the database.
(DML)
• It provides controlled access to the
database.
June 2008
SADC
Example of DML and DDL
• Data manipulating language (DML)
– Select * from Person;
– Insert into Person (Name, Address, tel) values (‘John Book’,
‘California’, 123 45)
• Data Definition language (DDL) (Changes in structure)
– Create table Person (
Name varchar(40) NotNull,
Address char(10),
tel int)
June 2008
SADC
DBMS environment
Data
Procedures People
Hardware Software
Bridge
Human
Machine
June 2008
SADC
Users with software
and questions
User1
Web site, For
instance SESTO.
Question: List all
accessions
User2
DBMS
Interpretation of question
Data base language (SQL)
DDL & DML
June 2008
SADC
Data bases with information and meta data
Features in SESTO e.g.
project archive and pictures
archive
Entry levels with seed store
data: Genus, Taxons, Cultivars
and Accessions.
To list all Taxons in your gene
bank, click the
button
and information about the
taxons and their accessions
will be presented.
June 2008
SADC
To see the accession
list of this taxon
(Arachis hypogaea)
click the [select]-link
June 2008
SADC
To see detailed
information about this
accession click [select]
June 2008
SADC
The accession name SWZ1
from Zea mays gives this
information
To see all the information
from stored material click
this link
June 2008
SADC
To edit accession
information and storage
information, click the [edit]
links and a pop up window
will be presented
Two batches with 10
distribution bags are
stored in the Active
Storage in freezer 15 in
boxes 10 and 8.
June 2008
SADC
To add and edit
information
about an
accession.
The information
goes to the
database and will
be stored there
June 2008
SADC
A documentation system should
be…
• Reliable
• Retrieve information
fast
• User-friendly
• Flexible, should
anticipate changes
June 2008
SADC
Things you will not regret!
•Be organized
and structured
from the start!
•Be sure you
have a good
data model
Spend some
time on
normalization,
it will be worth
it later
June 2008
•Try to cooperate
and work
together with the
documentation
working group
•(Documentation is far to
important to be left alone
with the IT-staff)
SADC
Web services
• Web services are standardized programs
that send your data to databases
automatically.
• Your data will be published elsewhere
• This makes your work
recognized
• And will give your gene
bank credit and your
material will be asked for
• GBIF, Biocase
Have you heard
about the new
dataset…
June 2008
SADC
Last but not least…
Data is not information.
• Information must be retrieved and
understood.
• Then knowledge is created.
"Where is the wisdom we have lost in knowledge?
Where is the knowledge we have lost in information?"
T.S.Eliot
(Where is the information we have lost in data?)
June 2008
SADC
Thank you for listening!
June 2008
SADC