Query: Artificial Intelligence and IGI Lists, etc

cgg · Legg inn av **cgg** » 15. desember 2006 kl. 4.34

Has anyone here attempted or know of attempts at applying AI (eg. PROLOG etc)
to IGI or other large family tree database files to find "best match" parents,
grandparents, and so on for an individual using criteria such as name
matches,naming conventions, birth-marriage-birth windows, location, etc? Given
a perfect database this might be relatively easy to do but where IGI records
miss a generation or two the difficulty is compounded and things could get a
little fuzzy. Any ideas or comments? What other uses could AI be put to in
genealogical research?

Colin

Kerry Raymond

Has anyone here attempted or know of attempts at applying AI (eg. PROLOG
etc)
to IGI or other large family tree database files to find "best match"
parents,
grandparents, and so on for an individual using criteria such as name
matches,naming conventions, birth-marriage-birth windows, location, etc?
Given
a perfect database this might be relatively easy to do but where IGI
records
miss a generation or two the difficulty is compounded and things could get
a
little fuzzy. Any ideas or comments? What other uses could AI be put to in
genealogical research?

I've thought about it from time to time (being an IT person) but it's a
nasty problem.

Even if you had a complete database of BDM events without any data entry
errors etc, it's still difficult because names are too common. You are
looking for William Shaw born in Dudley in about 1840, and the parish
registers have 5 of them being baptised in 1839-1841. Which is the right
one? We have no way of working these things out with human intelligence so
it's start to build artificial intelligence to do it.

Kerry

Gjest · Legg inn av **Gjest** » 16. desember 2006 kl. 2.14

Hi,

While I think that there may be some use in AI in matching/searching
some genealogy databases, such as census data, I suspect that trying to
apply machine logic to the IGI is akin to polishing a turd.

Having spent some years exploring names from all 19th century data
sources for my one-name study of Brebner/Bremner genealogy in the
north-east of Scotland, it's my experience that the IGI contains, at
best, perhaps 75% of actual births/christenings, and often less than
30% in certain areas. Trying to ascertain parents based on a such a
small subset of actual data makes any computer program useless. I think
you'll find that even experienced researchers are often surprised,
mystified and frequently chagrined at the actual parents that turn up
from more reliable Scottish sources, such as post-1855 marriage and
death records, especially after making what seem to be iron-clad
assumptions.

Now having said that, there may be some use for a program that looks at
illegitimate children in a given area, based on father's occupation
(assuming that is not simply "Ag. Lab"!) I have had some success in
that area, but that doesn't require a lot of programming, just a few
years of data entry and collation of BMD, census and address
information! And more than a little bit of Scottish common sense!

Perhaps some computing programming could also be applied to looking at
immigrants... for example, to see if someone who was alive in Scotland
in 1851,61,71 and having no death record after that time in Scotland
could be matched to a Canadian or US census. But again, the fact that
those available censuses are only a subset of the larger census set
that unfortunately does not include Australian, South African or Hong
Kong censuses, to name only three, makes the accuracy of such a program
doomed from the start.

I'm sorry to suggest that while your idea may have some merit in the
full availabilty of 20th century records worldwide, I regret that I'm
unlikely to be around in 2100 to benefit from the release of those
pertinent records!

John

cgg wrote:

Has anyone here attempted or know of attempts at applying AI (eg. PROLOG etc)
to IGI or other large family tree database files to find "best match" parents,
grandparents, and so on for an individual using criteria such as name
matches,naming conventions, birth-marriage-birth windows, location, etc? Given
a perfect database this might be relatively easy to do but where IGI records
miss a generation or two the difficulty is compounded and things could get a
little fuzzy. Any ideas or comments? What other uses could AI be put to in
genealogical research?

Colin

Joe Roberts · Legg inn av **Joe Roberts** » 16. desember 2006 kl. 2.49

You've gotten a couple of effective replies from folks who clearly
appreciate the complexity of that kind of task. I can't add a thing to help
there.

Here's just one small observation ... Really, a question.

I'm wondering if there's any available, effective software that could do
this:

... Search a set of multiple data (plain text) records, e.g. GEDCOM
files.

... Find selected criteria that exists in common inside those records,
in selected fields, e.g. names (and) birthplaces (and) dates ... (and...)

... But do it in a "smart" fashion, parsing data that are not
identical, but similar.

For example, finding some similarities in names, e.g. "Brebner" and
"Bremmer", as well as "Braebner".

- - -

So for example, one might load up a pile of files, e.g. GEDCOM but maybe
other text formats. Then one could do one search through all of them, for
multiple matches, in multiple types of data fields.

It's different from apps which do simple word-based searching through text
files. This would involve some kind of "smart" matching (for want of a
better term), to try to narrow down the searching to a desired combination
of matching names, dates, places, (and spouses, siblings, ... whatever).

- - -

I realize that may be what happens with a standard Ancestry search (for
example), for example with a range of dates and Soundex enabled. Maybe or
maybe not -- I'm not privy to their algorithms for searching.

I'm just wondering if there's some offline, standalone application which can
do that kind of "smart" parsing and searching through multiple files.
Downloaded GEDCOMs would be one example of a format, but maybe there would
be others too.

Whether based on AI or otherwise, is there anything like that out there?

Joe

Hugh Watkins · Legg inn av **Hugh Watkins** » 16. desember 2006 kl. 4.21

Joe Roberts wrote:

You've gotten a couple of effective replies from folks who clearly
appreciate the complexity of that kind of task. I can't add a thing to help
there.

Here's just one small observation ... Really, a question.

I'm wondering if there's any available, effective software that could do
this:

... Search a set of multiple data (plain text) records, e.g. GEDCOM
files.

... Find selected criteria that exists in common inside those records,
in selected fields, e.g. names (and) birthplaces (and) dates ... (and...)

... But do it in a "smart" fashion, parsing data that are not
identical, but similar.

For example, finding some similarities in names, e.g. "Brebner" and
"Bremmer", as well as "Braebner".

- - -

So for example, one might load up a pile of files, e.g. GEDCOM but maybe
other text formats. Then one could do one search through all of them, for
multiple matches, in multiple types of data fields.

It's different from apps which do simple word-based searching through text
files. This would involve some kind of "smart" matching (for want of a
better term), to try to narrow down the searching to a desired combination
of matching names, dates, places, (and spouses, siblings, ... whatever).

- - -

I realize that may be what happens with a standard Ancestry search (for
example), for example with a range of dates and Soundex enabled. Maybe or
maybe not -- I'm not privy to their algorithms for searching.

I'm just wondering if there's some offline, standalone application which can
do that kind of "smart" parsing and searching through multiple files.
Downloaded GEDCOMs would be one example of a format, but maybe there would
be others too.

Whether based on AI or otherwise, is there anything like that out there?

try googling "semantic web" which has to come first

http://wc.rootsweb.com/
has 460 million names to play with

Ancestry World Tree and One World Tree are other ways of earching it

Hugh W

--

a wonderful artist in Denmark
http://www.ingerlisekristoffersen.dk/

Beta blogger
http://snaps4.blogspot.com/ photographs and walks

old blogger GENEALOGE
http://hughw36.blogspot.com/ MAIN BLOG

Kerry Raymond · Legg inn av **Kerry Raymond** » 17. desember 2006 kl. 0.32

... Find selected criteria that exists in common inside those
records,
in selected fields, e.g. names (and) birthplaces (and) dates ... (and...)
... But do it in a "smart" fashion, parsing data that are not
identical, but similar.
For example, finding some similarities in names, e.g. "Brebner" and
"Bremmer", as well as "Braebner".

This is more feasible and to some extent it does exist in various forms in
various software. For example, Family Tree Maker has a "duplicate matching"
algorithm that is intended to spot likely duplicate people in your file, but
while it does spot some, it also gets a lot wrong, both false positives
(thinks people are duplicates who aren't) and false negatives (fails to spot
two people that seem identical to me).

However, I don't know of a tool that does it in a general way. I've often
thought about the problem myself. What you need to do is to take each pair
of entries and go through them comparing each type of fact and applying a
series of matching tests giving an matching score (out of 10 say) and then
using some kind of weighting algorithm across those tests to give an overall
matching score for that fact. Then repeat that process (matching tests and
weights) to combine those across the facts to give an overall "likelihood of
match" score. The ideal tool would allow you to configure which matching
tests were to be applied, what the weightings were within and between each
kind of fact.

Examples of facts (which include relationships):

* name
* mother
* father
* spouse
* child
* sibling
* date of birth
* date of death
* date of marriage(s)
* etc

Now different matching rules would apply to different facts.

For example with names, you would be looking to apply rules that looks for
subsets and supersets of names, e.g. Mary Smith, Mary Jane Smith, as well as
variant names Mary/Maria/Marie Smith/Smythe. Now some of these variants you
can get using Soundex or similar algorithms (which examines the letters in
the name). Others would have to be drawn from pre-compiled lists of
variants, Bill/William, Minnie/Whilhelmina, Margaret/Peggy/Peg, since they
defy any kind of algorithmic detection!

For example with dates, you need to consider how close exact dates are.
Clearly 18 August 1900 and 19 August 1900 are a "close match"? But what
about matching rules on same birthday 18 Aug 1900 and 18 Aug 1801? Or
day/month reversal 6 Nov 1900 versus 11 June 1900 (common translation
mistake between British and American format)? Then there are inexact dates?
How close a match are "nested" dates like 18 August 1900 to about 18 August
1900 to August 1900 to 1900 to 1890-1910 to 1850-1950? How close are
1800-1900 to 1850-1950? And of course simple abbreviations and format
issues? 18 Aug 1900, 18 August 1900, August 18, 1900, 18/8/1900, 8/18/1900,
18-8-1900.

With places, you have the similar issues. You have the "nesting" of places
within other places, e.g. "Annerley, Brisbane", "Brisbane, Australia",
"Brisbane, Queensland", "Brisbane, Queensland, Australia", "Queensland", all
of course with different degrees of matching. You have abbreviations and
variants New South Wales vs NSW, people adding and omiting words like County
and Shire. Unlike dates where you can work a lot of these out
programmatically (e.g. 18 Aug 1900 is within 1900), you need a precompiled
knowledge of what places are within other places (for nesting) as well as
pre-compiled lists of variants/abbreviations and words whose
inclusion/omission is not significant (Shire, County). And what about places
that have question marks "Queensland?".

With relationships though, you get into more complex matching rules. If two
men have the same mother, how likely are they to be the same man? Compared
with if two men have the same wife?

Relationships are a recursive matching problem, because the people to whom
they are related are also being matched as part of the same process. For
example, what if Bill Smith has mother Peg Jones and William Smith has
mother Margaret Jones? If you have already assigned a high matching score to
Peg and Margaret based on other facts (similar names, same date of birth,
etc), then that in turn increases the likelihood that Bill Smith is the same
man as William Smith.

So I think an algorithm would need to work in two phases. Firstly come up
with matching scores for individuals without using relationships. Then in
the next phase, look to see if highly-matched individuals have relationships
that also generate high-matchings, so this phase would reinforce or reduce
the scores of the matchings generated in the first phase. Essentially this
phase is matching families more than just individuals. After all, no matter
how similar two individuals look, if they have different parents, spouses
and children, you'd have to think they were separate people.

This second phase would need to have a lot of knowledge about relationships
and probabilities in its rules. For example, two people with the same
parents and the same date of birth could be the same person, but they could
be twins.

Having come up with all these matchings, you then may want a merging phase.
At this point, you probably want to work interactively. Because deciding two
highly-matched individuals are the same may lead to a pairing-off of their
various relatives (matching whole families), that is probably something you
want the user to decide. But on the other hand, the user doesn't want to be
presented with every possible matching for consideration, so setting the
threshold for when to ask the user is important to (and should probably be
user-adjustable as well).

In merging, the problem is also about how to merge facts? Some facts are
"nested" so you would generally go with the more precise one (18 Aug 1900
instead of 1900). But what if they are similar but don't nest "18 Aug 1900"
or "18 Aug 1901"?

All in all, it's a hard problem, but I think if you set up a framework using
matching rules and scores along these lines, then you can build up a
database of rules (and all those pre-compiled lists and tables) over time
and plug them in. It's an area where more rules probably get you better
overall outcomes. And then you can build rules around corner cases and so
forth.

Maybe when I retire ...

Kerry

Joe Roberts · Legg inn av **Joe Roberts** » 17. desember 2006 kl. 1.09

"Kerry Raymond" wrote:

(huge snip for brevity only -- please see Kerry's excellent explanation.)

It looks like it would be a daunting task just for the programming, even
before building the unique tables of rules.

It's no wonder that that kind of application apparently hasn't been tackled
yet. Thank you for giving such a comprehensive analysis.

Joe

Hugh Watkins · Legg inn av **Hugh Watkins** » 17. desember 2006 kl. 21.57

Joe Roberts wrote:

"Kerry Raymond" wrote:

(huge snip for brevity only -- please see Kerry's excellent explanation.)

It looks like it would be a daunting task just for the programming, even
before building the unique tables of rules.

It's no wonder that that kind of application apparently hasn't been tackled
yet. Thank you for giving such a comprehensive analysis.

ancestry.com uses this on all searches
you may chose exact with / withour soundex

or not exact

which is usually annoying but sometiomes produces stunning results
shopping list style
most likely at the top

ancestry world tree is going i the same direction and improving
compares submitted adtainstead of just listing it

Hugh W

--

a wonderful artist in Denmark
http://www.ingerlisekristoffersen.dk/

Beta blogger
http://snaps4.blogspot.com/ photographs and walks

old blogger GENEALOGE
http://hughw36.blogspot.com/ MAIN BLOG

cgg · Legg inn av **cgg** » 19. desember 2006 kl. 22.08

Thanks for the useful insights. Clearly not a realistic expectation. Wouldn't
it be wonderful were it that easy. But then we might miss those special
serendipities that comes from faithful research. Thanks again! - Colin

cgg <garvie@iafrica.com> wrote:

Has anyone here attempted or know of attempts at applying AI (eg. PROLOG etc)
to IGI or other large family tree database files to find "best match" parents,
grandparents, and so on for an individual using criteria such as name
matches,naming conventions, birth-marriage-birth windows, location, etc? Given
a perfect database this might be relatively easy to do but where IGI records
miss a generation or two the difficulty is compounded and things could get a
little fuzzy. Any ideas or comments? What other uses could AI be put to in
genealogical research?

Colin

Query: Artificial Intelligence and IGI Lists, etc

Query: Artificial Intelligence and IGI Lists, etc

Re: Query: Artificial Intelligence and IGI Lists, etc

Re: Query: Artificial Intelligence and IGI Lists, etc

Re: Query: Artificial Intelligence and IGI Lists, etc

Re: Query: Artificial Intelligence and IGI Lists, etc

Re: Query: Artificial Intelligence and IGI Lists, etc

Re: Query: Artificial Intelligence and IGI Lists, etc

Re: Query: Artificial Intelligence and IGI Lists, etc

Re: Query: Artificial Intelligence and IGI Lists, etc