[mb-users] Album Languages and Script discussion

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view

[mb-users] Album Languages and Script discussion

Moving this from IRC to the mailing list as I missed Dupuy when he was
on there earlier.

Feature I'm referring to is currently documented at
http://musicbrainz.org/wd/AlbumLanguage and being tested on MB test
servers such as http://dev-mb.djce.org.uk/  The issue at the heart of
the matter is what the labeling should be off.  As it says on the page
linked there:

"Albums have two linguistic attributes, language and script. The
language attribute (e.g. French) records the language of the album and
track titles, not the lyrics (if any), and the script attribute (e.g.
Latin) records the general type of characters in which the album and
track titles are written. In most cases, the script will be guessed
correctly, so you don't need to worry too much about it. The language
guessing is less accurate, and may be hard even for an experienced
person to determine."

Which leads to cases such as Namie Amuro
http://musicbrainz.org/showartist.html?artistid=12204 (selected because
she's a fairly typical Japanese pop star with a long career) Out of her
29 singles and 7 albums on Avex http://www.avexnet.or.jp/avexdb/amuro/ 
there's exactly one song named in Japanese, and that was in 1998.  By
the guidelines above they should almost all be labeled as English,
Latin.  This leaves me wondering what exactly is the point of having a
language feature but only using it in this way.  My hope was that it
could be used for example for customized guess case functions - have it
apply English or French style guide when you press guess depending on
which the release is labeled as being in - or use it to exclude certain
languages combos from the reports such as the TooMany/TooFew capitals
ones.  However many of the Japanese albums that would need to be
excluded are going to be labeled as English, Latin and those benefits
are lost.

I'd try to summarize Dupuy's point of view but it was spread out over
several IRC conversations to the point that I can't.  I'd ask that he do
so here himself, please.  To address a few points from it that I was
clear on:

[13:09 05/12/05] <dupuy> mo, you don't have to use the language thing;
one of the nice things about using the titles as the determinant is that
Matthias code can use the language guesser, and it will usually guess
right (except for that 1% japanese stuff)
[13:09 05/12/05] <dupuy> so, "it just happens"
[13:10 05/12/05] <mo> ... it quesses the script you mean? (AFAC on test)
[13:10 05/12/05] <dupuy> guessing script is very easy, and 99% accurate
[13:10 05/12/05] <mo> yes
[13:11 05/12/05] <dupuy> guessing language is a little bit harder, not
much, and fairly accurate given a large corpus
[13:11 05/12/05] <dupuy> (body of works in the languages)
[13:11 05/12/05] <mo> it hasn't doe that for me so far on test, though
[13:12 05/12/05] <dupuy> I think that Dave actually may have added the
code for that
[13:12 05/12/05] <dupuy> and if it doesn't yet, it is at least
*possible* - guessing the language of lyrics that aren't stored in the
DB is impossible
[13:12 05/12/05] <dupuy> and we don't store lyrics, and won't do so any
time soon (i.e. ever)

So we can automatically label script but not language, and don't want to
  base off lyrics as we don't have those is what I get from that
exchange.  To which my initial response is mostly puzzlement - we also
don't guess or store all sorts of other things but we count on users
being able to know and enter them.  We don't store lyrics, fine, this
doesn't mean that we don't have people that know what language something
is in and can enter it manually.  What makes this special that we want
to do it automatically as much as possible even if that means giving up
other benefits?

[13:17 05/12/05] <dupuy> that's another reason why we have to use
titles, and not songs
[13:17 05/12/05] <dupuy> what language is beethoven's fifth symphony in?
[13:18 05/12/05] <dupuy> the title can be in german, english, japanese,

A good point.  But I have trouble seeing why pop music and classical
music couldn't be handled in different ways for this, just as they are
treated different all over the rest of the site.

[13:09 05/12/05] <dupuy> mo, you don't have to use the language thing;
one of the nice things about using the titles as the determinant is that
Matthias code can use the language guesser, and it will usually guess
right (except for that 1% japanese stuff)
[12:25 05/12/05] <dupuy> that's enough; anyhow, i very much dislike
making general rules based on DJKC's weird edge cases

^^  This was one of the sections I was unclear on and wanted to hear
from Dupuy before attempting a response as I wasn't sure if that meant
he thought only 1% of Japanese stuff would have problems or only 1% of
the stuff in the DB was Japanese so it didn't matter or what exactly.

To summarize, what I was unclear on and/or questioning was:
1, why does it matter if we don't have lyrics for determining language
automatically when we can just as easily have users enter it like the
rest of the information on the site?
2, while I can see the point of having fields for the title languages
and script, why only have those and not a third or fourth field for
language attributes such as performance language and original language?
3, To me it seems like most people not familiar with MB or not familiar
with Japanese artists upon seeing a list such as
http://dev-mb.djce.org.uk/showartist.html?artistid=12204 would think the
album was performed in English if they were all labeled as English,
Latin; wouldn't it therefore be more sensible to implement a performance
language album attribute before an album/song title language attribute?

Sending this to MB-users instead of the l18n list since users has a lot
more activity...

MusicBrainz-users mailing list
[hidden email]