Data Import Policy

classic Classic list List threaded Threaded
19 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Data Import Policy

Andre Wiethoff
Hello everybody,

its again me, with some other weird ideas ;-)
I would like to ask whether an automatic metadata collection/crawling
for insertion in the Musicbrainz DB is fine, which will be created by
web data mining?

Basically I would like to work on these two automatic data
crawlings/data minings:

1) Add links for more artists to the AMG, Amazon and BBC web pages
(which will be automatically be matched by crawling the appropriate web
pages - of course very conservatively). Also interesting would be links
to the "Musixmatch" lyrics website ( https://www.musixmatch.com ). It
seems legit, so can it be added to the lyrics site whitelist?

2) Adding a new kind of relation (which also would need to be approved
first), I would call it "similar to". Basically I would mainly
automatically add similarities between two artists, but of course also
other similarities would be possible (song similarity, etc.). As this
kind of data is highly subjective, it might be a thought whether only
data by automatic web/database data mining would be accepted as input
(and no manual input of users)... This kind of data is available on AMG,
Amazon and BBC, which could be automatically be crawled (and only two
artist IDs would be added to the database as similar). Of course lateron
similarity could also be calculated by an algorithm using some
scrobbling data.

Are the two scenarios permitted by the data import policy of Musicbrainz
(e.g. doesn't violate any copyright issues, etc.). I think the first
case shouldn't create any problems at all, as using a link all
references are given to the data source (it is just a link).
The second case is a bit more difficult and problematic, as basically
the data is created by the appropriate companies and inserted into a new
database owned by somebody different? On the other hand, only two IDs
are stored (IDs which only makes sense in Musicbrainz) - can that data
violate copyright issues? I think it is a bit similar to Google crawling
and provide the results as their own...

What do you think?

Best regards,

Andre

PS: Here a small evaluation of the URLs stored in the system for some
link targets:

URLs total: 2117159
    Discogs total: 567330
       Discogs Release: 212163
       Discogs Artist: 200194
       Discogs Master: 126747
       Discogs label: 26405
    Allmusic total: 55395
       Allmusic Artist: 29943
       Allmusic Album: 20290
       Allmusic Composition: 4216
    Amazon total: 183044
       Amazon Product: 182683
       Amazon Artist: 229
    BBC total: 9805
       BBC Artist: 1347
       BBC Reviews: 8208

    Soundcloud: 26166
    Youtube total: 29747
       Youtube User Channel: 15127
       Youtube Video: 12723

I wonder why there are not more BBC artist links, as these links are
just http://www.bbc.co.uk/music/artists/<musicbrainz artist id>. It
should be pretty easy to add them...


_______________________________________________
MusicBrainz-devel mailing list
[hidden email]
http://lists.musicbrainz.org/mailman/listinfo/musicbrainz-devel
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Data Import Policy

tommycrock


On 11 May 2015 at 11:24, Andre Wiethoff <[hidden email]> wrote:
Hello everybody,

its again me, with some other weird ideas ;-)
I would like to ask whether an automatic metadata collection/crawling
for insertion in the Musicbrainz DB is fine, which will be created by
web data mining?

Basically I would like to work on these two automatic data
crawlings/data minings:

1) Add links for more artists to the AMG, Amazon and BBC web pages
(which will be automatically be matched by crawling the appropriate web
pages - of course very conservatively).

I think the community appreciate *very* conservative bots. You might want to check out https://musicbrainz.org/doc/Bots https://musicbrainz.org/doc/Code_of_Conduct/Bots and https://github.com/murdos/musicbrainz-bot 

Also interesting would be links
to the "Musixmatch" lyrics website ( https://www.musixmatch.com ). It
seems legit, so can it be added to the lyrics site whitelist?

You should put in a ticket for this https://musicbrainz.org/doc/Proposals - Lyric sites used to be different from all other relationships (they may still be), needing approval from ruaok
 
2) Adding a new kind of relation (which also would need to be approved
first), I would call it "similar to". Basically I would mainly
automatically add similarities between two artists, but of course also
other similarities would be possible (song similarity, etc.). As this
kind of data is highly subjective, it might be a thought whether only
data by automatic web/database data mining would be accepted as input
(and no manual input of users)... This kind of data is available on AMG,
Amazon and BBC, which could be automatically be crawled (and only two
artist IDs would be added to the database as similar). Of course lateron
similarity could also be calculated by an algorithm using some
scrobbling data.

This sounds dodgy to me - I'd at least want to know what the basis of their claim of similarity was. Also, be particularly careful what you do with AMG data! But you could always discuss the general idea on the forums. That's probably a better place to get wider views of the community. You could also try the style list, where new relationships used to be discussed.
 

<snip>

I wonder why there are not more BBC artist links, as these links are
just http://www.bbc.co.uk/music/artists/<musicbrainz artist id>. It
should be pretty easy to add them...

Sounds like a good plan. I guess no-one's got around to it or thought the beeb might do it themselves

_______________________________________________
MusicBrainz-devel mailing list
[hidden email]
http://lists.musicbrainz.org/mailman/listinfo/musicbrainz-devel
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Data Import Policy

Ian McEwen
On Mon, May 11, 2015 at 03:20:40PM +0100, Tom Crocker wrote:

> On 11 May 2015 at 11:24, Andre Wiethoff <[hidden email]>
> wrote:
>
> > Hello everybody,
> >
> > its again me, with some other weird ideas ;-)
> > I would like to ask whether an automatic metadata collection/crawling
> > for insertion in the Musicbrainz DB is fine, which will be created by
> > web data mining?
> >
> > Basically I would like to work on these two automatic data
> > crawlings/data minings:
> >
> > 1) Add links for more artists to the AMG, Amazon and BBC web pages
> > (which will be automatically be matched by crawling the appropriate web
> > pages - of course very conservatively).
>
>
> I think the community appreciate *very* conservative bots. You might want
> to check out https://musicbrainz.org/doc/Bots
> https://musicbrainz.org/doc/Code_of_Conduct/Bots and
> https://github.com/murdos/musicbrainz-bot
>
> Also interesting would be links
> > to the "Musixmatch" lyrics website ( https://www.musixmatch.com ). It
> > seems legit, so can it be added to the lyrics site whitelist?
> >
>
> You should put in a ticket for this https://musicbrainz.org/doc/Proposals -
> Lyric sites used to be different from all other relationships (they may
> still be), needing approval from ruaok
>
>
> > 2) Adding a new kind of relation (which also would need to be approved
> > first), I would call it "similar to". Basically I would mainly
> > automatically add similarities between two artists, but of course also
> > other similarities would be possible (song similarity, etc.). As this
> > kind of data is highly subjective, it might be a thought whether only
> > data by automatic web/database data mining would be accepted as input
> > (and no manual input of users)... This kind of data is available on AMG,
> > Amazon and BBC, which could be automatically be crawled (and only two
> > artist IDs would be added to the database as similar). Of course lateron
> > similarity could also be calculated by an algorithm using some
> > scrobbling data.
> >
>
> This sounds dodgy to me - I'd at least want to know what the basis of their
> claim of similarity was. Also, be particularly careful what you do with AMG
> data! But you could always discuss the general idea on the forums. That's
> probably a better place to get wider views of the community. You could also
> try the style list, where new relationships used to be discussed.
>
I agree; this is far too vague to be used properly, if there even is a
clear definition of using such a thing properly.

>
> >
> > <snip>
> >
> > I wonder why there are not more BBC artist links, as these links are
> > just http://www.bbc.co.uk/music/artists/<musicbrainz artist id>. It
> > should be pretty easy to add them...
> >
>
> Sounds like a good plan. I guess no-one's got around to it or thought the
> beeb might do it themselves
BBC Music links aren't to be added unless they have substantial content
that isn't just from MusicBrainz, because obviously if you want a BBC
Music link you can just construct them, so it's really only useful if
there's something there more than on MB.

See
https://musicbrainz.org/relationship/d028a975-000c-4525-9333-d3c8425e4b54

So it's not simple to add them automatically, and this fact also
explains the limited numbers to some degree.

> _______________________________________________
> MusicBrainz-devel mailing list
> [hidden email]
> http://lists.musicbrainz.org/mailman/listinfo/musicbrainz-devel


_______________________________________________
MusicBrainz-devel mailing list
[hidden email]
http://lists.musicbrainz.org/mailman/listinfo/musicbrainz-devel

attachment0 (188 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Data Import Policy

Paul Taylor-2
In reply to this post by Andre Wiethoff
On 11/05/2015 11:24, Andre Wiethoff wrote:

> Hello everybody,
>
> its again me, with some other weird ideas ;-)
> I would like to ask whether an automatic metadata collection/crawling
> for insertion in the Musicbrainz DB is fine, which will be created by
> web data mining?
>
> Basically I would like to work on these two automatic data
> crawlings/data minings:
>
> 1) Add links for more artists to the AMG, Amazon and BBC web pages
> (which will be automatically be matched by crawling the appropriate web
> pages - of course very conservatively). Also interesting would be links
> to the "Musixmatch" lyrics website ( https://www.musixmatch.com ). It
> seems legit, so can it be added to the lyrics site whitelist?
>
>
>
> PS: Here a small evaluation of the URLs stored in the system for some
> link targets:
>
> URLs total: 2117159
>      Discogs total: 567330
>         Discogs Release: 212163
>         Discogs Artist: 200194
>         Discogs Master: 126747
>         Discogs label: 26405
>
Andre

I have a particular interest in improving the links between Musicbrainz
(artist, releases and labels) with Discogs. I took the approach it is
very difficult to correctly link 100% of the time but it is quite easy
to find potential links that are right at least 95% of the time so my
solution was generate potential links and make them available so others
can submit them if they are interested. This takes an artist centric
approach and concentrates on finding links for releases and making it
easier to import releases from discogs as well for any artist.

I do plan to create reports showing potential artist and labels as well.

This is all available at http://albunack.net and might be of interest to
you.

Whilst adding a link doesn't modify existing data, it would be more
useful if additional data from the linked entity was also added but of
course you have to be even sure it is correct. For example if two
releases are matched on various metadata criteria but the MusicBrainz
release did not have a barcode entered and the Discogs one does then it
makes sense to add the Discogs barcode at the same as linking it. It is
possible to do this using the seed release mechanism, what is not
possible is to edit existing data, i.e you can add a barcode but not
modify an existing barcode.

Paul/ijabz


_______________________________________________
MusicBrainz-devel mailing list
[hidden email]
http://lists.musicbrainz.org/mailman/listinfo/musicbrainz-devel
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Data Import Policy

tommycrock

Sorry Paul. I meant to point out your site and say what I've seen of it seems great. I just wish I could find the time to try it out more!


_______________________________________________
MusicBrainz-devel mailing list
[hidden email]
http://lists.musicbrainz.org/mailman/listinfo/musicbrainz-devel
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Data Import Policy

Paul Taylor-2
On 11/05/2015 19:30, Tom Crocker wrote:
>
> Sorry Paul. I meant to point out your site and say what I've seen of
> it seems great. I just wish I could find the time to try it out more!
>
Tom

thanks, good to have some positive feedback - as so far I had just about
zilch

Paul

_______________________________________________
MusicBrainz-devel mailing list
[hidden email]
http://lists.musicbrainz.org/mailman/listinfo/musicbrainz-devel
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Data Import Policy

Andre Wiethoff
In reply to this post by Paul Taylor-2
Dear Paul,

> I have a particular interest in improving the links between Musicbrainz
> (artist, releases and labels) with Discogs. I took the approach it is
> very difficult to correctly link 100% of the time but it is quite easy
> to find potential links that are right at least 95% of the time so my
> solution was generate potential links and make them available so others
> can submit them if they are interested. This takes an artist centric
> approach and concentrates on finding links for releases and making it
> easier to import releases from discogs as well for any artist.
>
> I do plan to create reports showing potential artist and labels as well.
>
> This is all available at http://albunack.net and might be of interest to
> you.
>
> Whilst adding a link doesn't modify existing data, it would be more
> useful if additional data from the linked entity was also added but of
> course you have to be even sure it is correct. For example if two
> releases are matched on various metadata criteria but the MusicBrainz
> release did not have a barcode entered and the Discogs one does then it
> makes sense to add the Discogs barcode at the same as linking it. It is
> possible to do this using the seed release mechanism, what is not
> possible is to edit existing data, i.e you can add a barcode but not
> modify an existing barcode.
>
Thanks for sending the information about your project, I will definitely
have a look at it!

Best regards,

Andre



_______________________________________________
MusicBrainz-devel mailing list
[hidden email]
http://lists.musicbrainz.org/mailman/listinfo/musicbrainz-devel
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Data Import Policy

"Frederik “Freso” S. Olesen"
In reply to this post by Andre Wiethoff
Den 11-05-2015 kl. 12:24 skrev Andre Wiethoff:
> I would like to ask whether an automatic metadata collection/crawling
> for insertion in the Musicbrainz DB is fine, which will be created by
> web data mining?

You've got replies on most stuff, so I'll just skip over those and add
my own comments to some things.

> 2) Adding a new kind of relation (which also would need to be approved
> first), I would call it "similar to". […] Of course lateron
> similarity could also be calculated by an algorithm using some
> scrobbling data.

This should not be a relationship like we currently have relationships.
This is an entirely subjective piece of information and some people will
consider two artists similar while others will not. There are already
projects (though I forget which, sorry) that group/cluster entities
based on relationships (which IIRC wasn't completely off), so it is
possible to do something like it with the data in MB already. There's
also AcousticBrainz which can be used to cluster entities based on the
acoustic properties of their recordings. If we get scrobbling hooked in
to our end at some point, that'd obviously be another usable source for
this, but it isn't required to get something like this going.

> I think it is a bit similar to Google crawling
> and provide the results as their own...

Google does pay at least some of their data sources. I know, for one,
that Google is MetaBrainz' biggest "customer" by far (in terms of how
much money they put in the project).

--
Namasté,
Frederik “Freso” S. Olesen <http://freso.dk/>
MB:   https://musicbrainz.org/user/Freso
Wiki: https://wiki.musicbrainz.org/User:Freso


_______________________________________________
MusicBrainz-devel mailing list
[hidden email]
http://lists.musicbrainz.org/mailman/listinfo/musicbrainz-devel

signature.asc (836 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Data Import Policy

Andre Wiethoff
Hello Frederik,

thanks for your thoughts!
>
> This should not be a relationship like we currently have relationships.
> This is an entirely subjective piece of information and some people will
> consider two artists similar while others will not.
Yes, therefore I did propose to not open it for public editing, but
having it "computer generated" only (by whatever means, be it web
crawling or clustering on large data sets).
> There are already
> projects (though I forget which, sorry) that group/cluster entities
> based on relationships (which IIRC wasn't completely off), so it is
> possible to do something like it with the data in MB already.
I wonder which relationships have been used to group/cluster the
entities? With the existing Metabrainz data, I can think only of
scrobbling data to generate such kind of information...
I found this paper, but they used Musicbrainz only for the basic
metadata retrieval and Audioscrobbler (last.fm) for the similarity
clustering...
http://www.sfu.ca/~shaw/papers/musicianMap-VDA09.pdf
> There's
> also AcousticBrainz which can be used to cluster entities based on the
> acoustic properties of their recordings.
I don't think that clustering regarding the acoustic properties will
bring any good results for now, I guess this will still take ten years
until there exist something that produces results matchable to a human
expert (or even advanced amateur)...
> If we get scrobbling hooked in
> to our end at some point, that'd obviously be another usable source for
> this, but it isn't required to get something like this going.
But in the end you agree that the result of such a web crawl/clustering
algorithm/whatever should be stored in the database as final result (for
speedier access of the results) - if implemented at all? But perhaps we
should discuss at first whether the new data would be beneficial for the
users (or the database)...

I thought that relationships would have been the best place to put them,
as in fact it is a relation between e.g. two artists (even though the
definition of similarity would be depend on the algorithm or the page
that is crawled). E.g. Amazon will most probably use the "customers that
buy stuff from this artist also buyed stuff from these other artists"
similarity measurement. I am unsure which measurements are used by AMG
and BBC, but most probably also some kind of clustering algorithm...

Please see the similarity results of the pages for the artist "Herbert
Grönemeyer":
http://www.allmusic.com/artist/herbert-gr%C3%B6nemeyer-mn0000956217/related
http://www.amazon.de/Herbert-Groenemeyer/e/B000APL43M
http://www.bbc.co.uk/music/artists/456eabce-d1dd-4481-a206-36ab4f2eaeb8#more
>> I think it is a bit similar to Google crawling
>> and provide the results as their own...
> Google does pay at least some of their data sources. I know, for one,
> that Google is MetaBrainz' biggest "customer" by far (in terms of how
> much money they put in the project).
>
I also see this a bit controversial, as even two IDs could be
intellectual property...
I think that it is the biggest question of whether to allow web crawling
for this purpose at all.
Does anybody else have some insights on this?

Thank your in forward for your answers!

Best regards,

Andre


_______________________________________________
MusicBrainz-devel mailing list
[hidden email]
http://lists.musicbrainz.org/mailman/listinfo/musicbrainz-devel
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Data Import Policy

"Frederik “Freso” S. Olesen"
Den 12-05-2015 kl. 16:00 skrev Andre Wiethoff:
>> This should not be a relationship like we currently have relationships.
>> This is an entirely subjective piece of information and some people will
>> consider two artists similar while others will not.
> Yes, therefore I did propose to not open it for public editing, but
> having it "computer generated" only (by whatever means, be it web
> crawling or clustering on large data sets).

But you proposed it to be stored as a relationship like the current
relationships in the MusicBrainz database… (read below)

> But in the end you agree that the result of such a web crawl/clustering
> algorithm/whatever should be stored in the database as final result (for
> speedier access of the results) - if implemented at all? But perhaps we
> should discuss at first whether the new data would be beneficial for the
> users (or the database)...

In *a* db, sure, in *the* (MB) db, no. It is not objective data and it
would not be user generated. It would be far more reasonable to place it
in another (sub)project. See e.g., AcousticBrainz and CritiqueBrainz for
two MetaBrainz projects expanding on the MusicBrainz data without being
inserted directly into the MB site/data themselves. A
RecommendationBrainz or SimilarityBrainz (or, heck, maybe it could be
part of CritiqueBrainz?) would be a better fit for this.

(Also note that having it in a separate project does not mean it cannot
be used by/on MusicBrainz; e.g., CritiqueBrainz reviews are pulled in
for relevant MB release( group)s.)

>> There are already
>> projects (though I forget which, sorry) that group/cluster entities
>> based on relationships (which IIRC wasn't completely off), so it is
>> possible to do something like it with the data in MB already.
> I wonder which relationships have been used to group/cluster the
> entities? […]

IIRC, all the relationships. The more times two entities linked to each
other, the closer those two entities were. AFAIK, it's a fairly simple
heuristic, but given the amount of relationships in the MB db, it should
give reasonable results for most fairly well known artists.

>> There's
>> also AcousticBrainz which can be used to cluster entities based on the
>> acoustic properties of their recordings.
> I don't think that clustering regarding the acoustic properties will
> bring any good results for now, I guess this will still take ten years
> until there exist something that produces results matchable to a human
> expert (or even advanced amateur)...

I wouldn't make it stand on its own, no. ABz is still very much in its
infancy and the tools and algorithms in Essentia are not yet up to par
with this massive 2+ million song dataset currently available in the ABz
database. However, ABz can give you ranges about whether a group does
mostly vocal or instrumental things, whether they're mostly high or low
BPM, whether they have a predominant mood, etc.

These aren't necessarily 100% accurate, but combining similarity on
these values with relationship clustering, I think it may be possible to
get some interesting results (e.g., two artists with a lot of
relationships connecting them that additionally does mostly acoustic,
instrumental happy+relaxed music are likely more similar than two
artists with no relationships connecting them and one doing mostly
instrumental and the other doing mostly vocal stuff).

When/if we get access to scrobbles, that's a third data source that can
be added to the mix, but I really do not think we need it to get started
on a similarity/recommendation engine.

--
Namasté,
Frederik “Freso” S. Olesen <http://freso.dk/>
MB:   https://musicbrainz.org/user/Freso
Wiki: https://wiki.musicbrainz.org/User:Freso


_______________________________________________
MusicBrainz-devel mailing list
[hidden email]
http://lists.musicbrainz.org/mailman/listinfo/musicbrainz-devel

signature.asc (836 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Data Import Policy

"Frederik “Freso” S. Olesen"
In reply to this post by Andre Wiethoff
Den 12-05-2015 kl. 16:00 skrev Andre Wiethoff:
> Please see the similarity results of the pages for the artist "Herbert
> Grönemeyer":
> http://www.allmusic.com/artist/herbert-gr%C3%B6nemeyer-mn0000956217/related
> http://www.amazon.de/Herbert-Groenemeyer/e/B000APL43M
> http://www.bbc.co.uk/music/artists/456eabce-d1dd-4481-a206-36ab4f2eaeb8#more

I found the site using only MusicBrainz data for its clustering, except
it isn't using just MusicBrainz data ­— but it isn't using scrobble
data, only inter-artist relationships:
http://richseam.com/artist/m/02cskm

http://richseam.com/about-us has slightly more information on what they
are doing.

--
Namasté,
Frederik “Freso” S. Olesen <http://freso.dk/>
MB:   https://musicbrainz.org/user/Freso
Wiki: https://wiki.musicbrainz.org/User:Freso


_______________________________________________
MusicBrainz-devel mailing list
[hidden email]
http://lists.musicbrainz.org/mailman/listinfo/musicbrainz-devel

signature.asc (836 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Data Import Policy

Andre Wiethoff
In reply to this post by "Frederik “Freso” S. Olesen"
Hello Frederik,

>> But in the end you agree that the result of such a web crawl/clustering
>> algorithm/whatever should be stored in the database as final result (for
>> speedier access of the results) - if implemented at all? But perhaps we
>> should discuss at first whether the new data would be beneficial for the
>> users (or the database)...
> In *a* db, sure, in *the* (MB) db, no. It is not objective data and it
> would not be user generated. It would be far more reasonable to place it
> in another (sub)project. See e.g., AcousticBrainz and CritiqueBrainz for
> two MetaBrainz projects expanding on the MusicBrainz data without being
> inserted directly into the MB site/data themselves. A
> RecommendationBrainz or SimilarityBrainz (or, heck, maybe it could be
> part of CritiqueBrainz?) would be a better fit for this.
> (Also note that having it in a separate project does not mean it cannot
> be used by/on MusicBrainz; e.g., CritiqueBrainz reviews are pulled in
> for relevant MB release( group)s.)
I see. So most probably I proposed this to the wrong project?
By that point of view, a recommendation engine (recommendation matrix -
probably a sparse matrix stored in a database) should also not be part
of Musicbrainz, but also a another project extending Musicbrainz, right?

>>> There are already
>>> projects (though I forget which, sorry) that group/cluster entities
>>> based on relationships (which IIRC wasn't completely off), so it is
>>> possible to do something like it with the data in MB already.
>> I wonder which relationships have been used to group/cluster the
>> entities? […]
> IIRC, all the relationships. The more times two entities linked to each
> other, the closer those two entities were. AFAIK, it's a fairly simple
> heuristic, but given the amount of relationships in the MB db, it should
> give reasonable results for most fairly well known artists.
>>> There's
>>> also AcousticBrainz which can be used to cluster entities based on the
>>> acoustic properties of their recordings.
>> I don't think that clustering regarding the acoustic properties will
>> bring any good results for now, I guess this will still take ten years
>> until there exist something that produces results matchable to a human
>> expert (or even advanced amateur)...
> I wouldn't make it stand on its own, no. ABz is still very much in its
> infancy and the tools and algorithms in Essentia are not yet up to par
> with this massive 2+ million song dataset currently available in the ABz
> database. However, ABz can give you ranges about whether a group does
> mostly vocal or instrumental things, whether they're mostly high or low
> BPM, whether they have a predominant mood, etc.
>
> These aren't necessarily 100% accurate, but combining similarity on
> these values with relationship clustering, I think it may be possible to
> get some interesting results (e.g., two artists with a lot of
> relationships connecting them that additionally does mostly acoustic,
> instrumental happy+relaxed music are likely more similar than two
> artists with no relationships connecting them and one doing mostly
> instrumental and the other doing mostly vocal stuff).
This is where we differ (but of course this depends on the definition of
the term "interesting" ;-)
I don't think that the relationship table will give sufficient
information to really find e.g. artists that are closely related (as
quite often the only the band members are known). Combining it with a
large set of acoustic features, which are only probabilities on how
"similar" two songs regarding a given feature is, will not improve the
result that much. I agree that you would get a list of songs (and by
that artists) which are somewhat similar in the kind of music they make,
but this will not provide a (sorted) list of most similar
artists/songs/whatever...
So, if the basis data using the relations is not good enough, adding the
acoustic properties will only allow grouping to very large groups like
you mentioned e.g. with/without vocals or fast/slow BPM.

Perhaps we should start with defining "Similarity" first. Here is my try:
Similarity is the probability of a user also liking artist/song/etc. B
if he likes artist/song/etc A.
(this is a user centric view of similarity - of course each individual
user would see it differently how similar two bands are, but this is
only a probability...)

>
>> Please see the similarity results of the pages for the artist "Herbert
>> Grönemeyer":
>> http://www.allmusic.com/artist/herbert-gr%C3%B6nemeyer-mn0000956217/related
>> http://www.amazon.de/Herbert-Groenemeyer/e/B000APL43M
>> http://www.bbc.co.uk/music/artists/456eabce-d1dd-4481-a206-36ab4f2eaeb8#more
> I found the site using only MusicBrainz data for its clustering,
> except it isn't using just MusicBrainz data ­— but it isn't using
> scrobble data, only inter-artist relationships:
> http://richseam.com/artist/m/02cskm http://richseam.com/about-us has
> slightly more information on what they are doing.
Thanks for the links!

This exactly shows why the relationships wouldn't work out, using the
example of Herbert Grönemeyer (one of germany big ones). The artist
which is so similar that I can't often differ between them is
Westernhagen, which is listed on AllMusic and Amazon as related (BBC
shows only four related artists...). But analysing the connections by
richseam shows artists like John Smith (which doesn't seem to be a real
artist), Charles Aznavour (which is neither very similar, nor even
singing in the same language), Little Axe (Blues!), ..., then somewhen
"Die Fantastischen Vier" show up which are also singing in the same
language, but do HipHop...
At the end there are actually some few who would match a bit, like
Philipp Poisel (using the relation "has played concert with Grönemeyer"
- which would the only relation that would fullfill my definition of
similarity). But there is no sign of Westernhagen at all.
Only because two artists recorded their songs in a specific studio
doesn't make them related...
> When/if we get access to scrobbles, that's a third data source that can
> be added to the mix, but I really do not think we need it to get started
> on a similarity/recommendation engine.
Probably I just don't know where to start creating a similarity
algorithm only using the above two feature sets (and my definition of
similarity), but please prove me wrong.
Anyway, doing a recommendation engine based on the mentioned features
will absolutely not be possible (or at least not better than using some
random songs from "similar" artists - however "similar" is defined), as
there are much fewer relations on songs than on artists...

Something completely different: It seems that some audio fingerprints
are misdetected (meaning that one fingerprint has a bunch of results
with high score, but not all of the correct recording). I tested a live
version, but it found also the regular version and one even a cover from
a different group - I assume that either an algorithm has wrongly
assigned the songs metadata to the recording or a user has entered wrong
artist information)...

Best regards,

Andre

_______________________________________________
MusicBrainz-devel mailing list
[hidden email]
http://lists.musicbrainz.org/mailman/listinfo/musicbrainz-devel
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Data Import Policy

tommycrock


On 12 May 2015 21:22, "Andre Wiethoff" <[hidden email]> wrote:
>
...
>
> Something completely different: It seems that some audio fingerprints
> are misdetected (meaning that one fingerprint has a bunch of results
> with high score, but not all of the correct recording). I tested a live
> version, but it found also the regular version and one even a cover from
> a different group - I assume that either an algorithm has wrongly
> assigned the songs metadata to the recording or a user has entered wrong
> artist information)...
>

Yes, this is very common and my understanding is this is usually incorrectly submitted  data. Various software submits acoustids so depending on how it's set up and what the user selects or what their existing data is, incorrect acoustids can end up attached. Although I'd say if there's been a lot of submissions the right recording tends to have many more.
Having said which...
Sometimes two (or more) recordings will have one acoustid. These are usually really similar mixes. Vice versa one recording can have multiple acoustids because for example the speed has changed between released versions or one is truncated in a way we consider to be insufficient to treat as two different recordings.


_______________________________________________
MusicBrainz-devel mailing list
[hidden email]
http://lists.musicbrainz.org/mailman/listinfo/musicbrainz-devel
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

AcoustID

Andre Wiethoff
Hello Tom,

> > Something completely different: It seems that some audio fingerprints
> > are misdetected (meaning that one fingerprint has a bunch of results
> > with high score, but not all of the correct recording). I tested a live
> > version, but it found also the regular version and one even a cover from
> > a different group - I assume that either an algorithm has wrongly
> > assigned the songs metadata to the recording or a user has entered wrong
> > artist information)...
> >
>
> Yes, this is very common and my understanding is this is usually
> incorrectly submitted  data. Various software submits acoustids so
> depending on how it's set up and what the user selects or what their
> existing data is, incorrect acoustids can end up attached. Although
> I'd say if there's been a lot of submissions the right recording tends
> to have many more.
> Having said which...
> Sometimes two (or more) recordings will have one acoustid. These are
> usually really similar mixes. Vice versa one recording can have
> multiple acoustids because for example the speed has changed between
> released versions or one is truncated in a way we consider to be
> insufficient to treat as two different recordings.
>
Thank you for the detailed explanation! I thought that it might be that
way...

If I understood correctly, the AcoustIDs itself are stored on
Acoustid.org server externally.
Are there any references from Musicbrainz tables (e.g. a GUID) to the
data on Acoustid.org, or are they link all from AcoustID.org back to
Musicbrainz (so that the AcoustId.org API need to be called in order to
display which fingerprints are assigned to a recording for display on
the Musicbrainz webpage)? I didn't found any reference for it in the
CreateTables.sql script...

Thanks in forward for your answer!

Best regards,

Andre

_______________________________________________
MusicBrainz-devel mailing list
[hidden email]
http://lists.musicbrainz.org/mailman/listinfo/musicbrainz-devel
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: AcoustID

Ian McEwen
On Wed, May 13, 2015 at 12:29:40PM +0200, Andre Wiethoff wrote:

> Hello Tom,
>
> > > Something completely different: It seems that some audio fingerprints
> > > are misdetected (meaning that one fingerprint has a bunch of results
> > > with high score, but not all of the correct recording). I tested a live
> > > version, but it found also the regular version and one even a cover from
> > > a different group - I assume that either an algorithm has wrongly
> > > assigned the songs metadata to the recording or a user has entered wrong
> > > artist information)...
> > >
> >
> > Yes, this is very common and my understanding is this is usually
> > incorrectly submitted  data. Various software submits acoustids so
> > depending on how it's set up and what the user selects or what their
> > existing data is, incorrect acoustids can end up attached. Although
> > I'd say if there's been a lot of submissions the right recording tends
> > to have many more.
> > Having said which...
> > Sometimes two (or more) recordings will have one acoustid. These are
> > usually really similar mixes. Vice versa one recording can have
> > multiple acoustids because for example the speed has changed between
> > released versions or one is truncated in a way we consider to be
> > insufficient to treat as two different recordings.
> >
> Thank you for the detailed explanation! I thought that it might be that
> way...
>
> If I understood correctly, the AcoustIDs itself are stored on
> Acoustid.org server externally.
> Are there any references from Musicbrainz tables (e.g. a GUID) to the
> data on Acoustid.org, or are they link all from AcoustID.org back to
> Musicbrainz (so that the AcoustId.org API need to be called in order to
> display which fingerprints are assigned to a recording for display on
> the Musicbrainz webpage)? I didn't found any reference for it in the
> CreateTables.sql script...
>
All AcoustID data is stored on the AcoustID side. We've considered
setting up a system to move a view of that data back to MB as well, but
primarily so we can write reports based on that information and display
it without needing to make a request to remove webservice. At present
it's all displayed by way of a request made in javascript to the
AcoustID API, however.

> Thanks in forward for your answer!
>
> Best regards,
>
> Andre
>
> _______________________________________________
> MusicBrainz-devel mailing list
> [hidden email]
> http://lists.musicbrainz.org/mailman/listinfo/musicbrainz-devel

_______________________________________________
MusicBrainz-devel mailing list
[hidden email]
http://lists.musicbrainz.org/mailman/listinfo/musicbrainz-devel

attachment0 (188 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: AcoustID

Lukáš Lalinský
In reply to this post by Andre Wiethoff
On Wed, May 13, 2015 at 3:29 AM, Andre Wiethoff <[hidden email]> wrote:
If I understood correctly, the AcoustIDs itself are stored on
Acoustid.org server externally.
Are there any references from Musicbrainz tables (e.g. a GUID) to the
data on Acoustid.org, or are they link all from AcoustID.org back to
Musicbrainz (so that the AcoustId.org API need to be called in order to
display which fingerprints are assigned to a recording for display on
the Musicbrainz webpage)? I didn't found any reference for it in the
CreateTables.sql script...

I generate monthly data files with the MBID-AcoustID links, but unfortunately they are currently out of date:


(It turns out that generating database dumps that over 100GB compressed is a fairly non-trivial task and
I have been experiment with the best setup for me to do this.)

Lukas


_______________________________________________
MusicBrainz-devel mailing list
[hidden email]
http://lists.musicbrainz.org/mailman/listinfo/musicbrainz-devel
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: AcoustID

Joseph Curtin

Having just sourced the database. As a user, it is possible to restore the database using Ubuntu 14.04 and the postgresql-ppa.

Would you like me to document the process?

You're going to need at least one terabyte, maybe two. Its not a small db at all.

All that being said about the current backup process; here is my unsolicited opinion on a better backup process.

Setup a replication server for backups. Every nth hour, halt the replication server,  tar the datadir, and then restart the replication server. This might cost a pretty penny when it comes to hosting.

On May 18, 2015 3:15 PM, "Lukáš Lalinský" <[hidden email]> wrote:
On Wed, May 13, 2015 at 3:29 AM, Andre Wiethoff <[hidden email]> wrote:
If I understood correctly, the AcoustIDs itself are stored on
Acoustid.org server externally.
Are there any references from Musicbrainz tables (e.g. a GUID) to the
data on Acoustid.org, or are they link all from AcoustID.org back to
Musicbrainz (so that the AcoustId.org API need to be called in order to
display which fingerprints are assigned to a recording for display on
the Musicbrainz webpage)? I didn't found any reference for it in the
CreateTables.sql script...

I generate monthly data files with the MBID-AcoustID links, but unfortunately they are currently out of date:


(It turns out that generating database dumps that over 100GB compressed is a fairly non-trivial task and
I have been experiment with the best setup for me to do this.)

Lukas


_______________________________________________
MusicBrainz-devel mailing list
[hidden email]
http://lists.musicbrainz.org/mailman/listinfo/musicbrainz-devel

_______________________________________________
MusicBrainz-devel mailing list
[hidden email]
http://lists.musicbrainz.org/mailman/listinfo/musicbrainz-devel
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: AcoustID

Lukáš Lalinský
On Mon, May 18, 2015 at 3:39 PM, Joseph Curtin <[hidden email]> wrote:

Having just sourced the database. As a user, it is possible to restore the database using Ubuntu 14.04 and the postgresql-ppa.

Would you like me to document the process?


Definitely. I'd also welcome any patches/ideas on how to make the process easier. My goal so far was only to run the data on acoustid.org. Running a mirror was not my priority, because you need fairly expensive hardware to do it right, so not many people would actually do that anyway.
 

Setup a replication server for backups. Every nth hour, halt the replication server,  tar the datadir, and then restart the replication server. This might cost a pretty penny when it comes to hosting.


I actually already have a replicated server just for the data export. I currently do it while the replication is running, which is often causing problems, due to PostgreSQL running out of transactions (the export is one giant day-long serialized transaction). I have actually just moved the server to a completely separate one with the intention to stop replication during the process, but I need to figure out how to handle that with regard to monitoring and things like that.

Lukas


_______________________________________________
MusicBrainz-devel mailing list
[hidden email]
http://lists.musicbrainz.org/mailman/listinfo/musicbrainz-devel
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: AcoustID

Joseph Curtin


On May 18, 2015 4:14 PM, "Lukáš Lalinský" <[hidden email]> wrote:
>
>
> Definitely. I'd also welcome any patches/ideas on how to make the process easier. My goal so far was only to run the data on acoustid.org. Running a mirror was not my priority, because you need fairly expensive hardware to do it right, so not many people would actually do that anyway.

I hope to be able to help here. I took a quick look at what it might take to upgrade you to 9.4. I determined that it wasn't trivial, but it should be rather straight forward. When I have the time, I'll see what I can do. 

> I actually already have a replicated server just for the data export. I currently do it while the replication is running, which is often causing problems, due to PostgreSQL running out of transactions (the export is one giant day-long serialized transaction). I have actually just moved the server to a completely separate one with the intention to stop replication during the process, but I need to figure out how to handle that with regard to monitoring and things like that.
>

What kind of monitoring questions are you looking to answer?

If you copy the datadir, the cost would be cheaper. All you'll be doing is a bit copy vs scheduling and dumping data all within the context of postgres. You can transplant datadirs fairly easily as long as you do it while the server is shutdown.

On May 18, 2015 4:14 PM, "Lukáš Lalinský" <[hidden email]> wrote:
On Mon, May 18, 2015 at 3:39 PM, Joseph Curtin <[hidden email]> wrote:

Having just sourced the database. As a user, it is possible to restore the database using Ubuntu 14.04 and the postgresql-ppa.

Would you like me to document the process?


Definitely. I'd also welcome any patches/ideas on how to make the process easier. My goal so far was only to run the data on acoustid.org. Running a mirror was not my priority, because you need fairly expensive hardware to do it right, so not many people would actually do that anyway.
 

Setup a replication server for backups. Every nth hour, halt the replication server,  tar the datadir, and then restart the replication server. This might cost a pretty penny when it comes to hosting.


I actually already have a replicated server just for the data export. I currently do it while the replication is running, which is often causing problems, due to PostgreSQL running out of transactions (the export is one giant day-long serialized transaction). I have actually just moved the server to a completely separate one with the intention to stop replication during the process, but I need to figure out how to handle that with regard to monitoring and things like that.

Lukas


_______________________________________________
MusicBrainz-devel mailing list
[hidden email]
http://lists.musicbrainz.org/mailman/listinfo/musicbrainz-devel

_______________________________________________
MusicBrainz-devel mailing list
[hidden email]
http://lists.musicbrainz.org/mailman/listinfo/musicbrainz-devel
Loading...