These are chat archives for yaskyj/fastcaption

13th
Feb 2015
Justin Rogers
@yaskyj
Feb 13 2015 15:41
Hey Michael, I also wanted to confirm that we still want to continue scraping the ASR's from YouTube as long as they're there, right?
Justin Rogers
@yaskyj
Feb 13 2015 20:18
Michael, wanted to keep you updated. We're going to have to change the primary id for the database. It doesn't like having the full URL being passed into the call. Also, Arthur and I are going to have another pairing session this Sunday and we're working on the blank template functionality and user authentications.
And let me know if you'd prefer to start using the NoMoreCraption private room that other Michael and Quincy set up for us or continue to use this room.
Thanks!
Michael Lockrey
@MichaelLockrey
Feb 13 2015 20:32
Hi guys
Yes I definitely want to continue offering the asr data as a starting point
what are your thoughts on this?
im happy to keep using this room
Michael Lockrey
@MichaelLockrey
Feb 13 2015 20:54
I'd love to be involved (even if it's just a 20-30 minute) check in on Sunday or before Sunday's pair session
Let me know if that's ok
Also, what will we use as the primary ID for the database?
Will we need to use the YouTube 11 digit video ID?
How will that impact the ability to cover other video hosting platforms such as Facebook, Vimeo, AWS etc?
Justin Rogers
@yaskyj
Feb 13 2015 21:11
ASR data - I looked over the old code and the ASRs can still be scraped from the Youtube page. If you look in the source of the Youtube page it's after 'ttsurl'.
Michael Lockrey
@MichaelLockrey
Feb 13 2015 21:12
Right
So I'm just looking at this one
Justin Rogers
@yaskyj
Feb 13 2015 21:12
So it's not a problem to get them. It will still just be scraping them and if Youtube ever changes how the page is served then we'll have to change it.
The page source for that one?
Michael Lockrey
@MichaelLockrey
Feb 13 2015 21:13
Yes I found it
Thanks Justin
I had never done that before!
;)
Justin Rogers
@yaskyj
Feb 13 2015 21:13
NP!
Michael Lockrey
@MichaelLockrey
Feb 13 2015 21:14
Did you try doing a video using YouTube's own fan based tool?
i.e. doing the captions?
Just for some end user perspective etc?
Justin Rogers
@yaskyj
Feb 13 2015 21:15
You can actually look at the initial code I wrote to test the scraping. It's the request-test.js file in the main part of the nomorecraptions repo.
Michael Lockrey
@MichaelLockrey
Feb 13 2015 21:15
OK
Justin Rogers
@yaskyj
Feb 13 2015 21:16
I still need to do that. We've been trying to setup the base stuff. Trying to get everything to work nicely together.
The request-test.js file grabs that url with regex and then replaces creates replaces the unicode for the ampersands and commas. Then removes all of the extra /.
Michael Lockrey
@MichaelLockrey
Feb 13 2015 21:18
Right
I know how regex works
But it still comes across as "dark arts" to me sometimes!
Very hard to get my head around some regex expressions some days
;)
Justin Rogers
@yaskyj
Feb 13 2015 21:19
Right, I'm just saying that it looks like those three things are the only items that we have to worry about in that url.
And it's worked alright so far in my tests last night.
Michael Lockrey
@MichaelLockrey
Feb 13 2015 21:19
I saw this on npm - might be useful too?
They also had a YouTube audio stream package
Justin Rogers
@yaskyj
Feb 13 2015 21:20
And yes, of course you can be involved this Sunday. We planned on doing the pairing around 12PST if somewhere around then is good for you.
Michael Lockrey
@MichaelLockrey
Feb 13 2015 21:20
OK
I'm more front end than back end
:(
So is that West Coast time?
Justin Rogers
@yaskyj
Feb 13 2015 21:22
And for the id, I thought that a concatenation of the site name plus the unique video id. So it would be something like "youtubefPloDzu_wcI"
Michael Lockrey
@MichaelLockrey
Feb 13 2015 21:22
Good thinking
Justin Rogers
@yaskyj
Feb 13 2015 21:22
Yeah, that's West Coast.
Michael Lockrey
@MichaelLockrey
Feb 13 2015 21:22
Vimeo will obviously have a different system
Facebook seems to be the next frontier though
They are really pushing video content now
Justin Rogers
@yaskyj
Feb 13 2015 21:23
Yep, but I think that basic convention would still work for all video types.
Michael Lockrey
@MichaelLockrey
Feb 13 2015 21:23
and I've seen videos that only have 20-30 views on YouTube
That have 50k plus on Facebook
Excellent
Justin Rogers
@yaskyj
Feb 13 2015 21:24
I had just envisioned that the call to the API would be something like '/captions/:videoID', but when the videoID was the url it throws an error.
Michael Lockrey
@MichaelLockrey
Feb 13 2015 21:25
BTW came across this event
Very low key
Organised by Google Accessibility team
I think it's just an update and could include the fan based captioning tool etc
Justin Rogers
@yaskyj
Feb 13 2015 21:26
That's a neat library. It would relieve the need for hand rolling the youtube url tests.
Michael Lockrey
@MichaelLockrey
Feb 13 2015 21:26
Yes
Users could easily be copying links like this
They have quite a few YouTube related packages on that site
Justin Rogers
@yaskyj
Feb 13 2015 21:27
Yep, just added it to my bookmarks.
Michael Lockrey
@MichaelLockrey
Feb 13 2015 21:28
what does the u0026 in your code refer to?
Justin Rogers
@yaskyj
Feb 13 2015 21:29
That's the ampersand.
Michael Lockrey
@MichaelLockrey
Feb 13 2015 21:29
Right
Are the asr tracks expiring at the moment?
i.e. are they set to expire within some timeframe?
Or was that just a data field in their code that's unused at Google's end?
Justin Rogers
@yaskyj
Feb 13 2015 21:30
I don't know. But it you refresh the page the url will change itself around. Weird.
Michael Lockrey
@MichaelLockrey
Feb 13 2015 21:31
So does this mean that if I try to watch a Saturday Night Live video on YouTube
Justin Rogers
@yaskyj
Feb 13 2015 21:31
Oh, I think I know what you mean. The ttsurl isn't in all pages. I went through a dozen or so popular pages last night looking for it, but it's missing from most.
Michael Lockrey
@MichaelLockrey
Feb 13 2015 21:31
that's unallowed in Australia (geographic restrictions)
And change the meta properties in Chrome
I could watch it?
Justin Rogers
@yaskyj
Feb 13 2015 21:32
You could change you're IP address to US by using a proxy router.
Michael Lockrey
@MichaelLockrey
Feb 13 2015 21:32
That would just be because there's only 25% of videos with asr or published caption/subtitle tracks
Yes have used proxy services
But I just noticed in the view source there that the geographic regions are all enabled with two character country codes
So was just (thinking aloud)
Did you see my article on Medium how I claim that there's probably only 5% captioning tracks ?
20% are then ASR tracks (automated craptions)
and the rest are uncaptioned at all
Justin Rogers
@yaskyj
Feb 13 2015 21:37
I didn't. Did you post it on Twitter?
Michael Lockrey
@MichaelLockrey
Feb 13 2015 21:37
I did
But there's a lot of stuff on Twitter mate
Here it is:
Justin Rogers
@yaskyj
Feb 13 2015 21:38
Ha! Yeah, I'm bad with Twitter. I don't really tweet. I should probably remedy that at some point.
One of my goals is to use Twitter for this Justin
140 characters
it's painless and easy way to crowdsource
Easy to expect people to do 1-2 tweets (140 - 280 characters)
Justin Rogers
@yaskyj
Feb 13 2015 21:38
Right, Arthur and I were talking about that on Wednesday evening.
Michael Lockrey
@MichaelLockrey
Feb 13 2015 21:39
Not easy to expect people to volunteer more than 5 minutes of their time
Justin Rogers
@yaskyj
Feb 13 2015 21:39
He's working on setting up all of the APIs and authentications right now.
Michael Lockrey
@MichaelLockrey
Feb 13 2015 21:39
For twitter too?
Or just the main ones that we need?
Justin Rogers
@yaskyj
Feb 13 2015 21:39
All the basics.
Michael Lockrey
@MichaelLockrey
Feb 13 2015 21:40
Right
Justin Rogers
@yaskyj
Feb 13 2015 21:40
So that the users can login, but also connect their various accounts, so they can tweet it out.
Michael Lockrey
@MichaelLockrey
Feb 13 2015 21:40
Excellent
Yes understood
But think about it Justin - the main reason Google is using ASR is because they are trying to scale the solution
They have too much content
Justin Rogers
@yaskyj
Feb 13 2015 21:41
I was thinking that the tweet would just contain a brief blurb and then the tinyurl (or whichever). The actual url would be to the video fastforwarded to the last captions that was entered.
Michael Lockrey
@MichaelLockrey
Feb 13 2015 21:42
I've actually been thinking of this:
Justin Rogers
@yaskyj
Feb 13 2015 21:42
That is the problem. How do you get proper captions for videos over a certain time threshold.
Michael Lockrey
@MichaelLockrey
Feb 13 2015 21:42
If it's more than 2 or 3 minutes
I think you need to use a different approach
NMC is fine for short videos
But beyond 2-3 minutes it gets hard to expect users to keep going
So that's where I'm spending a lot of time thinking of a better way at the moment
Justin Rogers
@yaskyj
Feb 13 2015 21:43
Yep. Since it takes ~hour just for the 2-3 minute video.
Michael Lockrey
@MichaelLockrey
Feb 13 2015 21:44
In the old days of TV captioning on broadcast TV
They were happy to caption 23 minutes of content
in one working day!
That's starting from scratch
Justin Rogers
@yaskyj
Feb 13 2015 21:45
I guess we might want to add another element in the database for each video that contains the reference for the last caption that was saved.
Michael Lockrey
@MichaelLockrey
Feb 13 2015 21:46
Yes
Or possibly keep track of each caption block that has been edited in any way?
Justin Rogers
@yaskyj
Feb 13 2015 21:46
But I was also thinking, "what if someone really popular tweeted it out?"
Michael Lockrey
@MichaelLockrey
Feb 13 2015 21:46
That's true
I don't have popular friends
Other than you (Justin) ;)
Justin Rogers
@yaskyj
Feb 13 2015 21:47
Thanks, but I believe that you might be mistaken :worried:
Michael Lockrey
@MichaelLockrey
Feb 13 2015 21:47
One thing Medium do well with their blog platform
is add an accurate data stat
on how long it takes to read a piece or article
I wonder if we could do something similar
such as 20 lines to be corrected
Estimated time x 3
10 minutes
or something along those lines?
Justin Rogers
@yaskyj
Feb 13 2015 21:49
I guess, the url would just return you to the latest captions though. So if someone had started working on it then someone that clicks on it later would still be based to a good caption to start on.
Michael Lockrey
@MichaelLockrey
Feb 13 2015 21:49
Yes
I think that's a good starting point
We could even quasi "crowdsource" via twitter campaign
Saying just click this link and edit one block of captions
Justin Rogers
@yaskyj
Feb 13 2015 21:50
But, still, if Ashton Kutcher or someone tweeted it, 1000s would be trying to get to the latest caption at once.
Michael Lockrey
@MichaelLockrey
Feb 13 2015 21:50
Yes
Justin Rogers
@yaskyj
Feb 13 2015 21:51
I guess that's a problem that we hope to deal with though.
Michael Lockrey
@MichaelLockrey
Feb 13 2015 21:51
He's on my speed dial list by the way
Justin Rogers
@yaskyj
Feb 13 2015 21:51
Excellent!
Michael Lockrey
@MichaelLockrey
Feb 13 2015 21:51
Yes
It's just a hypothetical at the moment
Justin Rogers
@yaskyj
Feb 13 2015 21:52
Yep, what we hope for is that too many people are trying to caption the videos. I think that would be a great success!
Michael Lockrey
@MichaelLockrey
Feb 13 2015 21:52
It would be very foreign to my years of experience in this space Justin
Justin Rogers
@yaskyj
Feb 13 2015 21:53
But it would be a nice problem to have.
Michael Lockrey
@MichaelLockrey
Feb 13 2015 21:54
People have suggested to me that microtasking is a possible solution
LIke Mechanical Turks
etc
So we could pay $0.001 per line etc
But that seems like virtual slavery to me
Justin Rogers
@yaskyj
Feb 13 2015 21:54
Haha.
Michael Lockrey
@MichaelLockrey
Feb 13 2015 21:55
There's captioning service providers out there that only charge $1 minute
Justin Rogers
@yaskyj
Feb 13 2015 21:55
Yeah, pretty much. When I heard about it years ago i tried it for about an hour.
Michael Lockrey
@MichaelLockrey
Feb 13 2015 21:55
so they must be gouging on the costs side
and not paying much to their workers
Justin Rogers
@yaskyj
Feb 13 2015 21:55
I still get emails about my "credits" every once in a while.
Michael Lockrey
@MichaelLockrey
Feb 13 2015 21:55
Right
Justin Rogers
@yaskyj
Feb 13 2015 21:57
But you're ok with the ASR scraping from youtube so far, right? I haven't started to look at all of the other sites and their apis. I'm guessing they'll all be different.
Does Facebook, Vimeo, etc. create asr tracks?
Michael Lockrey
@MichaelLockrey
Feb 13 2015 21:58
No
That's what I wanted to talk to you about
We could provide an ASR track for these other services quite easily using Speechmatics
Facebook has an appalling method for adding captions to video uploads
or something stupid like that
You have to specify a specific file name that ends in "_en"
Basically it was hard enough for me to get across the bar they had set
So it's unlikely others will be doing it
Justin Rogers
@yaskyj
Feb 13 2015 22:00
I see that it looks there was a Speechmatics function that was created in the repo.
Michael Lockrey
@MichaelLockrey
Feb 13 2015 22:00
Yes
It took quite a while
But it uses their API
Justin Rogers
@yaskyj
Feb 13 2015 22:01
How much does it cost though?
Michael Lockrey
@MichaelLockrey
Feb 13 2015 22:01
and serves back an ASR track
6p per minute
Great Britain company
So about 12c
Justin Rogers
@yaskyj
Feb 13 2015 22:02
How feasible is the cost for the project?
Michael Lockrey
@MichaelLockrey
Feb 13 2015 22:02
BTW Cloud Academy would like 450 hours of content captioned!
I spoke to them this week
That's their current / back catalogue of learning content
Justin Rogers
@yaskyj
Feb 13 2015 22:02
Was that after the initial conversation or the same one?
Michael Lockrey
@MichaelLockrey
Feb 13 2015 22:02
Oh sorry
Did I already update you?
Justin Rogers
@yaskyj
Feb 13 2015 22:03
We spoke about Cloud Academy the other day.
Michael Lockrey
@MichaelLockrey
Feb 13 2015 22:03
I've been racking my brain trying to work out the best way to help them
But there's no easy way to fix up 450 hours of content when there's no transcripts etc
It's a big job - period!
Justin Rogers
@yaskyj
Feb 13 2015 22:04
That is a giant job.
Michael Lockrey
@MichaelLockrey
Feb 13 2015 22:05
They also won't want to pay much
Justin Rogers
@yaskyj
Feb 13 2015 22:06
Would it cover speechmatic costs even?
Michael Lockrey
@MichaelLockrey
Feb 13 2015 22:07
I am sure that they would cover that sort of pricing
But Speechmatics isn't much better than Google's ASR
If at all
I did a bit of testing and wasn't that impressed
But the main advantage is that they will have a dig
at providing a ASR transcript
for videos that Google didn't even try to ASR
Justin Rogers
@yaskyj
Feb 13 2015 22:08
I really wish there was a free open source open for it, but I guess that amount of processing is just way too cost prohibitive.
Michael Lockrey
@MichaelLockrey
Feb 13 2015 22:09
The main game now - seems to be respeaking
Have you heard of that approach?
Justin Rogers
@yaskyj
Feb 13 2015 22:10
Nope, what is that (as I'm Googling it).
Michael Lockrey
@MichaelLockrey
Feb 13 2015 22:10
OK
So what some captioning companies have started doing
is this:
They realise that ASR is not cutting it and doesn't offer the quality they need
So they are using Dragon NS and other programs
Justin Rogers
@yaskyj
Feb 13 2015 22:11
Ah, that makes sense. (Just read about it)
Michael Lockrey
@MichaelLockrey
Feb 13 2015 22:11
And they have a trained person who is a strong user of DNS etc
to respeak the dialogue into this software
Giving a much better quality transcript
Yes
Justin Rogers
@yaskyj
Feb 13 2015 22:11
Annunciate correctly, etc.
Michael Lockrey
@MichaelLockrey
Feb 13 2015 22:11
Punctuation
all of that
You have to speak very robotically
But if you're a strong user of DNS etc
It can be done effectively
Justin Rogers
@yaskyj
Feb 13 2015 22:13
I remember when speak recognition for home computers first came out, one of my friends was telling about how they were using it, but it was a real strain on their voice because of the unnatural way of speaking that was necessary.
Michael Lockrey
@MichaelLockrey
Feb 13 2015 22:13
That's right
The feedback I have
is that only 3 out of 10 people are suitable for this work
But it's better than court reporters or stenos
as there isn't 2-3 years of full time training required
You can be trained up in 6-8 weeks at a basic level
Justin Rogers
@yaskyj
Feb 13 2015 22:21
Do you have Screenhero?
Michael Lockrey
@MichaelLockrey
Feb 13 2015 22:22
I did have
But didn't use it much
Do you have it?
Justin Rogers
@yaskyj
Feb 13 2015 22:22
Just asking because that what's we're going to use on Sunday for the pairing.
Michael Lockrey
@MichaelLockrey
Feb 13 2015 22:22
OK
Justin Rogers
@yaskyj
Feb 13 2015 22:23
If you deleted it, I believe I can still send you an invitation.
Michael Lockrey
@MichaelLockrey
Feb 13 2015 22:23
OK please do that
Thanks for lending your ear this morning
I have given it a good chewing
;)
Justin Rogers
@yaskyj
Feb 13 2015 22:25
No problem. You're much more up on this information than I am. It's the only way we can build this. I've only been learning about all this for about three weeks now.
There's so much to learn.
Michael Lockrey
@MichaelLockrey
Feb 13 2015 22:25
Are you still enjoying the project?
There's a steep learning curve
But there's lots of opportunities in the online video space I believe
Justin Rogers
@yaskyj
Feb 13 2015 22:26
Oh, it's a pain in the @$$ as far as problems go. I love it!
Michael Lockrey
@MichaelLockrey
Feb 13 2015 22:26
Ha ha
Justin Rogers
@yaskyj
Feb 13 2015 22:27
I just sent you an invite to Screenhero. You should get an email.
What kind of system do you have?
Michael Lockrey
@MichaelLockrey
Feb 13 2015 22:27
I'm on Mac
Yosemite
But it's an older MBP
Does having 3 people on at once slow things down?
Justin Rogers
@yaskyj
Feb 13 2015 22:28
Great. I've been using my Mac and Arthur has one so it won't look any different to anyone.
Don't know, never tried it before. But the beta is made for more than two people.
Michael Lockrey
@MichaelLockrey
Feb 13 2015 22:29
OK
Justin Rogers
@yaskyj
Feb 13 2015 22:30
@QuincyLarson Have you found that Screenhero with more than two people impacts performance?
Also, I only have about 5% power right now in the cafe, so I'll be off pretty soon.
Michael Lockrey
@MichaelLockrey
Feb 13 2015 22:35
OK
Thanks for the update and chat
Justin Rogers
@yaskyj
Feb 13 2015 22:35
Yes, absolutely.
We'll chat again on Sunday. Trying to get all of this stuff wired together properly.
Michael Lockrey
@MichaelLockrey
Feb 13 2015 23:17
Just came across this on GitHub