ai Working on an Offline Tool (w/ AI) to Make Transcript Wrangling Easier - Thoughts?

Hey fellow doc makers!

I want to chat about: managing interview transcripts. If you’ve worked on a documentary, you know how overwhelming it can be, especially after conducting a ton o' interviews.

Wrangling all those files, trying to find specific quotes, or just keeping things organized is a time suck. In the past, I’ve built “fancy” templates for FileMaker Pro, which have worked, but I wanted more functionality. BTW, this isn’t leading to a sales pitch. :)

Of course, you can upload everything to an AI chatbot, but then all your proprietary data is out in the wild. Plus, I’ve done tests with three different AI options and the results were dicey re. search results. (More about that in a future video…)

However, I’m working on a solution—a simple offline app designed specifically for filmmakers to manage and search transcripts. Since it’s offline, privacy is in your hands/on your drives, and you won’t need to worry about uploading sensitive material to rando servers.

Here are some of the features I’m building in:
  • Quick, keyword-based search for finding quotes fast..much faster than online options
  • Advanced search to let you dig into specific phrases or themes (even nearby words)
  • Tagging and thematic organization so you can group quotes by topic or storyline
  • Transcript summaries for those okay with using an AI component..this can be turned on/off OR maybe using a local LLM model
  • Search save options include copy and/or save to different file types
  • Secure local storage (no internet required)
I’d love to get your thoughts! Is this something you’d find useful? Which features sound cool? Which are missing? Let me know – all feedback welcome!

I’ll let you know when I finish a demo video. Eventually, I'd like to recruit some tire kickers to test it for free, too. Thanks so much!
 
It's a good idea. There are tons of great practical applications for AI and I think right now is kind of the gold rush era.

I got started building one yesterday That simply goes through a directory of images, gives them an aesthetic rating score Based on composition color balance rule of 3rd stuff like that, Then provides a slider control where you can specify the top X percent of images. So, Here's 500 images from my movie in a folder, Sort the top 3% of images into folder X, And now I can pick my thumbnail out of the best 15 images rather than 500. It's already working, But it needs a few more days of work to improve the reliability of the aesthetic analysis.

In terms of your design the one thing I'm wondering about is if the average consumer has enough computing power to run a localized LLM that could come close to the results available from offline options running on server farms. I have no doubt that you can accomplish the bullet points above with a reasonably sized model, But I can't help but wonder if the results might be far stronger if you simply guaranteed information privacy on the server end and ran a service based on rented H100 cards.

An example if the bot could actually understand things like which lines were impactful or dramatic, or map the resolution of plot lines that emerge within the dialog, that could end up being incredibly useful.

I understand the idea of having a simple organizational program being a value, Though I wonder if such a system with the ai disabled would offer anything competitive versus existing products.

Here's the version I would actually suggest. You drop the raw interview tapes into the ai, It learns the names and vocal tones of each speaker from context on the tapes, It creates the transcript, It identifies all speakers, all topics, when each subtopic begins, and resolves. Actually when you've done everything you talked about above and added the functionality I'm describing here, You'd just be a stones throw from the AI being able to assemble the interview sections itself, essentially creating something far more powerful and saleable, an automatic interview editor.
 
Last edited:
It's a good idea. There are tons of great practical applications for AI and I think right now is kind of the gold rush era.

I got started building one yesterday That simply goes through a directory of images, gives them an aesthetic rating score Based on composition color balance rule of 3rd stuff like that, Then provides a slider control where you can specify the top X percent of images. So, Here's 500 images from my movie in a folder, Sort the top 3% of images into folder X, And now I can pick my thumbnail out of the best 15 images rather than 500. It's already working, But it needs a few more days of work to improve the reliability of the aesthetic analysis.

In terms of your design the one thing I'm wondering about is if the average consumer has enough computing power to run a localized LLM that could come close to the results available from offline options running on server farms. I have no doubt that you can accomplish the bullet points above with a reasonably sized model, But I can't help but wonder if the results might be far stronger if you simply guaranteed information privacy on the server end and ran a service based on rented H100 cards.

An example if the bot could actually understand things like which lines were impactful or dramatic, or map the resolution of plot lines that emerge within the dialog, that could end up being incredibly useful.

I understand the idea of having a simple organizational program being a value, Though I wonder if such a system with the ai disabled would offer anything competitive versus existing products.

Here's the version I would actually suggest. You drop the raw interview tapes into the ai, It learns the names and vocal tones of each speaker from context on the tapes, It creates the transcript, It identifies all speakers, all topics, when each subtopic begins, and resolves. Actually when you've done everything you talked about above and added the functionality I'm describing here, You'd just be a stones throw from the AI being able to assemble the interview sections itself, essentially creating something far more powerful and saleable, an automatic interview editor.
Hi Nate!
Thanks for the detailed notes. I appreciate it! To your points...

Local LLM: Good point. It would only be an option for folks who want to avoid all outside sharing of their content. I'd include basic spec guidelines for those interested in installing an LLM. There are light weight models that can run on mid-tier laptops. Totally optional, though.

Thematic Processing: Yes, this would be valuable. Even the ability to tag clips with custom theme/emotional tags could be useful. Once clips are tagged, you could search by them.

Simplicity: Don't underestimate the power of a simple, all-in-one local app to search, tag, clip, organize transcripts. I've been doing docs and news content for 20+ years and haven't found a tool that checks all the boxes. Plus, this would be FAST! I've run tests on numerous AI chatbots...slow and unreliable.

On-Board Transcription: To keep the price reasonable, users would import their own transcripts. Including transcription could double/triple the price. Plus, there are SO many free/affordable transcription options. Back in the day, I was dropping $1/min at Rev. Of course, that was a big upgrade from paying Martha to transcribe by hand. :) BUT, the app would be flexible with file types. I've built a small app that converts "messy" transcripts into organized CSV files.

Again, I really appreciate your time and thought on this. I'd love to reach out in the future as development progress is made. Best!
 
A technical question: have you considered how you (or the software) will handle different accents? Most good quality voice-to-text software that I'm familar with needs to be trained to recognise a single voice being "textified" ; automated "textifiers" - such as YouTube's subtitling and transcription services are absolutely dreadful in this regard and spew out all kinds of nonsense.

This may be a peculiarly "Old World" problem - there are, for example, four distinct accents spoken in my hometown of Dublin (Ireland) alone, with hundreds of variations across the rest of Ireland and Britain; and then you have all the nuances of non-anglophone Europeans speaking their version of English ... and that's just English.

So how will you ensure that the software is able to accurately find and tag important keywords in a block of audio, when it's fairly likely that those keywords are the ones that'll be pronounced significantly differently from one person to another. Tomato/tomato (/tomAAahzo/tamaha) 🥸
 
Last edited:
A technical question: have you considered how you (or the software) will handle different accents? Most good quality voice-to-text software that I'm familar with needs to be trained to recognise a single voice being "textified" ; automated "textifiers" - such as YouTube's subtitling and transcription services are absolutely dreadful in this regard and spew out all kinds of nonsense.

This may be a peculiarly "Old World" problem - there are, for example, four distinct accents spoken in my hometown of Dublin (Ireland) alone, with hundreds of variations across the rest of Ireland and Britain; and then you have all the nuances of non-anglophone Europeans speaking their version of English ... and that's just English.

So how will you ensure that the software is able to accurately find and tag important keywords in a block of audio, when it's fairly likely that those keywords are the ones that'll be pronounced significantly differently from one person to another. Tomato/tomato (/tomAAahzo/tamaha) 🥸
Hi! I can only imagine how frustrating this is. Unfortunately, the tech hasn't advanced far enough to be able to decipher strong accents. However, for this app:
To keep the price reasonable, users would import their own transcripts. Including transcription could double/triple the price. Plus, there are SO many free/affordable transcription options. Back in the day, I was dropping $1/min at Rev. Thankfully, we now have a bevy o' alternatives. BUT, the app would be flexible with file types. I've built a small app that converts "messy" transcripts into organized CSV files.
 
Last edited:
Back
Top