GNU social JP
  • FAQ
  • Login
GNU social JPは日本のGNU socialサーバーです。
Usage/ToS/admin/test/Pleroma FE
  • Public

    • Public
    • Network
    • Groups
    • Featured
    • Popular
    • People

Conversation

Notices

  1. Embed this notice
    Cory Doctorow (pluralistic@mamot.fr)'s status on Monday, 25-Sep-2023 10:17:30 JST Cory Doctorow Cory Doctorow

    Hey folks! I've got a folder of 100+ HTML files; I'd like to get a word-count of the rendered text (e.g. excluding tags). Is there an ubuntu utility I could use for this?

    ETA: I figured it out. I just did

    > cat * > merged.html

    then I opened the file in Firefox, copied the text, pasted it into Gedit, and used the Document Statistics tool.

    Dirty and ugly, but it got the job done.

    In conversation Monday, 25-Sep-2023 10:17:30 JST from mamot.fr permalink
    • clacke likes this.
    • Embed this notice
      Blair Fix (blair_fix@mastodon.online)'s status on Monday, 25-Sep-2023 10:17:31 JST Blair Fix Blair Fix
      in reply to

      @pluralistic

      You could use pandoc to convert all the html files to plain text:

      > for f in *.html; do pandoc "$f" -s -o "${f%.html}.txt"; done

      Then cat all the files and pipe them to wc to get a word count:

      > cat *.txt | wc

      In conversation Monday, 25-Sep-2023 10:17:31 JST permalink
      clacke likes this.
    • Embed this notice
      Boyd Stephen Smith Jr. (boydstephensmithjr@hachyderm.io)'s status on Monday, 25-Sep-2023 10:17:33 JST Boyd Stephen Smith Jr. Boyd Stephen Smith Jr.
      in reply to

      @pluralistic Relevant XKCD: https://xkcd.com/763/

      Your EDIT is several times more menacing than your question, to me. :)

      In conversation Monday, 25-Sep-2023 10:17:33 JST permalink

      Attachments


      clacke likes this.
    • Embed this notice
      Mina (mina@swiss-talk.net)'s status on Monday, 25-Sep-2023 10:17:36 JST Mina Mina
      in reply to
      • deadbeefmonster

      @deadbeefmonster

      wc -w

      @pluralistic

      In conversation Monday, 25-Sep-2023 10:17:36 JST permalink
      clacke likes this.
    • Embed this notice
      deadbeefmonster (deadbeefmonster@infosec.exchange)'s status on Monday, 25-Sep-2023 10:17:43 JST deadbeefmonster deadbeefmonster
      in reply to

      @pluralistic you can use `lynx -dump file.html | wc -w`

      In conversation Monday, 25-Sep-2023 10:17:43 JST permalink
      clacke likes this.
    • Embed this notice
      clacke (clacke@libranet.de)'s status on Monday, 25-Sep-2023 10:17:45 JST clacke clacke
      in reply to
      • Blair Fix
      @blair_fix @pluralistic This is the canonical answer. (e|)l(ynx|inks) and w3m are cool but for another purpose, pandoc is the document-oriented one.
      In conversation Monday, 25-Sep-2023 10:17:45 JST permalink

Feeds

  • Activity Streams
  • RSS 2.0
  • Atom
  • Help
  • About
  • FAQ
  • TOS
  • Privacy
  • Source
  • Version
  • Contact

GNU social JP is a social network, courtesy of GNU social JP管理人. It runs on GNU social, version 2.0.2-dev, available under the GNU Affero General Public License.

Creative Commons Attribution 3.0 All GNU social JP content and data are available under the Creative Commons Attribution 3.0 license.