Hey 2 all !
Today we will be doing some nasty shell programming things 🙂 You must wondering, how nasty can it gets? It can get very very nasty and above all sick buuut for the beginning, I’m going to tell you something about grep and using it for greping URL links from text.
Basically, you can use grep for searching some sequence of characters in the text. That can be words, for example. Another very powerful things are regular expressions. Even grep is acronym from the Global Regular Expression Print. Now, as regular expressions (regex), are way too complexed for some small talk, I’ll only explain the example right now, and leave the regex talk for another time. Get ready! Here it goes:
grep -o ‘http://%5B^”]*’
Now, why did I decide to talk about this, you may ask. We can find this on other web sites, huh? Well, I tried and all that I found was some python scripts written for doing it. For doing what? For getting the URL addresses from the text. Now, it may be confusing, so I’ll tell you the full story and the reason why I needed this grep regex.
I found some site with a lot of photos, that were funny. Now, without saying this funny part, you may think: “Is that the nasty part, that he was speaking of?” 😀 No! Photos are about photobombing, and girl who posted them, or was the photobomber, made up faces that really made me laugh! Check them out here.
Anyway, I wanted to make a collection, cuz I too like to photobomb ;). Right clicking on every one of them and saving them would really make a challenge I’d rather escape. If I were on Windows, I’d have to kill my self first, and then do that kind of job. Luckily, I use Linux, and my golden shell is just a few clicks away from me! Doing that part (clicking all the way to the shell, or just making keyboard shortcuts), you can run your shell throught terminal. Now, what then? First, you have to copy the web page url, or just save the html file in some plain text file. The name of the file is not important. Make sure that you save it in the directory and navigate to there through shell. When you are in the exact directory where the file with html data is, type cat <filename>. That will just print the whole file over the terminal. Now, you need the URL links. We all know that those links start with http:// and so we will use grep. We can use other tools like awk, but grep will do the job just fine, and more that fine, just perfect!!! So, you type cat <filename> | grep -o ‘http://%5B^”]*’ and hit the return (enter for newbies) button. That is the whole wisdom of getting the URL links from some text file. You can redirect the text that is going to be written on the terminal, in the file with > <file_for_result>, and then you’ll have the links in that file.
Now, for the regex used in this example, explanation goes like this:
http:// is the part which describes how the string that we are looking for begins
[^”] desribes that grep is going to return only the strings which do not contain ” or in other words, the strings that are ending with ”
* tells grep that characters that match [^”] can have zero or many occurrences.
From some strange reason I think I did not explain this as I should. Maybe my english is a bit rusty, but I’m willing to explain it to everyone who is having problems with this. So, just comment, and I’ll make sure to respond as soon as possible!